torchcodec-vulkan

A Vulkan video decode backend for torchcodec. It decodes on AMD, Intel, or NVIDIA GPUs through VK_KHR_video_decode, then color-converts and resizes in a compute shader.

Why

torchcodec has one GPU decode path, NVDEC, which is NVIDIA only. On AMD and Intel it decodes on the CPU. For VLM and video-model inference the decode and resize step is usually what starves the GPU, so moving it onto the GPU on non-NVIDIA hardware is the gap this fills.

The backend registers through torchcodec's third-party device interface, so there is no fork of torchcodec involved.

Status

v0.1.0. Working and tested on a single GPU (AMD RX 6900 XT, RADV). It hardware-decodes 8-bit H.264 and HEVC, matches torchcodec's CPU output, and is faster on the decode-and-downscale workload. Anything it cannot decode on the GPU (AV1 without a Vulkan decoder, 10-bit, unusual profiles) falls back to the CPU and still produces correct frames.

Read Limitations before depending on it. The short version: one GPU validated, source build only, CPU output only.

Install

There is no wheel yet. Build from source against your system FFmpeg (6.1 or newer, built with Vulkan) and a matching torchcodec. BUILDING.md has the full recipe, including a sudo-free way to get the FFmpeg dev headers.

Usage

torchcodec has no public kwarg for picking a device variant, so selection goes through a context manager that sets torchcodec's internal backend variable.

import torchcodec_vulkan
from torchcodec.decoders import VideoDecoder
from torchcodec_vulkan.api import use_vulkan_backend

with use_vulkan_backend():
    dec = VideoDecoder("clip.mp4", device="cpu")

frames = dec.get_frames_in_range(start=0, stop=32)
print(frames.data.shape)

Or the convenience wrapper, which decodes and resizes on the GPU in one pass:

import torchcodec_vulkan as tcv

batch = tcv.decode("clip.mp4", num_frames=32, resize=(224, 224))

On a multi-GPU machine, set TORCHCODEC_VULKAN_DEVICE to choose the adapter (an index or a DRM node, following FFmpeg's Vulkan device string).

Benchmarks

AMD RX 6900 XT, RADV, 60 frames, median of 7 runs. The workload is decode plus resize to 224x224, which is the frame-sampling shape VLMs use.

Input	CPU backend	Vulkan backend	Speedup
720p to 224	180 ms	109 ms	1.65x
1080p to 224	383 ms	124 ms	3.09x
1080p, no resize	~296 ms	~296 ms	~1.0x

The speedup grows with the downscale ratio, because the GPU decodes the full frame and only a small resized buffer is copied back. With no resize the readback is the whole frame and the two backends come out about even. The CPU stays free during decode in every case.

Correctness: mean absolute difference against the CPU backend is roughly 1 to 2.5 out of 255, with the larger differences confined to chroma edges, where the shader's bilinear chroma upsampling differs slightly from swscale.

Codec support

Codec	Path
H.264 8-bit	Vulkan decode
HEVC 8-bit	Vulkan decode
10-bit (P010), AV1 without a Vulkan decoder, others	CPU fallback

The fallback is transparent and its output is identical to torchcodec's CPU backend. The decoder prints one line to stderr the first time it falls back, so it is visible when a stream is not getting hardware acceleration.

How it works

FFmpeg demuxes and runs VK_KHR_video_decode, leaving the decoded NV12 frame in VRAM. A compute shader samples the Y and CbCr planes, applies the YUV to RGB matrix for the frame's color space and range, and resizes to the requested output dimensions in the same pass. The result is read back to a CPU tensor, which is where the readback cost shows up at full resolution.

Two details that took time to get right, both written up in SPEC.md:

The hardware frames context has to come from avcodec_get_hw_frames_parameters and then have MUTABLE_FORMAT added. Building it by hand misses the decode setup and crashes FFmpeg's decode init.
The readback buffer must be HOST_CACHED. The default HOST_COHERENT memory is write-combined and uncached on discrete GPUs, which made the readback about ten times slower. That one change took the workload from slower than the CPU to faster.

Limitations

Untested, meaning unknown rather than known broken:

Only validated on AMD RX 6900 XT with RADV. Intel (ANV), NVIDIA, and other AMD and Mesa versions have not been run. The parts most likely to behave differently are the HOST_CACHED readback memory, the per-plane R8 and R8G8 views, 8-bit SSBO storage, and the MUTABLE_FORMAT decode images.
Resolution changes mid-stream. Views are created per frame so there is no stale-handle risk, but the frames-context rebuild path itself is not exercised.
Stream counts beyond the small concurrency tests.

Known constraints in v0.1.0, by design:

Source build only, no wheel. Needs system FFmpeg with Vulkan, its dev headers, Python 3.13 or older, and torchcodec==0.14.*.
Selection depends on a torchcodec internal (_decoder_utils._CUDA_BACKEND) and on the 0.14 plugin ABI, which has changed across releases. The package warns at import if the installed torchcodec is not the version it was tested against.
10-bit and HDR run on the CPU fallback, not the GPU.
A decoder is single-in-flight and not safe to share across threads, which matches torchcodec's per-decoder contract. Separate decoders are fine, but each currently creates its own Vulkan device, which is a cost under high concurrency.
Output is a CPU tensor. Zero-copy onto a GPU tensor is the obvious next step for GPU-resident inference. It is blocked on AMD here (no ROCm torchcodec build, and RADV to ROCm memory sharing is not supported) and would land cleanly on NVIDIA. See SPEC.md.

Layout

csrc/                       C++ backend and the compute shader
  VulkanDeviceInterface.*   the torchcodec device interface
  VkContext.*               Vulkan device, pipeline, queue
  NV12ToRGB.*               convert dispatch and readback
  ColorMatrix.*             YUV to RGB matrix per color space and range
  shaders/nv12_to_rgb.comp  YUV to RGB plus resize
src/torchcodec_vulkan/      Python package
tests/                      26 tests
bench/                      the benchmark above
spike/                      standalone on-GPU shader check
SPEC.md, BUILDING.md        design notes and build recipe

License

BSD-3-Clause. See LICENSE.

Written by Dennis de Vulder with development assistance from Claude (Opus 4.8). The commit history records this with Assisted-by trailers.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bench		bench
csrc		csrc
spike		spike
src/torchcodec_vulkan		src/torchcodec_vulkan
test_assets		test_assets
tests		tests
.gitignore		.gitignore
BUILDING.md		BUILDING.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torchcodec-vulkan

Why

Status

Install

Usage

Benchmarks

Codec support

How it works

Limitations

Layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

torchcodec-vulkan

Why

Status

Install

Usage

Benchmarks

Codec support

How it works

Limitations

Layout

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages