A Vulkan video decode backend for torchcodec.
It decodes on AMD, Intel, or NVIDIA GPUs through VK_KHR_video_decode, then
color-converts and resizes in a compute shader.
torchcodec has one GPU decode path, NVDEC, which is NVIDIA only. On AMD and Intel it decodes on the CPU. For VLM and video-model inference the decode and resize step is usually what starves the GPU, so moving it onto the GPU on non-NVIDIA hardware is the gap this fills.
The backend registers through torchcodec's third-party device interface, so there is no fork of torchcodec involved.
v0.1.0. Working and tested on a single GPU (AMD RX 6900 XT, RADV). It hardware-decodes 8-bit H.264 and HEVC, matches torchcodec's CPU output, and is faster on the decode-and-downscale workload. Anything it cannot decode on the GPU (AV1 without a Vulkan decoder, 10-bit, unusual profiles) falls back to the CPU and still produces correct frames.
Read Limitations before depending on it. The short version: one GPU validated, source build only, CPU output only.
There is no wheel yet. Build from source against your system FFmpeg (6.1 or newer, built with Vulkan) and a matching torchcodec. BUILDING.md has the full recipe, including a sudo-free way to get the FFmpeg dev headers.
torchcodec has no public kwarg for picking a device variant, so selection goes through a context manager that sets torchcodec's internal backend variable.
import torchcodec_vulkan
from torchcodec.decoders import VideoDecoder
from torchcodec_vulkan.api import use_vulkan_backend
with use_vulkan_backend():
dec = VideoDecoder("clip.mp4", device="cpu")
frames = dec.get_frames_in_range(start=0, stop=32)
print(frames.data.shape)Or the convenience wrapper, which decodes and resizes on the GPU in one pass:
import torchcodec_vulkan as tcv
batch = tcv.decode("clip.mp4", num_frames=32, resize=(224, 224))On a multi-GPU machine, set TORCHCODEC_VULKAN_DEVICE to choose the adapter (an
index or a DRM node, following FFmpeg's Vulkan device string).
AMD RX 6900 XT, RADV, 60 frames, median of 7 runs. The workload is decode plus resize to 224x224, which is the frame-sampling shape VLMs use.
| Input | CPU backend | Vulkan backend | Speedup |
|---|---|---|---|
| 720p to 224 | 180 ms | 109 ms | 1.65x |
| 1080p to 224 | 383 ms | 124 ms | 3.09x |
| 1080p, no resize | ~296 ms | ~296 ms | ~1.0x |
The speedup grows with the downscale ratio, because the GPU decodes the full frame and only a small resized buffer is copied back. With no resize the readback is the whole frame and the two backends come out about even. The CPU stays free during decode in every case.
Correctness: mean absolute difference against the CPU backend is roughly 1 to 2.5 out of 255, with the larger differences confined to chroma edges, where the shader's bilinear chroma upsampling differs slightly from swscale.
| Codec | Path |
|---|---|
| H.264 8-bit | Vulkan decode |
| HEVC 8-bit | Vulkan decode |
| 10-bit (P010), AV1 without a Vulkan decoder, others | CPU fallback |
The fallback is transparent and its output is identical to torchcodec's CPU backend. The decoder prints one line to stderr the first time it falls back, so it is visible when a stream is not getting hardware acceleration.
FFmpeg demuxes and runs VK_KHR_video_decode, leaving the decoded NV12 frame in
VRAM. A compute shader samples the Y and CbCr planes, applies the YUV to RGB
matrix for the frame's color space and range, and resizes to the requested
output dimensions in the same pass. The result is read back to a CPU tensor,
which is where the readback cost shows up at full resolution.
Two details that took time to get right, both written up in SPEC.md:
- The hardware frames context has to come from
avcodec_get_hw_frames_parametersand then haveMUTABLE_FORMATadded. Building it by hand misses the decode setup and crashes FFmpeg's decode init. - The readback buffer must be
HOST_CACHED. The defaultHOST_COHERENTmemory is write-combined and uncached on discrete GPUs, which made the readback about ten times slower. That one change took the workload from slower than the CPU to faster.
Untested, meaning unknown rather than known broken:
- Only validated on AMD RX 6900 XT with RADV. Intel (ANV), NVIDIA, and other AMD
and Mesa versions have not been run. The parts most likely to behave
differently are the
HOST_CACHEDreadback memory, the per-plane R8 and R8G8 views, 8-bit SSBO storage, and theMUTABLE_FORMATdecode images. - Resolution changes mid-stream. Views are created per frame so there is no stale-handle risk, but the frames-context rebuild path itself is not exercised.
- Stream counts beyond the small concurrency tests.
Known constraints in v0.1.0, by design:
- Source build only, no wheel. Needs system FFmpeg with Vulkan, its dev headers,
Python 3.13 or older, and
torchcodec==0.14.*. - Selection depends on a torchcodec internal (
_decoder_utils._CUDA_BACKEND) and on the 0.14 plugin ABI, which has changed across releases. The package warns at import if the installed torchcodec is not the version it was tested against. - 10-bit and HDR run on the CPU fallback, not the GPU.
- A decoder is single-in-flight and not safe to share across threads, which matches torchcodec's per-decoder contract. Separate decoders are fine, but each currently creates its own Vulkan device, which is a cost under high concurrency.
- Output is a CPU tensor. Zero-copy onto a GPU tensor is the obvious next step for GPU-resident inference. It is blocked on AMD here (no ROCm torchcodec build, and RADV to ROCm memory sharing is not supported) and would land cleanly on NVIDIA. See SPEC.md.
csrc/ C++ backend and the compute shader
VulkanDeviceInterface.* the torchcodec device interface
VkContext.* Vulkan device, pipeline, queue
NV12ToRGB.* convert dispatch and readback
ColorMatrix.* YUV to RGB matrix per color space and range
shaders/nv12_to_rgb.comp YUV to RGB plus resize
src/torchcodec_vulkan/ Python package
tests/ 26 tests
bench/ the benchmark above
spike/ standalone on-GPU shader check
SPEC.md, BUILDING.md design notes and build recipe
BSD-3-Clause. See LICENSE.
Written by Dennis de Vulder with development assistance from Claude (Opus 4.8).
The commit history records this with Assisted-by trailers.