GPU-accelerated video decoding

Picking up where I left off last time, I’d like to discuss a few more things when using video decoding libraries in your own graphics pipeline. Actually, the previous article was just meant as the introduction to the more technical implementation details, but I got carried away.

Video decoding… it’s a thing. A solved thing, if you just want your basic “I have a video and a window, and I want to display the video in the window”. There are various libraries that can do that for you (DirectShow, Media Foundation, VLC, FFmpeg etc), and generally they will use the GPU to accelerate the decoding. Which is pretty much a requirement for more high-end content, such as 4k 60 fps video.

But I want to talk about just how thin of a line that GPU-accelerated decoding is. Because as soon as you want to do anything other than just displaying the video content in a window managed by the library, you run into limitations. If you want to do anything with the video frames, you usually just want to get a pointer to the pixel data of the frame in some way.

And that is where things tend to fall apart. Such a pointer will have to be in system memory. Worst case (which used to be quite often) this will trigger a chain reaction of the library doing everything in system memory in that case, which means it will also use the CPU to decode, rather than the GPU. See, as long as the library can manage the entire decoding chain from start to finish, and has freedom to decide which buffers to allocate where, and how to output the data, things are fine. But as soon as you want to have access to these buffers in some way, it may falls apart.

In the average case, it may use GPU acceleration for the actual decoding, but then copy the internal GPU buffer to a system buffer. And then you will have to copy it BACK to the GPU in your own texture, to do some actual rendering with it. The higher the resolution and framerate, the more annoying this GPU<->CPU traffic is, because it takes up a lot of precious CPU time and bandwidth.

But there’s a tiny bit more to it…

RGB32 vs NV12 format

In the modern world of truecolour graphics, we tend to use RGB pixelformats for textures, the most common being 8 bits per pixel, packed into a 32-bit word. The remaining 8-bits may be left undefined, or used as an extra alpha (A) component. The exact order may differ between different hardware/software, so we can have RGBA, BGRA, ARGB and whatnot, but let’s call this class ‘RGB32’, as in: “some variation of RGB, stored in a 32-bit word”. That is the variation you will generally want to use when rendering with textures.

For video however, this is not ideal. YUV colour models were used by JPEG and MPEG, among other formats, because they have interesting properties for compression. A YUV (again an umbrella term for various different, but related pixel formats) colour model  takes human perception into consideration. It decomposes a colour into luminance (brightness) and chrominance (colour) values. The human eye is more sensitive to luminance than to chrominance, which means that you can store the chrominance values with a lower resolution than the luminance values, without having much of an effect on the perceived visual quality.

In fact, getting back to the old analog PAL and NTSC formats: These formats were originally black-and-white, so they contained only the luminance of the signal. When colour information (chrominance) was added later, it was added at a lower resolution than the luminance. PAL actually uses YUV, and NTSC uses the similar YIQ encoding. The lower resolution of the chroma signal leads to the phenomenon of artifacting, which was exploited on CGA in 8088 MPH.

In the digital world, the Y (luminance) component is stored at the full resolution, and the U and V (chrominance) components are stored at a reduced resolution. A common format is 4:2:0, which means that for every 4 Y samples, 1 U sample and 1 V sample is stored. In other words, for every 2×2 block of pixels, all Y-values are stored, and only the average U and V values of the block are stored.

When converting back to RGB, the U and V components can be scaled back up, usually with a bilinear filter or such. This can easily be implemented in hardware, so that the pixel data can be stored in the more compact YUV layout, reducing the required memory footprint and bandwidth when decoding video frames. With RGB32, you need 32 bits per pixel. With an YUV 4:2:0 format, for 4 pixels you need to store a total of 6 samples, so 6*8 = 48 bits. That is effectively 48/4 = 12 bits per pixel, so only 37.5% of RGB32. That matters.

When you want to get access to the frame data yourself, you generally have to tell the decoder which format to decode to. This is another pitfall where things may fall apart, performance-wise. That is, a lot of hardware-accelerated decoders will decode into a YUV-layout. If you specify that you want to decode the frame into an RGB32 format, this may cause the decoder to choose a decoding path that is partially or even entirely run on the CPU, and as such will perform considerably worse.

In practice, the most common format that accelerated decoders will decode to is NV12. For an overview of NV12 and various other pixel formats, see this MSDN page. In short, NV12 is a format that stores YUV data in a single buffer, with the Y component first, and then the U and V components packed together:

figure 10. nv12 memory layout

This format is supported in hardware on a wide range of devices, and is your best bet for efficient accelerated GPU decoding.

What’s more: this format is also supported as a texture format, so for example with Direct3D11, you can use NV12 textures directly inside a shader. The translation from YUV to RGB is not done automatically for you though, but can be done inside the shader.

The format is a bit quirky. As it is a single buffer, that contains two sets of data, at different resolutions, Direct3D11 solves this by allowing you to create two shader views on the texture. For the Y component, you create an ID3D11ShaderResourceView with the DXGI_FORMAT_R8_UNORM format. For the U and V components, you create an ID3D11ShaderResourceView with the DXGI_FORMAT_R8G8_UNORM format. You can then bind these views as two separate textures to the pipeline, and read the Y component from the R component of the R8_UNORM view, and the U and V components from the R and G components of the R8G8_UNORM view respectively. From there you can do the usual conversion to RGB.

So the ideal way to decode video is to have the hardware routine decode it to NV12, and then let you have access to the NV12 buffer.

Using Media Foundation

With Media Foundation, it is possible to share your Direct3D11 device between your application and the Media Foundation accelerated decoders. This can be done via the IMFDXGIDeviceManager, which you can create with the MFCreateDXGIDeviceManager function. You can then use IMFDXGIDeviceManager::ResetDevice() to connect your D3D11 device to MediaFoundation. Important is to set your device to multithread-protected via the ID3D10Manager interface first.

This IMFDXGIDeviceManager can then be connected for example to your IMFSourceReader by setting its MF_SOURCE_READER_D3D_MANAGER attribute. As a result, any GPU acceleration done through D3D11 will now be done with your device, and as such, the resources created will belong to your device, and as such can be accessed directly.

A quick-and-dirty way to get to the underlying DXGI buffers is to query the IMFMediaBuffer of a video sample for its IMFDXGIBuffer interface. This interface allows you to get to the underlying ID3D11Texture2D via its GetResource method. And there you are. You have access to the actual D3D11 texture that was used by the GPU-accelerated decoder.

You probably still need to make a copy of this texture to your own texture with the same format, because you need to have a texture that has the D3D11_BIND_SHADER_RESOURCE flag set, if you want to use it in a shader, and the decoder usually does not set that flag. But since it is all done on the GPU, this is reasonably efficient.

Timing on external clock

Another non-standard use of video decoding frameworks is to take matters in your own hand, and output the audio and video frames synchronized to an external clock. By default, the decoder framework will just output the frames in realtime, based on whatever clock source it uses internally. But if you want to output it to a device with an external clock, you need to sync the frames yourself.

With DirectShow and MediaFoundation, this is not that difficult: every audio and video sample that is decoded, is provided with a timestamp, with an accuracy of 100 ns. So you can simply buffer a number of samples, and send them out based on their timestamp, relative to the reference clock of your choice.

For some reason, LibVLC only provides timestamps with the audio samples, not with the video samples it decodes. So that makes it difficult to use LibVLC in this way. Initially it did not have an easy way to decode frames on-demand at all, but recently they added a libvlc_media_player_next_frame() function to skip to the next frame manually. Then it is up to you to figure out what the frame time should be exactly.

One issue here though, is that if you let the library decode the video in realtime, it will also automatically compensate for any performance problems. So it will automatically apply frame skipping when required. If you are decoding manually, at your own speed, then you will need to manually take care of a situation where you may not be able to keep your decode buffer full, when the decoder cannot keep up. You may need to manually skip the playback position in the decoder ahead to keep in sync with the video output speed.

All in all, things aren’t always that straightforward when you don’t just let the video library decode the video by itself, and letting it time and display the output itself.

This entry was posted in Software development and tagged , , , , , , , , , , , , . Bookmark the permalink.

9 Responses to GPU-accelerated video decoding

  1. Martins Mozeiko says:

    Actually most common YUV format is 4:2:0 (which NV12 is), and not 4:1:1.
    4:1:1 does not mean that per 4 Y samples you’re storing 1 U and 1 V sample.

    A:B:C means that for A width of pixels there will be B number of U and V samples, and C is extra U&V samples for next row of A pixels. Which means that 4:1:1 operates in 4×2 pixel region (so 8 Y samples), and first row has 1 U&V sample and second row also has 1 Y&V sample. Which means for every horizontal Y samples there are 1 U and V sample on each row. independently (not 2×2 block).

    More common format is 4:2:0 – again 4×2 region of pixels (8 Y samples), on first row now we have two U&V samples, and no new samples on second row (they reuse same U and V as above). That means that per each 2×2 block has 4 Y samples and 1 U and V sample.

    Wikipedia has good illustrations on this:

  2. erdema says:

    Good writing. Thanks.

    What about APUs like Ryzen 5000G series. They work on systen memory. So, not needed to copy textures from main memory to GPU memory, no overhead…

    This might be added.

    • Scali says:

      You still do, because a texture is an object within the DirectX runtime environment. You can’t just use any pointer to system memory as a texture directly. So technically shared memory looks the same as dedicated VRAM: you have to create a DirectX texture, and copy the data into it.
      In fact, I mostly work with relatively cheap low-end devices, with AMD and Intel integrated GPUs, and thus shared memory.
      It’s a struggle to get them to play 4k content acceptably.

      • erdema says:

        I “believe” that, there should be a mechanism to share memory regions with GPU space. Or GPU should reach system memory with some mechanism without copy data…

        At least that is what I remember from AMDs datasheets promised… Don’t know if it compatible with current state of DX libs, but such a mechanism should to be there. Otherwise, what there are NO meaning of having CPU/GPU in same silicon.

        Might be CPU can create required buffer at GPU space directly. Since It’s also System ram, there should no need to copy the buffer from CPU space to GPU space afterwards.

        I don’t make programs with GPU/OpenCL space. But AFAIK term is “zero-copy shared-memory accesses” for googling about it. I am really sure about there is such a mechanism to avoid the memory copy. But can’t prove it with my code, yet. And the mechanism might not work perfectly. There should be still memory locking delays will reduce the perceived bandwidth but I bet it’s much more faster than copying the data…

      • Scali says:

        Well, even discrete cards allow you to create textures in system memory, and use a shared pool of memory (used to be called the ‘AGP Aperture’, not sure if it even has a name in the PCI-e era). That’s not the issue. The issue is that it’s very difficult to control what your video decoder library does, memory-wise, vs what your 3D rendering code does.
        So most of the time you should consider yourself lucky at all, if you can access the memory in a way that is compatible with DirectX, as opposed to having to copy the pixels one-by-one via the CPU.

        I think the best way may be to implement some of the video decoding interfaces yourself, so that you can implement the memory allocation functions that the decoder uses. But so far I haven’t figured out how to do that. There seems to be a problem there, where it is all-or-nothing: either you write the entire video decoder, or you just use all third-party code for the decoding.
        In this specific case you want a bit of both.

      • Martins Mozeiko says:

        > But AFAIK term is “zero-copy shared-memory accesses” for googling about it.

        When people talk about “zero copy” for hw accelerated video decoding, they typically mean not copying around decoded pixels before displaying in.

        So you push encoded buffer to gpu for decoding. Decode it into gpu memory/texture and then display it from this texture – with extra postprocessing you need like converting YUV to RGB.

        Without “zero copy” it would mean downloading decoded pixels to CPU memory to pass to your drawing/rendering pipeline – which may upload them back to GPU again.

  3. erdema says:

    Thank you for clarifications. Might be Linux has answers for such problems. So not need to compatible with DX. But I agree that it’s little complex work. And the gains are really low comparing the work need to be done.

    Modern GPU’s comes with Video decoder ASICs already and those are dominating the market. They are powerful & efficient. So GPU acceleration for Video decoding becames obsolete already, at least for modern chips. Soon, probably all of those codec works handled by CPU integrated FGPAs (and probably this is why AMD buy Xilinx).


    • Scali says:

      In linux you’d use OpenGL or Vulkan, which afaik have the same problems with textures: they are handles to objects that belong to the runtime, so you can’t just ‘cast’ memory from anywhere into the texture you want. The texture object needs to be created by your instance of the API runtime, if you want to use it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s