Picking up where I left off last time, I’d like to discuss a few more things when using video decoding libraries in your own graphics pipeline. Actually, the previous article was just meant as the introduction to the more technical implementation details, but I got carried away.
Video decoding… it’s a thing. A solved thing, if you just want your basic “I have a video and a window, and I want to display the video in the window”. There are various libraries that can do that for you (DirectShow, Media Foundation, VLC, FFmpeg etc), and generally they will use the GPU to accelerate the decoding. Which is pretty much a requirement for more high-end content, such as 4k 60 fps video.
But I want to talk about just how thin of a line that GPU-accelerated decoding is. Because as soon as you want to do anything other than just displaying the video content in a window managed by the library, you run into limitations. If you want to do anything with the video frames, you usually just want to get a pointer to the pixel data of the frame in some way.
And that is where things tend to fall apart. Such a pointer will have to be in system memory. Worst case (which used to be quite often) this will trigger a chain reaction of the library doing everything in system memory in that case, which means it will also use the CPU to decode, rather than the GPU. See, as long as the library can manage the entire decoding chain from start to finish, and has freedom to decide which buffers to allocate where, and how to output the data, things are fine. But as soon as you want to have access to these buffers in some way, it may falls apart.
In the average case, it may use GPU acceleration for the actual decoding, but then copy the internal GPU buffer to a system buffer. And then you will have to copy it BACK to the GPU in your own texture, to do some actual rendering with it. The higher the resolution and framerate, the more annoying this GPU<->CPU traffic is, because it takes up a lot of precious CPU time and bandwidth.
But there’s a tiny bit more to it…
RGB32 vs NV12 format
In the modern world of truecolour graphics, we tend to use RGB pixelformats for textures, the most common being 8 bits per pixel, packed into a 32-bit word. The remaining 8-bits may be left undefined, or used as an extra alpha (A) component. The exact order may differ between different hardware/software, so we can have RGBA, BGRA, ARGB and whatnot, but let’s call this class ‘RGB32’, as in: “some variation of RGB, stored in a 32-bit word”. That is the variation you will generally want to use when rendering with textures.
For video however, this is not ideal. YUV colour models were used by JPEG and MPEG, among other formats, because they have interesting properties for compression. A YUV (again an umbrella term for various different, but related pixel formats) colour model takes human perception into consideration. It decomposes a colour into luminance (brightness) and chrominance (colour) values. The human eye is more sensitive to luminance than to chrominance, which means that you can store the chrominance values with a lower resolution than the luminance values, without having much of an effect on the perceived visual quality.
In fact, getting back to the old analog PAL and NTSC formats: These formats were originally black-and-white, so they contained only the luminance of the signal. When colour information (chrominance) was added later, it was added at a lower resolution than the luminance. PAL actually uses YUV, and NTSC uses the similar YIQ encoding. The lower resolution of the chroma signal leads to the phenomenon of artifacting, which was exploited on CGA in 8088 MPH.
In the digital world, the Y (luminance) component is stored at the full resolution, and the U and V (chrominance) components are stored at a reduced resolution. A common format is 4:2:0, which means that for every 4 Y samples, 1 U sample and 1 V sample is stored. In other words, for every 2×2 block of pixels, all Y-values are stored, and only the average U and V values of the block are stored.
When converting back to RGB, the U and V components can be scaled back up, usually with a bilinear filter or such. This can easily be implemented in hardware, so that the pixel data can be stored in the more compact YUV layout, reducing the required memory footprint and bandwidth when decoding video frames. With RGB32, you need 32 bits per pixel. With an YUV 4:2:0 format, for 4 pixels you need to store a total of 6 samples, so 6*8 = 48 bits. That is effectively 48/4 = 12 bits per pixel, so only 37.5% of RGB32. That matters.
When you want to get access to the frame data yourself, you generally have to tell the decoder which format to decode to. This is another pitfall where things may fall apart, performance-wise. That is, a lot of hardware-accelerated decoders will decode into a YUV-layout. If you specify that you want to decode the frame into an RGB32 format, this may cause the decoder to choose a decoding path that is partially or even entirely run on the CPU, and as such will perform considerably worse.
In practice, the most common format that accelerated decoders will decode to is NV12. For an overview of NV12 and various other pixel formats, see this MSDN page. In short, NV12 is a format that stores YUV data in a single buffer, with the Y component first, and then the U and V components packed together:
This format is supported in hardware on a wide range of devices, and is your best bet for efficient accelerated GPU decoding.
What’s more: this format is also supported as a texture format, so for example with Direct3D11, you can use NV12 textures directly inside a shader. The translation from YUV to RGB is not done automatically for you though, but can be done inside the shader.
The format is a bit quirky. As it is a single buffer, that contains two sets of data, at different resolutions, Direct3D11 solves this by allowing you to create two shader views on the texture. For the Y component, you create an ID3D11ShaderResourceView with the DXGI_FORMAT_R8_UNORM format. For the U and V components, you create an ID3D11ShaderResourceView with the DXGI_FORMAT_R8G8_UNORM format. You can then bind these views as two separate textures to the pipeline, and read the Y component from the R component of the R8_UNORM view, and the U and V components from the R and G components of the R8G8_UNORM view respectively. From there you can do the usual conversion to RGB.
So the ideal way to decode video is to have the hardware routine decode it to NV12, and then let you have access to the NV12 buffer.
Using Media Foundation
With Media Foundation, it is possible to share your Direct3D11 device between your application and the Media Foundation accelerated decoders. This can be done via the IMFDXGIDeviceManager, which you can create with the MFCreateDXGIDeviceManager function. You can then use IMFDXGIDeviceManager::ResetDevice() to connect your D3D11 device to MediaFoundation. Important is to set your device to multithread-protected via the ID3D10Manager interface first.
This IMFDXGIDeviceManager can then be connected for example to your IMFSourceReader by setting its MF_SOURCE_READER_D3D_MANAGER attribute. As a result, any GPU acceleration done through D3D11 will now be done with your device, and as such, the resources created will belong to your device, and as such can be accessed directly.
A quick-and-dirty way to get to the underlying DXGI buffers is to query the IMFMediaBuffer of a video sample for its IMFDXGIBuffer interface. This interface allows you to get to the underlying ID3D11Texture2D via its GetResource method. And there you are. You have access to the actual D3D11 texture that was used by the GPU-accelerated decoder.
You probably still need to make a copy of this texture to your own texture with the same format, because you need to have a texture that has the D3D11_BIND_SHADER_RESOURCE flag set, if you want to use it in a shader, and the decoder usually does not set that flag. But since it is all done on the GPU, this is reasonably efficient.
Timing on external clock
Another non-standard use of video decoding frameworks is to take matters in your own hand, and output the audio and video frames synchronized to an external clock. By default, the decoder framework will just output the frames in realtime, based on whatever clock source it uses internally. But if you want to output it to a device with an external clock, you need to sync the frames yourself.
With DirectShow and MediaFoundation, this is not that difficult: every audio and video sample that is decoded, is provided with a timestamp, with an accuracy of 100 ns. So you can simply buffer a number of samples, and send them out based on their timestamp, relative to the reference clock of your choice.
For some reason, LibVLC only provides timestamps with the audio samples, not with the video samples it decodes. So that makes it difficult to use LibVLC in this way. Initially it did not have an easy way to decode frames on-demand at all, but recently they added a libvlc_media_player_next_frame() function to skip to the next frame manually. Then it is up to you to figure out what the frame time should be exactly.
One issue here though, is that if you let the library decode the video in realtime, it will also automatically compensate for any performance problems. So it will automatically apply frame skipping when required. If you are decoding manually, at your own speed, then you will need to manually take care of a situation where you may not be able to keep your decode buffer full, when the decoder cannot keep up. You may need to manually skip the playback position in the decoder ahead to keep in sync with the video output speed.
All in all, things aren’t always that straightforward when you don’t just let the video library decode the video by itself, and letting it time and display the output itself.