Today I randomly stumbled upon some discussions about DirectX 12, Mantle and whatnot. It seems a lot of people somehow think that the whole idea of reducing draw call overhead was new for Mantle and DirectX 12. While some commenters managed to point out that even in the days of DirectX 11, there were examples of presentations from various vendors talking about reducing draw call overhead, that seemed to be as far back as they could go.
I on the other hand have witnessed the evolution of OpenGL and DirectX from an early stage. And I know that the issue of draw call overhead has always been around. In fact, it really came to the forefront when the first T&L hardware arrived. One example was the Quake renderer, which used a BSP tree, to effectively depth-sort the triangles. This was a very poor case for hardware T&L, because it created a draw call for every individual triangle. Hardware T&L was fast if it could process large batches of triangles in a single go. But the overhead of setting the GPU up for hardware T&L was quite large, given that you had to initialize the whole pipeline with the correct state. So sending triangles one at a time in individual draw calls was very inefficient on that type of hardware. This was not an issue when all T&L was done on the CPU, since all the state was CPU-side anyway, and CPUs are efficient at branching, random memory access etc.
This led to the development of ‘leafy BSP trees’, where triangles would not be sorted down to the individual triangle level. Instead, batches of triangles were grouped together into a single node, so that you could easily send larger batches of triangles to the GPU in a single draw call, and make the hardware T&L do its thing more efficiently. To give an idea of how old this concept is, a quick Google drew up a discussion on BSP trees and their efficiency with T&L hardware on Gamedev.net from 2001.
But one classic presentation from NVIDIA that has always stuck in my mind is their Batch Batch Batch presentation from the Game Developers Conference in 2003. This presentation was meant to ‘educate’ developers on the true cost of draw calls on hardware T&L and early programmable shader hardware. To put it in perspective, they use an Athlon XP 2700+ (2.167 GHz) CPU and a GeForce FX5800 as their high-end system in that presentation, which would have been cutting-edge at the time.
What they explain is that even in those days, the CPU was a huge bottleneck for CPUs. There was so much time spent on processing a single call and setting up the GPU, that you basically got thousands of triangles ‘for free’ if you would just add them to that single call. At 130 triangles or less, you are completely CPU-bound, even with the fastest CPU of the day.
So they explain that the key is not how many triangles you can draw per frame, but how many batches per frame. There is quite a hard limit to the number of batches you can render per frame, at a given framerate. They measured about 170k batches per second on their high-end system (and that was a synthetic test doing only the bare draw calls, nothing fancy). So if you would assume 60 fps, you’d get 170k/60 = 2833 batches per frame. At one extreme of the spectrum, that means that if you only send one triangle per batch, you could not render more than 2833 triangles per frame at 60 fps. And in practical situations, with complex materials, geometry, and the rest of the game logic running on the CPU as well, the number of batches will be a lot smaller.
At the other extreme however, you can take these 2833 batches per frame, and chuck each of them full of triangles ‘for free’. As they say, if you make a single batch 500 triangles, or even 1000 triangles large, it makes absolutely no difference. So with larger batches, you could easily get 2.83 million triangles on screen at the same 60 fps.
And even in 2003 they already warned that this situation was only going to get worse, since the trend was, and still is, that GPU performance scales much more quickly than CPU performance over time. So basically since the early days of hardware T&L the whole CPU overhead problem has been a thing. Not just since DirectX 11 or 12. These were the days of DirectX 7, 8 and 9 (they included numbers for GeForce2 and GeForce4MX cards, which are DX7-level, they all suffer the same issue. Even a GeForce2MX can do nearly 20 million triangles per second if fed efficently by the CPU).
So as you can imagine, a lot of effort has been put into both hardware and software to try and make draw calls more efficient. Like the use of instancing, rendering to vertexbuffers, supporting texture fetches from vertex shaders, redesigned state management, deferred contexts and whatnot. The current generation of APIs (DirectX 12, Vulkan, Mantle and Metal) are another step in reducing the bottlenecks surrounding draw calls. But although they reduce the cost of draw calls, they do not solve the problem altogether. It is still expensive to send a batch of triangles to the GPU, so you still need to feed the data efficiently. These APIs certainly don’t make draw calls free, and we’re nowhere near the ideal situation where you can fire off draw calls for single triangles and expect decent performance.
I hope you liked this bit of historical perspective. The numbers in the Batch Batch Batch presentation are very interesting.
Reducing draw call overhead mostly appears to be a side effect of the design of modern graphics APIs rather than an intentional outcome. IMO, there was a far bigger emphasis on introducing radically new concepts like pipeline state objects and making the APIs stateless which was the biggest contributor into reducing the perceived overhead.
Going forward into the future, I expect that minimizing changes to the PSOs will be key to keeping the overhead down.
Well, I don’t think PSOs are new. They go back to DirectX 10. If you look at DirectX 9, it was still basically a traditional state machine, where every state parameter could be set with an individual call.
In DirectX 10 they grouped the state into a small number of state objects (ID3D10BlendState, ID3D10RasterizerState, ID3D10DepthStencilState, ID3D10SamplerState). Each state object could be created once, which would allow the driver to ‘compile’ it to a hardware-specific ‘blob’ (and also perform any validation upfront), and then it could be set to the pipeline in a single call.
The PSOs in DX12 are basically an evolution of that: they now group all state in just a single object. The reason for this is that hardware has evolved, and some of the state that was in separate objects in DX10/11, is now actually moved to a different part of the pipeline (often inserted as actual instructions into the shader code). This means that if you set one state object in DX10/11, it also triggers updating of other internal state, and often results in having to recompile the same shaders, to process the new state.
So the idea is basically the same as it was in DX10, and can hardly be called radical in DX12. Evolutions in hardware just made the grouping chosen for DX10 ineffective, so a more coarse grouping of state is now more efficient.
I do think that reducing draw call overhead was a key point in the design of DX12 though. I think the (re)introduction of command lists serves two purposes:
1) Each individual call can now easily be handled in usermode only. You just record calls as simple instructions in a list. The driver can then process the entire list in one go, which is more efficient than processing it one call at a time.
2) Commandlists decouple the setup of drawing from the actual rendering itself, which means that you can create the commandlists on any thread you like, in the background, while rendering is running asynchronously. This does not necessarily reduce the draw call overhead itself, but it removes the synchronization overhead (and makes you add synchronization explicitly where you need it), which means you can leverage multiple cores/threads better.
Aside from that, the validation is now mostly removed from the API calls, which also reduces overhead. That was also a deliberate choice.
IMO, there’s hardly much resemblance between D3D12’s PSOs and D3D11’s state objects. PSOs like you said are an amalgamation of all the static state in the pipeline but shader programs are also coupled to them as well. D3D11 state objects on the other hand group a set of smaller states according to the fixed function stages of the pipeline but what’s more is that D3D11 also uses separate shader objects.
It is this decoupling of state objects and shader objects that can become a potential source of the shader recompilations as implied.
It is this new found understanding of pipelines in modern gfx APIs that breaks our previous assumptions about pipelines in older gfx APIs only containing the rendering state. This inclusion of binary blobs in PSOs is a profound conceptual change from D3D11’s state objects and this has an adverse impact on certain sets of hardware.
Well, firstly, I was talking about D3D10, not D3D11, as D3D11 is pretty much the same API as far as state and shader management goes. Secondly, I think you’re saying the same thing as I did… more or less. You say “new found understanding of pipelines in modern gfx APIs”. But that’s viewing it backwards. The API is only there to drive a GPU. The pipeline is in the GPU, so the GPU dictates what the API should do. Which explains why older APIs had different types of state/shader management (they were aimed at the GPU architectures and pipelines of that era). As I already said. The whole shader recompilation thing is a recent development because newer GPU architectures moved more functionality from the fixed pipeline to the shader. Which means that certain state now has to be compiled into the shader, where it was physically separate from the shader units, in a fixed unit, in earlier GPUs.
I don’t think “new found understanding” makes much sense here. It’s not like people were clueless when they designed the earlier APIs, and suddenly with D3D12 the lightbulb went on. Even with these older APIs, people knew exactly what they were doing. It’s just that they were using different CPUs and GPUs, so the design requirements were entirely different from today. Direct3D characterizes itself because Microsoft refreshes the entire API every few years, to keep it up-to-date with modern GPUs. The fact that we’re on D3D12, and not D3D2, shows that D3D12 is not exactly the first time that the API was changed. Especially D3D7 (hardware T&L, buffers in VRAM), D3D8 (programmable shaders), D3D10 (the aforementioned grouped state management, and removing fixed-function pipeline altogether, focusing entirely on shader-based rendering) and D3D12 made big changes to the whole pipeline/programming model.
Not to mention that Microsoft actually designs the Direct3D APIs together with the major GPU vendors and game developers. So all the knowledge is there.
Other than that, apparently you see that as something revolutionary in D3D12, where I see D3D12’s state management as a logical evolution of the trend that started with D3D10. You’re entitled to your own opinion, but I’m not sure if it’s worth debating something like this. The technical story above is most important: when and why is shader recompilation triggered, and how can you avoid it, to be as efficient as possible?
Quick note: Command lists were first introduced in DirectX 11: