Today I randomly stumbled upon some discussions about DirectX 12, Mantle and whatnot. It seems a lot of people somehow think that the whole idea of reducing draw call overhead was new for Mantle and DirectX 12. While some commenters managed to point out that even in the days of DirectX 11, there were examples of presentations from various vendors talking about reducing draw call overhead, that seemed to be as far back as they could go.
I on the other hand have witnessed the evolution of OpenGL and DirectX from an early stage. And I know that the issue of draw call overhead has always been around. In fact, it really came to the forefront when the first T&L hardware arrived. One example was the Quake renderer, which used a BSP tree, to effectively depth-sort the triangles. This was a very poor case for hardware T&L, because it created a draw call for every individual triangle. Hardware T&L was fast if it could process large batches of triangles in a single go. But the overhead of setting the GPU up for hardware T&L was quite large, given that you had to initialize the whole pipeline with the correct state. So sending triangles one at a time in individual draw calls was very inefficient on that type of hardware. This was not an issue when all T&L was done on the CPU, since all the state was CPU-side anyway, and CPUs are efficient at branching, random memory access etc.
This led to the development of ‘leafy BSP trees’, where triangles would not be sorted down to the individual triangle level. Instead, batches of triangles were grouped together into a single node, so that you could easily send larger batches of triangles to the GPU in a single draw call, and make the hardware T&L do its thing more efficiently. To give an idea of how old this concept is, a quick Google drew up a discussion on BSP trees and their efficiency with T&L hardware on Gamedev.net from 2001.
But one classic presentation from NVIDIA that has always stuck in my mind is their Batch Batch Batch presentation from the Game Developers Conference in 2003. This presentation was meant to ‘educate’ developers on the true cost of draw calls on hardware T&L and early programmable shader hardware. To put it in perspective, they use an Athlon XP 2700+ GHz CPU and a GeForce FX5800 as their high-end system in that presentation, which would have been cutting-edge at the time.
What they explain is that even in those days, the CPU was a huge bottleneck for CPUs. There was so much time spent on processing a single call and setting up the GPU, that you basically got thousands of triangles ‘for free’ if you would just add them to that single call. At 130 triangles or less, you are completely CPU-bound, even with the fastest CPU of the day.
So they explain that the key is not how many triangles you can draw per frame, but how many batches per frame. There is quite a hard limit to the number of batches you can render per frame, at a given framerate. They measured about 170k batches per second on their high-end system (and that was a synthetic test doing only the bare draw calls, nothing fancy). So if you would assume 60 fps, you’d get 170k/60 = 2833 batches per frame. At one extreme of the spectrum, that means that if you only send one triangle per batch, you could not render more than 2833 triangles per frame at 60 fps. And in practical situations, with complex materials, geometry, and the rest of the game logic running on the CPU as well, the number of batches will be a lot smaller.
At the other extreme however, you can take these 2833 batches per frame, and chuck each of them full of triangles ‘for free’. As they say, if you make a single batch 500 triangles, or even 1000 triangles large, it makes absolutely no difference. So with larger batches, you could easily get 2.83 million triangles on screen at the same 60 fps.
And even in 2003 they already warned that this situation was only going to get worse, since the trend was, and still is, that GPU performance scales much more quickly than CPU performance over time. So basically since the early days of hardware T&L the whole CPU overhead problem has been a thing. Not just since DirectX 11 or 12. These were the days of DirectX 7, 8 and 9 (they included numbers for GeForce2 and GeForce4MX cards, which are DX7-level, they all suffer the same issue. Even a GeForce2MX can do nearly 20 million triangles per second if fed efficently by the CPU).
So as you can imagine, a lot of effort has been put into both hardware and software to try and make draw calls more efficient. Like the use of instancing, rendering to vertexbuffers, supporting texture fetches from vertex shaders, redesigned state management, deferred contexts and whatnot. The current generation of APIs (DirectX 12, Vulkan, Mantle and Metal) are another step in reducing the bottlenecks surrounding draw calls. But although they reduce the cost of draw calls, they do not solve the problem altogether. It is still expensive to send a batch of triangles to the GPU, so you still need to feed the data efficiently. These APIs certainly don’t make draw calls free, and we’re nowhere near the ideal situation where you can fire off draw calls for single triangles and expect decent performance.
I hope you liked this bit of historical perspective. The numbers in the Batch Batch Batch presentation are very interesting.