DirectX 12 and Vulkan: what it is, and what it isn’t

I often read comments in the vein of: “… but vendor A’s hardware is designed more for DX12/Vulkan than vendor B’s”. It’s a bit more complicated than that, because it is somewhat of a chicken-and-egg problem. So I thought I’d do a quick blog to try and explain it.

APIs vs hardware features

A large part of the confusion seems to be because the capabilities of hardware tend to be categorized by versions of the DirectX API. In a way that makes sense, since each new version of the DirectX API also introduces support for new hardware features. So this became a de-facto way of categorizing the hardware capabilities of GPUs. Since DirectX 11, we even have different feature levels that can be referred to.

As you can see, the main new features in DirectX 12 are Conservative Rasterization, Volume Tiled Resources and Rasterizer ordered views. But, as you can also see, these have been ‘backported’ to DirectX 11.3 as well, so apparently they are not specific to the DirectX 12 API.

But what is an API really? API stands for Application Programming Interface. It is the ‘thing’ that you ‘talk to’ when programming something, in this case graphics. And the ‘new thing’ about DirectX 12, Vulkan (and Metal and Mantle) is that the interface follows a new paradigm, a new programming model. In earlier versions of DirectX, the driver was responsible for tasks such as resource management and synchronization (eg, if you first render to a buffer, and later want to use that buffer as a texture on some surface, the driver makes sure that the rendering to the buffer is complete before rendering with the texture starts).

These ‘next-gen’ APIs however, work on a lower level, and give the programmer control over such tasks. Leaving it to the driver can work well in the general case, and makes things easier and less error-prone for the programmer. However, the driver has to work with all software, and will use a generic approach. By giving the programmer fine-grained control over the hardware, these tasks can be optimized specifically for an engine or game. This way the programmer can shave off redundant work and reduce overhead on the CPU side. The API calls are now lighter and simpler, because they don’t have to take care of all the bookkeeping, validation and other types of management. These have now been pushed to the engine code instead. On the GPU side, things generally stay the same however, but more on that later.

Command lists

Another change in the programming model is that GPU commands are now ‘decoupled’ at the API side: Sending rendering commands to the GPU is now a two-step process:

Add all the commands you want to execute to a list
Execute your command list(s)

Classic rendering APIs are ‘immediate’ renderers: with an API call you send the command directly to the driver/GPU. Drivers might internally buffer this, and create/optimize their own command lists, but this is transparent to the programmer. A big problem in this programming model is that the order in which the commands are executed, is important. That basically means that you can only use a single thread to send commands. If you were to use multiple threads, you’d have to synchronize them so that they all sent their commands in-order, which basically would mean they’d run one-after-another, so you might as well use a single thread.

DirectX 11 tried to work around this by introducing ‘deferred’ contexts. You would have one ‘immediate’ context, which would execute all commands immediately. But you could create additional contexts, which would buffer commands in a list, which you could later hand down to the immediate context to execute.

There were however two problems with this approach:

The deferred contexts supported only a subset of all commands
Only nVidia’s implementation managed to get significant performance from this

To clarify that second point, FutureMark built an API overhead test, which includes tests with DX11 using immediate and deferred contexts, with a single or multiple threads. See Anandtech’s review of this test.

As you can see, this feature does absolutely nothing on AMD hardware. They are stuck at 1.1M calls regardless of what technique you use, or how many cores you throw at it.

With nVidia however, you see that with 4 or 6 cores, it goes up to 2.2M-2.3M calls. Funny enough, nVidia’s performance on the single-threaded DX11 code also goes up with the 6-core machine, so the total gains from this technique are not very dramatic. Apparently nVidia already performs some parallel processing inside the driver.

DirectX 12 takes this concept further. You now have a command queue, in which you can queue up command lists, which will be executed in-order. The commands inside the command list will also be executed in-order. You can create multiple command lists, and create a thread for each list, to add the commands to it, so that they can all work in parallel. There are no restrictions on the command lists anymore, like there were with the deferred context in DX11 (technically you no longer have an ‘immediate’ context in DX12, they are all ‘deferred’).

An added advantage is that you can re-use these command lists. In various cases, you want to send the same commands every frame (to render the same objects and such), so you can now remove redundant work by just using the same command list over and over again.

Honourable mention for Direct3D 1 here: The first version of Direct3D actually used a very similar concept to command lists, known as ‘execute buffers’. You would first store your commands as bytecode in an ‘execute buffer’, and then execute the buffer. Technically this could be used in multi-threaded environments in much the same way: use multiple threads, which each fill their own execute buffer in parallel.

Asynchronous compute

Why is there a queue for the command lists, you might ask? Can’t you just send the command lists directly to an Execute( myList ) function? The answer is: there can be more than one queue. You can see this as a form of ‘GPU multithreading’: you can have multiple command lists executing at the same time. If you want to compare it to CPU mutithreading, you could view a command queue as a thread, and a command list as an instruction stream (a ‘ThreadProc’ that is called when the thread is running).

There are three different classes of queues and command lists:

Graphics
Compute
DMA/Copy

The idea behind this is that modern GPUs are capable of performing multiple tasks at the same time, since they use different parts of the GPU. Eg, you can upload a texture to VRAM via DMA while you are also rendering and/or performing compute tasks (previously this was done automatically by the driver).

The most interesting new feature here is that you can run a graphics task and a compute task together. The classic example of how you can use this is rendering shadowmaps; Shadowmaps do not need any pixel shading, they just need to store a depth value. So you are mainly running vertex shaders and using the rasterizer. In most cases, your geometry is not all that complex, so there are relatively few vertices that need processing, leaving a lot of ALUs on the GPU sitting idle. With these next-gen APIs you can now execute a compute task at the same time, and make use of the ALUs that would otherwise sit idle (compute does not need the rasterizer).

This is called ‘asynchronous’ compute, because, like with conventional multithreading on the CPU, you are scheduling two (or more) tasks to run concurrently, and you don’t really care about which order they run in. They can be run at the same time if the hardware is capable of it, or they can be run one-after-another, or they can switch multiple times (time-slicing) until they are both complete (on CPUs there are a number of ways to run multiple threads, single-core, multi-core, multi-CPU, HyperThreading. And the OS will use a combination of techniques to schedule threads on the available hardware. see also my earlier blog). You may care about the priority of both, so that you can allocate more resources to one of them, to make it complete faster. But in general, they are running asynchronously. You need to re-synchronize by checking that they have both triggered their event to signal that they have completed.

Now that the introduction is over…

So, how does it look when you actually want to render something? Well, let’s have a (slightly simplified) look at rendering an object in DirectX 11:

// Set the viewport and scissor rectangle.
D3D11_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_immediateContext->RSSetViewports(1, &viewport);
m_immediateContext->RSSetScissorRects(1, &m_scissorRect);

// Send drawing commands.
ID3D11RenderTargetView* renderTargetView = m_deviceResources->GetRenderTargetView();
ID3D11DepthStencilView* depthStencilView = m_deviceResources->GetDepthStencilView();
m_immediateContext->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue);
m_immediateContext->ClearDepthStencilView(depthStencilView, D3D11_CLEAR_DEPTH, 1.0f, 0);

m_immediateContext->OMSetRenderTargets(1, &renderTargetView, &depthStencilView);

m_immediateContext->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
m_immediateContext->IASetVertexBuffers(0, 1, &m_vertexBuffers);
m_immediateContext->IASetIndexBuffer(&m_indexBuffer, DXGI_FORMAT_R16_UINT, 0);
m_immediateContext->DrawIndexedInstanced(36, 1, 0, 0, 0);

And in DirectX 12 it would look like this (again, somewhat simplified):

// Set the viewport and scissor rectangle.
D3D12_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_commandList->RSSetViewports(1, &viewport);
m_commandList->RSSetScissorRects(1, &m_scissorRect);

// Indicate this resource will be in use as a render target.
CD3DX12_RESOURCE_BARRIER renderTargetResourceBarrier =
	CD3DX12_RESOURCE_BARRIER::Transition(m_deviceResources->GetRenderTarget(), D3D12_RESOURCE_STATE_PRESENT, D3D12_RESOURCE_STATE_RENDER_TARGET);
m_commandList->ResourceBarrier(1, &renderTargetResourceBarrier);

// Record drawing commands.
D3D12_CPU_DESCRIPTOR_HANDLE renderTargetView = m_deviceResources->GetRenderTargetView();
D3D12_CPU_DESCRIPTOR_HANDLE depthStencilView = m_deviceResources->GetDepthStencilView();
m_commandList->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue, 0, nullptr);
m_commandList->ClearDepthStencilView(depthStencilView, D3D12_CLEAR_FLAG_DEPTH, 1.0f, 0, 0, nullptr);

m_commandList->OMSetRenderTargets(1, &renderTargetView, false, &depthStencilView);

m_commandList->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
m_commandList->IASetVertexBuffers(0, 1, &m_vertexBufferView);
m_commandList->IASetIndexBuffer(&m_indexBufferView);
m_commandList->DrawIndexedInstanced(36, 1, 0, 0, 0);

// Indicate that the render target will now be used to present when the command list is done executing.
CD3DX12_RESOURCE_BARRIER presentResourceBarrier =
	CD3DX12_RESOURCE_BARRIER::Transition(m_deviceResources->GetRenderTarget(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_PRESENT);
m_commandList->ResourceBarrier(1, &presentResourceBarrier);

m_commandList->Close();

// Execute the command list.
ID3D12CommandList* ppCommandLists[] = { m_commandList.Get() };
m_deviceResources->GetCommandQueue()->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

As you can see, the actual calls are very similar. The functions mostly have the same names, and even the parameters are mostly the same. At a higher level, most of what you do is exactly the same: you use a rendertarget, a depth/stencil surface, you set up a viewport and scissor rectangle. Then you clear the rendertarget and depth/stencil for a new frame, and send a list of triangles to the GPU, which is stored in a vertex buffer and index buffer pair (you would already have initialized a vertex shader and pixel shader at an earlier stage, and already uploaded the geometry to the vertex- and index buffers, but I left those parts out for simplicity. The code there is again very similar between the APIs, where DirectX 12 again requires a bit more code, because you have to tell the API in more detail what you actually want. Uploading the geometry also requires a command list there).
So what the GPU actually has to do, is exactly the same, regardless of whether you use DirectX 11 or DirectX 12. The differences are mainly on the CPU-side, as you can see.

The same argument also extends to Vulkan. The code may look a bit different from what you’re doing in DirectX 12, but in essence, you’re still creating the same vertex and index buffers, and sending the same triangle list draw command to the GPU, rendering to a rendertarget and depth/stencil buffer.

So, what this means is that you do not really need to ‘design’ your hardware for DirectX 12 or Vulkan at all. The changes are mainly on the API side, and affect the workload of the CPU and driver, not the GPU. Which is also why DirectX 12 supports feature levels of 11_x: the API can also support hardware that pre-dates DirectX 12.

Chickens and eggs

But, how exactly did these new hardware features arrive in the DirectX 11.3 and 12 APIs? And why exactly did this new programming model emerge in these new APIs?

The first thing to point out is that Microsoft does not develop hardware. This means that Microsoft can’t just think up new hardware features out-of-the-blue and hope that hardware will support it. For each update of DirectX, Microsoft will have meetings with the big players on the hardware market, such as Intel, nVidia and AMD (in recent years, also Qualcomm, for mobile devices). These IHVs will give Microsoft input on what kind of features they would like to include in the next DirectX. Together with Microsoft, these features will then be standardized in a way that they can be implemented by all IHVs. So this is a somewhat democratic process.

Aside from IHVs, Microsoft also includes some of the larger game engine developers, to get more input from the software/API side of things. Together they will try to work out solutions to current problems and bottlenecks in the API, and also work out ways to include the new features presented by the IHVs. In some cases, the IHVs will think of ways to solve these bottlenecks by adding new hardware features. After all, game engines and rendering algorithms also evolve over time, as does the way they use the API and hardware. For example, in the early days of shaders, you wrote separate shaders for each material, and these used few parameters. These days, you tend to use the same shader for most materials, and only the parameters change. So at one point, switching between shaders efficiently was important, but now updating shader parameters efficiently is more important. Different requirements call for different APIs (so it’s not like older APIs were ‘bad’, they were just designed against different requirements, a different era with different hardware and different rendering approaches).

So, as you can see, it is a bit of cross-pollination between all the parties. Sometimes an IHV comes up with a new approach first, and it is included in the API. Other times, developers come up with problems/bottlenecks first, and then the API is modified, and hardware redesigned to make it work. Since not all hardware is equal, it is also often the case that one vendor had a feature already, while others had to modify their hardware to include it. So for one vendor, the chicken came first, for others the egg came first.

The development of the API is an iterative process, where the IHVs will work closely with Microsoft over a period of time, to make new hardware and drivers available for testing, and move towards a final state of the API and driver model for the new version of DirectX.

But what does it mean?

In short, it means it is pretty much impossible to say how much of a given GPU was ‘designed for API X or Y’, and how much of the API was ‘designed for GPU A or B’. It is a combination of things.

In DirectX 12/Vulkan, it seems clear that Rasterizer Ordered Views came from Intel, since they already had the feature on their DirectX 11 hardware. It looks like nVidia designed this feature into Maxwell v2 for the upcoming DX11.3 and DX12 APIs. AMD has yet to implement the feature.

Conservative rasterization is not entirely clear. nVidia was the first to market with the feature, in Maxwell v2. However, Intel followed not much later, and implemented it at Tier 3 level. So I cannot say with certainty whether it originated from nVidia or Intel. AMD has yet to implement the feature.

Asynchronous compute is a bit of a special case. The API does not consider it a specific hardware feature, and leaves it up to the driver how to handle multiple queues. The idea most likely originated from AMD, since they have had support for running graphics and compute at the same time since the first GCN architecture, and the only way to make good use of that is to have multiple command queues. nVidia added limited support in Maxwell v2 (they had asynchronous compute support in CUDA since Kepler, but they could not run graphics tasks in parallel), and more flexible/efficient support in Pascal. Intel has yet to support this feature (that is, they support code that uses multiple queues, but as far as I know, they cannot actually run graphics and compute tasks in parallel, so they cannot use it to improve performance by better ALU usage).

Also, you can compare performance differences from DirectX 11 to 12, or from OpenGL to Vulkan… but it is impossible to draw conclusions from these results. Is the DirectX 11 driver that good, or is the DirectX 12 engine that bad for a given game/GPU? Or perhaps the other way around? Was OpenGL that bad, and is the Vulkan engine that good for a given game/GPU?

Okay, but what about performance?

The main advantage of this new programming model in DirectX 12/Vulkan is also a potential disadvantage. I see a parallel with the situation of compilers versus assemblers on the CPU: it is possible for an assembly programmer to outperform a compiler, but there are two main issues here:

Compilers have become very good at what they do, so you have to be REALLY good to even write assembly that is on par with what a modern compiler will generate.
Optimizing assembly code can only be done for one CPU at a time. Chances are that the tricks you use to maximize performance on CPU A, will not work, or even be detrimental to performance on CPU B. Writing code that works well on all CPUs is even more difficult. Compilers however are very good at this, and can also easily optimize code for multiple CPUs, and include multiple codepaths in the executable.

In the case of DirectX 12/Vulkan, you will be taking on the driver development team. In DirectX 11/OpenGL, you had the advantage that the low-level resource management, synchronization and such, was always done by the driver, which was optimized for a specific GPU, by the people who built that GPU. So like with compilers, you had a very good baseline of performance. As an engine developer, you have to design and optimize your engine very well, before you get on par with these drivers (writing a benchmark that shows that you can do more calls per second in DX12 than in DX11 is one thing. Rendering an actual game more efficiently is another).

Likewise, because of the low-level nature of DirectX 12/Vulkan, you need to pay more attention to the specific GPUs and videocards you are targeting. The best way to manage your resources on GPU A might not be the best way on GPU B. Normally the driver would take care of it. Now you may need to write multiple paths, and select the fastest one for each GPU.

Asynchronous compute is especially difficult to optimize for. Running two things at the same time means you have to share your resources. If this is not balanced well, then one task may be starving the other of resources, and you may actually get lower performance than if you would just run the tasks one after another.

What makes it even more complicated is that this balance is specific not only to the GPU architecture, but even to the specific model of video card, to a certain extent. If we take the above example of rendering shadowmaps while doing a compute task (say postprocessing the previous frame)… What if GPU A renders shadowmaps quickly and compute tasks slowly, but GPU B renders the shadowmaps slowly and compute tasks quickly? This would throw off the balance. For example, once the shadowmaps are done, the next graphics task might require a lot of ALU power, and the compute task that is still running will be starving that graphics task.

And things like rendering speed will depend on various factors, including the relative speed of the rasterizers to the VRAM. So, even if two videocards use the same GPU with the same rasterizers, variations in VRAM bandwidth could still disturb the balance.

On Xbox One and PlayStation 4, asynchronous compute makes perfect sense. You only have a single target to optimize for, so you can carefully tune your code for the best performance. On a Windows system however, things are quite unpredictable. Especially looking to the future. Even if you were to optimize for all videocards available today, that is no guarantee that the code will still perform well on future videocards.

So we will have to see what the future brings. Firstly, will engine developers actually be able to extract significant gains from this new programming model? Secondly, will these gains stand the test of time? As in, are these gains still available 1, 2 or 3 generations of GPUs from now, or will some code actually become suboptimal on future GPUs? Code which, when handled by an optimized ‘high-level’ driver such as in DirectX 11 or OpenGL, will actually be faster than the DirectX 12/Vulkan equivalent in the engine code? I think this is a more interesting aspect than which GPU is currently better for a given API.

11 Responses to DirectX 12 and Vulkan: what it is, and what it isn’t

dealwithit says:

August 13, 2016 at 2:25 am

Good article about the subject

stefgtx says:

August 13, 2016 at 2:58 am

Nice article Scali!! I have one question though, being an 8350 and a GTX1070 user and given the “moar cores” of the FX, I run 3D MARK API overhead test and i got for DX11 ST: 1.29M, DX11 MT: 2.2M and with the DX12 I got 17.3M draw calls. I don’t get that massive scaling tho. Enabling DX12 in ROTTR for example I get performance increases in graphic intensive places… I have read many of your articles about mutli threading – multicore and I know how FX compares to Core i series but it seems a little weird that DX11 MT and DX12 have so much difference also the scaling between dx11 st and dx11 mt is roughly 1.somethingX. Even with I5’s scaling is dramatic.

- Scali says:
  
  August 13, 2016 at 10:46 am
  
  Not sure what your question is… but the API Overhead test just tries to do as many calls as possible. It shows you a ‘best case’ performance of each API, but it’s not representative of performance in actual games, where the primary goal is to render a game, not to do as many calls as possible.
  So as I said elsewhere in earlier blogs on Mantle and such, the massive increase in draw calls will generally translate in 10-15% better performance in games, when you have a ‘normal’ CPU paired to the GPU (rather than an extremely low-end one).
  Games do a lot more than just making draw calls, and the draw calls themselves will be heavier on the GPU than in the API overhead test. None of that speeds up by using a different API, so the draw call overhead only makes a small difference in the bigger picture.
  
Alexandar Ž says:

August 18, 2016 at 8:20 am

Interesting read.
On a related note – is there anything that can be done with DX12, but not DX 11.3 in terms of visual effects?
As far as I can remember every new version of DX so far brought some new eye-candy that was previously not possible, but every game with DX12/Vulkan so far looks exactly the same under the new API as it does under the older DX11 renderer.
If it’s just about performance benefits – seems like a LOT of work for not that much gain.

- Scali says:
  
  August 18, 2016 at 11:36 am
  
  Not directly, no. DX11.3 basically supports the same hardware features as DX12 does, so you should be able to do the same visual effects on both.
  On DX12 you can just tweak the code at a lower level to get more performance.
  In fact, I believe the only game that even uses DX11.3 features currently is Rise of the Tomb Raider (it has VXAO which uses conservative rasterization). Everything else is pretty much just using the vanilla DX11.0 features (aside from one or two games that can use rasterizer ordered views on Intel hardware via an extension, they predate DX11.3, so not standardized).
  
  So yes, as I said before, it feels a lot like tessellation: AMD doesn’t have the new DX11.3/FL12_1 features, so they are holding games back from using it (the only thing that DX12/Vulkan exposes to them is async compute, which explains why they’re so focused on this.. aside from Vulkan extensions of course, which they use in DOOM). As a result, we’re still getting the same level of graphics we’ve had for the past 5 years or so, with a small performance boost, if you’re lucky. Not something I would get excited about.
  
  - Sam says:
    
    December 8, 2016 at 8:40 am
    
    AMD made a good political move by getting the contract for both the PS4 and Xbox1. Doing so they pretty much dictated the standard feature set that games will be build around.
  - Scali says:
    
    December 8, 2016 at 8:51 am
    
    I think it’s the opposite rather: the GPUs for PS4 and Xbox1 were designed around the DX11 API and its featureset. You can’t dictate much, if at all, with consoles these days.
    
    The thing that concerns me more is this: DX12 offers various additional features. Both Intel and NV have had support for these features. AMD does not.
  - Sam says:
    
    December 8, 2016 at 8:59 am
    
    Well, in a way consoles dictate what features game developers build their game around – programmer time will be spend on getting the most out of those features and art will be designed around said features.
    
    Most game developers aren’t going to spend time using a feature only a small subset of PC gamers with top of the line rigs have and build assets around it.
    
    Seriously, most can’t even be bother to create a decent UI for the PC release. ><
  - Scali says:
    
    December 8, 2016 at 9:19 am
    
    Depends. When consoles are new, then perhaps. But once they get older, the console limitations become irrelevant for the PC game. Look at the Xbox360/PS3 era. These consoles were DX9-class. DX10 was released not much later on PC. And DX11 was also released before a console supported it. During this period, many games started using DX10 and DX11, and eventually games didn’t even support DX9 at all anymore.
Sam says:

December 8, 2016 at 8:37 am

So the scheduling of multiple queues for async compute is handled by the driver/GPU(hardware scheduler?)?

Overall performance from async compute will be partly dependent on how well the queues are scheduled to maximize ALU usage?

- Scali says:
  
  December 8, 2016 at 8:49 am
  
  Yes, so overly simplified there are two variables here that determine performance:
  1) How well are the ALUs used during serial workloads?
  2) How well can the scheduler combine workloads in parallel to improve ALU usage?

	OEM on MartyPC: PC emulation done…
	equipthering on An Amiga can’t do Wolfen…
	Mike Dawson on Running anything Remedy/Future…
	.NET Core: the small… on Migrating to .NET Core: the fu…
	Scali on Video playback on low-end MS-D…