AMD Zen: a bit of a deja-vu?

AMD has released the first proper information on their new Zen architecture. Anandtech seems to have done some of the most in-depth coverage, as usual. My first impression is that of a deja-vu… in more than one way.

Firstly, it reminds me of what AMD did a few years ago on the GPU-front: They ditched their VLIW-based architecture, and moved to a SIMD-based architecture, which was remarkably similar to nVidia’s architecture (nVidia had been using SIMD-based architectures since their 8800GTX). In this case, Zen seems to follow Intel’s Core i7-architecture quite closely. They are moving back to high-IPC cores, just as in their K7/K8 heyday (which at the time was following Intel’s P6-architecture closely), and they seem to target lower clockspeeds, around the 3-4 GHz area where Intel also operates. They are also adopting a micro-op cache. Something that Intel has been doing for a long time.

Secondly, AMD is abandoning their CMT-approach, and going for a more conventional SMT-approach. This is another one of those “I told you so”-moments. Even before Bulldozer was launched, I already said that having 2 ALUs hardwired per core is not going to work well. Zen is now using 4 ALUs per two logical cores. So technically they still have the same amount of ALUs per ‘module’. However, like the Core i7, they can now use all 4 cores with each thread, so you get much better IPC for single threads. This again is something I said a few years ago already. AMD apparently agrees with that. Their fanbase did not, sadly.

We can only wonder why AMD did not go for SMT right away with Bulldozer. I personally think that AMD knew all along that SMT was the better option. However, their CMT was effectively a ‘lightweight’ SMT, where only the FPU portion did proper SMT. I think it may be a combination of two factors here:

  1. SMT was originally developed by IBM, and Intel has been using their HyperThreading variation for many years. Both companies have collected various patents on the technology over the years. Perhaps for AMD it was not worthwhile to use fullblown SMT, because it would touch on too many patents and the licensing costs would be prohibitive. It could be that some of these patents have now expired, so the equation has changed to AMD’s favour. It could also be that AMD is now willing to take a bigger risk, because they have to get back in the CPU race at all cost.
  2. Doing a fullblown SMT implementation for the entire CPU may have been to much of a step for AMD in a single generation. AMD only has a limited R&D budget, so they may have had to spread SMT out over two generations. We don’t know how long it took Intel to develop HyperThreading, but we do know that even though their first implementation in Pentium 4 worked well enough in practice, there were still various small bugs and glitches in their implementations. Not necessarily stability-wise, but also security-wise. The concept of SMT is not that complicated, but shoehorning it into the massively complex x86 architecture, which has tons of legacy software which needs to continue working flawlessly, is an entirely different matter. This is quite a risky undertaking, and proper validation can take a long time.

At any rate, Zen looks more promising than Bulldozer ever did. I think AMD made a wise choice in going back to ‘follow the leader’-mode. Not necessarily because Intel’s architecture is the right one, but because Intel’s architecture is the most widespread one. I have said the same thing about Pentium 4 in the past: the architecture itself was not necessarily as bad as people think. Its biggest disadvantage was that it did not handle code optimized for the P6-architecture very well, and most applications had been developed for P6. If all applications would be recompiled with Pentium 4 optimizations, it would already have made quite a different impression. Let alone if developers actually optimized their code specifically for Pentium 4’s strengths (something we mainly saw with video encoding/decoding and 3D rendering).

Bulldozer was facing a similar problem: it required a different type of software. If Intel couldn’t pull off a big change in software optimization with the Pentium 4, then a smaller player like AMD certainly wouldn’t either. That is the main reason why I never understood Bulldozer.

Posted in Hardware news | Tagged , , , , , , , | 11 Comments

DirectX 12 and Vulkan: what it is, and what it isn’t

I often read comments in the vein of: “… but vendor A’s hardware is designed more for DX12/Vulkan than vendor B’s”. It’s a bit more complicated than that, because it is somewhat of a chicken-and-egg problem. So I thought I’d do a quick blog to try and explain it.

APIs vs hardware features

A large part of the confusion seems to be because the capabilities of hardware tend to be categorized by versions of the DirectX API. In a way that makes sense, since each new version of the DirectX API also introduces support for new hardware features. So this became a de-facto way of categorizing the hardware capabilities of GPUs. Since DirectX 11, we even have different feature levels that can be referred to.

As you can see, the main new features in DirectX 12 are Conservative Rasterization, Volume Tiled Resources and Rasterizer ordered views. But, as you can also see, these have been ‘backported’ to DirectX 11.3 as well, so apparently they are not specific to the DirectX 12 API.

But what is an API really? API stands for Application Programming Interface. It is the ‘thing’ that you ‘talk to’ when programming something, in this case graphics. And the ‘new thing’ about DirectX 12, Vulkan (and Metal and Mantle) is that the interface follows a new paradigm, a new programming model. In earlier versions of DirectX, the driver was responsible for tasks such as resource management and synchronization (eg, if you first render to a buffer, and later want to use that buffer as a texture on some surface, the driver makes sure that the rendering to the buffer is complete before rendering with the texture starts).

These ‘next-gen’ APIs however, work on a lower level, and give the programmer control over such tasks. Leaving it to the driver can work well in the general case, and makes things easier and less error-prone for the programmer. However, the driver has to work with all software, and will use a generic approach. By giving the programmer fine-grained control over the hardware, these tasks can be optimized specifically for an engine or game. This way the programmer can shave off redundant work and reduce overhead on the CPU side. The API calls are now lighter and simpler, because they don’t have to take care of all the bookkeeping, validation and other types of management. These have now been pushed to the engine code instead. On the GPU side, things generally stay the same however, but more on that later.

Command lists

Another change in the programming model is that GPU commands are now ‘decoupled’ at the API side: Sending rendering commands to the GPU is now a two-step process:

  1. Add all the commands you want to execute to a list
  2. Execute your command list(s)

Classic rendering APIs are ‘immediate’ renderers: with an API call you send the command directly to the driver/GPU. Drivers might internally buffer this, and create/optimize their own command lists, but this is transparent to the programmer. A big problem in this programming model is that the order in which the commands are executed, are important. That basically means that you can only use a single thread to send commands. If you were to use multiple threads, you’d have to synchronize them so that they all sent their commands in-order, which basically would mean they’d run one-after-another, so you might as well use a single thread.

DirectX 11 tried to work around this by introducing ‘deferred’ contexts. You would have one ‘immediate’ context, which would execute all commands immediately. But you could create additional contexts, which would buffer commands in a list, which you could later hand down to the immediate context to execute.

There were however two problems with this approach:

  1. The deferred contexts supported only a subset of all commands
  2. Only nVidia’s implementation managed to get significant performance from this

To clarify that second point, FutureMark built an API overhead test, which includes tests with DX11 using immediate and deferred contexts, with a single or multiple threads. See Anandtech’s review of this test.

As you can see, this feature does absolutely nothing on AMD hardware. They are stuck at 1.1M calls regardless of what technique you use, or how many cores you throw at it.

With nVidia however, you see that with 4 or 6 cores, it goes up to 2.2M-2.3M calls. Funny enough, nVidia’s performance on the single-threaded DX11 code also goes up with the 6-core machine, so the total gains from this technique are not very dramatic. Apparently nVidia already performs some parallel processing inside the driver.

DirectX 12 takes this concept further. You now have a command queue, in which you can queue up command lists, which will be executed in-order. The commands inside the command list will also be executed in-order. You can create multiple command lists, and create a thread for each list, to add the commands to it, so that they can all work in parallel. There are no restrictions on the command lists anymore, like there were with the deferred context in DX11 (technically you no longer have an ‘immediate’ context in DX12, they are all ‘deferred’).

An added advantage is that you can re-use these command lists. In various cases, you want to send the same commands every frame (to render the same objects and such), so you can now remove redundant work by just using the same command list over and over again.

Honourable mention for Direct3D 1 here: The first version of Direct3D actually used a very similar concept to command lists, known as ‘execute buffers’. You would first store your commands as bytecode in an ‘execute buffer’, and then execute the buffer. Technically this could be used in multi-threaded environments in much the same way: use multiple threads, which each fill their own execute buffer in parallel.

Asynchronous compute

Why is there a queue for the command lists, you might ask? Can’t you just send the command lists directly to an Execute( myList ) function? The answer is: there can be more than one queue. You can see this as a form of ‘GPU multithreading’: you can have multiple command lists executing at the same time. If you want to compare it to CPU mutithreading, you could view a command queue as a thread, and a command list as an instruction stream (a ‘ThreadProc’ that is called when the thread is running).

There are three different classes of queues and command lists:

  1. Graphics
  2. Compute
  3. DMA/Copy

The idea behind this is that modern GPUs are capable of performing multiple tasks at the same time, since they use different parts of the GPU. Eg, you can upload a texture to VRAM via DMA while you are also rendering and/or performing compute tasks (previously this was done automatically by the driver).

The most interesting new feature here is that you can run a graphics task and a compute task together. The classic example of how you can use this is rendering shadowmaps; Shadowmaps do not need any pixel shading, they just need to store a depth value. So you are mainly running vertex shaders and using the rasterizer. In most cases, your geometry is not all that complex, so there are relatively few vertices that need processing, leaving a lot of ALUs on the GPU sitting idle. With these next-gen APIs you can now execute a compute task at the same time, and make use of the ALUs that would otherwise sit idle (compute does not need the rasterizer).

This is called ‘asynchronous’ compute, because, like with conventional multithreading on the CPU, you are scheduling two (or more) tasks to run concurrently, and you don’t really care about which order they run in. They can be run at the same time if the hardware is capable of it, or they can be run one-after-another, or they can switch multiple times (time-slicing) until they are both complete (on CPUs there are a number of ways to run multiple threads, single-core, multi-core, multi-CPU, HyperThreading. And the OS will use a combination of techniques to schedule threads on the available hardware. see also my earlier blog). You may care about the priority of both, so that you can allocate more resources to one of them, to make it complete faster. But in general, they are running asynchronously. You need to re-synchronize by checking that they have both triggered their event to signal that they have completed.

Now that the introduction is over…

So, how does it look when you actually want to render something? Well, let’s have a (slightly simplified) look at rendering an object in DirectX 11:

// Set the viewport and scissor rectangle.
D3D11_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_immediateContext->RSSetViewports(1, &viewport);
m_immediateContext->RSSetScissorRects(1, &m_scissorRect);

// Send drawing commands.
ID3D11RenderTargetView* renderTargetView = m_deviceResources->GetRenderTargetView();
ID3D11DepthStencilView* depthStencilView = m_deviceResources->GetDepthStencilView();
m_immediateContext->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue);
m_immediateContext->ClearDepthStencilView(depthStencilView, D3D11_CLEAR_DEPTH, 1.0f, 0);

m_immediateContext->OMSetRenderTargets(1, &renderTargetView, &depthStencilView);

m_immediateContext->IASetVertexBuffers(0, 1, &m_vertexBuffers);
m_immediateContext->IASetIndexBuffer(&m_indexBuffer, DXGI_FORMAT_R16_UINT, 0);
m_immediateContext->DrawIndexedInstanced(36, 1, 0, 0, 0);

And in DirectX 12 it would look like this (again, somewhat simplified):

// Set the viewport and scissor rectangle.
D3D12_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_commandList->RSSetViewports(1, &viewport);
m_commandList->RSSetScissorRects(1, &m_scissorRect);

// Indicate this resource will be in use as a render target.
CD3DX12_RESOURCE_BARRIER renderTargetResourceBarrier =
m_commandList->ResourceBarrier(1, &renderTargetResourceBarrier);

// Record drawing commands.
D3D12_CPU_DESCRIPTOR_HANDLE renderTargetView = m_deviceResources->GetRenderTargetView();
D3D12_CPU_DESCRIPTOR_HANDLE depthStencilView = m_deviceResources->GetDepthStencilView();
m_commandList->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue, 0, nullptr);
m_commandList->ClearDepthStencilView(depthStencilView, D3D12_CLEAR_FLAG_DEPTH, 1.0f, 0, 0, nullptr);

m_commandList->OMSetRenderTargets(1, &renderTargetView, false, &depthStencilView);

m_commandList->IASetVertexBuffers(0, 1, &m_vertexBufferView);
m_commandList->DrawIndexedInstanced(36, 1, 0, 0, 0);

// Indicate that the render target will now be used to present when the command list is done executing.
CD3DX12_RESOURCE_BARRIER presentResourceBarrier =
m_commandList->ResourceBarrier(1, &presentResourceBarrier);


// Execute the command list.
ID3D12CommandList* ppCommandLists[] = { m_commandList.Get() };
m_deviceResources->GetCommandQueue()->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

As you can see, the actual calls are very similar. The functions mostly have the same names, and even the parameters are mostly the same. At a higher level, most of what you do is exactly the same: you use a rendertarget, a depth/stencil surface, you set up a viewport and scissor rectangle. Then you clear the rendertarget and depth/stencil for a new frame, and send a list of triangles to the GPU, which is stored in a vertex buffer and index buffer pair (you would already have initialized a vertex shader and pixel shader at an earlier stage, and already uploaded the geometry to the vertex- and index buffers, but I left those parts out for simplicity. The code there is again very similar between the APIs, where DirectX 12 again requires a bit more code, because you have to tell the API in more detail what you actually want. Uploading the geometry also requires a command list there).
So what the GPU actually has to do, is exactly the same, regardless of whether you use DirectX 11 or DirectX 12. The differences are mainly on the CPU-side, as you can see.

The same argument also extends to Vulkan. The code may look a bit different from what you’re doing in DirectX 12, but in essence, you’re still creating the same vertex and index buffers, and sending the same triangle list draw command to the GPU, rendering to a rendertarget and depth/stencil buffer.

So, what this means is that you do not really need to ‘design’ your hardware for DirectX 12 or Vulkan at all. The changes are mainly on the API side, and affect the workload of the CPU and driver, not the GPU. Which is also why DirectX 12 supports feature levels of 11_x: the API can also support hardware that pre-dates DirectX 12.

Chickens and eggs

But, how exactly did these new hardware features arrive in the DirectX 11.3 and 12 APIs? And why exactly did this new programming model emerge in these new APIs?

The first thing to point out is that Microsoft does not develop hardware. This means that Microsoft can’t just think up new hardware features out-of-the-blue and hope that hardware will support it. For each update of DirectX, Microsoft will have meetings with the big players on the hardware market, such as Intel, nVidia and AMD (in recent years, also Qualcomm, for mobile devices). These IHVs will give Microsoft input on what kind of features they would like to include in the next DirectX. Together with Microsoft, these features will then be standardized in a way that they can be implemented by all IHVs. So this is a somewhat democratic process.

Aside from IHVs, Microsoft also includes some of the larger game engine developers, to get more input from the software/API side of things. Together they will try to work out solutions to current problems and bottlenecks in the API, and also work out ways to include the new features presented by the IHVs. In some cases, the IHVs will think of ways to solve these bottlenecks by adding new hardware features. After all, game engines and rendering algorithms also evolve over time, as does the way they use the API and hardware. For example, in the early days of shaders, you wrote separate shaders for each material, and these used few parameters. These days, you tend to use the same shader for most materials, and only the parameters change. So at one point, switching between shaders efficiently was important, but now updating shader parameters efficiently is more important. Different requirements call for different APIs (so it’s not like older APIs were ‘bad’, they were just designed against different requirements, a different era with different hardware and different rendering approaches).

So, as you can see, it is a bit of cross-pollination between all the parties. Sometimes an IHV comes up with a new approach first, and it is included in the API. Other times, developers come up with problems/bottlenecks first, and then the API is modified, and hardware redesigned to make it work. Since not all hardware is equal, it is also often the case that one vendor had a feature already, while others had to modify their hardware to include it. So for one vendor, the chicken came first, for others the egg came first.

The development of the API is an iterative process, where the IHVs will work closely with Microsoft over a period of time, to make new hardware and drivers available for testing, and move towards a final state of the API and driver model for the new version of DirectX.

But what does it mean?

In short, it means it is pretty much impossible to say how much of a given GPU was ‘designed for API X or Y’, and how much of the API was ‘designed for GPU A or B’. It is a combination of things.

In DirectX 12/Vulkan, it seems clear that Rasterizer Ordered Views came from Intel, since they already had the feature on their DirectX 11 hardware. It looks like nVidia designed this feature into Maxwell v2 for the upcoming DX11.3 and DX12 APIs. AMD has yet to implement the feature.

Conservative rasterization is not entirely clear. nVidia was the first to market with the feature, in Maxwell v2. However, Intel followed not much later, and implemented it at Tier 3 level. So I cannot say with certainty whether it originated from nVidia or Intel. AMD has yet to implement the feature.

Asynchronous compute is a bit of a special case. The API does not consider it a specific hardware feature, and leaves it up to the driver how to handle multiple queues. The idea most likely originated from AMD, since they have had support for running graphics and compute at the same time since the first GCN architecture, and the only way to make good use of that is to have multiple command queues. nVidia added limited support in Maxwell v2 (they had asynchronous compute support in CUDA since Kepler, but they could not run graphics tasks in parallel), and more flexible/efficient support in Pascal. Intel has yet to support this feature (that is, they support code that uses multiple queues, but as far as I know, they cannot actually run graphics and compute tasks in parallel, so they cannot use it to improve performance by better ALU usage).

Also, you can compare performance differences from DirectX 11 to 12, or from OpenGL to Vulkan… but it is impossible to draw conclusions from these results. Is the DirectX 11 driver that good, or is the DirectX 12 engine that bad for a given game/GPU? Or perhaps the other way around? Was OpenGL that bad, and is the Vulkan engine that good for a given game/GPU?

Okay, but what about performance?

The main advantage of this new programming model in DirectX 12/Vulkan is also a potential disadvantage. I see a parallel with the situation of compilers versus assemblers on the CPU: it is possible for an assembly programmer to outperform a compiler, but there are two main issues here:

  1. Compilers have become very good at what they do, so you have to be REALLY good to even write assembly that is on par with what a modern compiler will generate.
  2. Optimizing assembly code can only be done for one CPU at a time. Chances are that the tricks you use to maximize performance on CPU A, will not work, or even be detrimental to performance on CPU B. Writing code that works well on all CPUs is even more difficult. Compilers however are very good at this, and can also easily optimize code for multiple CPUs, and include multiple codepaths in the executable.

In the case of DirectX 12/Vulkan, you will be taking on the driver development team. In DirectX 11/OpenGL, you had the advantage that the low-level resource management, synchronization and such, was always done by the driver, which was optimized for a specific GPU, by the people who built that GPU. So like with compilers, you had a very good baseline of performance. As an engine developer, you have to design and optimize your engine very well, before you get on par with these drivers (writing a benchmark that shows that you can do more calls per second in DX12 than in DX11 is one thing. Rendering an actual game more efficiently is another).

Likewise, because of the low-level nature of DirectX 12/Vulkan, you need to pay more attention to the specific GPUs and videocards you are targeting. The best way to manage your resources on GPU A might not be the best way on GPU B. Normally the driver would take care of it. Now you may need to write multiple paths, and select the fastest one for each GPU.

Asynchronous compute is especially difficult to optimize for. Running two things at the same time means you have to share your resources. If this is not balanced well, then one task may be starving the other of resources, and you may actually get lower performance than if you would just run the tasks one after another.

What makes it even more complicated is that this balance is specific not only to the GPU architecture, but even to the specific model of video card, to a certain extent. If we take the above example of rendering shadowmaps while doing a compute task (say postprocessing the previous frame)… What if GPU A renders shadowmaps quickly and compute tasks slowly, but GPU B renders the shadowmaps slowly and compute tasks quickly? This would throw off the balance. For example, once the shadowmaps are done, the next graphics task might require a lot of ALU power, and the compute task that is still running will be starving that graphics task.

And things like rendering speed will depend on various factors, including the relative speed of the rasterizers to the VRAM. So, even if two videocards use the same GPU with the same rasterizers, variations in VRAM bandwidth could still disturb the balance.

On Xbox One and PlayStation 4, asynchronous compute makes perfect sense. You only have a single target to optimize for, so you can carefully tune your code for the best performance. On a Windows system however, things are quite unpredictable. Especially looking to the future. Even if you were to optimize for all videocards available today, that is no guarantee that the code will still perform well on future videocards.

So we will have to see what the future brings. Firstly, will engine developers actually be able to extract significant gains from this new programming model? Secondly, will these gains stand the test of time? As in, are these gains still available 1, 2 or 3 generations of GPUs from now, or will some code actually become suboptimal on future GPUs? Code which, when handled by an optimized ‘high-level’ driver such as in DirectX 11 or OpenGL, will actually be faster than the DirectX 12/Vulkan equivalent in the engine code? I think this is a more interesting aspect than which GPU is currently better for a given API.

Posted in Direct3D, Hardware news, OpenGL, Software development, Software news, Vulkan | Tagged , , , , , , , , , , , , , , , , , , , , | 5 Comments

FutureMark’s Time Spy: some people still don’t get it

Today I read a review of AMD’s new Radeon RX470 on, by Jelle Stuip. He used Time Spy as a benchmark, and added the following description:

About 3DMark Time Spy has recently been some controversy, because it is found that the benchmark does not use vendor specific code paths, but a generic code path used for all hardware. Futuremark see that as a plus; it therefore does not matter which video card you test, because when turning Time Spy they all follow the same code path, so you can make fair comparisons. That also means that there is no specific optimization for AMD GPUs and AMD’s implementation of asynchronous compute is not fully exploited. In games that do use can be the relationship between AMD and Nvidia GPUs also different from the Time Spy benchmark represents.

Sorry Jelle, but you don’t get it.

Indeed, Time Spy does not use vendor specific code paths. However, ‘vendor’ is a misnomer here anyway. I mean, you can write a path specific for AMD or NVidia’s current GPU architecture, but that is no guarantee that it is going to be any good on architectures from the past, or architectures from the future. You need to write architecture-specific paths, is what people of the “vendor specific code path”-school of thought really mean. In this case, it is not just the microarchitecture itself, but the actual configuration of the video card has a direct effect on how async compute code performs as well (balance between number of shaders, shader performance and memory bandwidth and such factors).

However, in practice that is not going to happen of course, because that means that games have to receive updates to their code indefinitely, for each new videocard that arrives, until the end of time. So in practice your shiny new videocard will not have any specific paths for yesterday’s games either.

But, more importantly… He completely misinterprets the results of Time Spy. Yes, there is less of a difference between Pascal and Polaris than in most games/benchmarks using Async Compute. However, the reason for that is obvious: Currently Time Spy is the only piece of software where Async Compute is enabled on nVidia devices *and* results in a performance gain. The performance gains on AMD hardware are as expected (around 10-15%). However, since nVidia hardware now benefits from this feature as well, the difference between AMD and nVidia hardware is smaller than in other async compute scenarios.

Important to note also is that both nVidia and AMD are part of FutureMark’s Benchmark Development Program:

As such, both vendors have been actively involved in the development process, had access to the source code throughout the development of the benchmark, and have actively worked with FutureMark on designing the tests and optimizing the code for their hardware. If anything, Time Spy might not be representative of games because it is actually MORE fair than various games out there, which are skewed towards one vendor.

So not only does Time Spy exploit async compute very well on AMD hardware (as AMD themselves attest to here:, but Time Spy *also* exploits async compute well on nVidia hardware. Most other async compute games/benchmarks were optimized by/for AMD hardware alone, and as such do not represent how nVidia hardware would perform with this feature, since it is not even enabled in the first place. We will probably see more games that benefit as much as Time Spy does on nVidia hardware, once they start optimizing for the Pascal architecture as well. And once that happens, we can judge how well Time Spy has predicted the performance. Currently, DX12/Vulkan titles are still too much of a vendor-specific mess to draw any fair conclusions (eg. AoTS and Hitman are AMD-sponsored, ROTR is nVidia-sponsored, DOOM Vulkan doesn’t have async compute enabled for nVidia (yet?), and uses AMD-specific shader extensions).

Too bad, Jelle. Next time, please try to do some research on the topic, and get your facts straight.

Posted in Direct3D, Hardware news, Software development, Software news, Vulkan | Tagged , , , , , , , , , , , | 5 Comments

GeForce GTX1060: nVidia brings Pascal to the masses

Right, we can be short about the GTX1060… It does exactly what you’d expect: it scales down Pascal as we know it from the GTX1070 and GTX1070 to a smaller, cheaper chip, aiming at the mainstream market. The card is functionally exactly the same, apart from missing a SLI connector.

But let’s compare it to the competition, the RX480. And as this is a technical blog, I will disregard price. Instead, I will concentrate on the technical features and specs.

Die size: 230 mm²
Process: GloFo 14 nm FinFET
Transistor count: 5.7 billion
Memory bandwidth: 256 GB/s
Memory bus: 256-bit
Memory size: 4/8 GB
TDP: 150W
DirectX Feature level: 12_0

Die size: 200 mm²
Process: TSMC 16 nm FinFET
Transistor count: 4.4 billion
Memory bandwidth: 192 GB/s
Memory bus: 192-bit
Memory size: 6 GB
TDP: 120W
DirectX Feature level: 12_1

And well, if we would just go by these numbers, then the Radeon RX480 looks like a sure winner. On paper it all looks very strong. You’d almost think it’s a slightly more high-end card, given the higher TDP, the larger die, higher transistor count, higher TFLOPS rating, more memory and more bandwidth (the specs are ~30% higher than the GTX1060). In fact, the memory specs are identical to that of the GTX1070, as is the TDP.

But that is exactly where Pascal shines: due to the excellent efficiency of this architecture, the GTX1060 is as fast or faster than the RX480 in pretty much all benchmarks you care to throw at it. If this would come to a price war, nVidia would easily win this: their GPU is smaller, their PCB can be simpler because of the smaller memory interface, and the lower power consumption, and they can use a smaller/cheaper cooler because they have less heat to dissipate. So the cost for building a GTX1060 will be lower than that of a RX480.

Anyway, speaking of benchmarks…

Time Spy

FutureMark recently released a new benchmark called Time Spy, which uses DirectX 12, and makes use of that dreaded async compute functionality. As you may know, this was one of the points that AMD has marketed heavily in their DX12-campaign, to the point where a lot of people thought that:

  1. AMD was the only one supporting the feature
  2. Async compute is the *only* new feature in DX12
  3. All gains that DX12 gets, come from using async compute (rather than the redesign of the API itself to reduce validation, implicit synchronization and other things that may reduce efficiency and add CPU overhead)

Now, the problem is… Time Spy actually showed that GTX10x0-cards gained performance when async compute was enabled! Not a surprise to me of course, as I already explained earlier that nVidia can do async compute as well. But many people were convinced that nVidia could not do async compute at all, not even on Pascal. In fact, they seemed to believe that nVidia hardware could not even process in parallel period. And if you take that as absolute truth, then you have to ‘explain’ this by FutureMark/nVidia cheating in Time Spy!

Well, of course FutureMark and nVidia are not cheating, so FutureMark revised their excellent Technical Guide to deal with the criticisms, and also published an additional press release regarding the ‘criticism’.

This gives a great overview of how the DX12 API works with async compute, and how FutureMark made use of this feature to boost performance.

And if you want to know more about the hardware-side, then AnandTech has just published an excellent in-depth review of the GTX1070/1080, and they dive deep into how nVidia performs asynchronous compute and fine-grained pre-emption.

I was going to write something about that myself, but I think Ryan Smith did an excellent job, and I don’t have anything to add to that. TL;DR: nVidia could indeed do async compute, even on Maxwell v2. The scheduling was not very flexible however, which made it difficult to tune your workload to get proper gains. If you got it wrong, you could receive considerable performance hits instead. Therefore nVidia decided not to run async code in parallel by default, but just serialize it. The plan may have been to ‘whitelist’ games that are properly optimized, and do get gains. We see that even in DOOM, the async compute path is not enabled yet on Pascal. But the hardware certainly is capable of it, to a certain extent, as I have also said before. Question is: will anyone ever optimize for Maxwell v2, now that Pascal has arrived?

Update: AMD has put a blog-post online talking about how happy they are with Time Spy, and how well it pushes their hardware with async compute:

I suppose we can say that AMD has given Time Spy its official seal-of-approval (publicly, that is. They already approved it within the FutureMark BDP of course).

Posted in Direct3D, Hardware news, OpenGL, Software development, Vulkan | Tagged , , , , , , , , , , | 21 Comments

AMD’s Polaris debuts in Radeon RX480: I told you so

In a recent blogpost, after dealing with the nasty antics of a deluded AMD fanboy, I already discussed what we should and should not expect from AMD’s upcoming Radeon RX480.

Today, the NDA was lifted, and reviews appear everywhere on the internet. Cards are also becoming available in shops, and street prices become known. I will make this blogpost very short, because I really can’t be bothered:

I told you so. I told you:

  1. If AMD rates the cards at 150W TDP, they are not magically going to be significantly below that. They will be in the same range of power as the GTX970 and GTX1070.
  2. If AMD makes a comparison against the GTX970 and GTX980 in some slides, then that is apparently what they think they will be targeting.
  3. If AMD does not mention anything about DX12_1 or other fancy new features, it won’t have any such things.
  4. You only go for aggressive pricing strategy if you don’t have anything else in the sense of a unique selling point.

And indeed, all this rings true. Well, with 3. there is  a tiny little surprise that AMD does actually make some vague claims to some ‘foveated rendering’ feature. But at this point it is not entirely clear what it does, how developers should use it, let alone how it performs.

So, all this shows just how good nVidia’s Maxwell really is. As I said, AMD is one step behind, becaue they missed the refresh-cycle that nVidia did on Maxwell. And this becomes painfully clear now: Even though AMD moved to 14nm FinFET, their architecture is so much worse in efficiency that they can only now match Maxwell’s performance-per-watt at 28 nm. Pascal is on a completely different level. Aside from that, Maxwell already has the DX12_1 featureset.

All this adds up to Polaris being too-little-too-late, which has become a time-honoured AMD tradition by now. At first, only in the CPU department. But lately the GPU department appears to have been reduced to the same.

So what do you do? You undercut the prices of the competition. Another time-honoured AMD tradition. This is all well-and-good for the short term. But nVidia is going to launch those GTX1050/1060 cards eventually (and rumour has it that it will be sooner rather than later), and then nVidia will have the full Pascal efficiency at its disposal to compete with AMD on price. This is a similar situation to the CPU department again, where Intel’s CPUs are considerably more efficient, so Intel can reach the same performance/price levels with much smaller CPUs, which are cheaper to produce. So AMD is always on the losing end of a price war.

Sadly, the street prices are currently considerably higher than what AMD promised us a few weeks ago. So even that is not really working out for them.

Right, I think that’s enough for today. We’ll probably pick this up again soon when the GTX1060 surfaces.

Posted in Hardware news | Tagged , , , , , , , , , | 51 Comments

GameWorks vs GPUOpen: closed vs open does not work the way you think it does

I often read people claiming that GameWorks is unfair to AMD, because they don’t get access to the sourcecode, and therefore they cannot optimize for it. I cringe everytime I read this, because it is wrong on so many levels. So I decided to write a blog about it, to explain how it REALLY works.

The first obvious mistake is that although GPUOpen itself may be open source, the games that use it are not. What this means is that when a game decides to use an open source library, the code is basically ‘frozen in time’ as soon as they build the binaries, which you eventually install when you want to play the game. So even though you may have the source code for the effect framework, what are you going to do with it? You do not have the ability to modify the game code (eg DRM and/or anti-cheat will prevent you from doing this). So if the game happens to be unoptimized for a given GPU, there is nothing you can do about it, even if you do have the source.

The second obvious mistake is the assumption that you need the source code to see what an effect does. This is not the case, and in fact, in the old days, GPU vendors generally did not have access to the source code anyway (it might sound crazy, but in the old days, game developers actually developed games, as in, they developed the whole renderer and engine. Hardware suppliers supplied the hardware). These days, they have developer relations programs, and tend to work with developers more closely, which also involves getting access to source code in some cases (sometimes to the point where the GPU vendor actually does some/a lot of the hard work for them). But certainly not always (especially when the game is under the banner of the competing GPU vendor, such as Gaming Evolved or The Way It’s Meant To Be Played).

So, assuming you don’t get access to the source code, is there nothing you can do? Well no, on the contrary. In most cases, games and effect frameworks (even GameWorks) generally just perform standard Direct3D or OpenGL API calls. There are various game development tools available to analyze D3D or OpenGL code. For example, there is Visual Studio Graphics Diagnostics:

Basically, every game developer already has the tools to study which API calls are made, which shaders are run, which textures and geometry are used, how long every call takes etc. Since AMD and nVidia develop these Direct3D and OpenGL drivers themselves, they can include even more debugging and analysis options into their driver, if they so choose.

So in short, it is basically quite trivial for a GPU vendor to analyze a game, find the bottlenecks, and then optimize the driver for a specific game or effect (you cannot modify the game, as stated, so you have to modify the driver, even if you would have the source code). The source code isn’t even very helpful with this, because you want to find the bottlenecks, and it’s much easier to just run the code through an analysis-tool than it is to study the code and try to deduce which parts of the code will be the biggest bottlenecks on your hardware.

The only time GPU vendors actually want/need access to the source code is when they want to make fundamental changes to how a game works, either to improve performance, or to fix some bug. But even then, they don’t literally need access to the source code, they need the developer to change the code for them and release a patch to their users. Sometimes this requires taking the developer by the hand through the source code, and making sure they change what needs to be changed.

So the next time you hear someone claiming that GameWorks is unfair because AMD can’t optimize for it, please tell them they’re wrong, and explain why.


Posted in Direct3D, OpenGL, Software development, Software news | Tagged , , , , , , , , , , , , | 17 Comments

The damage that AMD marketing does

Some of you may have have seen the actions of a user that goes by the name of Redneckerz on a recent blogpost of mine. That guy posts one wall of text after the next, full of anti-nVidia rhetoric, shameless AMD-promotion, and an endless slew of personal attacks and fallacies.

He even tries to school me on what I may or may not post on my own blog, and how I should conduct myself. Which effectively comes down to me having to post *his* opinions. I mean, really? This is a *personal* blog. Which means that it is about the topics that *I* want to discuss, and I will give *my* opinion on them. You don’t have to agree with that, and that is fine. You don’t have to visit my blog if you don’t like to read what I have to say on a given topic. In fact, I even allow people to comment on my blogs, and they are free to express their disagreements.

But there are limits. You can express your disagreements once, twice, perhaps even three times. But at some point, when I’ve already given off several warnings that we are not going to ‘discuss’ this further, and keep things on-topic, you just have to stop. If not, I will just make you stop by removing (parts of) your comments that are off-limits. After all, nobody is waiting for people to endlessly spew the same insults, and keep making the same demands. It’s just a lot of noise that prevents other people from having a pleasant discussion (and before you call me a hypocrit, I may delete the umpteenth repeat of a given post, but I left the earlier ones alone, so it’s not like I don’t allow you to express your views at all).

In fact, I think even without the insults, the endless walls of text that Redneckerz produces are annoying enough. He keeps repeating himself everywhere. And that is not just my opinion. Literally all other commenters on that item have expressed their disapproval of Redneckerz’ posting style (which is more than a little ironic, given the fact that at least part of Redneckerz’ agenda is to try and paint my posting style as annoying and unwanted).

Speaking about the feedback of other users, they also called him out on having an agenda, namely promoting AMD. Which seems highly likely, given the sheer amount of posts he fires off, and the fact that their content is solely about promoting AMD and discrediting nVidia.

The question arose mainly whether he was just a brainwashed victim of AMD’s marketing, or whether AMD would actually be compensating him for the work he puts in. Now, as you can tell from the start of the ‘conversation’, this was not my first brush with Redneckerz. I had encountered him on another forum some time ago, and things went mostly the same. He attacked me in various topics where I contributed, in very much the same way as here: an endless stream of replies with walls-of-text, and poorly conceived ideas. At some point he would even respond to other people, mentioning my name and speculating what my reply would have been. However, I have not had contact with him since, and Redneckerz just came to my blog out of the blue, and started posting like a maniac here. One can only speculate what triggered him to do that at this moment (is it a coincidence that both nVidia and AMD are in the process of launching their new 16nm GPU lineups?)

Now, if Redneckerz was just a random forum user, we could leave it at that. But in fact, he is an editor for a Dutch gaming website,

That makes him a member of the press, so the plot thickens… I contacted that website, to inform them that one of their editors had gone rampant on my blog and other forums, and that they might want to take action, because it’s not exactly good publicity for their site either. I got some nonsensical response about how they were not responsible for what their editors post on other sites. So I replied that this isn’t about who is responsible, but what they could do is talk some sense into him, for the benefit of us all.

Again, they were hiding behind the ‘no responsibility’-guise. So basically they support his conduct. Perhaps they are in on the same pro-AMD thing that he is, whatever that is exactly.

I’ve already talked about that before, in general, in my blog related to the release of DirectX 12. About how the general public is being played by AMD, developers and journalists. Things like Mantle, async compute, HBM, how AMD allegedly has an advantage in games because they supply console APUs and whatnot. This nonsense has become so omnipresent that people think this is actually the reality. Even though benchmarks and sales figures prove the opposite (eg, nVidia’s GTX960 and GTX970 are the most popular cards among Steam users by a margin:

Just like we have to listen to people claiming Polaris is going to save AMD. Really? The writing is already on the wall: AMD’s promotional material showed us a slides with two all-important bits of information:



First, we see them compare against the GeForce GTX970/980. Secondly, we see them stating a TDP of 150W. So, the performance-target will probably be between GTX970 and GTX980 (and the TFLOPS rating also indicates that ballpark). And the power envelope will be around 150W. They didn’t just put these numbers on there at random. The low-balling pricetag is also a tell-tale sign. AMD is not a charitable organization. They’re in this business to make money. They don’t sell their cards at $199 to make us happy. They sell them at $199 because they’ve done the maths and $199 will be their sweet-spot for regaining marketshare and getting enough profit. Desperately trying to keep people from buying more of those GTX960/970/980 cards until AMD gets their new cards on the market. If they had a killer architecture, they’d charge a premium because they could get away with. nVidia should have little trouble matching that price/performance-target with their upcoming 1050/1060.

Which matches exactly with how I described the situation AMD is in: they are one ‘refresh’ behind on nVidia, architecture-wise, since they ‘skipped’ Maxwell, where nVidia concentrated on maximizing performance/watt, since they were still stuck at 28 nm. I said that it would be too risky for AMD to do the shrink to 16 nm and at the same time, also do a major architectural overhaul. So it would be unlikely for AMD to completely close the gap that nVidia had opened with Maxwell. And that appears to be what we see with Polaris. When I said it, I was accused of being overly negative towards AMD. In fact, Kyle Bennett of HardOCP said basically the same thing. And he was also met by a lot of pro-AMD people who attacked him. After AMD released their information on Polaris however, things went a bit quiet on that side. We’ll have to wait for the actual release and reviews at the end of this month, but the first signs don’t point to AMD having an answer to match Pascal.

The sad part is that it always has to go this way. You can’t say anything about AMD without tons of people attacking you. Even if it’s the truth. Remember John Fruehe? Really guys, I’m trying to do everyone a favour by giving reliable technical info, instead of marketing BS. I can do that, because I actually have a professional background in the field, and have a good hands-on understanding of CPU internals, GPU internals, rendering algorithms and APIs. Not because I’m being paid to peddle someone’s products, no matter how good or bad they are.

In fact, a lot of the comments I make aren’t so much about AMD’s products themselves, but rather about their inflated and skewed representation in the media.

Posted in Hardware news, Science or pseudoscience? | Tagged , , , , , | 25 Comments