As you are probably aware by now, nVidia has released its new Pascal architecture, in the form of the GTX 1080, the ‘mainstream’ version of the architecture, codenamed GP104. nVidia had already presented the Tesla-varation of the high-end version earlier, codenamed GP100 (which has HBM2 memory). When they did that, they also published a whitepaper on the architecture.
It’s quite obvious that this is a big leap in performance. Then again, that was to be expected, given that GPUs are finally moving from 28 nm to 14/16 nm process technology. Aside from that, we have new HBM and GDDR5x technologies to increase memory bandwidth. But you can find all about that on the usual benchmark sites.
I would like to talk about the features instead. And although Pascal doesn’t improve dramatically over Maxwell v2 in the feature-department, there are a few things worth mentioning.
A cool trick that Pascal can do, is ‘Simultaneous Multi-Projection’. Which basically boils down to being able to render the same geometry with multiple different projections, as in render it from multiple viewports, in a single pass. Sadly I have not found any information yet on how you would actually implement this in terms of shaders and API states, but somehow I think it will be similar to the old geometry shader functionality where you could feed the same geometry to your shaders multiple times with different view/projection matrices, which allowed you to render a scene to a cubemap in a single pass for example. Since the implementation of the geometry shader was not very efficient, this never caught on. This time however, nVidia is showcasing the performance gains for VR usage and such, so apparently the new approach is all about efficiency.
Secondly, there is conservative rasterization. Maxwell v2 was the first card to give us this new rendering technology. It only supported tier 1 though. Pascal bumps this up to tier 2 support. And there we have the first ‘enigma’ of DirectX 12: for some reason hardly anyone is talking about this cool new rendering feature. It can bump up the level of visual realism another notch, because it allows you to do volumetric rendering on the GPU in a more efficient way (which means more dynamic/physically accurate lighting and less pre-baked lightmaps). Yet, nobody cares.
Lastly, we have to mention Asynchronous Compute Shaders obviously. There’s no getting around that one, I’m afraid. This is the second ‘enigma’ of DirectX 12: for some reason everyone is talking about this one. I personally do not care about this feature much (and neither do various other developers. Note how they also point out that it can even make performance worse if it is not tuned properly, yes also on AMD hardware. Starting to see what I meant earlier?). It may or may not make your mix of rendering/compute tasks run faster/more efficiently, but that’s about it. It does not dramatically improve performance, nor does it allow you to render things in a new way/use more advanced algorithms, like some other new features of DirectX 12. So I’m puzzled why the internet pretty much equates this particular feature with ‘DX12’, and ignores everything else.
If you want to know what it is (and what it isn’t), I will direct you to Microsoft’s official documentation of the feature on MSDN. I suppose in a nutshell you can think of it as multi-threading for shaders. Now, shaders tend to be presented as ‘threaded’ anyway, but GPUs had their own flavour of ‘threads’, which was more related to SIMD/MIMD, where they viewed a piece of SIMD/MIMD code as a set of ‘scalar threads’ (where all threads in a block share the same program counter, so they all run the same instruction at the same time). The way asynchronous shaders work in DX12 is more like how threads are handled on a CPU, where each thread has its own context, and the system can switch contexts at any given time, and determine the order in which contexts/threads are switched in a number of ways.
Then it is also no surprise that Microsoft’s examples here include synchronization primitives that we also know from the CPU-side, such as barriers/fences. Namely, the nature of asynchronous execution of code implies that you do not know exactly when which piece of code is running, or at what time a given point in the code will be reached.
The underlying idea is basically the same as that of threading on the CPU: Instead of the GPU spending all its time on rendering, and then spending all its time on compute, you can now start a ‘background thread’ of compute-work while the GPU is rendering in the foreground. Or variations on that theme, such as temporarily halting one thread, so that another thread can use more resources to finish its job sooner (a ‘priority boost’).
Now, here is where the confusion seems to start. Namely, most people seem to think that there is only one possible scenario and therefore only one way to approach this problem. But, getting back to the analogy with CPUs and threading, it should be obvious that there are various ways to execute multiple threads. We have multi-CPU systems, multi-core CPUs, then there are technologies such as SMT/HyperThreading, and of course there is still the good old timeslicing, that we have used since the dawn of time, in order to execute multiple threads/asynchronous workloads on a system with a single CPU with a single core. I wrote an article on that some years ago, you might want to give it a look.
Different approaches in hardware and software will have different advantages and disadvantages. And in some cases, different approaches may yield similar results in practice. For example, in the CPU world we see AMD competing with many cores with relatively low performance per core. Intel on the other hand uses fewer cores, but with more performance per core. In various scenarios, Intel’s quadcores compete with AMD’s octacores. So there is more than one way that leads to Rome.
Getting back to the Pascal whitepaper, nVidia writes the following:
Compute Preemption is another important new hardware and software feature added to GP100 that allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architectures. Compute Preemption prevents long-running applications from either monopolizing the system (preventing other applications from running) or timing out. Programmers no longer need to modify their long-running applications to play nicely with other GPU applications. With Compute Preemption in GP100, applications can run as long as needed to process large datasets or wait for various conditions to occur, while scheduled alongside other tasks. For example, both interactive graphics tasks and interactive debuggers can run in concert with long-running compute tasks.
So that is the way nVidia approaches multiple workloads. They have very high granularity in when they are able to switch between workloads. This approach bears similarities to time-slicing, and perhaps also SMT, as in being able to switch between contexts down to the instruction-level. This should lend itself very well for low-latency type scenarios, with a mostly serial nature. Scheduling can be done just-in-time.
Edit: Recent developments cause me to clarify the above statement. I did not mean to imply that nVidia has an entirely serial nature, and only a single task is run at a time. I thought that it was common knowledge that nVidia has been able to run multiple concurrent compute tasks on their hardware for years now (introduced on Kepler as ‘HyperQ’). However, it seems that many people are now somehow convinced that nVidia’s hardware can only run one task at a time (really? never tried to run two or more windowed 3D applications at the same time? You should try it sometime, you’ll find it works just fine! Add some compute-enabled stuff, and still, it works fine). I am strictly speaking about the scheduling of the tasks here. Because, as you probably know from CPUs, even though you may have multiple cores, you will generally have more processes/threads than you have cores, and some processes/threads will go idle, waiting for some event to occur. So periodically these processes/threads have to be switched, so they all receive processing time, and idle time is minimized. What I am saying here deals with the approach that nVidia and AMD take in handling this.
AMD on the other hand seems to approach it more like a ‘multi-core’ system, where you have multiple ‘asynchronous compute engines’ or ACEs (up to 8 currently), which each processes its own queues of work. This is nice for inherently parallel/concurrent workloads, but is less flexible in terms of scheduling. It’s more of a fire-and-forget approach: once you drop your workload into the queue of a given ACE, it will be executed by that ACE, regardless of what the others are doing. So scheduling seems to be more ahead-of-time (at the high level, the ACEs take care of interleaving the code at the lower level, much like how out-of-order execution works on a conventional CPU).
Sadly, neither vendor gives any actual details on how they fill and process their queues, so we can only guess at the exact scheduling algorithms and parameters. And until we have a decent collection of software making use of this feature, it’s very difficult to say which approach will be best suited for the real-world. And even then, the situation may arise, where there are two equally valid workloads in widespread use, where one workload favours one architecture, and the other workload favours the other, so there is not a single answer to what the best architecture will be in practice.
Oh, and one final note on the “Founders Edition” cards. People seem to just call them ‘reference’ cards, and complain that they are expensive. However, these “Founders Edition” cards have an advanced cooler with a vapor chamber system. So it is quite a high-end cooling solution (previously, nVidia only used vapor chambers on the high-end, such as the Titan and 980Ti, not the regular 980 and 970). In most cases, a ‘reference’ card is just a basic card, with a simple cooler that is ‘good enough’, but not very expensive. Third-party designs are generally more expensive, and allow for better cooling/overclocking. The reference card is generally the cheapest option on the market.
In this case however, nVidia has opened up the possibility for third-party designs to come up with cheaper coolers, and deliver cheaper cards with the same performance, but possibly less overclocking potential. At the same time, it will be more difficult for third-party designs to deliver better cooling than the reference cooler, at a similar price. Aside from that, nVidia also claims that the whole card design is a ‘premium’ design, using high-quality components and plenty of headroom for overclocking.
So the “Founders Edition” is a ‘reference card’, but not as we know it. It’s not a case of “this is the simplest/cheapest way to make a reliable videocard, and OEMs can take it from there and improve on it”. Also, some people seem to think that nVidia sells these cards directly, under their own brand, but as far as I know, it’s the OEMs that build and sell these cards, under the “Founders Edition” label. For example, the MSI one, or the Inno3D one.
These can be ordered directly from the nVidia site.