GeForce GTX1060: nVidia brings Pascal to the masses

Posted on July 20, 2016 by Scali

Right, we can be short about the GTX1060… It does exactly what you’d expect: it scales down Pascal as we know it from the GTX1080 and GTX1070 to a smaller, cheaper chip, aiming at the mainstream market. The card is functionally exactly the same, apart from missing a SLI connector.

But let’s compare it to the competition, the RX480. And as this is a technical blog, I will disregard price. Instead, I will concentrate on the technical features and specs.

RX480:
Die size: 230 mm²
Process: GloFo 14 nm FinFET
Transistor count: 5.7 billion
TFLOPS: 5.1
Memory bandwidth: 256 GB/s
Memory bus: 256-bit
Memory size: 4/8 GB
TDP: 150W
DirectX Feature level: 12_0

GTX1060:
Die size: 200 mm²
Process: TSMC 16 nm FinFET
Transistor count: 4.4 billion
TFLOPS: 3.8
Memory bandwidth: 192 GB/s
Memory bus: 192-bit
Memory size: 6 GB
TDP: 120W
DirectX Feature level: 12_1

And well, if we would just go by these numbers, then the Radeon RX480 looks like a sure winner. On paper it all looks very strong. You’d almost think it’s a slightly more high-end card, given the higher TDP, the larger die, higher transistor count, higher TFLOPS rating, more memory and more bandwidth (the specs are ~30% higher than the GTX1060). In fact, the memory specs are identical to that of the GTX1070, as is the TDP.

But that is exactly where Pascal shines: due to the excellent efficiency of this architecture, the GTX1060 is as fast or faster than the RX480 in pretty much all benchmarks you care to throw at it. If this would come to a price war, nVidia would easily win this: their GPU is smaller, their PCB can be simpler because of the smaller memory interface, and the lower power consumption, and they can use a smaller/cheaper cooler because they have less heat to dissipate. So the cost for building a GTX1060 will be lower than that of a RX480.

Anyway, speaking of benchmarks…

Time Spy

FutureMark recently released a new benchmark called Time Spy, which uses DirectX 12, and makes use of that dreaded async compute functionality. As you may know, this was one of the points that AMD has marketed heavily in their DX12-campaign, to the point where a lot of people thought that:

AMD was the only one supporting the feature
Async compute is the *only* new feature in DX12
All gains that DX12 gets, come from using async compute (rather than the redesign of the API itself to reduce validation, implicit synchronization and other things that may reduce efficiency and add CPU overhead)

Now, the problem is… Time Spy actually showed that GTX10x0-cards gained performance when async compute was enabled! Not a surprise to me of course, as I already explained earlier that nVidia can do async compute as well. But many people were convinced that nVidia could not do async compute at all, not even on Pascal. In fact, they seemed to believe that nVidia hardware could not even process in parallel period. And if you take that as absolute truth, then you have to ‘explain’ this by FutureMark/nVidia cheating in Time Spy!

Well, of course FutureMark and nVidia are not cheating, so FutureMark revised their excellent Technical Guide to deal with the criticisms, and also published an additional press release regarding the ‘criticism’.

This gives a great overview of how the DX12 API works with async compute, and how FutureMark made use of this feature to boost performance.

And if you want to know more about the hardware-side, then AnandTech has just published an excellent in-depth review of the GTX1070/1080, and they dive deep into how nVidia performs asynchronous compute and fine-grained pre-emption.

I was going to write something about that myself, but I think Ryan Smith did an excellent job, and I don’t have anything to add to that. TL;DR: nVidia could indeed do async compute, even on Maxwell v2. The scheduling was not very flexible however, which made it difficult to tune your workload to get proper gains. If you got it wrong, you could receive considerable performance hits instead. Therefore nVidia decided not to run async code in parallel by default, but just serialize it. The plan may have been to ‘whitelist’ games that are properly optimized, and do get gains. We see that even in DOOM, the async compute path is not enabled yet on Pascal. But the hardware certainly is capable of it, to a certain extent, as I have also said before. Question is: will anyone ever optimize for Maxwell v2, now that Pascal has arrived?

Update: AMD has put a blog-post online talking about how happy they are with Time Spy, and how well it pushes their hardware with async compute: http://radeon.com/radeon-wins-3dmark-dx12

I suppose we can say that AMD has given Time Spy its official seal-of-approval (publicly, that is. They already approved it within the FutureMark BDP of course).

This entry was posted in Direct3D, Hardware news, OpenGL, Software development, Vulkan and tagged 3DMark, AMD, FutureMark, gaming, geforce, GPU, GTX1060, nvidia, performance, Radeon RX480, Time Spy. Bookmark the permalink.

21 Responses to GeForce GTX1060: nVidia brings Pascal to the masses

Mkilbride says:

July 20, 2016 at 7:47 pm

Your blog brings me a small measure of sanity when I go to websites and read people saying that “NVIDIA is dead, Vulkan is the future, AMD is the future!” and I try to explain to them, the points you made above, and it’s just “NVIDIA Fanboy can’t handle the truth!”

Reply
- Scali says:
  
  July 20, 2016 at 8:12 pm
  
  Yea, AMD is desperately trying to make DX12/Vulkan/async compute into their unique selling point. Fanboys promote that as “future-proof”. Some reviews take a more realistic stance: “nVidia’s cards are competitive at DX12/Vulkan, but if you look at DX11-games, they deliver even better value for money”.
  
  Reply
  - Mkilbride says:
    
    July 20, 2016 at 10:21 pm
    
    Also, I’d be amazed if we got another 3 games using Vulkan in the next few years myself. Many I talk to…are saying stuff like “Every game is going to use Vulkan now!” Trying to explain to them about the history of Vulkan, OpenGL, DX, ect…they don’t listen.
    
    DX11 will still be the standard for many years, and in DX12, NVIDIA & AMD are pretty close.
  - Scali says:
    
    July 20, 2016 at 10:25 pm
    
    I think DX12 and Vulkan are plotting their own downfall as we speak. Even with extensive hand-holding from IHVs, they can barely get their engines to outperform the DX11 ones in various cases. IHVs can’t keep doing that for every game out there… And they also can’t re-optimize for every new GPU they introduce. It might drive developers back to DX11/OpenGL, if it continues to yield more consistent results.
Alex says:

July 20, 2016 at 11:42 pm

I do wonder, how the GTX 1060 can compete with the theoretically more powerful RX 480? What’s Nvidia’s secret sauce?

Reply
- Scali says:
  
  July 21, 2016 at 6:55 am
  
  I don’t think there’s one or two isolated reasons.
  Rather, just as with Intel’s CPUs, they’ve evolved faster than AMD’s, so they are just more efficient in many areas. The performance is the sum of all parts, and all parts just perform better than AMD’s parts, and are balanced well together.
  Because on the whole, since Fermi and GCN, nVidia and AMD architectures are actually quite similar. There are no longer huge architectural differences such as SIMD/SIMT vs VLIW processing. It’s more about the details and the fine-tuning.
  
  Reply
Edison says:

July 21, 2016 at 5:25 am

Is there any detail info that can tell me the difference on current implement both NVIDIA and AMD ?

Reply
- Scali says:
  
  July 21, 2016 at 6:48 am
  
  Difference of what?
  
  Reply
  - Edison says:
    
    July 21, 2016 at 7:40 am
    
    The async compute.
  - Scali says:
    
    July 21, 2016 at 7:41 am
    
    The most detailed info on nVidia is probably the AnandTech article I linked above. For AMD I have not seen any detailed info about their implementation.
    
    This GCN preview has some details on ACE’s, but no details on how exactly they switch contexts, how they prioritize, how quickly they can do that, etc: http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5
    At the higher level it sounds very similar at least… the scheduling seems to occur ‘per CU’, which is very similar to the ‘per SM’-approach that nVidia has taken. In both cases it’s groups of SIMD units.
    So that implies that AMD does a very similar type of partitioning. They just had the dynamic load balancing in there at an early stage, where nVidia didn’t introduce that until Pascal.
    
    I suppose AMD may have been a bit too early with that feature, and nVidia just a bit too late.
siavashserver says:

July 21, 2016 at 6:19 pm

Hi. How is one supposed to develop and optimize async compute code on nvidia hardware? (without the need to be white listed by nvidia to run in parallel)

Thanks!

Reply
- Scali says:
  
  July 21, 2016 at 6:31 pm
  
  You’ll have to ask nVidia about that. Things may be different for Maxwell v2 and Pascal.
  They could also be blacklisting rather than whitelisting.
  Perhaps there’s a registry key to enable it for development.
  
  Reply
dogen says:

July 22, 2016 at 5:06 pm

Hey Scali, have you seen this article?
http://www.hardwareunboxed.com/gtx-1060-vs-rx-480-in-6-year-old-amd-and-intel-computers/

The Doom/Vulkan results are especially interesting. AMD seems to take a larger hit when going down in CPU power while using Vulkan than nvidia does in OpenGL even. It’s very surprising that AMD can have higher overhead in their vulkan driver than nvidia’s OpenGL driver(much less their own vulkan driver)

Your opinion on this? Perhaps it’s just an optimization issue that’ll be improved over time by AMD?

Reply
- Scali says:
  
  July 22, 2016 at 6:19 pm
  
  Well, that’s quite sad, isn’t it? AMD, who had issues with CPU overhead in the old APIs… And Vulkan was going to be the the answer to that… Apparently not.
  I don’t know if AMD can solve this.
  It only makes it more obvious that the RX480 performance in DOOM Vulkan is not the result of a more efficient API and lower CPU overhead, but rather because of vendor-specific optimizations that reduce the GPU workload.
  
  Reply
  - dogen says:
    
    July 22, 2016 at 11:43 pm
    
    Well, I think it’s a combination of both. I mean, overhead seems to be a little better in vulkan(and the game’s own performance display shows an improvement), but the gcn specific optimizations must also be a large part of it.
    
    It’ll be interesting to see what happens when id releases the nvidia optimization patch.
Alex says:

July 23, 2016 at 6:35 pm

AMD is claiming a 40% IPC improvement in Zen versus current AMD chips. How does that compare to Intel chips?

Reply
- Scali says:
  
  July 25, 2016 at 11:22 am
  
  Well, you can check here, eg the Cinebench or 3d particle single-threaded benchmarks: http://www.anandtech.com/bench/CPU/1028
  You’ll find Intel chips that are WAY faster than the fastest AMD chips at roughly the same clockspeeds (also makes it painfully obvious how much better IPC the old Phenoms had over the FX series).
  We’re not talking 40%, but rather ~100%.
  
  Aside from that, I’ve heard AMD claim ‘40% faster’ before, eg with Barcelona. In reality, they were about -10% faster 🙂
  
  Reply
  - Tech 4 Freelancers (@TFreelancers) says:
    
    July 28, 2016 at 4:28 pm
    
    “Aside from that, I’ve heard AMD claim ‘40% faster’ before, eg with Barcelona. In reality, they were about -10% faster:)”
    Intel did that too when they pushed NetBurst. I still remember how painfully slow my 3 GHz Pentium 4 was. 😥
  - Scali says:
    
    July 28, 2016 at 4:35 pm
    
    Pretty sure Intel never made any claims as “40% faster!”… AMD did, proof here: https://www.youtube.com/watch?v=G_n3wvsfq4Y
dealwithit says:

July 29, 2016 at 11:20 pm

It seems that some interesting things are happening with Radeon:

Reply
Alexandar Ž says:

October 24, 2016 at 9:18 am

Look at this work of desperation:
https://www.techpowerup.com/227091/amd-wants-you-to-choose-radeon-rx-470-over-the-gtx-1050-ti-for-now?cp=2#comments

Reply

	OEM on MartyPC: PC emulation done…
	equipthering on An Amiga can’t do Wolfen…
	Mike Dawson on Running anything Remedy/Future…
	.NET Core: the small… on Migrating to .NET Core: the fu…
	Scali on Video playback on low-end MS-D…