The return of Larrabee, and Kepler’s true form

A little over a year ago, Intel announced their future plans for the Larrabee project, their attempt at a massively parallel architecture, much like nVidia’s and AMD’s GPGPU architectures. Initially Larrabee was also supposed to power a graphics card. However, Intel eventually decided against releasing a graphics card, at least, for now, since they would not have been competitive enough.

Last week, Intel released the first actual products based on their new architecture, the Xeon Phi. It contains 60 cores running at 1 GHz, capable of running 240 threads simultaneously (via 4-way HyperThreading). It delivers about 1 TFLOPS of double-precision performance. All this is crammed into about 5 billion transistors on 22 nm.

As luck would have it, nVidia also released their latest GPGPU product last week, the Tesla K20/K20X. This makes for an interesting comparison. On paper, Tesla is faster, with about 1.17  TFLOPS (double precision) for the K20 and 1.31 TFLOPS for the K20X. However, this lead is very minor, so in practice it will come down to which architecture suits your particular algorithmic needs best, and who can make the most efficient compilers.

One of the strong points of Kepler is supposed to be its high efficiency, where nVidia claims up to 93% efficiency, because of Kepler’s dynamic parallelism, and what they call ‘Hyper-Q’: the ability to have multiple CPU cores queue instructions for the GPGPU in parallel, allowing for more efficient feeding of workloads to the GPGPU, and removing some of the CPU bottleneck issues (if you want to read more on that, you could read Anandtech’s coverage of the new architecture).

What is also interesting is that the K20 and 20X have 2496 and 2688 Cuda cores respectively. This sounds like a whole lot more than Xeon Phi’s 60 cores and 240 threads. But that is partly because Intel still counts in a regular x86 fashion, which is vastly different from how GPUs count their resources.

Namely, each x86 core has a 512-bit SIMD unit. This effectively means that it has 16 scalar single-precision ‘cores’ in parallel, or 8 double-precision ‘cores’. If you were to multiply that by the number of threads, then you’d get 3840 single-precision ‘threads’ or 1920 double-precision ‘threads’. Which is a whole lot closer to Kepler’s figures. This is purely theoretical though, since you can’t directly compare HyperThreading’s threads with the way threads are run on the Kepler architecture.

On the other hand, the closest thing to a core in x86 CPU terms that Kepler has, is the ‘streaming multiprocessor’ or SMX. Kepler has 15 of those. Each of these multiprocessors is essentially a CPU-like piece of logic in the sense that it has its own program counter. The ‘Cuda cores’ inside each SMX can be seen as big SIMD blocks. Aside from that there is also a form of simultaneous multithreading so that multiple threads can be active at the same time on the Cuda cores.

There are about as many similarities as there are differences between the two architectures. It will be interesting to see how well Intel can compete in the world of GPGPU. Based on the design, it looks like Intel’s approach will be more flexible in handling dynamic code (as in many different kernels, or kernels with a lot of dynamic branching and such), and nVidia’s approach will be better when the code is more straightforward and just requires raw grunt. However, their performance in practice may be closer together than the different architectures imply.

What is interesting though is that this is the ‘full’ Kepler. As I said in my blog about the GeForce GTX680 release a while ago, it was a ‘mainstream’ chip design and a high-end version of the architecture was probably still underway. So here we have it then. And comparing the full Kepler to the mainstream derivative in the GTX680, it is also clear that the GTX680 is ‘artificially limited’ in terms of GPGPU performance. It only has 3.1B transistors, compared to the incredible amount of 7.1B transistors of the full Kepler at 28 nm (also, compare this to ‘only’ 5B transistors for Xeon Phi, and that at 22 nm… Intel may also be scaling up their game in the near future).

At this point it is not clear whether nVidia will also create a graphics card based on the full Kepler GK110 chip, but if they do, it is going to be nothing short of a monster.

This entry was posted in Hardware news, OpenCL and tagged , , , , , , , , , , , , , , , . Bookmark the permalink.

5 Responses to The return of Larrabee, and Kepler’s true form

    • Scali says:

      Well, my first impression is that this was written with a terrible pro-Intel bias. I don’t have any experience developing with Xeon Phi yet, but I can’t imagine the difference with Cuda being anywhere as significant as that article tries to make out. Cuda supports standard langauges like C++ and Fortran as well, and porting existing code is nowhere near as painful as they seem to suggest.

      I am surprised they don’t go deeper into a less controversial standpoint, and that is price: Intel’s chips are smaller and cheaper to produce (smaller die, 22 nm vs 28 nm), and Intel prices are rumoured to be as low as $400 per card. That is only a fraction of the prices that nVidia charges for its Tesla range (probably $2000+ for the K20).
      If anything, THAT is going to be truly revolutionary.

  1. Brendon says:

    What other areas do you think Intel will get into? I get the impression they are trying to diversify as the PC market is shrinking. Sorry if it is a little off topic, just curious about your opinion.

    • Scali says:

      Well, Intel went into SSD production not too long ago. Their Atom-based SoCs are now also good enough to be used in smartphones and tablets. And now there’s GPGPU cards. It seems like Intel is covering quite a lot of ground already. The most obvious area that Intel could still get into is discrete video cards. That was originally the plan for Larrabee, but so far they’re only going for GPGPU. I expect that they still want to make full GPUs out of this techology though, so discrete videocards are probably on the roadmap. Intel has also come a long way with their integrated GPUs in recent years, so they may pull it off this time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s