The myth of CMT (Cluster-based Multithreading)

The first time I heard someone use the term ‘CMT’, I was somewhat surprised. Was there a different kind of CPU multithreading technology that I somehow missed? But when I looked it up, things became quite clear. If you google the term, you’ll mainly land on AMD marketing material, explaining ‘cluster-based multithreading’ (or sometimes also ‘clustered multithreading’):

This in itself is strange, as one page you will also find is this: http://dl.acm.org/citation.cfm?id=640477.640525

Triggered by the ever increasing advancements in processor and networking technology, a cluster of PCs connected by a high-speed network has become a viable and cost-effective platform for the execution of computation intensive parallel multithreaded applications.

So apparently the term ‘cluster-based multithreading’ has been used before AMD’s CMT, and is a lot less confusing: it just speaks of conventional clustering of PCs to build a virtual supercomputer.

So CMT is just an ‘invention’ by AMD’s marketing department. They invented a term that sounds close to SMT (Simultaneous Multithreading), in an attempt to compete with Intel’s HyperThreading. Now clearly,  HyperThreading is just a marketing-term as well, but it is Intel’s term for their implementation of SMT, which is a commonly accepted term for a multithreading approach in CPU design, and has been in use long before Intel implemented HyperThreading (IBM started researching it in 1968, to give you an idea of the historical perspective here).

Now the problem I have with CMT is that people are actually buying it. They seem to think that CMT is just as valid a technology as SMT. And worse, they think that the two are closely related, or even equivalent. As a result, they are comparing CMT with SMT in benchmarks, as I found in this Anandtech review a few days ago: http://www.anandtech.com/show/5279/the-opteron-6276-a-closer-look/6

AMD claimed more than once that Clustered Multi Threading (CMT) is a much more efficient way to crunch through server applications than Simultaneous Multi Threading (SMT), aka Hyper-Threading (HTT).

Now, I have a problem with comparisons like these… Let’s compare the benchmarked systems here: http://www.anandtech.com/show/5279/the-opteron-6276-a-closer-look/2

Okay, so all systems have two CPUs. So let’s look at the CPUs themselves:

  • Opteron 6276: 8-module/16-thread, which has two Bulldozer dies of 1.2B transistors each, total 2.4B transistors
  • Opteron 6220: 4-module/8-thread, one Bulldozer die of 1.2B transistors
  • Opteron 6174: 12-core/12-thread, which has two dies of 0.9B transistors each, total 1.8B transistors
  • Xeon X5650: 6-core/12-thread, 1.17B transistors

Now, it’s obvious where things go wrong here, by just looking at the transistorcount: The Opteron 6276 is more than twice as large as the Xeon. So how can you have a fair comparison of the merits of CMT vs SMT? If you throw twice as much hardware at the problem, it’s bound to be able to handle more threads better. The chip is already at an advantage anyway, since it can handle 16 simultaneous threads, where the Xeon can only handle 12.

But if we look at the actual benchmarks, we see that the reality is different: AMD actually NEEDS those two dies to keep up with Intel’s single die. And even then, Intel’s chip excels in keeping response times short. The new CMT-based Opterons are not all that convincing compared to the smaller, older Opteron 6174 either, which can handle only 12 threads instead of 16, and just uses vanilla SMP for multithreading.

Let’s inspect things even closer… What are we benchmarking here? A series of database scenarios, with MySQL and MSSQL. This is integer code. Well, that *is* interesting. Because, what exactly was it that CMT did? Oh yes, it didn’t do anything special for integers! Each module simply has two dedicated integer cores. It is the FPU that is shared between two threads inside a module. But we are not using it here. Well, lucky AMD, best case scenario for CMT.

But let’s put that in perspective… Let’s have a simplified look at the execution resources, looking at the integer ALUs in each CPU.

The Opteron 6276 with CMT disabled has:

  • 8 modules
  • 8 threads
  • 4 ALUs per module
  • 2 ALUs per thread (the ALUs can not be shared between threads, so disabling CMT disables half the threads, and as a result also half the ALUs)
  • 16 ALUs in total

With CMT enabled, this becomes:

  • 8 modules
  • 16 threads
  • 4 ALUs per module
  • 2 ALUs per thread
  • 32 ALUs in total

So nothing happens, really. Since CMT doesn’t share the ALUs, it works exactly the same as the usual SMP approach. So you would expect the same scaling, since the execution units are dedicated per thread anyway. Enabling CMT just gives you more threads.

The Xeon X5650 with SMT disabled has:

  • 6 cores
  • 6 threads
  • 3 ALUs per core
  • 3 ALUs per thread
  • 18 ALUs in total

With SMT enabled, this becomes:

  • 6 cores
  • 12 threads
  • 3 ALUs per core
  • 3 ALUs per 2 threads, effectively ~1.5 ALUs per thread
  • 18 ALUs in total

So here the difference between CMT and SMT becomes quite clear: With single-threading, each thread has more ALUs with SMT than with CMT. With multithreading, each thread has less ALUs (effectively) than CMT.

And that’s why SMT works, and CMT doesn’t: AMD’s previous CPUs also had 3 ALUs per thread. But in order to reduce the size of the modules, AMD chose to use only 2 ALUs per thread now. It is a case of cutting off one’s nose to spite their face: CMT is struggling in single-threaded scenario’s, compared to both the previous-generation Opterons and the Xeons.

At the same time, CMT is not actually saving a lot of die-space: There are 4 ALUs in a module in total. Yes, obviously, when you have more resources for two threads inside a module, and the single-threaded performance is poor anyway, one would expect it to scale better than SMT.

But what does CMT bring, effectively? Nothing. Their chips are much larger than the competition’s, or even their own previous generation. And since the Xeon is so much better with single-threaded performance, it can stay ahead in heavy multithreaded scenario’s, despite the fact that SMT does not scale as well as CMT or SMP. But the real advantage that SMT brings is that it is a very efficient solution: it takes up very little die-space. Intel could do the same as AMD does, and put two dies in a single package. But that would result in a chip with 12 cores, running 24 threads, and it would absolutely devour AMD’s CMT in terms of performance.

So I’m not sure where AMD thinks that CMT is ‘more efficient’, since they need a much larger chip, which also consumes more power, to get the same performance as a Xeon, which is not even a high-end model. The Opteron 6276 tested by Anandtech is the top of the line. The Xeon X5650 on the other hand is a midrange model clocked at 2.66 GHz. The top model of that series is the X5690, clocked at 3.46 GHz. Which shows another advantage of smaller chips: better clockspeed scaling.

So, let’s not pretend that CMT is a valid technology, comparable to SMT. Let’s just treat it as what it is: a hollow marketing term. I don’t take CMT seriously, or people who try to use the term in a serious context, for that matter.

About these ads
This entry was posted in Hardware news and tagged , , , , , , . Bookmark the permalink.

10 Responses to The myth of CMT (Cluster-based Multithreading)

  1. NewImprovedjdwii says:

    Simple, They want to be like HP/Apple/Nintendo and that’s be different, Now i will say SMT usually scales around 20-30% where CMT can be 55-80%, But i will agree wiith you and say its harder to do since its a bigger die and it just means Amd doesn’t make as much money as Intel.

  2. Pingback: AMD Steamroller | Scali's OpenBlog™

  3. Pingback: Anonymous

  4. Pingback: AMD's New High Performance Processor Cores Coming Sometime in 2015 - Giving Up on Modular Architecture

  5. Pingback: AMD Confirms Development of High-Performance x86 Core With Completely New Architecture

  6. Pingback: AMD’s New High Performance Processor Cores Coming Sometime in 2015 … « Reviews Technology

  7. Ventisca says:

    so you’re saying that AMD’s CMT is nothing but marketing gimmick?
    I’m no expert, but after reading your article, (maybe) I have a similar opinion. :D
    The module is actually two core, but just under one instruction fetch and decode. So what’s AMD done is not same level of technology of SMT, instead, they just do more thread in more core.
    The new-ish AMD’s core architecture, Steamroller, split the decode unit for each core in the module so each module has 2 instruction decoder, so it’s clear that that they are actually two “separated” core.

    • Randoms says:

      They are still sharing the branch predictor, fetch and the SIMD cluster.

      So it is still need to separated cores. It is a step backwards from the original CMT design, but is it still a CMT design.

  8. Pingback: AMD FX Series Making a Comeback Within Two Years - APU 14 Conference Reveals Future Roadmaps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s