The myth of CMT (Cluster-based Multithreading)

The first time I heard someone use the term ‘CMT’, I was somewhat surprised. Was there a different kind of CPU multithreading technology that I somehow missed? But when I looked it up, things became quite clear. If you google the term, you’ll mainly land on AMD marketing material, explaining ‘cluster-based multithreading’ (or sometimes also ‘clustered multithreading’):

This in itself is strange, as one page you will also find is this: http://dl.acm.org/citation.cfm?id=640477.640525

Triggered by the ever increasing advancements in processor and networking technology, a cluster of PCs connected by a high-speed network has become a viable and cost-effective platform for the execution of computation intensive parallel multithreaded applications.

So apparently the term ‘cluster-based multithreading’ has been used before AMD’s CMT, and is a lot less confusing: it just speaks of conventional clustering of PCs to build a virtual supercomputer.

So CMT is just an ‘invention’ by AMD’s marketing department. They invented a term that sounds close to SMT (Simultaneous Multithreading), in an attempt to compete with Intel’s HyperThreading. Now clearly,  HyperThreading is just a marketing-term as well, but it is Intel’s term for their implementation of SMT, which is a commonly accepted term for a multithreading approach in CPU design, and has been in use long before Intel implemented HyperThreading (IBM started researching it in 1968, to give you an idea of the historical perspective here).

Now the problem I have with CMT is that people are actually buying it. They seem to think that CMT is just as valid a technology as SMT. And worse, they think that the two are closely related, or even equivalent. As a result, they are comparing CMT with SMT in benchmarks, as I found in this Anandtech review a few days ago: http://www.anandtech.com/show/5279/the-opteron-6276-a-closer-look/6

AMD claimed more than once that Clustered Multi Threading (CMT) is a much more efficient way to crunch through server applications than Simultaneous Multi Threading (SMT), aka Hyper-Threading (HTT).

Now, I have a problem with comparisons like these… Let’s compare the benchmarked systems here: http://www.anandtech.com/show/5279/the-opteron-6276-a-closer-look/2

Okay, so all systems have two CPUs. So let’s look at the CPUs themselves:

  • Opteron 6276: 8-module/16-thread, which has two Bulldozer dies of 1.2B transistors each, total 2.4B transistors
  • Opteron 6220: 4-module/8-thread, one Bulldozer die of 1.2B transistors
  • Opteron 6174: 12-core/12-thread, which has two dies of 0.9B transistors each, total 1.8B transistors
  • Xeon X5650: 6-core/12-thread, 1.17B transistors

Now, it’s obvious where things go wrong here, by just looking at the transistorcount: The Opteron 6276 is more than twice as large as the Xeon. So how can you have a fair comparison of the merits of CMT vs SMT? If you throw twice as much hardware at the problem, it’s bound to be able to handle more threads better. The chip is already at an advantage anyway, since it can handle 16 simultaneous threads, where the Xeon can only handle 12.

But if we look at the actual benchmarks, we see that the reality is different: AMD actually NEEDS those two dies to keep up with Intel’s single die. And even then, Intel’s chip excels in keeping response times short. The new CMT-based Opterons are not all that convincing compared to the smaller, older Opteron 6174 either, which can handle only 12 threads instead of 16, and just uses vanilla SMP for multithreading.

Let’s inspect things even closer… What are we benchmarking here? A series of database scenarios, with MySQL and MSSQL. This is integer code. Well, that *is* interesting. Because, what exactly was it that CMT did? Oh yes, it didn’t do anything special for integers! Each module simply has two dedicated integer cores. It is the FPU that is shared between two threads inside a module. But we are not using it here. Well, lucky AMD, best case scenario for CMT.

But let’s put that in perspective… Let’s have a simplified look at the execution resources, looking at the integer ALUs in each CPU.

The Opteron 6276 with CMT disabled has:

  • 8 modules
  • 8 threads
  • 4 ALUs per module
  • 2 ALUs per thread (the ALUs can not be shared between threads, so disabling CMT disables half the threads, and as a result also half the ALUs)
  • 16 ALUs in total

With CMT enabled, this becomes:

  • 8 modules
  • 16 threads
  • 4 ALUs per module
  • 2 ALUs per thread
  • 32 ALUs in total

So nothing happens, really. Since CMT doesn’t share the ALUs, it works exactly the same as the usual SMP approach. So you would expect the same scaling, since the execution units are dedicated per thread anyway. Enabling CMT just gives you more threads.

The Xeon X5650 with SMT disabled has:

  • 6 cores
  • 6 threads
  • 3 ALUs per core
  • 3 ALUs per thread
  • 18 ALUs in total

With SMT enabled, this becomes:

  • 6 cores
  • 12 threads
  • 3 ALUs per core
  • 3 ALUs per 2 threads, effectively ~1.5 ALUs per thread
  • 18 ALUs in total

So here the difference between CMT and SMT becomes quite clear: With single-threading, each thread has more ALUs with SMT than with CMT. With multithreading, each thread has less ALUs (effectively) than CMT.

And that’s why SMT works, and CMT doesn’t: AMD’s previous CPUs also had 3 ALUs per thread. But in order to reduce the size of the modules, AMD chose to use only 2 ALUs per thread now. It is a case of cutting off one’s nose to spite their face: CMT is struggling in single-threaded scenario’s, compared to both the previous-generation Opterons and the Xeons.

At the same time, CMT is not actually saving a lot of die-space: There are 4 ALUs in a module in total. Yes, obviously, when you have more resources for two threads inside a module, and the single-threaded performance is poor anyway, one would expect it to scale better than SMT.

But what does CMT bring, effectively? Nothing. Their chips are much larger than the competition’s, or even their own previous generation. And since the Xeon is so much better with single-threaded performance, it can stay ahead in heavy multithreaded scenario’s, despite the fact that SMT does not scale as well as CMT or SMP. But the real advantage that SMT brings is that it is a very efficient solution: it takes up very little die-space. Intel could do the same as AMD does, and put two dies in a single package. But that would result in a chip with 12 cores, running 24 threads, and it would absolutely devour AMD’s CMT in terms of performance.

So I’m not sure where AMD thinks that CMT is ‘more efficient’, since they need a much larger chip, which also consumes more power, to get the same performance as a Xeon, which is not even a high-end model. The Opteron 6276 tested by Anandtech is the top of the line. The Xeon X5650 on the other hand is a midrange model clocked at 2.66 GHz. The top model of that series is the X5690, clocked at 3.46 GHz. Which shows another advantage of smaller chips: better clockspeed scaling.

So, let’s not pretend that CMT is a valid technology, comparable to SMT. Let’s just treat it as what it is: a hollow marketing term. I don’t take CMT seriously, or people who try to use the term in a serious context, for that matter.

This entry was posted in Hardware news and tagged , , , , , , . Bookmark the permalink.

92 Responses to The myth of CMT (Cluster-based Multithreading)

  1. NewImprovedjdwii says:

    Simple, They want to be like HP/Apple/Nintendo and that’s be different, Now i will say SMT usually scales around 20-30% where CMT can be 55-80%, But i will agree wiith you and say its harder to do since its a bigger die and it just means Amd doesn’t make as much money as Intel.

  2. Pingback: AMD Steamroller | Scali's OpenBlog™

  3. Pingback: Anonymous

  4. Pingback: AMD's New High Performance Processor Cores Coming Sometime in 2015 - Giving Up on Modular Architecture

  5. Pingback: AMD Confirms Development of High-Performance x86 Core With Completely New Architecture

  6. Pingback: AMD’s New High Performance Processor Cores Coming Sometime in 2015 … « Reviews Technology

  7. Ventisca says:

    so you’re saying that AMD’s CMT is nothing but marketing gimmick?
    I’m no expert, but after reading your article, (maybe) I have a similar opinion.😀
    The module is actually two core, but just under one instruction fetch and decode. So what’s AMD done is not same level of technology of SMT, instead, they just do more thread in more core.
    The new-ish AMD’s core architecture, Steamroller, split the decode unit for each core in the module so each module has 2 instruction decoder, so it’s clear that that they are actually two “separated” core.

    • Randoms says:

      They are still sharing the branch predictor, fetch and the SIMD cluster.

      So it is still need to separated cores. It is a step backwards from the original CMT design, but is it still a CMT design.

  8. Pingback: AMD FX Series Making a Comeback Within Two Years - APU 14 Conference Reveals Future Roadmaps

  9. Pingback: F.A.Q pertanyaan yang sering diajukan tentang Arsitektur AMD CMT yang ada di AMD APU dan FX - SutamatamasuSutamatamasu

  10. Lionel Alva says:

    Would you know of any tenable alternatives to SMT then?

    • Scali says:

      Well no… There is no alternative. Why should there be an alternative? That’s like asking “What is an alternative to cache?” or “What is an alternative to pipelining instructions?”
      There are no alternatives, they are just techniques to improve performance in a CPU design.

  11. Scali says:

    Yay, gets posted on Reddit for the umpteenth time… Cognitive dissonance ensues with posters there, trying to discredit this piece hard… with far-fetched and non-sensical arguments (actually going against the AMD marketing material that I put directly on here. If you have to argue against AMD’s own marketing material in order to discredit my article, you know you’ve completely lost it)… But nobody is man enough to comment here.
    The reason I can’t wait for AMD going bankrupt is that it is hopefully the end of these AMD fanboys.
    I am tired of their endless insults and backstabbing.

    • UIGoWild says:

      Do you think the cpu market will be better without cometition? It doesn’t take a marketing degree to understand that without competition, prices would sky-rocket and innovation would go slower. Now I guess you’re thinking that I’m an AMD fan and all that, but that just childish. I’m not trying to defends people who insluted you, being a fanboy of a company and never thinking twice is not clever at all.

      Although, by saying:

      The reason I can’t wait for AMD going bankrupt is that it is hopefully the end of these AMD fanboys.

      You kinda show that you’re just the opposite. An “Anti-AMD”. Thats not better than a fan boy. I hope AMD will get better and that we’ll see a real competition now that they announced that they’re going for SMT, not because I’m a AMD fan, but because I want the best for the customers.

      • Scali says:

        Do you think the cpu market will be better without cometition?

        This is the fallacy known as a ‘leading question’.

        It doesn’t take a marketing degree to understand that without competition, prices would sky-rocket and innovation would go slower.

        This is the fallacy known as ‘slippery slope’.

        You kinda show that you’re just the opposite. An “Anti-AMD”. Thats not better than a fan boy.

        Nice try, but I’m anti-fanboy, not anti-AMD.

        Anyway, if you take a glimpse at reality for a moment, you’ll see that we’ve effectively been without any real competition for many years in the CPU-market. Prices didn’t exactly skyrocket so far, and innovation didn’t exactly slow down. What we do see is that innovation has moved into other areas than just CPU-performance at all cost (such as the breakneck GHz-race in the Pentium3/4-era, which customers didn’t exactly benefit from. They received poor, immature products with a tendency to overheat, become unstable or just break down, from both sides).
        Currently there’s innovation in things like better power-efficiency, Intel scaling down their x86 architectures to also move into tablet/smartphone/embedded markets, and more focus on graphics acceleration and features (for the first time ever, Intel is actually the leader in terms of GPU features, with the most complete DX12 GPUs on the market).

      • UIGoWild says:

        Okay. Lets say I haven’t been perfectly clear. And yeah my comment may have looked like a attack or something, but I was just thinking that you were at risk to ruin your credibility by saying that you wished for AMD to go bankrupt.

        You said:
        Nice try, but I’m anti-fanboy, not anti-AMD.

        So okay, I might have been reacting a bit too quickly. Actually, I totally agree with you on that point. Being a fanboy of a company, any company, is not a clever choice. But I still hold to my point: I would rather keep AMD in the race just to be sure there’s a “tangible” competitor to intel (or nvidia for that matter). I would be saying the same thing if Intel was the one lagging behind. I may be pessimistic, but I don’t like the idea of having only one company holding more than 70% of a market. (Which is already a huge chunk and the actual share of intel at the moment [ps. don’t quote me on that but I’m pretty its close to that].)

        And even though the competition over performance wasn’t really strong (its been forever since AMD was close to Intel), I still think that this competition was good for the customers in the end.

      • Klimax says:

        @UIGoWild
        You are still massively wrong. There is still competition. It is called older Intel’s chips. If there are no improvements and price higher then the only sold new chips will be replacements and trickle of new computers. And massive second hand market. There for no price change is to be expected. Look up monopoly pricing. It is not what you think it is. Not even remotely.

  12. Justin Ayers says:

    “There is still competition. It is called older Intel’s chips.” But the key you’re missing is that competition between businesses is essential.

    • Klimax says:

      Not necessary for some markets. Like CPU market. Because even five years old chips can be good enough for many people, they form effective competition to new chips since potential buyers don’t have pressing need to upgrade them and if new chips were substantially more expensive then even new buyers can skip them and get old chips.

      One of reasons why monopoly are not illegal, only abuse of dominant/monopoly position is. And you forgot that we are already there. AMD ceased to be competitor to Intel about four to six years ago.

      • HowDoMagnetsWork says:

        Let’s assume that Intel actually will end up increasing their prices, believing they’d make more money. Then customers buy more older chips. Years pass, barely any new Intel CPUs are bought, most of the old ones are out of stock. What now? If AMD is in the race, people switch to AMD, even if their devices are half as good as Intel’s. If AMD is not in the race, customers will be forced to pay Intel tremendous prices or just not use their products. Of course, if the company is full of good people, they would never do that, rendering competition useless. But what company is full of good people? Competition is very important for any market.

      • Scali says:

        People aren’t forced to buy new CPUs. CPUs don’t really break down or wear out (in case you missed it, earlier this year, I was part of the team that released 8088 MPH, a demo that runs on the original IBM PC 5150 from 1981. We used plenty of original 80s PCs during development, with their original 8088 CPUs, and they still worked fine, 30+ years after they were made).
        There’s no point in buying older chips if you already have an older chip like that.
        Likewise, performance-per-dollar is a delicate balance. If Intel makes their CPUs too expensive, people simply will not upgrade, because they cannot justify the cost vs the extra performance (perhaps you youngsters don’t know this, but in the good old days when Intel was the only x86-supplier, it often took many years for a new CPU architecture to become mainstream. For example, the 386 was introduced in 1985, but didn’t become mainstream until around 1990. It was just too expensive for most people, so they bought 8088/286 systems instead).

        This means that Intel is always competing against itself, and has only limited room for increasing prices. At the same time they constantly need to improve performance at least a little, to keep upgrades attractive enough.
        If they don’t, they will price themselves out of their own market. If people don’t buy new CPUs, Intel has no income. Which is obviously a scenario that Intel needs to avoid at all costs.

        AMD is really completely irrelevant in most of today’s market already, because their fastest CPUs can barely keep up with mainstream Intel CPUs of a few generations ago. A lot of people have already upgraded to these CPUs or better, and have no interest in getting an AMD CPU at all, even if AMD would give them away for free.
        So we’ve already had the scenario of Intel competing against its older products for many years now. Not much will change if AMD disappears completely.

        It seems a lot of AMD fanboys think that the whole CPU market is in the sub-$200 price bracket where AMD operates. In reality most of it is above that.

  13. Reality Cop says:

    Scali, you’re damn blind. In those “good old days when Intel was the only x86 supplier”:

    1. x86 wasn’t the only option. You had PCs built with MOS, Motorola, and Zilog CPUs all over the place. You had Sun SPARC workstations.

    2. Intel was NOT the only x86 supplier. AMD, NEC, TI, and other were making x86 clones before 1990.

    • Scali says:

      Oh really now?

      1. x86 wasn’t the only option. You had PCs built with MOS, Motorola, and Zilog CPUs all over the place. You had Sun SPARC workstations.

      You think I didn’t know that? I suggest you read some of my Just keeping it real articles. You could have figured it out anyway, since I explicitly said ‘x86 supplier’.

      2. Intel was NOT the only x86 supplier. AMD, NEC, TI, and other were making x86 clones before 1990.

      They were not clones, they were ‘second source’. These fabs made CPUs of Intel’s design, commissioned by Intel. That’s like saying TSMC makes ‘Radeon and GeForce clones’ because they build the actual GPUs that nVidia and AMD design.
      For all intents and purposes, these second source CPUs are Intel CPUs. Intel was the only one designing the x86 CPUs, even if other fabs also manufactured them (which was the point in that context anyway).

      What is your point?

      • k1net1cs says:

        “What is your point?”

        Likely trying to look overly smart.
        At least he tried…but IGN said “6/10 for looking up Wikipedia”.

        Funny how a “Reality Cop” who tried to call you out has to be directed to a collection of articles titled “Just Keeping It Real” for actual, real info on what you’ve done.

  14. OrgblanDemiser says:

    Sooo… who care if AMD continue to exists? Does it hurts anyone? Personally as long as my computer works fine and don’t cost me too much I’m happy with that.

    • Scali says:

      It’s mostly AMD’s marketing and its fanboy following, which distort the truth, misleading/hurting customers.

      • OrgblanDemiser says:

        True. But isn’t it the case with most companies nowaday? I mean, just looking at some HDMI cables boxes make me laugh sometimes. (i.e High speed 1080P ready, Gold plated and such.) Internet providers displaying the speeds in Mega bits instead of Mega bytes. Apple showcasing a good old tablet pen, calling it an “innovation”. (I’ll be careful and not going to extrapolate on this.) And to be topical with recent news: (“recent”) Volkswagen. (No need to add more :P)

        At this point it seems like the customer is taken for a fool at every corner. Fanboy or not, I guess you have to be careful and seek the truth backed by facts and not by advertisement money.

        So again, with AMD, I think people have to admit that when you buy their chips. You buy sub par components. For budget builds I agree the price might be a valuable argument, but its sub par nonetheless.

      • Scali says:

        Fanboy or not, I guess you have to be careful and seek the truth backed by facts and not by advertisement money.

        That is what this blog is here for.

      • semitope says:

        “That is what this blog is here for.”

        hahahhahaha

        you can’t be serious with that line. You know you are always bashing AMD. Really strange but you are a hateboy.

      • Scali says:

        But I am serious. Thing is, AMD has far more dubious marketing than most other companies (I mean take the recent HBM scam… you can’t really be defending that nonsense can you?), so I don’t cover Intel and nVidia as often. They do come along every now and then. You know what’s strange? AMD is just a marginal player in the CPU and GPU market these days, with < 20% marketshare in both arenas… So they aren't selling a whole lot of products compared to Intel and nVidia. Yet there are so many people always thrashing me and my AMD-related blogs (and only those blogs, the other blogs don’t receive such thrash-posts at all). It's amazing how rabid the following is of such a small and meaningless company.

      • semitope says:

        AMD doesn’t have dubious marketing, at best they have weak marketing. Dubious marketing is lying about 970 specs so it doesn’t look weaker than its competition. Dubious marketing is gameworks etc. What AMD does is at best a little misunderstanding here and poor statements there. Like overclockers dream, when they likely meant it can take tons of power and has watercooling.

        You are just extremely biased against them. There aren’t many people posting against you here and I just ended up back here after doing a search. Yet you think they are rabid, never mind the mountains of ignorant nvidia consumers who get crapped on at every turn. But you completely ignore what nvidia does to its consumers.

      • Scali says:

        Sounds like you don’t know what dubious is, and what isn’t. The claims about HBM’s bandwidth compensating for the lower memory capacity is an outright lie. GameWorks is not dubious at all. It does exactly as it says on the tin: it offers added value for nVidia hardware.
        If anything, the 970 specs were a misunderstanding/poor statement. nVidia responded by explaining how the 970 uses its 4 GB of memory in detail, so that is cleared up.

        I am not biased at all, but your statements clearly are. You are only proving my point further with posts such as this one.

      • semitope says:

        again, if its a lie fury x should not keep up with 980ti when vram becomes very important.

        Instead of claiming its a lie, why not figure out how the hell the lie seems true?

      • Scali says:

        But it isn’t true. There’s plenty of evidence of Fury X performance tanking when the vram-wall is hit.
        This review for example measures average frame times and such: http://techreport.com/review/28513/amd-radeon-r9-fury-x-graphics-card-reviewed/14
        As you can tell from their charts, and their conclusion, in certain memory-heavy games, there are spikes in the framerate, and it is not as smooth in 4k as the GeForce cards, which have more memory.
        Here is another review that shows similar data: http://hexus.net/tech/reviews/graphics/84170-amd-radeon-r9-fury-x/?page=12
        As they also say, it isn’t as smooth.
        Here is a third review, concluding the same: http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/22
        And an entire article investigating the issue here: http://www.extremetech.com/gaming/213069-is-4gb-of-vram-enough-amds-fury-x-faces-off-with-nvidias-gtx-980-ti-titan-x/2

      • semitope says:

        That looks like a mix.

        eg fury x renders more frames under 25ms than 980ti (33% vs 16%) even though slowest 1% (ONE PERCENT) takes longer on average. The witcher 3 results are not bad. Not confirmed to be VRAM related.

        Don’t see point of techreport link. Results aren’t bad

        For the extreme tech link the problem with assuming its HBM is that the same difference is seen at other resolutions. The issue exists for all the resolutions. It simply gets worse because the resolution and demand is higher.

        Anandtech assumes its due to HBM size, but at 1440 the same trend persists. Should we assume the fury x has more spikes due the game and driver or jump to assuming the hbm is an issue even though it is not an issue during most of the test. Also, when were the frame dips? during frame transitions?

      • Scali says:

        That’s the difference between you and me. I write my own graphics engines, and I can easily create scenarios that use a lot of memory, and benchmark them. I don’t need to rely on games to do that (games whose code AMD and NVidia have also analysed, and created driver ‘optimizations’ for, so you’re never sure what you’re testing exactly anyway). Even so, it is clear that everyone concludes that the AMD hardware doesn’t run as smoothly. So you will have to accept that as fact, even if you want to continue being in denial about 4 GB being the reason (what else could it possibly be?).

        “Drivers” would be a poor reason obviously, since Fury is not a new architecture. It’s the same GCN 1.2 they’ve been using for a few years now, so drivers should be quite mature.

      • semitope says:

        Are you an unreal engine developer?

      • semitope says:

        I was taking dubious to mean something worse. Its really a meaningless word in the capacity being used. Almost all companies concerned are very guilty of this.

        I dont get how you can defend nvidia, yet bash AMD for minor things. What huddy said could be a simple way of explaining what their engineers are doing with HBM. The important thing I remember is they said they had engineers specifically assigned to make the memory limitation less of an issue.

        yet here you are claiming it was a lie rather than realizing it works and trying to figure out what he was really saying.

      • Scali says:

        I am not defending nVidia. Difference is, nVidia admitted that they had published the wrong specs for 970, and explained how it worked. AMD doesn’t admit their lies, they just keep piling on new lies time and time again.

        And please, don’t try defending Huddy. He’s just a clueless marketing guy. I am a graphics developer with decades of experience in the field. I know the ins and outs of CPUs, GPUs and APIs, and what he says simpl is not true, for technical reasons I have already explained earlier. No point in further discussion.
        So stop wasting your time.

      • semitope says:

        Huddy is not a clueless marketing guy. He has technical experience.

        Nvidia only spoke about the 970 issue when it was found out. lets take a guess if they would if nobody pointed it out.

        What lies should AMD come clean on? Odds are they are just perceived lies on your part.

      • Scali says:

        Really? There are a number of lies documented on this blog, which AMD has never come clean on.
        Eg, claims of Barcelona being 40% faster than any other CPU on the market at launch. Or Bulldozer having more than 17% higher IPC than Barcelona.
        Then there’s the tessellation issues.
        And what about all those claims about Mantle? Being an open standard, being a console API and whatever else.
        And now HBM.
        It has been proven on all counts that AMD’s claims were false. I don’t “perceive lies”, I am an expert in the field.
        Huddy understands as little about technology as the average fanboy. He’s proven that much. He even commented on some of my blogs personally, but wasn’t able to have a technical discussion. He just threw insults and threats around.
        But you probably understand even less about technology than he does, if you believe his crap. I suggest you talk to someone who actually has a clue. You’ll see that nobody with a clue will be able to contest anything I write on this blog on a technical level. Everything is 100% true and verified. I stand by that. Feel free to try and prove me wrong, but you’ll have to do that with technical arguments and proper evidence. I see neither.

      • semitope says:

        Barcelona looked like it ran into major bugs and severely cut the clock speed and some silicon. With Barcelona and Bulldozer there is always the question of “in what?” a single number won’t represent all cases of comparisons. if its 40% faster in anything and 17% faster in anything, its not a lie.

        Some of the claims you respond to are from users.

        eg. mantle being a console API? who said that?

        Mantle was never released. APPARENTLY things can change. Mantle is gone to vulkan which is good enough. I really do not get that complaint. They told us their plans, their plans changed and mantle went into vulkan. What is the issue? If dx12 wasn’t what it is they probably would have put out mantle IMO, they decided to back dx12 instead iirc.

        Not sure what tesselation you are talking about but users see it in their own gaming so I don’t know how its AMD lying about something. If users see no benefit to going over 16x tess yet suffer huge performance loss, then that’s that. Even disregarding AMD, its not to our benefit. AMD’s complaint is that using that much tessellation is overdoing it, gamers complain about the same thing. Even nvidia’s gamers complain about it. What is the issue?

        Already responded to your HBM claim. Huddy did not say that its because of faster swap to system RAM, he said the working set can be kept in HBM while swapping with system RAM doesn’t get in the way of the GPU. What you were responded to was likely some interpretation you heard from lay people.

        Go ahead and link to huddy’s comments, with proof it was huddy. From his education and history he seems technically competent. And you are clearly biased far more than he could be considering he has worked for AMD’s competition after working for AMD. Not sure why you feel you need to belittle the guy. He has more experience in these things that you. being a graphics designer means next to nothing when he’s involved with much more than you. https://www.linkedin.com/in/richardhuddy

        I would understand if he chose to not spend much time with you. You do not see your own bias but its very obvious and he would be wasting his time.

        I realize you are just ultra sensitive to anything AMD does.

      • Scali says:

        So you’re still in denial? Well, it’s your choice to be ignorant.
        I see no point in discussing these issues with you again. Everything is already explained in the blog posts. Your excuses are pathetic in light of all the information already presented.

        Linking to Huddy’s comments is simple:
        https://scalibq.wordpress.com/2012/05/03/richard-huddy-comments-on-my-blog/
        https://scalibq.wordpress.com/2012/05/09/richard-huddy-responds-again/

        Feel free to email him to ask if it was really him. I did.
        Note that Andrew Copland (also a game developer, look him up in LinkedIn if you like) also asks the same questions… but Huddy is unable to answer anything of a technical nature.

        Note also that at the time there was no information about Mantle yet… but even so, everything said back then turned out to be true. We’re not going to drop DirectX, and we’re not going to drop hardware abstraction. Even Mantle, being specific to the GCN-architecture, had some level of abstraction, instead of just programming the GPU directly.
        As you see, neither Johan Andersson, Michael Gluck, Andrew Copland, nor myself were expecting to drop the API. We’re all graphics developers. Huddy was the only one claiming something different, and he was wrong.

        Also, see the link already in the HBM article: https://www.reddit.com/r/Amd/comments/3xn0zf/fury_xs_4gb_hbm_exceeds_capabilities_of_8gb_or/
        As Huddy is quoted there: 4GB HBM “exceeds capabilities of 8GB or 12GB”.
        So he literally says that 4 GB of HBM is better than 4 GB, or even 8 to 12 GB of GDDR5. Since the only difference between HBM and GDDR5 is bandwidth, he attributes more bandwidth to be a substitute for 8 or 12 GB.
        It’s all there, you’re just trying hard to remain in denial.

      • semitope says:

        I just looked through the huddy stuff. In the post you said he replied to you were working off an assumption he wanted there to be no api. You claim this is dumb, but why do you assume he wouldn’t know an api is necessary and just jump to assuming he must really mean there should be no api? Then someone corrects you (basically destroys your entire post) by pointing out huddy was not the one asking for the api to go away, and I would assume developers did not mean for the api to literally go away, but to get less in the way.

        The fact that you missed something so obvious in the interview. VERY obvious in the interview and decided to bash huddy for it should clue you in to your bias. The text you linked to says

        “Huddy says that one of the most common requests he gets from game developers is: ‘Make the API go away.'”

        not hard to know where the statement comes from. You should not have titled the post as “Richard Huddy talks nonsense again”

        I see huddy points out it wasn’t his opinion. You still attack the guy even though there’s no error in what he is saying. Some developers said that. end of story.

        You do not agree with those developers apparently, but why take issue with huddy?

        Coplands comment was misguided as well. He should email Andersson and ask Huddy for other developers who shared the sentiment so he can have his question answered. I suspect he made that comment because you made it seem like these were Huddy’s comments. In the very blog post the only thing huddy said was repi gets it. The larger quote was from andersson himself.

        Also, developers would not be forced into anything. Just because lower level access might be possible, does not mean it has to be used. eg. iirc naughty dog did really low level stuff for uncharted games on ps3, doesn’t mean other developers had to get that hardcore. Simply having the capability to do it might be what some developers want.

        Huddy wasn’t claiming the API needed to go away, he was saying developers said that. AND a reasonable assumption would be that the developers do not mean it should go away but say that in the sense that it should get OUT of the way because dx11 might have been a problem for them.

        again, hypersensitive

        Exceeds capabilities, not capacity. I assumed it was estimates and stretching it with up to 12GB, but it depends on how VRAM works and is used by the GPU. If you’re going to store more than 4GB of data to be used right away then probably no. But clearly their claim is along the lines that GDDR5 has a lot of inefficiencies and the way HBM works with their optimizations gets rid of some of them.

        I doubt any gddr5 GPU would be able to process that much at the same time anyway.

      • Scali says:

        “Destroys your post”… heh, not quite. Anyway, I already point to that ‘other developer’, that’s Johan Andersson of DICE (repi), and I quoted him verbatim. He does NOT ask for the API to go away, that much is clear from his quote. Huddy misrepresented Andersson’s statement. I even contacted Andersson himself about it, but he did not want to side with Huddy on this. I think we’re done here anyway. You can’t even keep track of all the things that have already been discussed, and your posts aren’t adding any value. It’s just noise.

        Stupid nonsense about GDDR5 vs HBM. As already stated, it’s just memory. The only difference is bandwidth. Memory management is done in software, not in GDDR5 or HBM itself. So nothing changes there. AMD basically claims that they can suddenly do much more efficient memory management now that they have HBM. Which is a load of BS.
        Besides, as already said… when you run out of vram, the memory bottleneck is the PCI-e interface to system memory. This is orders of magnitude slower than GDDR5 or HBM, making the vram technology completely irrelevant for performance in this case.
        I mean, how little do you understand about computers?
        If you want to copy from A to B, and your access speed from A is lower than that to B, you will never be able to copy faster than the speed of A.
        Look it up for yourself. Even a Skylake with overclocked DDR4 can do only about 50 GB/s max: http://www.legitreviews.com/ddr4-memory-scaling-intel-z170-finding-the-best-ddr4-memory-kit-speed_170340/2
        Even a mainstream GDDR5-based card such as the GTX950 already has twice that: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-950/specifications
        A GTX 980Ti or Titan X’s memory is more than three times as fast still.
        So if the memory you’re copying from is so much slower than your GDDR5 or HBM, how is HBM going to make a difference? It isn’t. The bottleneck is on the other side.

      • semitope says:

        Did Huddy even link to that statement from andersson? Anyway, you are taking a negative interpretation as usual. first assuming huddy must be an idiot for saying something he didn’t then when you find out he didn’t you try to claim he must be lying that developers said that, and even if they did, they didn’t.

        A sensible interpretation would be they say they want the API out of the way. w.e.

        Andersson did not want to side with Huddy or did he just not bother replying to you? I am sure those two speak often enough. I doubt repi is the type to waste his time on someone like you.

        The only difference between HBM and GDDR5 is bandwidth? and you claim huddy says dumb things…
        Between the internal operation of HBM, the memory controller, Connection to GPU and their own software optimizations, maybe they can do more.

        Yes when you “run out of VRAM” i.e. when what you have in VRAM is not what the GPU needs, you have an issue. Nobody denies that I think.

        Maybe message AMDs engineers for an explanation of how their use of HBM differs from GDDR5s typical usage.

      • Scali says:

        What the hell do you want anyway?
        It’s crystal-clear: Andersson simply spoke out that he would have liked a more low-level interface to the hardware, like on consoles (but not direct hardware access without any kind of abstraction layer).
        Huddy misinterpreted that, and in the bit-tech article you can clearly see him using the words “drop the API”, and other parts of his story also indicate that he is pushing for direct hardware access. That is NOT what developers were talking about. They know the downside of direct hardware access, and they know it’s never going to work on a heterogeneous platform such as the PC.

        And yes, the only difference between HBM and GDDR5 is indeed bandwidth, at least as far as the rest of the system is concerned. The internal operation of HBM, the memory controller and connection to the GPU aren’t relevant. The rest of the system doesn’t see this, and it isn’t relevant. The net result of these differences is just higher bandwidth for the system.
        “Software optimizations”… that’s nonsense of course. You can perform the same optimizations for any type of memory, and these have been done for years already.

        Yes when you “run out of VRAM” i.e. when what you have in VRAM is not what the GPU needs, you have an issue. Nobody denies that I think.

        You are. Because you’re the one arguing that 4 GB suddenly isn’t 4 GB when it’s HBM. So ‘running out of VRAM’ is somehow different when you have HBM?
        In the real world, 4 GB is 4 GB, regardless of what memory technology or speed. It fits exactly 4 GB. So you always run out of memory at the exact same point, namely at the 4 GB threshold.

        Why would I need to message AMD engineers? I already know it isn’t going to work. Magic doesn’t exist. Besides, if there was some kind of magic to it, then it would be Huddy’s job to talk to the engineers, and make some kind of press release about this magic. Instead, we got smoke and mirrors… and cards that clearly exhibit spiky performance in memory-hungry games.

      • semitope says:

        This is just too weird. Let me get this straight, your main objection to dropping the api etc is just that it would be difficult and therefore no developer would want it? It would be tough so Huddy must be an idiot for even talking about that kind of situation? No way Andersson would ever want that kind of access to a GPU? You’re using your personal opinion to call him and possibly any developer who actually did voice these views an idiot. You try to say andersson didn’t because you do not want to look a fool for calling someone like him an idiot. Why excuse him for his statement by claiming its an ideal and not anyone else? What if he really wanted that situation to come about?

        So what if the rest of the system does not see it? It’s still a factor and one the software could exploit in a way not possible with gddr5.

        Not claiming 4GB isn’t 4GB, I am saying how you use 4GB of HBM can be different from how you use 8GB GDDR5.

        Your evidence for issues with high memory usage was dubious.
        You would ask the engineers because you you do not know. You are just brushing every thing aside and pretend hbm is gddr5. You did no research before going off on your biased witch hunt, most of which would never happen (probably all) if nvidia was the one making the case for 4GB HBM etc.

      • Scali says:

        STFU and RTFA, I have explained in great detail why an abstraction layer is required.

        Also, I *am* an engineer, unlike Huddy. I don’t need to ask AMD’s engineers, I know everything they do… and by the looks of it, a considerable deal more, given the fact that I pointed out in great detail that Bulldozer wasn’t going to work in practice, more than a year before they had the actual CPU on the market.

      • semitope says:

        You explained why its preferred iirc. Its not a requirement and you cannot say just because you think it should be so, all developers would want it so.

        AFAIK bulldozer CPUs work in practice. Otherwise AMD would be in a lot more trouble after selling CPUs that wouldn’t turn on.

        Sorry if you have been involved in GDDR5 and HBM development and know the tech in detail, thought you didn’t.

      • Scali says:

        So you don’t get it. TL;DR: If you don’t use an abstraction layer, your code will only work on the exact hardware you’ve targeted. You lose any kind of backward and forward compatibility. I suggest you read up on 8088 MPH and just how picky it is in regards to hardware (and you thought PC compatible meant compatible…). Acceptable for a demoscene prod, but not for any games or other commercial software.

      • semitope says:

        What I am curious about is what position you take on this:

  15. aron says:

    talking about used market is pretty short sighted. I mean think about this. The growth of people using computers and business using computers, if new processors that come out don’t compete with used ones that is going to drive the price of old processors up. it won’t drop prices though of new processors. even so everyone cant go used, eventually those would be gone. either that or we would be using really really shitty processors that don’t support modern stuff (such as ddr3/4 ram, pci slots even for graphics cards, (want to start digging around for those old agp cards?) motherboards, it would have an adverse effect on the other lines of production. who would purchase ddr4 memory if the only cpu they can find is an intel q6600 or an old amd x64. either that or in order for those companies to remain in business we would have to go back to older technologies.

    • Scali says:

      Most people don’t buy processors, they buy complete systems. You won’t be able to buy a new system with a used processor. And especially for business users, used systems aren’t an option. They need a reliable system with a proper service contract. So new is the only option.

  16. Stashix says:

    Firstly thanks for the blog, it makes for some interesting reading.
    I would be interested in your take on the whole Gameworks deal, especially since it seems to me there are a lot of unsubstantiated claims circulating around.

  17. Dawis says:

    Clearly there was always some confusion between marketing terms and technology behind them.
    I myself have seen advertisement in newspaper for: Intel Hyper Trading – So not even Intel can spell what it is selling sometimes.

    But overall I do notget frustration with CMT. Clearly as described it is should not be positioned as CMT vs SMT, but rather CMT vs Hyper Threading (besides was it not Intel that relabeled their HT as SMT?). Though when you look at it it makes sens to compare CMT vs HT and it does make sense to say CMT scales better than HT. If we take out of the equations such things as die space, CMT clearly adds more execution resources to table per thread, so the performance scales better per virtual core. Of course, problem is that each actual core is larger and it is more expensive to add one core.

    If you analyse architecture historically Intel mostly implemented this tech mostly as marketing tool as on Netburst architecture was not really built for SMT. NetBurst was quite narrow core with lack of parallel execution units so it gained little from HT. Irony is that the same technology back then implemented for K7 would arguably give much better results as K7 core, had better theoretical IPC and potentially more unused execution resources.

    In general I think what AMD did by adding 4 ALU’s to one core makes sense if you want to work with SMT. Problem is that with CMT they limited 2 ALU’s per thread which really hurt their IPC for single thread applications.

    From other side it might make sense in server space, where large amount of workloads average IPC is usually <1, so CMT would guarantee 2 ALU's for each thread which could theoretically improve performance for heavy multi-threaded server workloads, where single threaded performance does not matter.
    Of course if Intel CPU's single core performance is sometimes up to 50% and more better than AMD due to edge in architecture, process node and application level optimizations, your benchmark results will show one picture regardless if CMT scales better for server loads. Effect of CMT will not be able to close the huge gap even if that was the right decision.

    Overall I think, CMT was not the right move for AMD, as it hurt their single threaded performance too much, which is what matters in most of the benchmarks and in desktop space. If there was any effect on server workloads it was too little to close the gap in those few server workload benchmakrs that it surfaced in.

    • Scali says:

      Clearly as described it is should not be positioned as CMT vs SMT, but rather CMT vs Hyper Threading (besides was it not Intel that relabeled their HT as SMT?).

      No, as per my argument, HT and SMT are pretty much equivalent, where CMT is not equivalent to either.

      Though when you look at it it makes sens to compare CMT vs HT and it does make sense to say CMT scales better than HT.

      This depends on your definition of ‘scaling’.
      As I point out, if you look at ‘transistors-per-workload’ scaling, HT/SMT is far superior to CMT.

      If we take out of the equations such things as die space, CMT clearly adds more execution resources to table per thread, so the performance scales better per virtual core. Of course, problem is that each actual core is larger and it is more expensive to add one core.

      Since the whole point of SMT is to improve multithreading performance, we can only conclude that CMT does not work. CMT is basically throwing a huge amount of extra silicon at the problem, which is what we were doing anyway before we had SMT.

      If you analyse architecture historically Intel mostly implemented this tech mostly as marketing tool as on Netburst architecture was not really built for SMT. NetBurst was quite narrow core with lack of parallel execution units so it gained little from HT.

      Okay, now we’re getting into complete nonsense-territory.
      Firstly, ‘lack of parallel execution units’? Pentium 4 has plenty of parallel execution units:

      Secondly, the argument for Pentium 4’s HT has always been that it works because of the Pentium 4’s low IPC, where a lot of execution units sit idle (so pretty much the opposite of what you try to claim now).
      I was one of the few people at the time who pointed out that the HT-principles would also work on CPUs with high IPC, which Intel has since proven with the Core i7 line.

      • Dawis says:

        The picture you posted is a bit misleading.
        It makes it look like Netburst had 4 ALU’s, where in fact it had 2 double pumped ALU’s which could execute some transactions at double speed.
        Also if I remember correctly Prescott and later cores dropped those double pumped ALU’s in favor of even longer pipeline and higher frequencies.

        Never the less 2 double pumped ALU’s is not realy the same as 4 ALU’s in terms of parallel execution units. Also note that complex ALU instructions had only one slow ALU block.
        In practice I still claim Netburst was pretty narrow as compared to K7 at the time. To say it had lack of execution units would be false yes, but as compared to other architectures like K7 or Pentium III it was narrow architecture with super long pipeline.
        Do not get me wrong It was a very interesting and clever attempt by Intel engineers – and for certain scenarios it was able to show very good performance. As general purpose architecture it was eventually loosing to more conventional K7 even though AMD was always at least one generation of process node behind Intel. the best proof for that is that Intel went back and took PIII architecture for their next generation flagship architecture.

        In practical scenarios P4 did not gain much from HT which I why I claim it was more marketing than performance technology at the time. Even today the leading performance factor is still single threaded performance, which is exactly why CMT fails.

        The last point you make is not very humble though – you claim to be one of the few people at the time who pointed out that HT principles would also work on CPU’s with high IPC? I mean SMT was technology was researched since 1968 and Intel was also not the first to implement it – surely more than a few noticed it was good especially with CPU’s with high IPC.

      • Scali says:

        The picture you posted is a bit misleading. It makes it look like Netburst had 4 ALU’s, where in fact it had 2 double pumped ALU’s which could execute some transactions at double speed.

        There’s nothing misleading about it, it lists two double-pumped ALUs. It would be misleading if it had 4 blocks marked ‘ALU’, rather than two blocks marked ‘2x ALU’.

        Never the less 2 double pumped ALU’s is not realy the same as 4 ALU’s in terms of parallel execution units.

        Nobody made any such claim.

        Point remains that it has two ALUs, two AGUs, and various pipelined FPU/SIMD execution units. Given that the P4 can retire only 6 ops per 2 clks, it is far wider than required (in a perfect world it would only need three execution units that retire an instruction every cycle to get this throughput. It has far more than 3 execution units, and some of these units can even retire 2 ops per clk because they’re double-pumped).
        Hence, claiming it is a ‘narrow’ CPU is completely ridiculous. It is wider than the P6 architecture that went before it:
        http://img.tomshardware.com/us/1999/08/09/the_new_athlon_processor/p3-architecture.jpg

        In practical scenarios P4 did not gain much from HT which I why I claim it was more marketing than performance technology at the time.

        Based on what metric?
        There are examples where a Pentium HT gets 30-50% more performance from a second thread.
        Of course, in the early days, OSes and software weren’t very good at putting HT to good use. Windows XP received various tweaks to make its synchronization primitives more HT-friendly for example.

        you claim to be one of the few people at the time who pointed out that HT principles would also work on CPU’s with high IPC? I mean SMT was technology was researched since 1968 and Intel was also not the first to implement it – surely more than a few noticed it was good especially with CPU’s with high IPC.

        Most people only know x86, and only have a very limited understanding of it at that (you are one of them, by the way).
        I am one of the few who understands SMT in general, in more than just the x86-context (also having knowledge of IBM’s POWER architecture and Sun’s Niagara for example). Which means indeed I am among a very small minority who understands that HT would also work on an x86 CPU with high IPC. Most people were arguing that it was just a ‘dirty hack’ to mask Pentium 4’s poor IPC, where Athlons wouldn’t need it because they were more efficient.
        I said: “Just wait and see”, and we did. Not only did Core i7 deliver higher IPC than any x86 architecture that went before it, but HT also works very well on this type of CPU.

  18. Dawis says:

    I am also not claiming CMT works. I think engineers had some idea though and most likely it was server space applications where it is important to guarantee that a particular thread will get execution resources (hence at least 2 ALU’s available), but it is not important for any thread to get a lot of them (4 ALU’s) since average IPC of the workload does not demand it.

    I am also not a fan of CMT as it hurts Desktop performance, but they might have had a point for server workloads. Besides comparing them to Intel does not really compare CMT with HT since there are many more attributes – Intel runing on more efficient node process, workload being optimized for Intel, Intel in general outperforming AMD.

    Fair comparison would be if AMD supported both CMT and SMT as modes of operation. Then we could compare different workloads and see if some of those scenarios benefit CTM setup.

    I am just saying that locking 2 ALU’s per thread might deliver advantages in some scenarios.

    • Scali says:

      I am just saying that locking 2 ALU’s per thread might deliver advantages in some scenarios.

      Don’t you see how backward this logic is?
      “Locking 2 ALU’s per thread?”
      That doesn’t even make sense.
      If you don’t implement any special kind of threading technology, but just make a vanilla multicore CPU, then by definition your ALUs and all other resources are ‘locked per thread’.
      You make it sound like this is a feature of the CPU, where it isn’t.

      The problem is… older AMD CPUs had 3 ALUs per thread, now they only have 2. They had to remove a few ALUs in order to get the transistor count down, so that they could scale the core count up.
      Once you understand that, you understand my position: SMT also scales the core count up, but instead of having to remove logic and hurt your IPC, you simply add some extra logic to run two threads on one physical core. It’s a very elegant approach.

      My guess is that AMD wanted to avoid the patents around SMT, so they came up with something that is not-quite SMT, so they don’t have to pay any licensing fees.
      These patents have now expired, so AMD will probably be moving from their ‘not-quite’ SMT to full SMT.

      • Dawis says:

        I fully understand and support your position that scaling down ALU’s was a bad move. I even mentioned it several times, that only 2 ALU’s available per thread certainly limits IPC, which is why this generation of AMD CPU’s (was it K10?) offered lower performance per clock cycle than their previous generation that had 3 ALU’s per thread.

        My point was that adding one ALU per core was probably step in right direction as it does not add too much silicon and widens the core. Unfortunately they limited ALU count as two per thread, but they probably had some research to back it up. Without knowing their considerations it is quite hard to say that it was all bad.

        Do I agree it was wrong move – yes, but again I would be surprised they did not have a good reason for it. It would be very stupid not to have engineering reason for implementing such architecture decision – without any research or data to back up that decision.

        Your guess for AMD wanting to avoid patents is probably right.

        My guess was that it was easier for them to implement CMT as they do not have to think about balancing thread performance that way. SMT offers more flexibility – it might be harder to implement + also make sure that some thread is not starved of execution resources. For desktop it probably makes no sense, but in server space it is probably important to guarantee minimum performance level of all threads.

  19. Dawis says:

    You are being a little arrogant:
    “Most people only know x86, and only have a very limited understanding of it at that (you are one of them, by the way).”

    You have no basis at all to make that claim, so forgive me if I think you are a bit arrogant.
    You don’t have to put someone down to be smart and feel special. Everyone notices it is a good blog with well argument-ed opinions about cpu architecture and many other topics. But you should know that well argument-ed opinion does not make it universally true. Insulting other party does not contribute to productive discussion as well as arguments against your claims insult your intelligence.

    And technically you were wrong. Most people don’t even know or care what x86 is. It does not make them dumb either.
    But I get what you meant. Most people who think they know great deal about CPU’s probably don’t even know that most of their knowledge is about x86 architecture.

    x86 is also probably too wide term as today it is only instruction set with no implications about architecture. It was born as true CISC architecture, but these days it’s more a RISC execution units with CISC instruction set.

    • Scali says:

      You have no basis at all to make that claim,

      Sure I do. Non-x86 systems only have a very small marketshare, hence only a very small amount of people use them. And only a small subset of those develop on these systems. A smaller subset of them still knows about these systems at the instruction/microarchitecture-level. Those are facts.

      And technically you were wrong

      Wrong on what?

      x86 is also probably too wide term as today it is only instruction set with no implications about architecture. It was born as true CISC architecture, but these days it’s more a RISC execution units with CISC instruction set.

      If you’re trying to sound smart, you failed.
      This is completely irrelevant to the point I was making. In the x86-world (which includes all x86-compatible CPUs ever made, by any vendor), the only SMT-implementation comes from Intel. For most people, the Pentium 4 was the first and only SMT-capable CPU they knew.
      Their ignorance is what led them to claim it wouldn’t work on high-IPC architectures. Despite the fact that various other architectures implemented various kinds of SMT very successfully in the non-x86 world.
      And even if you only know x86, if you understand the instructionset well enough, and know how well the execution units are used even in an architecture that we consider high IPC (for x86 terms that is), you’d have to come to the conclusion that there are still plenty of idle units.
      But most people don’t actually know a whole lot about CPUs. They just think they do, and then start cheerleading for their favourite brand.

      • Dawis says:

        “Sure I do. Non-x86 systems only have a very small marketshare, hence only a very small amount of people use them. And only a small subset of those develop on these systems. A smaller subset of them still knows about these systems at the instruction/microarchitecture-level. Those are facts”

        Yes these are facts, but these facts do not in any way imply I only know x86. These facts imply there is large chance I am one of those people, they do not imply I am one of those people. I am sure you understand the difference.
        Based on those facts and that method of deduction one could conclude the same about you.

        Hence it was right for me to assume you are arrogant based on that.

      • Scali says:

        Yes these are facts, but these facts do not in any way imply I only know x86

        Then you are ignoring all the evidence in the posts you made here so far. These posts clearly show a lack of understanding, even of x86 architectures such as the P6, and SMT in general.
        Besides, you are pulling this quote out of context now. I only responded to the fact that I can claim I am part of a small minority of engineers who have hands-on experience with non-x86 CPUs.

        Based on those facts and that method of deduction one could conclude the same about you.

        Given the fact that this blog contains tons of articles about non-x86 processors, I don’t have anything to prove in this respect.
        You’re starting to get quite annoying. As I always say, never take a knife to a gun fight.

      • Dawis says:

        ” For most people, the Pentium 4 was the first and only SMT-capable CPU they knew.
        Their ignorance is what led them to claim it wouldn’t work on high-IPC architectures. Despite the fact that various other architectures implemented various kinds of SMT very successfully in the non-x86 world.”

        Well I am not most people and nobody here made claim that SMT would work worse for architectures with high IPC. I was simply saying it is a bit arrogant to assume you were among the first ones in the world to deduce that SMT will have even better results for architectures with high IPC.

        I mean I guess compared to rest of the world, yes you were one of the first ones, because 99% of the world does not even know what CPU is and 99% of the remaining 1% does not know what SMT is, and I suppose 99% of that probably did not have idea.

        You figured that out and that is good, but that still does not make you a visionary. I am sure it was no surprise to CPU chip designers for example and practically everyone who understands these concepts at a fair level.

        You make it sound so that AMD and Intel should hire you to lead their chip design projects.

        Peace.

      • Scali says:

        nobody here made claim that SMT would work worse for architectures with high IPC.

        You actually made the opposite claim: that SMT didn’t work well on the Pentium 4 because the Pentium 4 was a ‘narrow’ architecture.
        I just gave some historical perspective on that. The general consensus was that HT worked quite well on the Pentium 4, and they didn’t expect it to work as well on a CPU with higher IPC. The reason for this is that the Pentium 4 had relatively many idle units on average (the opposite of what you claimed basically).

        I thought your claim was rather funny, and clearly demonstrates that you weren’t around in the Pentium 4 age. You try to project your understanding of current CPUs to that era, and fail to understand the state of the world back then. You can’t hide ignorance.

        I was simply saying it is a bit arrogant to assume you were among the first ones in the world to deduce that SMT will have even better results for architectures with high IPC.

        Again, pulling things out of context/misrepresenting my statements. I never said I was ‘among the first’, I said I was one of the few who didn’t believe HT only worked on the P4 because it had such low IPC.

        You make it sound so that AMD and Intel should hire you to lead their chip design projects.

        Perhaps they did.

      • Dawis says:

        “I thought your claim was rather funny, and clearly demonstrates that you weren’t around in the Pentium 4 age. You try to project your understanding of current CPUs to that era, and fail to understand the state of the world back then. You can’t hide ignorance.”

        I was around in Pentium 4 age, but depends on what you mean by being around – I was never part of some Pentium 4 HT discussion group. I simply read things that I could find on web and books, and I remember thinking back then that HT would work better with CPU’s that are able to do more per clock cycle. I was not then and am not now a chip architect or designer – I was simply computer science student who found that subject fascinating and did a lot of free time digging

        I was not part of Intel engineering nor any other SMT discussion comunity, so I was never aware that HT was introduced for Netburst because real life IPC left a lot of unused execution potential.

        “Perhaps they did.”
        Take that was Intel if given you are such a fan.
        Not possible to be objective then.
        Must say that Intel is great company, but I always had some room for AMD giving how much they achieved from underdog position in all areas (available funds, available technology and process node etc.) and how much they actually influenced later Intel decisions: Abandoning RIMM, abandoning Itanium (ok it was not AMD – their own x86 was too strong for Itanic), abandoning Netburst, putting memory controller on die and such.

        I also did not like how Intel behaved in market when AMD 286 or K7 had a chance to shine. When you read pages like these Intel also does not have the best appeal:
        http://www.intel4004.com/

        It must be a great company to work for these days though.

      • Scali says:

        depends on what you mean by being around

        I mean actually studying the microarchitecture in detail, and hand-optimizing code for the microarchitecture. The only way to really study what a microarchitecture is really capable of.

  20. Dawis says:

    BTW comparing two architecture diagrams like P6 vs Netburst alone does not entirely prove that one is wider than another. My definition of wider does not mean more execution units, but rather ability to use them in parallel in one clock cycle.
    I have to say I am not expert in P6, but Core architecture was largely based on P6 rather than Netburst – I would not be surprised P6 outperforming Netburst clock per clock.

    Netburst was not really known for its good IPC, granted it was also due to its long pipeline having effect on branch misprediction penalit and many other effects.

    Do you have data on how many instructions Netburst was able to retire per cycle and how many instructions P6 was able to retire per cycle? Of course we have to take into account that theoretical maximum is different than realistic scenario.

    • Scali says:

      My definition of wider does not mean more execution units, but rather ability to use them in parallel in one clock cycle.

      Given the fact that both the PIII and P4 use an out-of-order architecture where micro-ops can be sent to any execution port in the CPU, how is there a difference between the two?

      Do you have data on how many instructions Netburst was able to retire per cycle and how many instructions P6 was able to retire per cycle?

      As I already said, the P4 does 6 uOps per 2 clks. P6 does 3 uOps per clk. Effectively it’s the same.
      You know, you could just look this info up yourself. I see no point in you asking me about this.
      In fact, I don’t even understand why you’re arguing here, since you say you don’t know much about P6 anyway. So by your own admission you don’t know what you’re talking about. How can you judge the P4, if you don’t know what its predecessor was capable of?
      Your statements are null and void.

      • Dawis says:

        I have not read any good articles about P6. I have studied Netburst (warious generations of it) as well as AMD Hammer architecture much much more – in Intel books and some really good articles in web in.

        I found these articles to go really in depth at a time:
        http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
        http://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html

        I have not read a
        Saying both CAN execute 3 micro operations per clk does not mean they perform the same clock for clock. It has much more depth in it as they are very different architectures.

        We tend to simplify architecture by drawing blocks of ALU’s, FPU’S, AGU’s etc., but just because certain architecture can retire 3 operations per cycle given certain circumstances does not mean it always retires 3 – in fact most real life scenarios will combine instructions in way that CPU will not be able to retire 3 in cycle, but much less on average.

      • Scali says:

        I have not read any good articles about P6.

        You can start with the Intel Optimization Manuals, and then read Agner Fog’s documentation on x86 microarchitectures.

        Saying both CAN execute 3 micro operations per clk does not mean they perform the same clock for clock.

        Nobody made such a claim. You claimed however that:
        “NetBurst was quite narrow core with lack of parallel execution units so it gained little from HT.”

        That is what I responded to. Nothing to do with clock-for-clock comparisons. Just about ILP.
        You were also the one who came up with this metric of how many instructions a CPU can retire per cycle. I merely answered your question. Now you want to move the goalposts again when the answer apparently is not what you expected (which by itself proves that you don’t know enough about the subject, so why are you even continuing the discussion?)

        We tend to simplify architecture by drawing blocks of ALU’s, FPU’S, AGU’s etc., but just because certain architecture can retire 3 operations per cycle given certain circumstances does not mean it always retires 3 – in fact most real life scenarios will combine instructions in way that CPU will not be able to retire 3 in cycle, but much less on average.

        Once again, thank you Captain Obvious.
        None of what you say has anything to do with your original comment and my response to that. It’s just some random facts that you certainly don’t have to explain to me. I mean, you’ve read my blog, you’ve seen some of the more in-depth architectural discussion and assembly-level optimizations I’ve discussed. It is obvious that I know about these things.

      • Dawis says:

        “Nobody made such a claim. You claimed however that:
        “NetBurst was quite narrow core with lack of parallel execution units so it gained little from HT.” ”

        What I wanted to say is that IPC was weakspot for Netburst and that is why it would not gain from HT as much as other architectures with higher theoretical IPC would.

        Well I get your point – Netburst had low real life IPC, but quite a lot of parallel execution units, therefore it should be logical that a lot of those execution units are idle and can therefore benefit from HT.

        I actually agree that it is a valid point as well – and I must say I did not really look at it from that angle before.

        What I considered was that since it can retire “only” 3 micro operations per cycle and there are further limitations to micro operations per cycle in real life scenario, that just puts its limits to gains you can have from HT.

        My thought was – given Intel had architecture which can retire more micro operations per cycle and which has potential to execute more mops per cycle in more real life scenarios and has less such limits than Netburst – potential benefits from HT would be far greater at the time.

        Do you disagree?

      • Scali says:

        What I wanted to say is that IPC was weakspot for Netburst and that is why it would not gain from HT as much as other architectures with higher theoretical IPC would.

        But that doesn’t make sense. You don’t know WHY IPC is a weak spot. There can be a number of reasons why IPC is low. That was my point. A CPU with low IPC does not benefit from HT by definition, nor would a CPU with high IPC not benefit from HT by definition.

        What I considered was that since it can retire “only” 3 micro operations per cycle and there are further limitations to micro operations per cycle in real life scenario, that just puts its limits to gains you can have from HT.

        Then you are only looking at how many instructions can run in parallel. Another factor is how long these instructions take to execute (latency/throughput).
        The P4 suffered mostly from the fact that a lot of instructions had high latency and/or low throughput. You could have a lot of instructions ‘in flight’. This explains why HT worked well on that particular architecture.

        Do you disagree?

        I don’t think you can take the retirement-rate as a single bottleneck, and think that HT performance would improve much by retiring more uOps per cycle. I think that given the rest of the architecture, it was a decent balance, and not much would be gained by just changing this one part.

      • Dawis says:

        “But that doesn’t make sense. You don’t know WHY IPC is a weak spot. There can be a number of reasons why IPC is low. That was my point. A CPU with low IPC does not benefit from HT by definition, nor would a CPU with high IPC not benefit from HT by definition.”

        I would certainly agree that CPU with low IPC would not automatically benefit from HT by definition. In Netburst case it might have worked due to super long pipeline and high latency on many instructions – you make good point there.

        On the other hand I think architecture with high theoretical IPC ( ability to execute many instructions per cycle if needed ) would certainly benefit from SMT. This is simply due to real life workloads not being able to take advantage over that high potential IPC in single thread. You must of course know, that given any sequence of instructions you will have certain maximum theoretically achievable IPC, which unfortunately is not that high in real life. Which is why you don’t generally see ridiculously wide cores without SMT support.

        None the less CPU architecture that is capable on executing many instructions in single cycle in parallel, should greatly benefit from SMT feature.
        Thus what I am saying is if you design architecture from scratch with SMT as one of core features you can design a wider core than you otherwise would, knowing that SMT will make sure these resources are utilized.
        Netburst was not designed though, with SMT at its code, but rather acquired it along the way.

        “I don’t think you can take the retirement-rate as a single bottleneck, and think that HT performance would improve much by retiring more uOps per cycle. I think that given the rest of the architecture, it was a decent balance, and not much would be gained by just changing this one part.”

        You are probably right if you are talking about Netburst. Yet my original statement was comparing it to other – real life or hypothetical architectures.
        As for Netburst – I am not a hater. I think it was an exciting architecture. Even though I favored AMD at the time, I almost wanted intel to continue with Netburst direction to see what they would come up with in next generations – they had several more core versions planned.

        Peace.

      • Scali says:

        This is simply due to real life workloads not being able to take advantage over that high potential IPC in single thread.

        This is flawed reasoning.
        IPC is:
        1) The number of instructions executed in actual code, not theoretical maximum
        2) Measured on a per-thread basis

        Which is why people thought a CPU with high IPC wouldn’t benefit from HT: they thought that because the IPC is relatively high, there wouldn’t be a lot of idle units for a second thread.
        The flaw of reasoning here is that ‘high IPC’ is relative to x86 code, and x86 code is very inefficient by definition. Modern backends are far more advanced than the instructionset, and you simply can’t write the code to use a modern backend to the fullest. For x86 we consider an average of ~2 instructions per clk ‘high IPC’. But this still leaves a lot of units idle in a modern x86 backend.

        Netburst was not designed though, with SMT at its code, but rather acquired it along the way.

        I don’t agree with that.

      • Dawis says:

        “This is simply due to real life workloads not being able to take advantage over that high potential IPC in single thread.

        This is flawed reasoning.
        IPC is:
        1) The number of instructions executed in actual code, not theoretical maximum
        2) Measured on a per-thread basis”

        Yes – I get now I should have not used IPC term here. I think I tried to use wider core term in beginning meaning by that core that has more execution resources and has less limits on how many of them can be utilized in parallel in a given cycle.

      • Dawis says:

        “Which is why people thought a CPU with high IPC wouldn’t benefit from HT: they thought that because the IPC is relatively high, there wouldn’t be a lot of idle units for a second thread.
        The flaw of reasoning here is that ‘high IPC’ is relative to x86 code, and x86 code is very inefficient by definition. Modern backends are far more advanced than the instructionset, and you simply can’t write the code to use a modern backend to the fullest. For x86 we consider an average of ~2 instructions per clk ‘high IPC’. But this still leaves a lot of units idle in a modern x86 backend.”

        Agreed.
        From that point of view that statement makes a lot of sense – higher IPC must mean that fewer execution units are unused. Seems logical assumption given that why would someone design core that has a lot of unused execution units on average.

        Somehow, though, I never looked at it like that.
        My original claim was towards that core that has more execution units (wider by a very simple definition) must have more of them idle and available for SMT to take advantage on. I think I also touched subject that modern execution blocks looks more RISC than x86 anyway.

        Are you a fan of VLIW and Itanium architecture?

      • Dawis says:

        Since you only post which statements you disagree with it is hard to understand which statements if any you agree with.
        It is interesting though you claim Netburst was designed with SMT at its core, given that first generations of Netburst cores did not even have HT and it only came once Netburst was quite mature in second half of its life cycle.

      • Scali says:

        given that first generations of Netburst cores did not even have HT and it only came once Netburst was quite mature in second half of its life cycle.

        That is incorrect. HT was already available in Northwood, which was only the second generation of Pentium 4, and Northwood was released in Jan 2002, little over a year after Willamette in Nov 2000 (at first, only Xeon-branded Northwoods had HT enabled, regular consumer-oriented Pentium 4 CPUs did not, although they shared the same die, and as such, HT was physically present in these CPUs).

      • Dawis says:

        First Northwood with HT came more than two years after release of Willamete. Are you saying on drawing board of Willamete they had SMT planned already as one of principles of Netburst?

      • Scali says:

        First Northwood with HT came more than two years after release of Willamete.

        Get your maths right!
        As I said:
        Willamette – Nov 2000
        Northwood – Jan 2002
        That is 13-14 months, not “more than two years”.

      • Dawis says:

        OK, I was measuring the desktop release of 3.06Ghz P4 which happened May 2003. As you said Xeon came earlier so you are right.

  21. Dawis says:

    “You can start with the Intel Optimization Manuals, and then read Agner Fog’s documentation on x86 microarchitectures.”
    Intel optimization manuals was what I actually studied for Netburst. I do not think I ordered anything from them regarding P6, because it was pretty much obsolete at the time.

  22. Thomas says:

    “Non-x86 systems only have a very small marketshare”

    What architecture is used by smartphones?

  23. Pingback: AMD FX Series Making a Comeback Within Two Years – APU 14 Conference Reveals Future Roadmaps – Welcome to Info-Pc

  24. Pingback: AMD Zen: a bit of a deja-vu? | Scali's OpenBlog™

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s