AMD Steamroller

Perhaps some of you expected more coverage on upcoming CPUs on this blog, given the history regarding AMD’s Bulldozer. But well, there hasn’t been too much to report on, really. I did a short thing on Trinity/Piledriver, but there was not too much to tell. Small improvements, not very spectacular.

The same goes for Steamroller. To be honest, it just doesn’t interest me a whole lot at this point. Anandtech covered Steamroller a while ago, but there was not much in the way of spectacular news. As Anandtech says:

Steamroller addresses this by duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. Don’t expect a doubling of performance since it’s rare that a 4-issue front end sees anywhere near full utilization, but this is easily the single largest performance improvement from all of the changes in Steamroller.

That proves my point already: If the decoder is the largest performance improvement, it’s not going to be very spectacular. It is an interesting fact in the sense that it implies that AMD is leaving their ‘Clustered MultiThreading’ approach. Apparently AMD concluded that sharing resources the way they have been doing, just doesn’t work. So instead of trying to share resources and reduce transistorcount, they are going to put in two dedicated decoders for each module again. A move back towards conventional multicore technology, like in their earlier Athlon and Phenom architectures.

I am not sure how this is going to make AMD more competitive with Intel though, since Intel has clearly been moving towards better performance-per-watt with the last few generations of Core architectures, and Haswell will only improve on that further.

Another thing is that AMD is going for larger caches in Steamroller. What is interesting is that they had 64k of shared L1 cache in Bulldozer per module. Which is a lot more than the 32k of L1 cache per core in Sandy/Ivy Bridge. This 32k is shared by two logical cores when using HyperThreading. AMD’s problem was not really that the cache was too small. Its problem was that the cache was not efficient enough. A small L1 cache with low latency is generally better than a large L1 cache with high latency (the L2 is still there to reduce the penalty of cache misses from L1). It seems that AMD’s problem is in getting the latency of a shared cache down. So when I read that they are going for a larger L1 cache, that implies that they couldn’t get the latency down, and instead try to increase the hit ratio by making it larger.

This again goes in the opposite direction of Intel: a bruteforce approach, more transistors, more power consumption, more die-size. So again, I don’t see this as a good sign in terms of competition with Intel. Bulldozer/Piledriver are already ridiculously large and powerhungry compared to Intel CPUs of the same performance level. And I am not sure how this ties in with the mission statement from AMD that they are no longer going for the high-end market. The Steamroller improvements seem to focus mostly on more performance at the cost of larger and more powerhungry chips, rather than becoming smaller, more efficient mainstream chips.

At least there’s no John Fruehe this time, claiming ridiculous performance estimates. Because clearly the improvements in the Steamroller architecture are not going to be groundbreaking.

Meanwhile, there has also been some news on Haswell. It has some new instructions which will make locking in multithreaded situations more efficient. Sadly Steamroller won’t have these instructions yet. Anyway, the interesting part is that these new instructions basically convert a mutex-style lock into something of a single writer/multiple reader style lock automatically. That is, instead of each thread waiting on the lock, they will start executing, but with read-only access. When a thread tries to write, an exception occurs, and the whole ‘transaction’ is tried again with proper locking. It is a simple, yet elegant solution which turns coarse locking situations into nearly fine-grained locking performance. The best thing is that support for these instructions only have to be inside the locking objects, provided by the OS. So a simple update of the OS locking libraries will make all applications use the new instructions automatically.

This entry was posted in Hardware news and tagged , , , , , , , , , , , , , , , , , . Bookmark the permalink.

17 Responses to AMD Steamroller

  1. T. says:

    I think that AMD gave up not only the performance race but the efficiency race too, as they are stuck for the next few years with this “modular” approach that is anything but efficient, so I agree with you that they are going to go wild on die size and high clocks.

    I don’t think that Steamroller has anything to do with desktops. They are probably aiming the thing at cheap servers and workstations as they don’t haven’t the strict cooling requirements that data centers have, plus they must have a small presence in the market to sell something or else years of relationships in the supply chain will be lost.

    The FX on the desktop is only to sell defective dies to AMD fanboys and 20th tier OEMs. Who would pay for a dual “module”, GPU-less, 125W CPU if not an AMD fanboy? It is better than send them to the thrash can.

    In any case, I think that steamroller will fail badly even on this restricted market. Intel is reserving its cutting edge nodes for desktops and notebooks and one year later they move their servers chips to the new node, which is now pretty mature. What happens is that Intel is able to field an entire range of processors in all relevant price points and with excellent availability.

    SNB-EP, which is the true power of the *bridge architecture, pretty much wiped out AMD server business to the point of the CEO have to admit in public that the server business ground to a halt. Now they sell server chips for the price of desktops chips. Haswell-EP will arrive a few months after Steamroller, in a mature node, covering all the relevant price points and with an even bigger performance advantage. I wouldn’t be surprised if AMD quit the server market after Steamroller.

  2. Positron says:

    With my limited understanding of CPU design I don’t think AMD is leaving CMT. The module idea is partly about keeping the FPU performance steady while increasing the number of integer cores. AMD hopes that FPU heavy code moves to the GPU.

    I don’t think AMD was trying to reduce the L1 latency but rather maintain/increase clock speeds. According to Intel (via Anandtech) going from a 3-cycle latency cache to 4 reduces IPC by 2-3% but the clock speed gains are far greater than the IPC losses. Also AMD’s data cache is 16KB, 75% less than K10 so increases here can decent performance gains.

    • Scali says:

      Well, I didn’t say they were leaving CMT entirely, merely that instead of taking the idea of sharing functional units even more (moving towards proper SMT, where all resources are shared between 2 or more threads), they are going back to more dedicated units per thread.

      Also, yes, I am saying that AMD *can’t* reduce the L1 latency. But I don’t agree with you that they do this on purpose in order to maintain/increase clockspeeds.
      After all, making the L1 cache larger is not exactly a good idea either, in terms of clockspeed scaling.
      You’re right, the data cache on Bulldozer is only 16kb, the instruction cache is 64kb.
      The problem that AMD has is that Intel is very far ahead in terms of cache design. Intel can get larger cache sizes with lower latency AND with better hit-rate due to higher associativity. All that while Intel still shares the caches between 2 threads, and while their CPUs still scale beyond 3 GHz easily.
      As we know, Intel is not currently pursuing high clockspeeds. Instead, they are going for lower TDP. If they were to use a 125-130W TDP envelope rather than the 77W they currently use on the Core i7 3770 for example, they could probably exceed AMD’s clocks easily (judging by overclocking results).

    • Scali says:

      I guess the short answer is that AMD hopes to move all heavy FP/SIMD processing to the GPU with OpenCL, so integer is all the CPU has to be good at.

      • k1net1cs says:

        Good if that’s really the case, but AMD can’t even be bothered to lend a helping hand to Folding@Home to make a decent OpenCL-based GPU client.
        All F@H have now is an inefficient OpenCL AMD GPU client, which gobbles up one CPU core no matter how fast the CPU is, and can’t even support GCN properly.
        nVidia may have a proprietary standard (CUDA), but they do help devs around whenever they can to implement it, while AMD with its open standard (OpenCL) rarely give a damn on devs unless its something profitable.

        Both companies are sponsors for F@H, so I don’t get why there’s a big performance difference between the two implementations (a.k.a. GPU clients), and I don’t believe OpenCL is in any way far inferior to CUDA if it’s implemented properly.

      • Scali says:

        AMD/ATi have always had more of an “If you build it, they will come”-attitude.
        When you’re making x86-clones, it’s obvious from the start that you’re just lifting along on your competitor’s hard work.
        And ATi has never had a very strong software department either. They never did as much as nVidia when it comes to developer support with rich SDKs, tools, presentations and all that. ATi also had notoriously bad drivers, especially in the OpenGL department (where nVidia excelled in OpenGL, both in terms of driver quality, and support for developers with all sorts of extensions, and good examples on how to use them etc).
        With Direct3D, ATi had the advantage that Microsoft already supplied their own rich SDK, so they could just take advantage of that.

        AMD/ATi used to have their own proprietary Stream SDK for GPGPU. But it was never very successful, so a few years ago they decided to focus entirely on OpenCL. And again, it seems that’s more or less all that they’re doing. They hope that other developers will start using OpenCL, and provide tools and libraries, simply because OpenCL is supported by multiple vendors. So far that hasn’t really worked very well.

        OpenCL is inferior to CUDA in the sense that OpenCL is a lowest-common-denominator solution. nVidia can add all the latest hardware features to CUDA immediately. With OpenCL, they are limited to the basic API and its extension mechanism, and since OpenCL is also meant to run on non-nVidia hardware, developers are less likely to support nVidia-only extensions (same story as with OpenGL… and we all know how well that went).
        So in theory CUDA will always be the superior solution for nVidia GPGPU. In practice, it depends on what you want to do, and how well OpenCL maps to the nVidia hardware for those particular tasks. In some cases there may be little or no difference. In other cases it might be quite significant.

  3. Pingback: Intel Haswell: How low can you go? | Scali's OpenBlog™

  4. T. says:

    Do you think that HSA may become the lowest common denominator between ARM and x86, a “me too” feature?

    • Scali says:

      I’m not sure what it is exactly that the HSA aims to do… but it looks like some major players, such as nVidia, Apple and Intel are not members…

  5. nickysn says:

    Hm, actually, this change is very much in line with Agner Fog’s analysis of Bulldozer:
    He concluded that the shared decoders of Bulldozer are a serious bottleneck and he states that he believes adding one set of decoders per thread would greatly improve throughput.

    • Scali says:

      Well, instead of adding more decoders, AMD could also have improved the efficiency when sharing the decoders. Currently each thread can decode instructions every other cycle. Intel’s CPUs with HT also have 4 decoders which they share between 2 threads, but it happens in a more flexible way, so those 4 decoders can effectively decode many more instructions (otherwise Intel could never have that much higher IPC).

      I am not sure if I agree with Agner’s remarks on trace cache though. The Pentium 4 only had one decoder, so without trace cache it would probably be very bad indeed. Aside from that, Intel has since reintroduced trace cache with Sandy Bridge (they call it ‘decoded Icache’ now, but the idea is very similar), and AMD Steamroller will also get a form of trace cache (decoded micro-op queue).

      Adding more decoders and trace cache may alleviate Bulldozer’s decoding bottleneck, but it is not the most efficient solution, I think. It will only add extra die-area, which was another problem in the Bulldozer design to begin with.
      It looks like AMD tries to ‘fix’ a bruteforce design by adding more bruteforce, while its competitors (both Intel and ARM) go for smaller, more efficient chips instead.

  6. Haswell says:

    In classic AMD fashion, Steamroller delayed until 2014, it will have to compete against Intel Broadwell now.

    As said before, AMD will be steamrolled by Haswell and Broadwell.

  7. Dano says:

    AMD will survive by selling there chips for low margins… (significantly cheaper than Haswell, and Broadwell)

  8. Broadwell says:

    “The reality is quite clear by now: AMD isn’t going to solve its CPU performance issues with anything from the Bulldozer family. What we need is a replacement architecture, one that I suspect we’ll get after Excavator concludes the line in 2015.”

    Steamroller is another AMD engineering epic failure.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s