AMD Zen: a bit of a deja-vu?

AMD has released the first proper information on their new Zen architecture. Anandtech seems to have done some of the most in-depth coverage, as usual. My first impression is that of a deja-vu… in more than one way.

Firstly, it reminds me of what AMD did a few years ago on the GPU-front: They ditched their VLIW-based architecture, and moved to a SIMD-based architecture, which was remarkably similar to nVidia’s architecture (nVidia had been using SIMD-based architectures since their 8800GTX). In this case, Zen seems to follow Intel’s Core i7-architecture quite closely. They are moving back to high-IPC cores, just as in their K7/K8 heyday (which at the time was following Intel’s P6-architecture closely), and they seem to target lower clockspeeds, around the 3-4 GHz area where Intel also operates. They are also adopting a micro-op cache. Something that Intel has been doing for a long time.

Secondly, AMD is abandoning their CMT-approach, and going for a more conventional SMT-approach. This is another one of those “I told you so”-moments. Even before Bulldozer was launched, I already said that having 2 ALUs hardwired per core is not going to work well. Zen is now using 4 ALUs per two logical cores. So technically they still have the same amount of ALUs per ‘module’. However, like the Core i7, they can now use all 4 cores with each thread, so you get much better IPC for single threads. This again is something I said a few years ago already. AMD apparently agrees with that. Their fanbase did not, sadly.

We can only wonder why AMD did not go for SMT right away with Bulldozer. I personally think that AMD knew all along that SMT was the better option. However, their CMT was effectively a ‘lightweight’ SMT, where only the FPU portion did proper SMT. I think it may be a combination of two factors here:

  1. SMT was originally developed by IBM, and Intel has been using their HyperThreading variation for many years. Both companies have collected various patents on the technology over the years. Perhaps for AMD it was not worthwhile to use fullblown SMT, because it would touch on too many patents and the licensing costs would be prohibitive. It could be that some of these patents have now expired, so the equation has changed to AMD’s favour. It could also be that AMD is now willing to take a bigger risk, because they have to get back in the CPU race at all cost.
  2. Doing a fullblown SMT implementation for the entire CPU may have been to much of a step for AMD in a single generation. AMD only has a limited R&D budget, so they may have had to spread SMT out over two generations. We don’t know how long it took Intel to develop HyperThreading, but we do know that even though their first implementation in Pentium 4 worked well enough in practice, there were still various small bugs and glitches in their implementations. Not necessarily stability-wise, but also security-wise. The concept of SMT is not that complicated, but shoehorning it into the massively complex x86 architecture, which has tons of legacy software which needs to continue working flawlessly, is an entirely different matter. This is quite a risky undertaking, and proper validation can take a long time.

At any rate, Zen looks more promising than Bulldozer ever did. I think AMD made a wise choice in going back to ‘follow the leader’-mode. Not necessarily because Intel’s architecture is the right one, but because Intel’s architecture is the most widespread one. I have said the same thing about Pentium 4 in the past: the architecture itself was not necessarily as bad as people think. Its biggest disadvantage was that it did not handle code optimized for the P6-architecture very well, and most applications had been developed for P6. If all applications would be recompiled with Pentium 4 optimizations, it would already have made quite a different impression. Let alone if developers actually optimized their code specifically for Pentium 4’s strengths (something we mainly saw with video encoding/decoding and 3D rendering).

Bulldozer was facing a similar problem: it required a different type of software. If Intel couldn’t pull off a big change in software optimization with the Pentium 4, then a smaller player like AMD certainly wouldn’t either. That is the main reason why I never understood Bulldozer.

Advertisements
This entry was posted in Hardware news and tagged , , , , , , , . Bookmark the permalink.

26 Responses to AMD Zen: a bit of a deja-vu?

  1. Thomas says:

    In this 2005 interview with Fred Weber, it sounds like they weren’t in favour of “Doing a fullblown SMT implementation”

    Fred’s response to this question was thankfully straightforward; he isn’t a fan of Intel’s Hyper Threading in the sense that the entire pipeline is shared between multiple threads. In Fred’s words, “it’s a misuse of resources.” However, Weber did mention that there’s interest in sharing parts of multiple cores, such as two cores sharing a FPU to improve efficiency and reduce design complexity. But things like sharing simple units just didn’t make sense in Weber’s world, and given the architecture with which he’s working, we tend to agree.

    And this was back in 2005, where I think AMD had the performance lead. Although I don’t think highly of AMD releasing Bulldozer in 2011 based on a single idea from 5+ years prior!

    • Scali says:

      Such statements make me wonder: Are they just saying that because they want to downplay one of their competitor’s unique advantages? Or did they really think the technology wasn’t useful. The former I can understand. The latter would be a sad showing of incompetence.

  2. Ron says:

    Well, I for one am looking forward to their Zen-based APUs. My computer doesn’t need the fastest CPU available to be useful to me, just like the majority of users, and AMD always give the best bang for the buck. Getting a CPU+GPU in a single well-priced package is a sweet deal for a lot of us.

    • jdwii says:

      They give you the best bang for buck cause that is all they could make. Its not like a company is going to be nice and make things cheaper cause they are moral or something they have to make money and they need the board to be supportive.

      Currently today i’d never get a 8 core Piledriver CPU over a modern I3. Never i’d argue that it offers less CPU performance while also being on a outdated platform.

      As for their APUs they are only worth it for mom and dad PCs and even then my dads PC is rocking a I3 6100 that barely draws any power and is super freaking fast and can even play 4K video.

      Really currently the only CPU i can recommend from Amd is if you need to build something and use the AM1 platform since you can build a whole PC for like 180$ or so. Maybe just maybe the 860K is decent for 70$ but it would bottleneck even a 950 IMO and a user would see GPU usage probably around 70% at times.

      But i do wish Zen improves things and at least helps in the lower-end market and mid-range market cause currently one expects to spend 125$ for a CPU that will not bottleneck any game with a mid-range GPU(1060-480).

    • mh says:

      The old “AMD give best price/performance” is another old dragon that’s due a good slaying; they actually don’t. https://www.cpubenchmark.net/cpu_value_available.html

  3. jdwii says:

    This was nicely written and i’ve been waiting for this article from you for a while. Guessing you don’t follow CPU architectures as much as you used to maybe? Either way i actually think i agree with both 1 and 2 but probably sway a bit more towards 2.

    Still i expect lower then skylake performance in IPC for many different reasons i’m just happy Amd isn’t as blind as their die hard fanbase which i still remember i was part of years ago before bulldozer released. So happy to be with Intel now(4790K owner) but i wish Intel had real competition. I mean 2000$ for a 10 core CPU? Charging 100$ more for just HT and a tiny bit more cache?

    Crazy can easily tell they have no competition

    • Scali says:

      Well, until now, AMD simply hasn’t released any information on Zen. The talk about Zen was just wild speculation, and I didn’t see any point in doing an article on that.

  4. Luciano @ Tech 4 Freelancers says:

    Thank you for this informative post. As I’ve stated in a previous post, I’m interested in seeing what Zen will have to offer once it’s out. With AMD being… Well, AMD, I was starting to lose a lot of hope after the release of their new graphics card.
    As much as I dislike the turn for the worse AMD has taken, I kind of don’t want them to go away: I feel like a more intense competition between Intel and AMD could benefit customers.

  5. Abraka Dabra says:

    I disagree with most of this blogpost.

    AMD’s Bulldozer was a perfectly good microarchitecture, but its adoption was seriously hamstrung by the first-generation TLB flaw, and optimized multi-threaded runtimes to favour neighbouring caches didn’t turn up until after Intel had made this worthwhile over two generations of SMT.

    As for four ALU ops per cycle — that means 16 ops under a L1d hit. Most application code, index-bound as it usually is, never reaches that much; the excess width is what made SMT worthwhile on Intel’s offerings. AMD’s failing with the separate integer cores was not a lack of width, but the shared decoders in the front end — and no µop cache or loop buffer to get past it. Exploiting only two independent dependency chains per thread, instead of four, hardly counts for anything but bragging rights and special-casing; and arguably AMD got the latter down with CPU turbo shenanigans.

    It is not as sexy a microarchitecture as Sandy Bridge, admittedly, but that (and the L3 cache arrangement) is pretty well the lot of it.

    • Scali says:

      but its adoption was seriously hamstrung by the first-generation TLB flaw

      Aren’t you confusing AMD failures here? Barcelona had the TLB flaw. Bulldozer had the BSOD problem.

      and optimized multi-threaded runtimes to favour neighbouring caches didn’t turn up until after Intel had made this worthwhile over two generations of SMT.

      This would explain why you would disagree with my article. My article states that the “if you build it, they will come”-approach doesn’t work, at least not in the x86-world. People expect x86 CPUs to perform properly out-of-the-box, rather than requiring all their applications to be rewritten specifically for a new CPU. Zen is likely going to be a bigger success than Bulldozer for the simple reason that it’s closer to the Intel Core-architecture, and therefore can run code optimized for Core efficiently. Most applications out there are optimized for Core.

      As for four ALU ops per cycle — that means 16 ops under a L1d hit. Most application code, index-bound as it usually is, never reaches that much; the excess width is what made SMT worthwhile on Intel’s offerings. AMD’s failing with the separate integer cores was not a lack of width, but the shared decoders in the front end — and no µop cache or loop buffer to get past it. Exploiting only two independent dependency chains per thread, instead of four, hardly counts for anything but bragging rights and special-casing; and arguably AMD got the latter down with CPU turbo shenanigans.

      I think you are overlooking a key aspect here:
      You are looking at averages, rather than looking at performance per-cycle. Yes, on average you may have ‘excess width’. But it ‘comes and goes’. And *that* is why SMT works. Not because you *always* have spare resources, but because you have some bursts where you can use all your resources for a single thread, and then there’s some bubbles in the pipeline for that thread, so you can interleave work from the second thread.
      If you don’t have the ‘excess width’ in the first place, you will lack these bursts. That’s one of the main problems of Bulldozer: in single-threaded workloads it performs very poorly (especially with games using DX11 or lower, or OpenGL, it is quite obvious. One thread has to feed the GPU instructions, and it is getting swamped).

  6. Clemens Eisserer says:

    > If all applications would be recompiled with Pentium 4 optimizations,
    > it would already have made quite a different impression.

    Having written both hand-optimized assembly as well as C code containing x86-intrinsic for the P4, I can’t agree with that. The sole focus of P4’s design is to push clock speeds high. Beside the limited decode and execution ressources, no compiler optimization can make the 20-30 cycle branch mispredict latency disappear.

    Regarding Bulldozer: There was a really nice article regarding Bulldozers performance issues on anandtech: http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper

    • Scali says:

      Have you hand-optimized assembly for P4? It doesn’t sound like it, if all you can mention is branch misprediction.
      Firstly, when the P4 came out, its branch predictor was much more advanced than any other CPU, so the amount of branch mispredictions was considerably lower than on other CPUs. Secondly, the P4 also introduced branch predication: you could prefix a jump instruction with an extra byte to indicate whether to treat it as taken or not taken. Compilers can certainly make use of this.

      Secondly, if you actually hand-optimized assembly for P4, you would know that its ALU is quite different from x86 before and after it.
      One big difference is that its ALU is double-pumped, but the shifter is not. A shift takes 2-4 cycles. Which means that small shift-left operations are better performed with adds. You can do add eax, eax ; add eax, eax in a single cycle. So still twice as fast as shl eax, 1
      So any code that ‘optimizes’ multiply operations with series of shifts is likely to be very slow on P4. Recompile it with optimizations targeting P4 specifically, and you can easily make it 2-3 times faster.

      Likewise, replacing MMX or x87 code with SSE2 can greatly improve performance. Again, with a simple recompile.
      Just a few examples.

      I’d love to hear any counter-arguments, but I’m not holding my breath. I’m pretty sure you’re just bluffing about the assembly part. Else you would have known the above things, and various other examples. Your current statements didn’t reach beyond the superficial “But P4 has a long pipeline and high clockspeed!”

      • Veda says:

        Okay, I’ll bite. The P4 dropped the barrel shift design because, complexity.
        On average the pipeline length was something like 28 stages, IIRC 32 stages on the Presscott.(don’t pin me down on my numbers it’s been ages)

        Then there are 2 things, ol’code binaries from 386 till P3 used the barrel shifter for small MUL’s.
        And on a branch miss you lose all instructions pipe-lined and you have to wait a few cycles on L1 cache 6~8 cycles depending on the operating frequency.
        Solution: Loop-unrolling. 😉

        What does this mean to an end user?
        Absolutely nothing!

        You cannot directly compare competing systems on basis of IPC because pipe-lining, micro-code optimization, Cache latency, and again the operating frequency
        Nor can we do it on basis GHz’s for the aforementioned reasons.

        All we dev’s can do is compile a ton of times and make special binaries for each platform.(something that cannot be managed realistically)
        Intel releases it’s own compiler which it maintains and updates, AMD is depending on 3rd party software.
        No matter how we look at it our view is both biased and skewed by our own experiences.

      • Scali says:

        So you agree with me, but somehow because of your tone and inclusion of random unrelated things (talking about IPC, loop-unrolling, end-users, bias/skew etc) it sounds like you don’t.
        Weird response.

        Also, lol@anyone mentioning the Intel compiler. Hardly any software uses that. Virtually all open source software (linux, OS X, FreeBSD etc) is compiled with gcc or LLVM.
        Most Windows stuff is compiled with MSVC.

      • Clemens Eisserer says:

        1. Branch prediction only helps with predictable branches. In case your branches are *not* predictable, even the best predictor won’t help. In this case taken/not-taken hints encoded into branch-instructions are worth nothing. Sure there are tricks like cmov et al., but there field of use is limited.

        2. Of course there are, as with any architecture, micro-optimizations that can help P4.
        However, have you ever measured how much e.g. GCC’s “mtune” flag helps P4 for real-world integer code – most of the time the gains are negligible compared to optimizing / instruction scheduling for P6.

        3. Why does replacing MMX with SSE2 speed things up on P4, except for more registers?

      • Scali says:

        1. The improved branch predictor in the P4 makes more branches ‘predictable’ more often (for example, by having a deeper history, you can detect longer repetitive branch patterns, allowing you to predict cases that couldn’t be predicted before). The use of trace cache makes pipeline flushes within cached code less expensive (decoding stages can be skipped). Only looking at the cost of a misprediction is a fallacy. Not to mention that you do not acknowledge that there are cases where predication can indeed help. You only mention cases where they can’t. Yes, obviously they exist, but you try hard to make it sound like there is no merit to them whatsoever.
        Some numbers in this article:
        http://www.tomshardware.com/reviews/intel,264-8.html
        a) Pentium 4’s target buffer is 8 times(!) as large as the Pentium 3
        b) Pentium 4 can eliminate 33% of the mispredictions of the Pentium 3.

        2. I don’t see why we would take GCC as a metric. Obviously the Intel compiler is the one to measure, since that one has the best optimizer for the P4 architecture. And the results certainly aren’t negligible.

        3. EMMS is a relatively costly operation on P4. SSE2 does not require EMMS, and can be used in parallel with x87 or MMX code. Aside from that, you are again guilty of a fallacy, since you ignore the replacement of x87 code with SSE2 that I mentioned. Which is the more interesting one of the two options. The x87 implementation on P4 is relatively poor, since it is basically a microcode emulation of x87 on top of the SSE2 backend. Using SSE2 directly is far more efficient than x87, even if you only use it for scalar operations. A P4-optimizing compiler will use SSE2 intrinsics and SSE2-based implementations of all common math.h routines (as do 64-bit compilers by the way).
        This illustrates my point perfectly: Intel designed the P4 assuming that software would be recompiled for SSE2, and native x87 performance would not be a big issue. In practice, too many legacy applications used x87, and P4 only got to showcase its SSE2-strength in a small selection of applications. AMD made the same basic mistake with Bulldozer.

        You haven’t convinced me yet. You demonstrate a combination of bias and very limited/selective knowledge.
        I can picture you googling ‘EMMS’ right now.
        You went from “You can’t get performance improvements from P4 by recompiling” to just pulling things out of contexts, quoting selectively, and ignoring entire arguments.
        I’ve seen it all many times before. Go play somewhere else.

  7. Veda says:

    Scali the compiler part was a red herring on my part, I am on the fence.

    I want to like AMD I really do, because i dislike both the green and the blue front.
    But i can no longer justify my notions, they date back to the time of the K6 and Athlon64.
    Even before then I got burned with the fall of Commodore.

    IPC tells us just as much as MIPS tells us, a whole lot of nothing on it’s own.
    IPC is nothing more then a computational value of the best case scenario for instruction executed per cycle.(current average, IIRC 1.8~2.1 IPC per core)

    With that out of the way, let’s move forward IPC is a fun number you get to multiply it by the operating frequency.
    That is the theoretic maximum number of BogoMIPS, you can attain.
    Of course this number is off because again IPC is talking about best case instruction executed per cycle.

    I am off the opinion the worst case is far more interesting, either by reducing the chance of gaining it, or by improving it’s time you reduce a niche condition.
    Same is true for code and hardware, but that you knew already.(Keeping it real)

    Cache latency, another fun subject, I mean this with the most sarcasm i can add in text.
    Your L1 cache has a 6~8 cycle delay depending on your operational frequency.
    L2 is worse with a 28~37 cycle delay, and let’s not start about L3 Cache latency nor our Main memory.
    These things we hit with a cache miss during an branch miss prediction, an TLB performance hit later, and we are filling our pipeline with NOP’s till our fetch is done.

    We have reached this point a decade ago, and oddly enough the current solution is to loop-unroll and parallelize the problem where we can.
    Sure it works, what we truly need is a paradigm shift, and i don’t see this happening in the foreseeable future.
    This is what makes me feel sad, nothing truly new has happened.
    SIMD is a technical dead end, this is because we only attain higher bandwidth by increasing latency.
    With every increment of DDRx SDRAM we are moving further away from Random Access Memory and towards a serialized buffer.

    In my opinion as it currently stands AMD64 and x86 architectures should be killed, shot and buried.(not necessarily in that order)
    They have been going a lot of directions, but an accumulator CPU design is so 1971.
    And if you do not agree with me, then here is my proof http://ref.x86asm.net/coder.html .
    I am quite certain your one of the few that would understand that table.

    • Scali says:

      We have reached this point a decade ago, and oddly enough the current solution is to loop-unroll and parallelize the problem where we can.
      Sure it works, what we truly need is a paradigm shift, and i don’t see this happening in the foreseeable future.

      Isn’t SMT the ‘current solution’ to that? If you can’t fill the pipeline bubbles from one thread, add another.

      • Veda says:

        Well Scali it’s a bit of both.
        SMT requires a context switch, which if I’ve been told correctly means you swap out your register file.
        This does very little to solve the problem as it only attempts to hide it.
        It invalidates your current instruction cache line in flight, and your looking at the L1 cache penalty.

        So instead of looking at the full pipe-line length your looking at cache latency.(admitting an much lower number for most, 16 stages pipe-line, vs 4 cycles for L1 cache on Haswell)

        Let’s get a little back to the article you wrote.
        SMT != CMT, CMT = SMT

        I am not fond of CMT nor of SMT, I find them to be hacks to evade some of the laws physics.
        CMT’s Merritt over SMT is that it does not matter which pipe-line on the cluster has the bubble occur it can be filled.
        The cost however of doing this requires a much larger register file, and each core needs either a port to it (slower, but cheaper for the hardware) or a duplication of the hardware at the cost of a more complex control circuit.(faster, but much more expensive)
        This larger register file makes context switching much more costly.
        Besides that the tech is mostly comparable.

      • Scali says:

        SMT requires a context switch, which if I’ve been told correctly means you swap out your register file.

        No, it doesn’t.
        SMT is where you have multiple register files and other logic so you can have multiple thread contexts active simultaneously (hence the name).
        Basically a core with HyperThreading is a sort of ‘Siamese twin’ of two cores sharing the decoders, OOOe logic and a single execution backend.
        This invalidates the rest of your post, sadly, because your assumptions of making a context switch and invalidating a cacheline are both false.

        CMT’s Merritt over SMT is that it does not matter which pipe-line on the cluster has the bubble occur it can be filled.

        Again, completely false.
        CMT is a different kind of ‘Siamese twin’, where you have two cores, where they only share the instruction decoder and a single FPU backend, but the integer ALUs are still dedicated to each core.
        This means that:
        A) CMT requires more dedicated hardware, because it can share less between threads (it’s closer to conventional dual-core).
        B) CMT cannot fill bubbles in the integer pipeline, because each integer pipeline only has a single thread context active.

        As a result, AMD had to include less ALUs per core, to be able to include support for 8 threads on a single chip. The resulting chip is still much larger than Intel’s CPUs using HT to run 8 threads. It also runs single threads much slower, mostly because each thread has less execution resources at its disposal.

  8. Veda says:

    Scali in my book a context switch used to mean that you jumped to subroutine stored all relevant registers and restored them before jumping back.
    These modern CPU’s a context switch is switching a register file and to some extend the state-machine.

    I did not say invalidating a cache line i said invalidating instruction cache line in flight.
    Which i must admit could have been formulated a whole lot better.
    As I intended to say invalidating instructions inflight from the cache line.
    It doesn’t really matter, SMT and CMT are just as bad, Intel is just doing it better then AMD at this moment.

    • Scali says:

      As I say, you don’t switch contexts with SMT, you have two (or more) contexts active at the same time.

      I don’t see why you’d call SMT bad. They need to add about 5% extra transistors over a single core to implement HT, and they get more than 5% extra performance from the second thread (20-50% is possible, depending on the type of code). So they improve efficiency.
      CMT is barely cheaper to implement than a ‘full’ core, and performance is about 80% as good as a ‘full’ core. I don’t think they save more than 20% of transistors over a ‘full’ core, so efficiency isn’t boosted much, if at all. The worst thing is that they made the single cores slower as I already said.

      Anyway, you seem to be confusing various things. Context switching in the case of multithreading is not done on subroutine level. With a subroutine, the programmer (or compiler) is responsible for saving and restoring registers. A context switch is done by the OS scheduler, where the OS handles saving and restoring of registers (and as such has no knowledge of what to save).

      Also, a ‘register file’ is confusing. There are CPUs with multiple register files/windows, which they can ‘shift’ with a single instruction. x86 cannot do this.

  9. Rebrandeon says:

    http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700

    “However there are a few edge cases where AMD is lacking behind 10-20% still, even to Broadwell.”

    Bulldozer 2.0 confirmed.
    Underwhelming, garbage performance. But hey, lets cherry pick the benchmarks we win against Intel at our event.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s