Intel Haswell: How low can you go?

At the recent Intel Developer Forum, Intel disclosed new details about the upcoming Haswell architecture. For the full coverage, you can read the coverage from various tech sites, such as Anandtech. As usual, I will just pick out a few things that I think are worth commenting on.

Efficiency

I think the most relevant bit of news is that Intel continues to aim for lower power envelopes. From Nehalem on, there was already a trend of Intel bringing the TDP down, while maintaining or even increasing performance. Lynnfield would take the TDP from 130W to 95W with virtually the same performance. Sandy Bridge then held on to the 95W TDP while improving performance. Ivy Bridge dropped the TDP down to 77W while improving performance slightly once again. And now Haswell is aiming to reduce TDP yet again. In fact, Anandtech suspects that Intel wants to use Haswell-derivatives in tablets and similar devices, rather than the Atoms they’ve used for this purpose so far. Anand is even speculating that Intel may drop Atom altogether and use a single architecture for anything from smartphones up to desktops. Not yet with Haswell, but perhaps with its successor (as always, the high-end architectures of today are the low-end architectures of tomorrow). This would mean a full head-on attack on ARM.

Yes, I’ve said it: ARM. That seems to be Intel’s target for the future, not AMD. As I discussed earlier, AMD is still pursuing performance more than power consumption and efficiency. Since Haswell clearly focuses on efficiency, AMD will have its work cut out with Steamroller.

Performance

Does this mean that Haswell is only about lower power consumption? No, not at all. Intel is also introducing some instruction set extensions, such as the Transactional Synchronization eXtensions (TSX), which I already mentioned in the Steamroller article earlier. They are also adding new Advanced Vector eXtensions (AVX2), which should improve both floating point and integer performance.

But another thing that I found remarkable, is that Intel is adding 2 new execution ports to the architecture for the first time in years (it started with 5 ports in the Pentium Pro, Conroe added a 6th port. Haswell will take it up to 8). For years I’ve been wondering just how much extra instruction level parallelism (ILP) could be extracted from x86 code. And everytime Intel managed to surprise me by improving its instructions-per-cycle (IPC), which at least partly was the result of better ILP (whereas AMD has been stuck on more or less the same level of IPC since the Athlon64, and actually reduced IPC in their Bulldozer architecture). However, since the introduction of HyperThreading, the question has also lost its relevance somewhat. Even if the extra execution ports will do little for a single thread, it would mean that there will be more resources to share between two threads, so the performance in multithreaded performance may improve.

Intel has also steadily increased its Out-of-Order-execution (OoO) instruction window over the years, and in Haswell they are increasing it yet again. Like with the extra execution ports, I am not sure how much performance they can get from a single thread, if any, but again the increased OoO-window will clearly benefit HyperThreading scenarios, since the OoO-window is partitioned equally between the two threads. In Haswell the OoO-window will be so large that each logical core will have roughly the same number of instructions in flight as a single core in the Conroe architecture.

At any rate, it looks like Haswell has some interesting improvements in store. Lower power consumption and higher performance (by the looks of it, especially in multithreaded scenarios) is a great combination.

This entry was posted in Hardware news and tagged , , , , , , , , , , , , , , , . Bookmark the permalink.

15 Responses to Intel Haswell: How low can you go?

  1. Brendon says:

    I enjoy reading your blog and always learn a lot about CPU architecture so thank you. I am just curious does OoO = out-of-order?

  2. alex says:

    hello Scali!
    I’d like to know what you think about AVX2.
    GPGPU for home users is going to die?

    • Scali says:

      I don’t think AVX2 will make much of a difference in that respect. It makes the CPU somewhat faster, but GPGPUs get faster with every generation as well, so this is just what Intel will need to do to keep up. I don’t think AVX2 will have a dramatic impact… none of the extensions ever did (despite MMX, various versions of SSE, and now AXV, we still use GPUs, not CPUs for graphics. SSE was not even good enough to compete with hardware T&L videocards).
      Aside from that, I don’t consider GPGPU very ‘alive’ anyway, for home users. There’s still only a handful of programs that make use of it.

      • alex says:

        I hope that Havok physics engine will benefit of the instructions AVX2 … I think that would provide a suitable alternative to nvidia gpu physX … maybe I’m just dreaming XD

      • T. says:

        Since 2010 we are seeing lower units sales of dGPUs and smaller revenues from both major players, but at the same time the number of PCs is going up, or at least stagnating.

        This is a direct consequence of IGPs getting more powerful, and it is a trend that is here to stay. In the long term, dGPUs are to be a niche market.

        Don’t you think that the same may happen to GPGPU? AVX2 and IGP getting a bigger share of the processing share to the point that the performance is reasonable to almost everyone, effectively killing the GPGPU market?

        Don’t you

      • Scali says:

        Discrete GPUs are being replaced by integrated GPUs, which are still GPUs. AVX2 replacing GPUs is an entirely different story. As integrated GPUs get more powerful, their GPGPU abilities also improve.
        Nevertheless, integrated GPUs only compete with the absolute low end of discrete GPUs. High-end GPUs are far more powerful, and it is unlikely that CPUs will catch up on that on the short term, if ever. A GPU is just a different type of processor than a CPU is, and both have their pros and cons.

  3. NewimprovedJDWII says:

    Hey unless i missed it in the article what kind of IGPU improvements do you expect with Haswell loads of people are saying its going to beat Trinity on this front?

    Also i’m surprised you didn’t do an article on Amd lately since they are not doing too good. This is a real childish question but what would you do if you where the CEO of Amd right now?

    • Scali says:

      The Anandtech article I linked to covers the GPU side of Haswell as well: http://www.anandtech.com/show/6355/intels-haswell-architecture/12
      But I wanted to concentrate only on the CPU side this time.
      In their conclusion, they say that the high-end GPU versions will be up to twice as fast as Ivy Bridge’s HD4000. That should put the performance in the same ballpark as Trinity at least. But obviously Haswell will have a severe advantage in games that are heavier on the CPU, since Trinity gets CPU-limited quickly, as I covered in my earlier Trinity article: sometimes Trinity is not even faster than Llano, and even the HD4000 outperforms it in some cases.

      As for AMD… I think they’re already doing what they should be doing: they rehired Jim Keller, and the focus on future CPUs will be more on reducing power consumption and improving efficiency, making better mobile solutions and such.
      The only problem is that they should have done this years ago. It takes a long time to turn the direction around in a CPU company. We’ve seen that with Intel: The Pentium 4 wasn’t the right direction, but it took them years to change around to the Core architecture and the new tick-tock strategy.
      Likewise, there will be Bulldozer-based products in the AMD pipeline for years to come, so Steamroller is certainly not going to reflect any changes yet. It’s still going to be a ‘Pentium 4’. It will probably take 2-3 years at least until we see the first signs of the ‘new AMD’.

      The irony of it all is that Bulldozer is actually a remarkably easy chip to market, since it combines both the MHz myth and the multicore myth in a single chip: both the clockspeed and the core count are higher than similarly priced Intel chips, so on paper they look great, and apparently that’s part of the reason why they are selling relatively well.

      • T. says:

        AMD quarterly results were a bloodbath in the last two quarters, and Q4 should be even worse. Bulldozer is not selling at all on servers, and going very cheap on desktops.

  4. T. says:

    Looks like that AMD wasn’t the first CPU designer to share FPU:
    http://en.wikipedia.org/wiki/Rock_%28processor%29

    The 16 cores in Rock are arranged in four core clusters. The cores in a cluster share a 32 KB instruction cache, two 32 KB data caches, and two floating point units. Sun designed the chip this way because server workloads usually have high re-utilization in data and instruction across processes and threads but low number of floating-point operations in general. Thus sharing hardware resources among the four cores in a cluster leads to significant savings in area and power but low impact to performance.

    And the results were more or less the same, except that Sun was not fool enough to take the project to the market:

    “On 5 April 2010, Dave Dice and Nir Shavit released a paper “TLRW: Return of the Read-Write Lock” to be presented at SPAA 2010.[46][48] On 12 May 2010, Reuters reported that Oracle CEO Larry Ellison shut down the Rock project when Oracle acquired Sun, quoting him as saying, “This processor had two incredible virtues: It was incredibly slow and it consumed vast amounts of energy. It was so hot that they had to put about 12 inches of cooling fans on top of it to cool the processor. It was just madness to continue that project.”

    • Scali says:

      Sun has had CPUs with shared FPUs on the market for years, in their Niagara architecture, with one FPU for 8 cores, and 4-way SMT per core (although the later version had one FPU per core, and 8-way SMT, so effectively they went from one FPU per 32 threads to one FPU per 8 threads). Works fine for database/webservers (I’ve mentioned it a few times in blogposts/comments here). For AMD it’s a bit of a different story, since they sell their CPUs as general-purpose mainstream desktop/laptop parts. Then again, AMD hopes that the floating point workload can be shifted to GPGPU over the years, where AMD’s (integrated or discrete) GPUs deliver very good floating point performance.

  5. Pingback: Haswell Hasarrived | Scali's OpenBlog™

  6. Pingback: Intel disables TSX in Haswell | Scali's OpenBlog™

Leave a reply to NewimprovedJDWII Cancel reply