Agner Fog recently wrote a blog on how the Intel Compiler decides which codepath to run on a given CPU, and how this affects AMD. While I respect Agner Fog, and basically agree with what he says (I’ve known about this mechanism for many years, and its ‘side-effects’, that is not up for debate as far as I’m concerned), I don’t agree with the way he presented it. The term ‘Cripple AMD’ is not entirely accurate, to say the least. But more on that later. Perhaps Agner wrote it to draw attention to the issue, in which case he has succeeded very well indeed.
I’ll first comment on the benchmarks, as I think that is the REAL problem here. Clearly the Intel Compiler is not a vendor-neutral compiler. That is a common truth. It isn’t a bad thing in itself… but if you want to sell vendor-neutral benchmarking tools, then the LAST thing you should do is to use the Intel Compiler. However, I don’t think that is Intel’s fault (unless someone can prove that Intel somehow bribed or otherwise ‘persuaded’ the developers to use their compiler). Do you convict the storekeeper who sold the rope that someone used to hang himself? While it is possible to hang yourself with a rope, that is not the main reason why rope exists. As such, there is no direct relation between selling rope and murder.
Ofcourse that’s not how AMD fanboys see it. They think Intel needs to be convicted for developing a compiler that supports their products (and obviously not the competition’s). They even think that it’s a case of Intel exploiting their powerful market position. Now that is ridiculous… The Intel Compiler is not bundled with a CPU, it is sold separately. And it doesn’t have a strong market position at all. Microsoft and gcc dominate the compiler market. Intel is just a small player, mainly interesting for scientific research where maximum performance is required, and code is only run on their own (Intel) systems. Another thing that is ridiculous is that they somehow mistake the system requirements (as in, the machine required to run the Intel Compiler) for the optimization targets.
What’s worse, they don’t seem to understand exactly WHAT the Intel Compiler does in the CPU dispatcher. They don’t understand the basics of code optimization. While it goes a little far to explain microarchitectural differences in detail (I will refer to the Intel Optimization Manual for that), I will give a small practical example of a ‘small challenge’ we had at the asmcommunity forum.
Okay, so it’s just a simple coding problem, and the challenge was to create the fastest routine. Now, this was all done in assembly language (with some C++ routines as reference), so no Intel Compiler was used for the code, all code was just written directly for the x86 architecture and ran ‘as is’.
Various programmers of various skill levels contributed their solutions. We put all the different solutions into a single timing framework, and ran it on various PCs. The results were something like this:
1) lingo12/Scali SSSE3
2) Scali SSE2
3) Scali MMX+SSSE3
2) Scali2/C compiler
3) sysfce2/C compiler
1) Scali SSE2
3) Scali MMX
3) lingo12/Scali2/C compiler/drizz
1) Scali MMX
As you can see, I made a top 3 for all different CPU types (and separate for ‘plain x86’ routines and routines using extensions such as MMX or SSE)… and as you can see, the top 3 is different everytime. Now that is the point I’m making here. Which solution is the winner? In other words, which is the most optimal routine? You cannot answer that, unless you specify which CPU. The keyword here is ‘microarchitecture’.
Now this is the problem that all compilers face: what may be optimal for one CPU may be disastrous for another. Eg, lingo12’s routine is the fastest on Core2 Duo, but it isn’t even in the top 3 on a Pentium 4. And those are both Intel CPUs, and they both support pretty much the same features.
That’s why the Intel Compiler doesn’t just check for “GenuineIntel” and whether or not eg SSE3 is supported. No, it will also check other CPUID info, such as the family number. This way it can know what microarchitecture the CPU has, and then pick the proper code for it. Agner Fog actually mentions this in his blog, but it seems like it’s not understood and/or just ignored by everyone who comments on it. He even points out that Intel has a rather peculiar way of naming their CPUs, just to keep this mechanism of microarchitecture selection working.
Namely, if we start at the beginning… CPUID was introduced on Pentium, and also added to 80486 processors of that era. The Pentium was named family 5 (Pentium meaning ‘fifth’, referring to the fact that it’s technically the 80586, but Intel dropped the use of numbers in favour of names because they couldn’t trademark a number, and competitors kept using the same names as Intel). The 486 was family 4, and technically it would go down to 386, 286, 186 and 8086. So originally the family number was just the CPU model. Everytime Intel introduced a new microarchitecture, the family was increased by one.
This also held true for the Pentium Pro, which was family 6. Pentium II and Pentium III were also family 6, as they were little more than a Pentium Pro with MMX and SSE extensions added, from a microarchitectural point-of-view. As such, the optimization rules didn’t change. When Intel introduced the Pentium 4, the family number was increased again, this time to family 15, with an extra ‘extended family’ field added to CPUID (and the Pentium 4 required vastly different code to reach optimal performance). However, when Intel introduced the Core2, they went BACK to family 6. And Core i3/i5/i7 also still report family 6.The logic behind this move is that Core2 and Core i3/i5/i7, while technically new microarchitectures, have performance characteristics very similar to the PPro-family, and as such require pretty much the same type of optimizations. So when older software (most notably generated by the Intel Compiler) ‘sees’ family 6, it will select the most optimal code.
But therein lies the problem: If you don’t know anything about the microarchitecture, you can’t select the most optimal codepath, because you have no idea which one that should be. The above ‘small challenge’ illustrates that perfectly. The optimal code on one microarchitecture can be disastrous on another. Take for example the Athlon XP. The “Scali MMX” routine is the fastest of all routines there. Okay, so you run it on a Pentium 4… You detect MMX, so you figure you should select “Scali MMX”, as that is the fastest routine for MMX-supporting CPUs. Well, the joke’s on you. “Scali MMX” is second-to-last on the Pentium 4. And it’s not just the Pentium 4… on the Core2 Duo, “Scali MMX” is well in the lower regions as well.
So as you see, instructionset features don’t say anything about performance, it’s a gamble. Currently Intel doesn’t take a gamble at all, and just plays it safe. If Intel were to take a gamble on a microarchitecture that it doesn’t know, and it would be a bad choice (such as the above “Scali MMX” example), people would still cry foul. My suggestion to Intel is that they ask their x86 licensees to pick the codepath that the Intel compiler should run for their microarchitecture, and get it in writing. That way, Intel shows their good intentions, it will *probably* pan out well for end-users in practice most of the time (the current Phenom microarchitecture is not that different from Core2/Core i7), and if it fails, then it isn’t Intel’s fault, as they just do what is agreed on.
The irony of this is that the Pentium 4 suffered quite severely from this microarchitectural optimization problem. When it debuted, most code was optimized for Pentium/PPro. The Pentium 4 didn’t handle it very well, especially the x87 implementation was relatively weak. When code was recompiled on P4, avoiding certain performance hazards, and making use of new features like SSE2, to work around the shortcomings such as the x87, it could make quite significant leaps in performance. This is also partly the reason why it was always a strong performer in 3d rendering and video processing. These tasks benefited greatly from SSE2, and the developers in those particular markets were quick to adopt the new technology and recompile/optimize their applications for the new microarchitecture.
To conclude… there may be ‘tainted’ benchmarks, such as PCMark05, compiled by the Intel Compiler, which puts non-Intel CPUs at a disadvantage… but I would like to point out that the shoe has been on the other foot as well. A few years ago, a benchmark called ScienceMark emerged in the review world. Historically it was one of the few benchmarks that Athlons performed well in… Look at these results for example.An Athlon64 FX-62 about as fast as a Core2 Duo X6800? Amazing, no other benchmark shows results even remotely similar…
The plot thickens when you realize that some of ScienceMark’s developers are/were AMD employees.
(eg ‘redpriest’, as he himself says here: “Full disclosure: I am an engineer that works for AMD (in CPUs and not in graphics)”).
It appears that ScienceMark 2.0 happens to implement synthetic benchmark routines that seem to perform well on Athlon microarchitectures, and not that well on Intel microarchitectures (even the ones that are generally considered to be superior overall by quite a margin, such as Core2 vs Athlon64). Given the link to AMD, is that a coincidence?
PS: I don’t use the Intel Compiler myself. Partly because of issues like these, and partly because I don’t see why I should spend money on another compiler, when the Microsoft/gcc do a fine job as well. However, I do think it’s Intel’s right to only support optimizations for their own CPUs. That is also what they advertise. I don’t like to see any government or organization forcing regulations on Intel in this case, because it would be incredibly arbitrary. It’s bad enough that anti-trust laws and regulations have such an arbitrary character (it seems that companies just sue Microsoft because they know Microsoft can’t win anyway because of their size, and AMD did the same with Intel), but in this case, Intel doesn’t even have a considerable marketshare, so it’s not even an anti-trust case. It would mean that any company could sue any competitor, regardless of marketshare. Whoever has the best spindoctors will convince the judge and jury (who after all, are just laymen).