John Fruehe: AMD’s latest and greatest liar

Posted on August 31, 2010 by Scali

AMD is starting the marketing offensive on Bulldozer and Bobcat… John Fruehe is joining some forums to enter discussions with his marketing spin.

It appears that AMD is getting really desperate now. They are actually downplaying their current architecture in order to make Bulldozer look better.

Take this post for example:

In AVX we will have 8 256-bit units
In non-AVX we will have 16 128-bit units.
Compared to everything I have seen on the server Sandybridge, they will have 8 256-bit AVX units, so we are generally tied on AVX code, but on non-AVX code they will only have 8 128-bit units, or half the FP capability.

This was in a discussion that was mostly focused around how their current architecture has three FPU execution ports per core (two of which are 128-bit SIMD units), where Bulldozer has only two FPU units (both 128-bit) per module, or two cores.

He is completely ignoring their own Magny Cours, which has 12 cores and therefore 24 128-bit units. A lot more than the 16 units in Interlagos, the most high-end Bulldozer server variation (8 modules, 16 cores). This comparison defects the attention away from the obvious: Magny Cours has more FP units than Interlagos. Interlagos’ units need to be QUITE a bit more efficient in order to break even… and what does breaking even with Magny Cours mean? Not much, really. Magny Cours is already struggling against current Nehalem derivatives with 8 cores and HyperThreading (so 16 logical cores). Sandy Bridge will only improve on Nehalem’s performance (floating point is going to be one of the strong points of the architecture, according to AnandTech’s preview). While on the other hand it’d be surprising if AMD could actually make Bulldozer’s 16 FP units perform as well as Magny Cours’ 24 FP units. It’s not likely to close the gap with Nehalem, let alone Sandy Bridge.

Also, the current architecture has multiple 128-bit SIMD ports per FPU, so a single core can perform multiple FP operations in parallel. As you can read in Agner Fog’s microarchitecture document: http://www.agner.org/optimize/microarchitecture.pdf

12.7 Floating point execution pipes

The three execution units in the floating point pipes are named FADD, FMUL and FMISC.

FADD can handle floating point addition. FMUL can handle floating point multiplication and division. FMISC can handle memory writes and type conversions. All three units can handle memory reads. The floating point units have their own register file and their own 80-bits data bus.

The latency for floating point addition and multiplication is 4 clock cycles. The units are fully pipelined so that a new operation can start every clock cycle. Division takes 11 clock cycles and is not fully pipelined. The latency for move and compare operations is 2 clock cycles.

He also claims that the current generation can only do 1.5 ALU and 1.5 AGU instructions per cycle, since the units are shared: http://www.xtremesystems.org/forums/showthread.php?257927-AMD-s-Bobcat-and-Bulldozer&p=4523917&viewfull=1#post4523917

Today’s processors have 3 execution units that are shared between ALU/AGU. That is essentially 1.5 ALU and 1.5 AGU. With BD we get 2 AGU and 2 ALU. Much better.

Anyone who has ever optimized for AMD’s K7 or higher knows that this is wrong. It is irrelevant that the ALU/AGU units may be shared, since the architecture can only dispatch 3 ops per clk at most. However, all 3 ops can be ALU, or all 3 can be AGU. That is not possible with BD. You can also read about this in Agner Fog’s microarchitecture documentation: http://www.agner.org/optimize/microarchitecture.pdf

12.6 Integer execution pipes

Each of the three integer execution pipes has its own ALU (Arithmetic Logic Unit) and its own AGU (Address Generation Unit). Each of the three ALU’s can handle any integer operation except multiplication. This means that it is possible to do three single integer instructions in the same clock cycle if they are independent. The three AGU’s are used for memory read, write and complex versions of the LEA instruction. It is possible to do two memory operations and one LEA in the same clock cycle. It is not possible to do three memory operations because there are only two ports to the data cache.

But what bothers me most is that so many people just seem to buy whatever AMD says. They just want to believe that Bulldozer is going to be this super-fast CPU, and a total Intel-killer. They want to believe it so badly. They no longer use common sense.

I get a deja-vu feeling of the days before the Barcelona launch… with Randy Allen claiming that Barcelona would be 40% faster than Intel’s fastest offerings. People just ate it up… and it was a huge disappointment to them, when they found out that Barcelona was in fact not faster than Intel’s “multi chip modules” (you know, “they glued two CPUs together, it’s not even a REAL quadcore, like AMD’s native solution!!!11oneone”). I think it will be a similar disappointment this time around… Especially for those who aren’t interested in the server market. As a desktop, Bulldozer will probably not be a very interesting option, with its reduced floating point and single-threaded performance. It is probably not going to be that great for gaming and photo/video encoding. Mark my words.

In fact, John even goes as far as claiming that Bulldozer will have at least 17% higher IPC than Deneb. Aside from the fact that it’s pretty much impossible to boost IPC by 17% with a micro-architecture update, especially in the case of Bulldozer, where you will have less execution units per core than Deneb has… AMD’s own slides even contradict him: “Throughput advantages for multi-threaded workloads without significant loss on serial single-threaded workload components”. Well, that pretty much warns us already that serial single-threaded performance (in other words: IPC) may take a small hit… albeit not a ‘significant’ one. Various people on comp.arch also discussed this, including a former AMD chip designer, who also figured that IPC would go down a bit, rather than up… a sacrifice made for better multi-threading scalability (and possibly slightly higher clocks). Especially “more than 17%” just sounds like a fairytale. At least it’s not as bad as Randy Allen’s 40% figure… but it’s still a boldfaced lie. If, given Bulldozer’s architectural tradeoffs, they would break even with Deneb, that would already be quite a feat of engineering. I don’t rule out the chance that it may actually turn out to be a smidge faster… but 17%? That is a LOT, a WHOLE lot. They would need some kind of ‘secret sauce’ for that… which they don’t have.

So I will just repeat what I said elsewhere:

What we know so far is this:
– They are removing one ALU and one AGU per core.
– They are removing one decoder per core (effectively, as you get one 4-wide decoder per module, so effectively 2-wide decoders per core, as opposed to three decoders).
– They are sharing one FPU/SIMD unit per 2 cores.
– They are stretching the pipeline to more stages.

These are all actions that DECREASE the execution resources and efficiency in some way.
We have not heard about them even COMPENSATING for the removal of these resources yet… so even being on par with previous generation IPC in single threads would already be quite an improvement in efficiency (roughly 33%, which is arguably more than what AMD or Intel ever achieved in a microarchitecture update, with the obvious exception of Netburst->Conroe, although this was skewed by the drop of about 1 GHz in clockspeed).
So basically, with this level of reduction and sharing of resources, and STILL increasing IPC over the previous gen, that would be one INCREDIBLE feat of engineering… And they’d need some pretty nifty ‘secret sauce’ to make these ‘anemic’ cores run that fast. Have you seen it? I haven’t.

This entry was posted in Uncategorized. Bookmark the permalink.

21 Responses to John Fruehe: AMD’s latest and greatest liar

Pingback: AMD agrees: John Fruehe is a liar | Scali's blog
Torquemada says:

August 22, 2011 at 3:37 pm

Scali says: “He is completely ignoring their own Magny Cours, which has 12 cores and therefore 24 128-bit units. A lot more than the 16 units in Interlagos, the most high-end Bulldozer server variation (8 modules, 16 cores). This comparison defects the attention away from the obvious: Magny Cours has more FP units than Interlagos. Interlagos’ units need to be QUITE a bit more efficient in order to break even…”

Sometimes IL’s units are more efficient: they can each do FMA. Remember that MC’s two units are separate Mul and Add units. When you can employ FMA — and this ought to be quite often in the fast fourierous workloads that quite matter here — IL can do 16 Mul & 16 Add whereas MC can do 12 Mul + 12 Add.

(As an aside, I presume IL is a bit easier to multi-thread because you get the full FP broadside with 8 threads instead of MC’s 12; but I don’t know if this has any practical meaning, you are in the land of massively parallel anyway.)

Scali: “and what does breaking even with Magny Cours mean? Not much, really. Magny Cours is already struggling against current Nehalem derivatives with 8 cores and HyperThreading (so 16 logical cores).”

Agreed. Me too won’t expect best performance/socket from BD, just good performance/watt (good for AMD) and good performance/price (not so very good for AMD).

On the whole I’d call JF unfortunately vague but not an outright liar. I’d also call him a typical Marketing person. 😛

Reply
- Scali says:
  
  August 22, 2011 at 4:18 pm
  
  Well, the point of 1.5 ALUs and 1.5 AGUs in current AMD architectures is not vague. He literally says that, and it is patently false.
  
  Other than that, the problem with FMA is that all code needs to be recompiled, and then you’ll just have to see where you land (as you say, the operative word is ‘sometimes’). Perhaps they can break even then, or even pull ahead, compared to MC.
  I was talking about existing code, where BD doesn’t look like it’s going to be all that good.
  
  Also, if what I saw the other day is correct, the die of BD will be 316 mm^2, which is a lot bigger than Intel’s SB quadcore or even their Gulftown hexacore. So that is not a very good sign for good performance/watt… Large chips are powerhungry or clocked low (or both), not to mention the production costs/yields…
  Signs point to AMD once again selling larger chips with less performance than Intel’s competitors, so AMD needs to cut into their profit margins to remain competitive. Not good for the long-term future of the company.
  
  Reply
Torquemada says:

August 22, 2011 at 3:52 pm

One example of JF’s vagueness (I nearly called it smugness) that peeves me a bit::

JF in that AnandTech forum post: “Compared to everything I have seen on the server Sandybridge, they will have 8 256-bit AVX units, so we are generally tied on AVX code, but on non-AVX code they will only have 8 128-bit units, or half the FP capability.”

All fine and dandy there, but who in reality won’t use AVX if they vectorize anyway? All these 128-bit and 256-bit units crunch vectors of FP32 or FP64 operands. FP128 use is vanishingly rare, FP256 more so; we are definitely talking vectors here. Unless 8-operand (in the FP32 case) vectors are much harder to use than 4-operand vectors, and I don’t think so, JF is erecting an artificial wall here, an unnecessarily strong distinction between “256-bit” and “128-bit”. I might be reading him wrong though, this is all open to interpretation and different readings. But I wish he were more specific at times…

Reply
- Scali says:
  
  August 22, 2011 at 4:22 pm
  
  Well, that’s marketing for you, I suppose. He may have a point that their AVX unit can ‘split in half’ on 128-bit code… but that’s not the whole story.
  As you say, you need to be running 128-bit code in the first place… which is not that likely.
  Also, it’s not as simple as just counting the number of units you have. What also matters are things like how these units are ordered into the various execution ports, how their operations are pipelined etc.
  And then there’s other things, such as memory access… AMD’s architecture has long suffered from not being able to reorder memory loads and stores, limiting the effectiveness of OoO.
  They also had problems with limited cache bandwidth… Having high-performance vector units is one thing… keeping them fed is another.
  
  Reply
Pingback: AMD Bulldozer Zambezi FX-8100 benchmarks from SiSoftware | Scali's blog
Pingback: AMD Bulldozer: Nothing to see here, moving on | Scali's blog
Pingback: AMD admits mistake in new blog | Scali's blog
Pingback: John Fruehe finally does the sensible thing and comes up with an excuse | Scali's blog
Pingback: Meet AMD’s Mike Houston | Scali's blog
Pingback: Why technical forum discussions don’t work | Scali's blog
Pingback: Richard Huddy comments on my blog | Scali's blog
Pingback: Multi-core and multi-threading performance (the multi-core myth?) | Scali's blog
Pingback: John Fruehe leaves AMD | Scali's OpenBlog™
Pingback: AMD Steamroller | Scali's OpenBlog™
Pingback: Thought this was cool: Multi-core and multi-threading performance (the multi-core myth?) | Scali’s OpenBlog™ « CWYAlpha
Pingback: More hate fail… | Scali's OpenBlog™
Pingback: Dahakon, of: laffe personen die mij denken te kennen | Scali's OpenBlog™
Pingback: The damage that AMD marketing does | Scali's OpenBlog™
Pingback: AMD Zen: a bit of a deja-vu? | Scali's OpenBlog™
Pingback: AMD Bulldozer: It’s time to settle | Scali's OpenBlog™

	OEM on MartyPC: PC emulation done…
	equipthering on An Amiga can’t do Wolfen…
	Mike Dawson on Running anything Remedy/Future…
	.NET Core: the small… on Migrating to .NET Core: the fu…
	Scali on Video playback on low-end MS-D…