AMD is starting the marketing offensive on Bulldozer and Bobcat… John Fruehe is joining some forums to enter discussions with his marketing spin.
It appears that AMD is getting really desperate now. They are actually downplaying their current architecture in order to make Bulldozer look better.
Take this post for example:
In AVX we will have 8 256-bit units
In non-AVX we will have 16 128-bit units.
Compared to everything I have seen on the server Sandybridge, they will have 8 256-bit AVX units, so we are generally tied on AVX code, but on non-AVX code they will only have 8 128-bit units, or half the FP capability.
This was in a discussion that was mostly focused around how their current architecture has three FPU execution ports per core (two of which are 128-bit SIMD units), where Bulldozer has only two FPU units (both 128-bit) per module, or two cores.
He is completely ignoring their own Magny Cours, which has 12 cores and therefore 24 128-bit units. A lot more than the 16 units in Interlagos, the most high-end Bulldozer server variation (8 modules, 16 cores). This comparison defects the attention away from the obvious: Magny Cours has more FP units than Interlagos. Interlagos’ units need to be QUITE a bit more efficient in order to break even… and what does breaking even with Magny Cours mean? Not much, really. Magny Cours is already struggling against current Nehalem derivatives with 8 cores and HyperThreading (so 16 logical cores). Sandy Bridge will only improve on Nehalem’s performance (floating point is going to be one of the strong points of the architecture, according to AnandTech’s preview). While on the other hand it’d be surprising if AMD could actually make Bulldozer’s 16 FP units perform as well as Magny Cours’ 24 FP units. It’s not likely to close the gap with Nehalem, let alone Sandy Bridge.
Also, the current architecture has multiple 128-bit SIMD ports per FPU, so a single core can perform multiple FP operations in parallel. As you can read in Agner Fog’s microarchitecture document: http://www.agner.org/optimize/microarchitecture.pdf
12.7 Floating point execution pipes
The three execution units in the floating point pipes are named FADD, FMUL and FMISC.
FADD can handle floating point addition. FMUL can handle floating point multiplication and division. FMISC can handle memory writes and type conversions. All three units can handle memory reads. The floating point units have their own register file and their own 80-bits data bus.
The latency for floating point addition and multiplication is 4 clock cycles. The units are fully pipelined so that a new operation can start every clock cycle. Division takes 11 clock cycles and is not fully pipelined. The latency for move and compare operations is 2 clock cycles.
He also claims that the current generation can only do 1.5 ALU and 1.5 AGU instructions per cycle, since the units are shared: http://www.xtremesystems.org/forums/showthread.php?257927-AMD-s-Bobcat-and-Bulldozer&p=4523917&viewfull=1#post4523917
Today’s processors have 3 execution units that are shared between ALU/AGU. That is essentially 1.5 ALU and 1.5 AGU. With BD we get 2 AGU and 2 ALU. Much better.
Anyone who has ever optimized for AMD’s K7 or higher knows that this is wrong. It is irrelevant that the ALU/AGU units may be shared, since the architecture can only dispatch 3 ops per clk at most. However, all 3 ops can be ALU, or all 3 can be AGU. That is not possible with BD. You can also read about this in Agner Fog’s microarchitecture documentation: http://www.agner.org/optimize/microarchitecture.pdf
12.6 Integer execution pipes
Each of the three integer execution pipes has its own ALU (Arithmetic Logic Unit) and its own AGU (Address Generation Unit). Each of the three ALU’s can handle any integer operation except multiplication. This means that it is possible to do three single integer instructions in the same clock cycle if they are independent. The three AGU’s are used for memory read, write and complex versions of the LEA instruction. It is possible to do two memory operations and one LEA in the same clock cycle. It is not possible to do three memory operations because there are only two ports to the data cache.
But what bothers me most is that so many people just seem to buy whatever AMD says. They just want to believe that Bulldozer is going to be this super-fast CPU, and a total Intel-killer. They want to believe it so badly. They no longer use common sense.
I get a deja-vu feeling of the days before the Barcelona launch… with Randy Allen claiming that Barcelona would be 40% faster than Intel’s fastest offerings. People just ate it up… and it was a huge disappointment to them, when they found out that Barcelona was in fact not faster than Intel’s “multi chip modules” (you know, “they glued two CPUs together, it’s not even a REAL quadcore, like AMD’s native solution!!!11oneone”). I think it will be a similar disappointment this time around… Especially for those who aren’t interested in the server market. As a desktop, Bulldozer will probably not be a very interesting option, with its reduced floating point and single-threaded performance. It is probably not going to be that great for gaming and photo/video encoding. Mark my words.
In fact, John even goes as far as claiming that Bulldozer will have at least 17% higher IPC than Deneb. Aside from the fact that it’s pretty much impossible to boost IPC by 17% with a micro-architecture update, especially in the case of Bulldozer, where you will have less execution units per core than Deneb has… AMD’s own slides even contradict him: “Throughput advantages for multi-threaded workloads without significant loss on serial single-threaded workload components”. Well, that pretty much warns us already that serial single-threaded performance (in other words: IPC) may take a small hit… albeit not a ‘significant’ one. Various people on comp.arch also discussed this, including a former AMD chip designer, who also figured that IPC would go down a bit, rather than up… a sacrifice made for better multi-threading scalability (and possibly slightly higher clocks). Especially “more than 17%” just sounds like a fairytale. At least it’s not as bad as Randy Allen’s 40% figure… but it’s still a boldfaced lie. If, given Bulldozer’s architectural tradeoffs, they would break even with Deneb, that would already be quite a feat of engineering. I don’t rule out the chance that it may actually turn out to be a smidge faster… but 17%? That is a LOT, a WHOLE lot. They would need some kind of ‘secret sauce’ for that… which they don’t have.
So I will just repeat what I said elsewhere:
What we know so far is this:
– They are removing one ALU and one AGU per core.
– They are removing one decoder per core (effectively, as you get one 4-wide decoder per module, so effectively 2-wide decoders per core, as opposed to three decoders).
– They are sharing one FPU/SIMD unit per 2 cores.
– They are stretching the pipeline to more stages.
These are all actions that DECREASE the execution resources and efficiency in some way.
We have not heard about them even COMPENSATING for the removal of these resources yet… so even being on par with previous generation IPC in single threads would already be quite an improvement in efficiency (roughly 33%, which is arguably more than what AMD or Intel ever achieved in a microarchitecture update, with the obvious exception of Netburst->Conroe, although this was skewed by the drop of about 1 GHz in clockspeed).
So basically, with this level of reduction and sharing of resources, and STILL increasing IPC over the previous gen, that would be one INCREDIBLE feat of engineering… And they’d need some pretty nifty ‘secret sauce’ to make these ‘anemic’ cores run that fast. Have you seen it? I haven’t.