Perhaps some of you expected more coverage on upcoming CPUs on this blog, given the history regarding AMD’s Bulldozer. But well, there hasn’t been too much to report on, really. I did a short thing on Trinity/Piledriver, but there was not too much to tell. Small improvements, not very spectacular.
The same goes for Steamroller. To be honest, it just doesn’t interest me a whole lot at this point. Anandtech covered Steamroller a while ago, but there was not much in the way of spectacular news. As Anandtech says:
Steamroller addresses this by duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. Don’t expect a doubling of performance since it’s rare that a 4-issue front end sees anywhere near full utilization, but this is easily the single largest performance improvement from all of the changes in Steamroller.
That proves my point already: If the decoder is the largest performance improvement, it’s not going to be very spectacular. It is an interesting fact in the sense that it implies that AMD is leaving their ‘Clustered MultiThreading’ approach. Apparently AMD concluded that sharing resources the way they have been doing, just doesn’t work. So instead of trying to share resources and reduce transistorcount, they are going to put in two dedicated decoders for each module again. A move back towards conventional multicore technology, like in their earlier Athlon and Phenom architectures.
I am not sure how this is going to make AMD more competitive with Intel though, since Intel has clearly been moving towards better performance-per-watt with the last few generations of Core architectures, and Haswell will only improve on that further.
Another thing is that AMD is going for larger caches in Steamroller. What is interesting is that they had 64k of shared L1 cache in Bulldozer per module. Which is a lot more than the 32k of L1 cache per core in Sandy/Ivy Bridge. This 32k is shared by two logical cores when using HyperThreading. AMD’s problem was not really that the cache was too small. Its problem was that the cache was not efficient enough. A small L1 cache with low latency is generally better than a large L1 cache with high latency (the L2 is still there to reduce the penalty of cache misses from L1). It seems that AMD’s problem is in getting the latency of a shared cache down. So when I read that they are going for a larger L1 cache, that implies that they couldn’t get the latency down, and instead try to increase the hit ratio by making it larger.
This again goes in the opposite direction of Intel: a bruteforce approach, more transistors, more power consumption, more die-size. So again, I don’t see this as a good sign in terms of competition with Intel. Bulldozer/Piledriver are already ridiculously large and powerhungry compared to Intel CPUs of the same performance level. And I am not sure how this ties in with the mission statement from AMD that they are no longer going for the high-end market. The Steamroller improvements seem to focus mostly on more performance at the cost of larger and more powerhungry chips, rather than becoming smaller, more efficient mainstream chips.
At least there’s no John Fruehe this time, claiming ridiculous performance estimates. Because clearly the improvements in the Steamroller architecture are not going to be groundbreaking.
Meanwhile, there has also been some news on Haswell. It has some new instructions which will make locking in multithreaded situations more efficient. Sadly Steamroller won’t have these instructions yet. Anyway, the interesting part is that these new instructions basically convert a mutex-style lock into something of a single writer/multiple reader style lock automatically. That is, instead of each thread waiting on the lock, they will start executing, but with read-only access. When a thread tries to write, an exception occurs, and the whole ‘transaction’ is tried again with proper locking. It is a simple, yet elegant solution which turns coarse locking situations into nearly fine-grained locking performance. The best thing is that support for these instructions only have to be inside the locking objects, provided by the OS. So a simple update of the OS locking libraries will make all applications use the new instructions automatically.