With the news of Windows 8 supporting ARM architectures, the RISC vs CISC discussions are back on the internet. Funny, as the newer generation doesn’t appear to have much of a clue about anything. As I said before, it’s all about knowing the history. It’s funny to see that x86 actually appears to have ‘fanboys’, who will try to defend ‘their’ x86 architecture against the new competition from ARM.
For example, they try to name advantages for CISC. That is pretty ironic. For starters, CISC was not a conscious design philosophy. The acronym ‘CISC’ did not exist until ‘RISC’ was coined (it is in fact a retronym). RISC was a conscious design philosophy, trying to reduce the complexity in the instruction set, in order to make the architecture simpler and more efficient. This was a response to developments in both software (programming languages and compilers) and hardware (ever higher transistor density and propagation speeds). As a result of this new philosophy, everything that went before it, was referred to as CISC from then on. The family of CISC architectures therefore is far less coherent than the family of RISC architectures. After all, RISC follows a specific design philosophy, where CISC does not.
What people appear to be missing is that any advantages that a certain architecture may have, are very much related to the era in which that architecture was developed. The RISC philosophy responded to the demands of that era, just as the philosophy for various CISC architectures responded to the demands of their era (and in some cases also their intended purpose). While the x86 archictecture may have had some advantages at the time, that does not mean that these advantages are still valid today.
For example, many early CISC architectures, such as the x86, were designed with (semi-)variable instruction length. The advantage was that the most common instructions could be encoded with the shortest machine code sequences, which led to smaller code (a variation of entropy encoding/compression, if you will). This was a very valid consideration at the time, since memory was still a very limited resource. We are talking about computer systems that may have had in the range of 1KB to 256KB of memory. So every byte that you could save, mattered.
However, fast-forward to the RISC era, and memory was not such a limitation anymore. Besides, with developments in software, such as the GUI, memory demands changed. As programs became ever more graphical, the memory usage became less dependent on code, and more dependent on data (bitmaps, widgets, controls, that sort of thing). The instructionset of a CPU has little or no effect on this data, so the savings based on different instructionsets became less and less.
At the same time, memory buses became wider. A side effect of these wider buses was that they could generally only address words on word-alignment (and I mean words in the proper sense, the maximum addressable unit of the architecture… not the x86 definition of a word which is frozen-in-time as 16-bits, because that was the word size on early x86 architectures, which have since been replaced by architectures with 32-bit words and 64-bit words). 32-bit words were common in those days… so as an example: If you have a 32-bit word in memory, it has to be aligned on a 32-bit boundary. If you want to read a 32-bit word that is not aligned, this results in the memory controller reading two 32-bit aligned words and re-assembling the requested 32-bit word from these two words. A lot less efficient, obviously. Suddenly, having variable-length instructions seemed a lot less attractive, as you would constantly run into unaligned data. So what used to be an advantage of a CISC architecture, has now turned into a disadvantage, because time (and technology) has caught up with the original idea behind it. RISC would address this by forcing word-sized instructions, and forcing every instruction to be aligned. This made for a much simpler and more efficient instruction fetcher and decoder.
Some more irony is in the fact that x86 have been RISC processors as well, since the Pentium Pro/AMD K6 era (and the last iteration of that other legendary CISC architecture, the 68060, was also using a RISC backend). Namely, as x86 evolved generation after generation, it became too complex to implement every instruction in hardware directly. Instead, the decoder would first decode complex x86 instructions into a series of simpler instructions, and then execute these instructions one-at-a-time (it could also reorder the instructions for even more efficiency and instruction-level parallelism. A technique known as out-of-order execution). Intel named these internal instructions micro-ops. They were effectively an internal RISC instructionset. Since this internal instructionset was simple and efficient, it allowed the Pentium Pro architecture to reach much higher clockspeeds than before, while also reaching higher instruction throughput than ever before.
The trick was in the old 90-10 rule: 90% of the time is spent in 10% of the code. Or in other words: it’s mostly about loops in the code. While the Pentium Pro still had to fetch and decode the complex, unaligned x86 code first, and translate it, most of the time you would be running the same code. Since technology had now developed far enough to allow reasonably large caches in the CPU itself, there was some opportunity here to make the x86 decoding less of a problem. The CPU would decode the instructions in two stages:
- Determine the instruction boundaries (in other words, determine the start and the length of each instruction).
- Decode the instructions into micro-ops and store them in an internal buffer.
The instruction boundaries could easily be stored in the code cache on a per-page basis. This meant that this step would only be required the first time a memory page was fetched into the code cache. Because 90% of the time you are running a loop, you can skip this step most of the time.
Since the decoded instructions could be buffered internally, this meant that the decoder could work asynchronously. It could just continue decoding as fast as it could, until the buffer was full, not having to wait until the instructions were actually being executed.
As a result, x86 had now achieved near-RISC performance, without having to sacrifice compatibility with the original instructionset. A side-effect was that it became more of a RISC CPU to program for. Optimizing for x86 architectures now mostly entailed tweaking the code so that it could decode as quickly as possible, and use as few micro-ops as possible internally. A large set of archaic complex instructions was still supported by the latest x86 processors, but would have a considerable performance penalty because they would take long to decode and/or generate a large sequence of micro-ops, which took a while to execute.
Which brings me to another common misconception about CISC vs RISC: people seem to think that complexity is the same as the number of instructions. That CISC CPUs having more instructions means that they are more flexible and more powerful to program for. The complexity is not about the number of instructions, but about how they are encoded. RISC instructionsets tend to have a single layout for all instructions, which makes them easy to decode with just a simple table lookup. RISC processors don’t necessarily need to have less instructions than CISC processors, and as such they are not necessarily less flexible and less powerful. A good example was the PowerPC G4 processor. It introduced a very powerful SIMD instructionset, better than Intel’s MMX/SSE and AMD’s 3DNow! attempts at SIMD. The G4 was a good example of how RISC was superior to CISC at the time. The G4 could outperform Pentium III processors which were clocked several hundred MHz higher.
But, as this was all still in the mid-to-late 90s, you may have already guessed that history has repeated itself since: Time and technology have also caught up with the RISC philosophy to a certain extent. As a result, we see RISC going through a similar evolution as x86 (the only surviving CISC architecture). Modern RISC architectures have more instructions and complexity added to them, and they do not necessarily decode and execute instructions directly anymore either. They may also break up certain instructions in smaller pieces and buffer (and reorder) them first, or sometimes even resort to software emulation of certain legacy portions of the instructionset. This is often referred to as post-RISC. And some more irony in my example from the previous paragraph: Apple has abandoned the PowerPC architecture a few years ago, in favour of Intel’s x86, since Intel now offered more performance and lower power consumption.
So if CISC isn’t truly CISC, and if RISC isn’t truly RISC, the whole debate is rather silly anyway, isn’t it? Well, in a way it is… However, there is still some difference between them. Namely, the instructionset still dictates what a processor can do, and although the instructionset is translated to an internal RISC-like instructionset anyway, all translation is not equal. So let’s get back to the original ARM vs x86 debate that kicked this off in the first place. Namely, x86 has evolved into an architecture that is aimed at PCs, workstations and servers. ARM on the other hand has been adopted by the embedded market, and evolved mainly into a compact and energy-efficient architecture for mobile devices.
As a result, ARM CPUs tend to be small and low-power. Intel’s attempts at entering the mobile and embedded market with their x86-based Atom show the gap quite well. ARM processors deliver much better battery life, and can be used in small devices such as phones and mp3 players. Atom is mainly interesting for netbooks, but smaller devices are a bridge too far. Building a competitor to the iPad on an Atom basis is also going to prove difficult, as there is no way you can match the iPad’s battery life in the same form factor. An important factor here is that the decoding logic for an x86 instructionset is considerably larger than for an ARM instructionset.
On the other hand, x86 processors deliver a lot more performance than these ARM processors. So for high-end PCs, workstations and servers, we will likely continue to use x86 for a while. With larger, more powerful CPU designs, the x86 decoding logic becomes relatively less of a factor. More execution units, larger caches, more powerful memory controllers and such will also take up a lot of die space on a high-performance processor, and these components are independent of the instructionset used.
However, there may be some exceptions… Namely, the rise of two technologies in recent years may put ARM into a more competitive position in terms of performance. Firstly we have multicore processing. Since ARM CPUs tend to be very small, you can fit more cores into the same die-area as with an x86-based architecture. Only recently have we seen the first dualcore ARM processors (after all, size and power consumption are more important for the mobile devices that ARM is mainly aimed at), but nVidia has also demonstrated their upcoming quadcore ARM processor for their upcoming Tegra 3. So the performance gap is rapidly being closed here.
And speaking of nVidia, we also arrive at the second upcoming technology: GPGPU processing. The mobile-oriented ARM architectures we know today may not be all that powerful when it comes to floating-point and SIMD processing… But that is exactly the area in which GPGPU excels! So the GPU can compensate for this weakness in the CPU architecture.
And this is just short-term… Both technologies are an example of how existing ARM cores can be used in more powerful configurations. Just like how x86 proves over and over again that you can make a 70s CPU instructionset evolve into pretty much anything, future ARM architectures might also be aimed more at desktop/high-performance computing, and have more powerful floating-point/SIMD processing units, just like their x86 cousins (even for x86, the FPU was an optional co-processor until Intel integrated it in the 486DX, back in 1989. And even then the x87 FPU wasn’t such a great performer compared to other architectures. SIMD was first added to the Pentium MMX in 1993, and later refined for the Pentium III with SSE, which is still added to with every new generation). There is nothing that would prevent CPU designers from making a high-performance ARM variation with a featureset that rivals, or even exceeds, the x86.
The last bit of irony comes from the fact that Intel had an ARM division itself (bought from Digital). In the early days of pocket PCs and Windows CE (Compaq’s iPaq and such), a lot of units were powered with Intel’s StrongARM processor line, and later with its successor, the XScale. Funny enough Intel sold this division to Marvell in 2006. Only shortly before mobile devices really started taking off. And then we find Intel trying to re-enter that market with the x86-based Atom. Intel actually still owns an ARM license too.
At any rate, it would be interesting to see ARM and x86 go head-to-head on the Windows platform. I don’t see it as a CISC vs RISC battle myself, as I said. I mostly see it as two instructionsets battling it out. Which could be interesting. A lot more interesting than AMD and Intel, both using the same x86. It may bring new insights to the table, new technologies, more pronounced strong and weak points of each competitor. And also: more choice. I find it strange that linux/opensource advocates always talk about openness and freedom, while in reality you are mostly limited to the same x86-based hardware. Personally I find it far more interesting if I can run the same OS on a variety of different CPUs, than running a different OS on the same hardware.