After the tutorial on C that I re-published recently, it is now time for another old article that is still remarkably educational and useful today.
It describes the internals of the Pentium and Pentium Pro CPUs. Ironically enough, today’s Intel CPUs are still remarkably similar to these oldies. The Atom is a superscalar two-pipeline in-order CPU, closely related to the classic Pentium. The Core architectures are out-of-order execution CPUs, which still work very similar to the Pentium Pro I described. Likewise, AMD’s x86 CPUs are all out-of-order execution CPUs as well, and work according to the same principles.
An execution pipeline is basically a conveyor belt for instructions. Instructions pass various stages, at each stage, a part of the execution of the instruction is done. Normally each stage will take 1 clock cycle (clk), that is, an instruction will stay in a stage for 1 clk, and then proceed on to the next. Basically there are 3 main operations:
- Fetch: Load instruction from memory.
- Decode: Set up operands (load from memory if necessary), and find out what the instruction is supposed to do. Basically prepare to dispatch operands to the correct logic unit, for execution.
- Execute: Take the operands, dispatch to the correct logic unit, perform the desired operation, and store the result.
A simplified pipeline could have these three stages: fetch – decode – execute. The instruction passes through them in that sequence:
Fetch -> Decode -> Execute +--------+--------+--------+ clk 0: | INSTR | | | +--------+--------+--------+ +--------+--------+--------+ clk 1: | | INSTR | | +--------+--------+--------+ +--------+--------+--------+ clk 2: | | | INSTR | +--------+--------+--------+
Now if the instruction is being fetched, it cannot be decoded or executed yet. Does this mean that only 1 stage of the pipeline is active, while the rest is idle? The answer is no. Why not? Because while you execute the first instruction, you can decode the second, and fetch the third at the same time. The different stages do not share any hardware resources, so each stage can work independently of the others, at the same time:
Fetch -> Decode -> Execute +--------+--------+--------+ clk 0: | INSTR0 | | | +--------+--------+--------+ +--------+--------+--------+ clk 1: | INSTR1 | INSTR0 | | +--------+--------+--------+ +--------+--------+--------+ clk 2: | INSTR2 | INSTR1 | INSTR0 | +--------+--------+--------+
So the stages can work in parallel on sequential instructions. This in effect reduces the execution time of a sequence of instructions. If you have for example a piece of code of 10 instructions, it doesn’t take 10 x 3 = 30 clks, but instead, it takes 2 clks for the first instruction to reach the execute stage, and from there on, at every clk a new instruction arrives at the execute stage, so you execute the 10 instructions in 10 clks. That’s a total of only 2 + 10 = 12 clks. There is a sort of ‘telescoping’, or ‘waterfall’ effect.
The different instructions in the pipeline can be dependent however. It could be that the instruction that is being decoded, requires an operand that is the result of the instruction being executed. To solve this problem, operand forwarding was introduced. This means that the decode stage does not have to wait for the result to be available. The result is forwarded back into the pipeline as a new operand, so execution can continue without “stalling” (waiting):
Consider an add instruction of the form: add destination, source1, source2
INSTR0: add r0, r1, r2 INSTR1: add r4, r0, r3 Fetch -> Decode -> Execute +--------+--------+--------+ clk 0: | INSTR2 | INSTR1 | INSTR0 |->-+ +--------+--------+--------+ | | +------<-----+ r0 | +--------+--------+--------+ clk 1: | INSTR3 | INSTR2 | INSTR1 | +--------+--------+--------+
Another problem arises when a jump is made. When you jump to a different location, the instructions that were already in the pipeline, are now invalid. So the pipeline needs to be flushed, and start over at the right location. With unconditional jumps, modern CPUs will spot the jumps in time, and continue fetching instructions at the new location right away, without stalling. With conditional jumps however, this is not possible. It is not clear whether the jump should be taken or not, until the condition has been tested, which at worst happens by the last instruction before the conditional jump in the pipeline. So instead of waiting, a modern CPU will try to predict whether the jump will be taken or not, judging from a hint in the code (the usual rule is: if the jump is to a lower address, it is probably a loop, so take the jump. Else, do not take it. Some CPUs even have hint bits inside the opcode, to indicate the behaviour), and from previous iterations. They store whether the jump was taken or not, on the last few occasions. With this strategy, the code can usually continue without stalling. Only when the prediction turns out to be wrong, the pipeline will have to be flushed, and execution will stall for a few clks until the first instruction has reached the execution stage again.
A superscalar CPU is a CPU which has more than one execution pipeline, working in parallel. So you can execute more than one instruction per clk. Operand forwarding can not solve all dependency problems anymore however. It could be that there are 2 instructions being executed in parallel, but the second one requires the result of the first one. The CPU will then have to put the second pipeline on hold. It can only execute the first instruction, then forward the result into the second pipeline. Then the second pipeline can continue executing on the next clk. We refer to this as a stall in the second pipeline. The penalty was 1 clk. It is the responsibility of the programmer to avoid these stalls.
What happens when the clockspeed goes up? All stages in the pipeline consist of sequences of logic gates. A gate has a certain propagation time. The time it takes to settle to a consistent state, after a change of input signals. So the sequence of gates in a pipeline stage can have at most a propagation time of 1 clk. As clockspeed goes up, the time that 1 clk takes goes down, so the maximum length of the logic sequence in a pipeline stage decreases as well. There are 2 solutions to this problem:
- Reduce the logic required to fetch, decode and execute instructions
- Split the pipeline up in more stages
In order to reduce the logic required to fetch and decode instructions, the instructions should be encoded in a less complex fashion. Reduced Instruction Set Computing (RISC) is a CPU design philosophy which aims for small and simple instructionsets. Complex instructions which can be replaced by a sequence of 1 or more simple instructions without significantly increasing total execution time, will not be implemented in hardware. Instructions should also be encoded in an orthogonal (uniform) manner, to reduce decoding complexity. They should all be the same size, and layout. The old paradox of “less is more” is a nice description of RISC. To give a name to the previous, more complex designs, the name Complex Instruction Set Computing (CISC) was chosen. These more complex designs date from an earlier age, when memory was small and expensive, and clock speeds were low. There was no pipelining yet, let alone superscalar designs. Code was mostly written by hand, rather than by compilers. So the designs were made to minimize the code footprint, and maximize the operations per cycle. This meant a lot of instructions with implicit operands, and performing common sequences of operations.
A few results of this philosophy are that virtually all instructions can only operate on registers, and virtually all instructions have the same number of operands. Loading registers from memory, or storing to memory is done with a few special instructions. At the time of RISC implementation, manufacturers were able to put millions of gates on a chip, but because of RISC, way less were needed than ever before. The extra gates could now be used for more registers, extra cache or other features.
Splitting the pipeline up in order to increase clockspeed is a poor mans solution. Namely, if there are interlocks because of dependent instructions for example, the penalties can increase from 1 to several stages. But, what if you cannot redesign your instructionset, because you want to keep supporting legacy code?
There are again 2 solutions:
- Design a new RISC CPU, and support legacy code with a software emulation layer.
- Design a new RISC CPU, and support legacy code with a hardware emulation layer.
It is interesting to see that Motorola chose the first option, when going from 68k to PowerPC. The Apple computers were powered by 68k CPUs. The new PowerPC-powered CPUs had 68k emulation built into the OS. At that time, the choice was not such a bad one. The new RISC CPU could be manufactured at much higher clockspeeds, and with more cache and more registers. This gave a leap in performance, which could make emulated legacy code perform acceptably, while native code would run at the highest possible speed.
Motorola did however release a 68060, which was designed with the second option. Which was the same choice that Intel made for their x86 family. The instructionset was not redesigned totally. They did use some ideas taken from the RISC philosophy however. In short, their CPUs now translate the complex x86 instructions internally, and then issue them to an execution pipeline which could be seen as a sort of RISC CPU. It can execute what Intel calls “micro-operations” or µOps, since the Pentium PRO (Pentium PRO, Celeron, PII and PIII all have the same P6 core). And surely, some of the instructions which would have been removed in a RISC instructionset, are now constructed from a sequence of 2 or more µOps. The Pentium has no names for the ‘atomic operations’ that it builds the instructions from… But it clearly does just that. Their pipeline is still longer than that of a true RISC CPU, but it is shorter than it would be when all instructions were to be decoded and executed directly, rather than translated to µOps. Intel managed to keep the implementation of a complex instructionset within reasonable bounds, and managed to get the clockspeed up to impressive rates. At the time of writing they have a Pentium 4 CPU out, which runs at 1.5 GHz. This makes the Pentium 4 the highest clocked CPU currently available. The high clockspeed comes at a cost however. The instruction throughput is relatively low, because the pipeline has a lot of stages, because at 1.5 GHz, the stages have to be very short. On top of that, its decoding scheme is still more complex than that of CPUs with a RISC instructionset.
But let’s look at the x86 family more closely. Because at the backend there is the RISC-like µOp-execution unit, the CPU also gets RISC-like traits. What does this mean? The translation from x86 instructions to µOps is not always an easy one. Only a part of the x86 instructionset can be considered orthogonal. But the x86 instructions have complex addressing modes. Intel decided to map register-only operations 1:1 to µOps, and add additional µOps for the complex addressing modes. Pentium PRO can handle simple addressing modes ([reg]) in 1 µOp forms as well.
A few Pentium examples:
add eax, edx
is 1 clk.
add eax, [edx]
will translate to:
mov tmp, [edx] add eax, tmp
where tmp is a temporary internal register. Takes 2 clks.
add [eax], edx
is three operations on Pentium:
mov tmp, [eax] add tmp, edx mov [eax], tmp
This is exactly the style in which RISC code is written, since add would only be able to operate on registers, not on memory directly. So in short, the CPU seems to translate ordinary x86 code to RISC code. The downside to having the CPU translate your code, is that the pipeline will have to wait for the sequence of instructions to finish, before issuing the next pair of instructions. When you write the RISC-like code yourself, you are working with 1 clk operations only, and you can get 2 independent instructions through per clk. A few instructions are even more interesting on Pentium, because translating them by hand will actually make them faster.
takes 3 clks.
mov edx, eax sar edx, 31
takes 2 clks.
takes 2 clks.
mov [esi], eax add esi, 4
takes 1 clk.
takes 5 clks.
dec ecx jnz label
takes 1 clk.
Basically there are a few extra rules for instructions to execute in 1 clk on Pentium. The Pentium has 2 pipelines, but they are not identical, the primary pipeline is called U, the secondary pipeline is called V. Some instructions can only be executed in U (eg. shift/rotate instructions), others only in V (eg. call/jump instructions). There are also some instructions which will not pair at all. They will only execute in U, and V will stall. The decoders also have limits. Each pipeline can decode instructions up to 7 bytes per clk. If a longer instruction is encountered, there will be a stall. Finally, the Address Generation Unit (AGU) is a stage earlier in the pipeline. What this means is that if you have an instruction that has to address memory with the result of a previous instruction, then it will have to stall for 1 clk. Namely, the operand has to be forwarded not one, but 2 stages back. This is known as the Address Generation Interlock (AGI) stall:
add eax, ebx mov edx, [eax] Decode -> AGU -> Execute +--------+--------+--------+ clk 0: | INSTR2 | INSTR1 | INSTR0 |->-+ +--------+--------+--------+ | | +----------<----------+ eax | +--------+--------+--------+ clk 1: | INSTR2 | INSTR1 | | +--------+--------+--------+
We say the latency of the instruction is 2 clks. The result has to be available 2 clks ahead. Most instructions have a latency of 1 clk. Operand forwarding can avoid stalls on 1 clk latency instructions, as long as there are no 2 dependent instructions in the pipeline at the same time, as we’ve seen earlier.
The Pentium PRO (and other P6 core CPUs) is a completely different beast however. Instead of translating in-place, as the Pentium does, the Pentium PRO decodes the x86 instructions into µOps and stores them in a buffer. A scheduling unit (Reservation Station) will then dispatch µOps to the units as the operands and the unit become available. This means it can execute instructions out-of-order (ooo). The results are then stored in a reorder buffer (ROB), so the result is guaranteed to be equivalent to the original sequence of instructions. In other words, the out-of-order execution is transparent to the program. Why is this interesting? A stall occuring in the early part of the pipeline will not affect the backend. The decoder might not issue new µOps during the stall, but while there are still µOps in the ooo buffer, the backend can continue execution. The ooo execution can also minimize the cost of latencies/stalls. While 1 µOp has to wait on a previous one to complete, the CPU can still dispatch other µOps which are not dependent, even if they were originally sequenced after the dependent instruction.
The Pentium PRO has 3 decoders, d0, d1 and d2. Decoder d0 can decode an instruction with a maximum length of 7 bytes, and a maximum complexity of 4 µOps per clk. Decoders d1 and d2 can decode an additional instruction of 1 µOp per clk each. So the ooo buffer can be filled with up to 6 µOps per clk.
The ooo execution pipeline has 5 different ports, port 0 through 4. Each port handles a specific class of µOps. Port 0 handles integer and floating point arithmetic, and address generation operations. Port 1 handles simple integer arithmetic instructions (not shift, multiply or divide). Port 2 handles memory load operations. Port 3 handles the calculation of memory write addresses. Port 4 handles memory write operations.
Each port can take 1 µOp per clk. This means that the CPU can dispatch a maximum of 5 (independent) µOps per clk. Since the decoders decode into the ooo buffer, they are now no longer dependent on the execution backend. This means that the decoders no longer have to stall when the execution takes more than 1 clk. In fact, the ports on a PPRO themselves are pipelined as well. Which means that even multiple-clk instructions such as multiply can be issued every clk, and they pass through the stages subsequently. With PPRO, you don’t specify execution speed of instructions by simply counting the clks, instead, you specify latency and throughput. For example, mul is specified as: latency 4 clks, throughput 1 per clk. This means that when we issue a mul, it takes 4 clks to complete in total. But you can issue 1 mul per clk. If you issue 2 subsequent muls, then the second one will follow one clk after the first. So we see the same ‘telescoping’ or ‘waterfall’ effect…
1 mul takes 4 clks 2 muls take 5 clks 3 muls take 6 clks etc.
After execution, the µOps wait in the ROB for “retirement”. The retire stage will update all operands (register or memory) in order, at a maximum of 3 µOps per clk. It will also take care of the operand forwarding.
This rather complex and very interesting execution scheme shifts the focus from instruction scheduling to issuing µOps to the ooo backend. Scheduling instructions for dependency is now less important, because the CPU can schedule the µOps itself. There should not be so many dependencies that the ooo-buffer fills up faster than the backend can handle, but not each and every dependency results directly in a stall anymore. The ooo buffer acts like a cushion for dependencies and latencies.
You could see the CPU as a funnel for instructions… You can add up to 6 µOps per clk, then it can dispatch up to 5 µOps, and finally retire up to 3 µOps per clk. When writing code for PPRO CPUs, the main focus should be on using as little µOps as possible, rather than getting as many µOps per clk into the ooo buffer. Decoding them efficiently is important as well. Instructions that consist of more than 1 µOp should be scheduled to go into d0. Try to get both d1 and d2 to decode an instruction as well.
One part we’ve not touched yet, is the PPRO’s register renaming scheme. While on the outside it appears that the PPRO has only 8 integer registers, it has in fact 40 temporary registers inside. It uses these temporary registers internally, for a special kind of dependencies such as this one:
mov eax, [a] mul [b] mov [c], eax mov eax, [d]
We saw earlier that a mul takes 4 clks. If the CPU would use the physical registers directly, then the last mov would have to wait until the previous store operation was dispatched. But instead the first three instructions get one temporary register assigned for eax, and the last instruction gets a new temporary register assigned again. The eax register has been ‘renamed’ to a temporary register. Now the load operation can be dispatched at any time, it does not have to wait for the previous instructions to finish. The CPU can rename 3 registers per clk. Each temporary register has its own dependency chain. Try to keep these chains short, so that the CPU can allocate new temporary registers early, which will reduce dependencies, and allow the CPU to dispatch more indepentent µOps. (note: the old tricks of ‘xor reg, reg’ or ‘sub reg, reg’ to set a register to 0 will not be recognized as independent. A new independent chain will not be created.)