Very simple out-of-order execution simulator

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Over the summer, I put together a simple demonstration of out-of-order execution in JavaScript. I thought some of you might find it interesting... you can find it here.

The instruction set:
[*]ldi - load immediate, register is set to the value in the instruction
[*]addr - register add, destination register is set to sum of source registers
[*]mul - register multiply, destination register is set to product of source registers
[*]bne - branch if not equal, jumps to location specified in the instruction if the source registers are not equal
[*]halt - stops execution

The boxes you see are:
[*]Program: the instructions in the program
[*]PC: the program counter (i.e. the location of the instruction that will be fetched)
[*]The decoder: shows instruction to decode, the PC of that instruction, and the branch prediction (right now, the branch predictor always predicts branches as taken); the "map" table which is used for register renaming; the decoder's results
[*]PRF: physical register file - speculative registers
[*]ARF: architectural register file - commited (non-speculative) register values, Note that its values are pointers into the physical register file.
[*]Free list: indicates which physical registers are not in use
[*]Issue Queue: holds instructions waiting to execute. Instructions are "Ready" when all source values are valid (i.e. the instructions that produced the source values have finished)
[*]ALUs: the execution units. there are 2.
[*]Re-Order Buffer: ensures the instructions update the architectural register file in order, and allows for recovery when a branch is mispredicted

Description of the program it executes:
register 0 = 5
register 1 = 1
register 2 = 2
register 3 = 2
register 2 = register 1 + register 2
register 3 = register 0 * register 3
register 0 = 5
register 0 = 5
if register2 differs from register 0, go to instruction 4
halt
Note that instruction 4 is the fifth instruction, since the count starts at 0. Note also that setting register 0 to 5 often doesn't actually do anything - the extra instructions are there just so you can see them finish before the multiply does.

What the program does:
register 1 gets incremented repeatedly
register 3 gets multiplied by 5 repeatedly, until register 1 has reached 5

How to run it
Click the "step" button to execute 1 cycle.

I hope at least a few people find it interesting and understandable. It's hard to show all of the pieces on the screen in a readable way.

If anyone has questions, I'd be happy to answer them. If I need another break from studying, I might write something more about how out-of-order execution works.

Notes to people familiar with out-of-order execution
I cheat for precise interrupts. The "right" way of doing it would be to take snapshots of the free list at each branch (or, I guess, instruction that could fault), and restore them when a misprediction or exception occurs. Since I already have too many tables on the screen, I just reconstruct the free list myself when an exception occurs - I look at the ARF to see what physical registers are in use and rebuild the freelist from there.

Also, by leaving out memory support, I eliminate one of the most complicated parts of a modern processor.

My issue queue is obviously way too big for the sample programs I use. I sized everything to be large enough that I never have stalls due to full ROB/issue queue/physical register file.

Related project
I'm currently writing another version in C, which presently has about the same level of functionality, but I'm hoping to eventually make it support a complete, real instruction set and integrating it with a full-system simulator. It would be pretty cool if that worked, since it would allow for performance modelling of unmodified binaries.
 
 

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
Neat...

Can someone explain to me the difference between in-order and out-of-order execution?
What are the advantages of each?
Is x86 in-order or out-of-order?

Is your program like a primitive VMware, sort of, kind of?
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Can someone explain to me the difference between in-order and out-of-order execution?
Consider a set of instructions:
r0 = r0*2
r1 = r1+1
r2 = r2+1
Now, multiplication is pretty complicated, so on reasonable* CPUs it takes about 3-5 cycles. In an in-order CPU, you'd start executing the multiply, and not execute other instructions for the next 2 cycles, and then resume executing the two adds. However, you can see that the two adds don't need the result of the multiply to execute - there is no true dependence between the instructions. That means that if we were clever, we could actually execute all 3 instructions at the same time. Because the adds only take 1 cycle, they'd both finish in one cycle, before the multiply finishes.

*"reasonable" basically means "not a Pentium 4" - I think a multiply on a P4 takes over 20 cycles

Consider this even better example:
r0=r0*2
r1=r0+1
r2=r2+1
r3=r3+1
The second instruction depends on the first, but the third and fourth instructions don't. With an in-order CPU, we'd execute the multiply, do nothing for a couple cycles, and resume. This sequence of instructions would take 6 cycles to execute (3 mul, 3 add). If we allow for out-of-order execution, however, while the multiply is being computed, we can execute the third and fourth instructions. Once the multiply finishes, we can execute the second instruction. An in-order CPU with 2 execution units would take 6 cycles to execute this sequence (3 mul, 1 for 2 adds, 1 for the last add). An out-of-order CPU, however, would take 4 cycles (3 for the multiply, during which the second add would happen, 1 for the 2nd and 4th instructions).

If I remember correctly, out-of-order execution gives you a performance boost of around 40% on average.

There are some caveats, though. Sometimes you hit a branch instruction, but you don't know until you actually execute it whether it's taken or not taken. You have to guess. Branch predictors can get it right about 95% of the time, but when they guess wrong, you have to throw away the incorrect work. Now, you don't find out whether you guessed correctly or not until very near the end of the pipeline. In an in-order CPU, when you discover a prediction was wrong, you can just throw away everything in the front part of the pipeline (since they're junk instructions, and haven't yet written values to your register file). In an out-of-order CPU, though, instructions after the branch could have already written values to your register file. There are a few ways to handle this; my simulator uses one of them.

Also, some instructions can cause exceptions (for example, if you divide by 0 or access an invalid memory address). When an exception occurs, you want to make sure that every instruction up to the exception has finished executing, and none after it have started - this makes it possible for programmers to understand the execution of their programs without going insane ;). It also allows exception handlers to know exactly how much of the program has executed. Making sure you get this right requires some extra hardware compared to an in-order processor. Also, figuring out which instructions are independent of each other (so you can actually execute them out of order and still get the right answer) requires some fancy hardware.

What are the advantages of each?
As I mentioned above, out-of-order execution requires quite a bit of extra hardware to find independent instructions and ensure that instructions appear to execute in order. In-order CPUs are significantly simpler.

Is x86 in-order or out-of-order?
Instruction sets don't really determine whether the CPU is in-order or out-of-order. An out-of-order CPU actually looks like an in-order CPU to the programmer, and acts exactly like it (but it will complete more instructions in a given number of cycles). You can build an in-order x86 CPU (like all Intel x86 CPUs up to the Pentium), or an out-of-order x86 CPU (like all Intel processors since the Pentium Pro / Pentium II). An out-of-order machine will only reorder instructions that don't communicate with each other, so that it gets the answer right. Now, some instruction sets are easier to execute out of order than others, and x86 is definitely among the hardest (all out-of-order x86 CPUs I'm familiar with actually translate x86 instructions into more RISC- or VLIW-like instructions internally and execute those out of order; Intel calls these instructions "micro-ops" or "u-ops" - a single x86 instruction could map to one uop or a long sequence of uops).

Is your program like a primitive VMware, sort of, kind of?
VMware is aimed at running x86 programs in a virtual x86 environment, on x86 machines, and to do that very fast. To reach that goal, it uses an entirely different method from what my program does. However, there are some similarities - both VMware and my program offer virtual hardware that can run programs as if they were run on real hardware.
 

bendixG15

Diamond Member
Mar 9, 2001
3,483
0
0
Very interesting....

I'll come back in a couple of days to see if
any intelligent replies have been posted...I hope so..
 

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
Wow, thanks. I really understand it now. Very neat. So there is a lot of balancing between optimization and "doing it right the first time" with out-of-order? Pentium 4 and Athlon 64 are out-of-order? What happens when you step in a debugger one instruction at a time on an out-of-order? Is it still transparent to you?
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
So there is a lot of balancing between optimization and "doing it right the first time" with out-of-order?
I don't think I understand what you're asking here.

Pentium 4 and Athlon 64 are out-of-order?
Yes. All (main-stream x86) AMD processors since the K5 have been out-of-order. (AMD does make the Geode line of x86 chips which are in-order.)

What happens when you step in a debugger one instruction at a time on an out-of-order? Is it still transparent to you?
Yup, it's still transparent. As I understand it, the debugger causes every instruction to fault and traps the faults. The exception handling methods I mentiond in my earlier post make sure that only one instruction executes, and it executes completely.
 

iotone

Senior member
Dec 1, 2000
946
0
0
holy crap that is awesome... wish that was around when i was taking architecture! good job OP :)
 

dwcal

Senior member
Jul 21, 2004
765
0
0
Originally posted by: CTho9305
So there is a lot of balancing between optimization and "doing it right the first time" with out-of-order?
I don't think I understand what you're asking here.
That might have been unclear, but maybe he means ordering instructions when you're hand tuning assembly code. It's not as simple as choosing the instructions that take the fewest number of cycles total. You'd probably want to balance add and multiply instructions to keep all the execution units busy, and the proportion of each instruction would depend on the processor.
 

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
Never mind, I don't understand what I was asking either, lol. Something about branch predictions and how you need to make them accurate to get the best performance. In-order CPUs don't have to do branch prediction right?
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Something about branch predictions and how you need to make them accurate to get the best performance. In-order CPUs don't have to do branch prediction right?
You still need to predict branches on in-order CPUs if they're pipelined (which pretty much all CPUs are), but in an in-order CPU, instructions that you executed "speculatively" (i.e. instructions after the branch that you've executed before figuring out which way the branch goes) cannot write to your registers. Since the instructions after the branch can't write to registers before the branch is "resolved" (you know which way it went), you can just kill them and start executing from the right place. In an out-of-order CPU, later instructions could have executed and written their results to registers already; you need to make sure that if you predicted the branch wrong, you haven't overwritten data you need to recover, and you haven't allocated physical registers that don't get freed.