How are instructions delivered to the CPU?

chrstrbrts

Senior member
Aug 12, 2014
522
3
81
Hello everyone,

I used to think that when a 64-bit CPU fetched an instruction, after translating the virtual address to a physical address in the MMU, it just grabbed 8 consecutive bytes from RAM starting at some base address and that one big 64-bit instruction was sent down an instruction bus.

Though, after reading the first chapters of Intel's 64 and IA-32 Architectures Software Developer’s Manual Volume 2, I realize that instructions can be of varied length.

Further, some instructions with prefixes, etc. are much longer than 64 bits long.

So, how exactly does an instruction make its way from RAM to the CPU?

Thanks.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,564
4,480
75
I used to think that when a 64-bit CPU fetched an instruction, after translating the virtual address to a physical address in the MMU, it just grabbed 8 consecutive bytes from RAM starting at some base address and that one big 64-bit instruction was sent down an instruction bus.
That, or some variation of that, would be true of a RISC processor. But Intel processors are more Complex.

You seem to be asking about the Fetch unit. Here's an old PDF with an overview of the Intel fetch cycle, on a 32-bit Pentium: http://users.utcluj.ro/~baruch/book_ssce/SSCE-Intel-Pipeline.pdf Modern processors aren't that much more complicated.
 
  • Like
Reactions: Cogman

Merad

Platinum Member
May 31, 2010
2,586
19
81
Modern Intel CPUs are quite a bit more advanced than the old Pentiums. Here's the fetch/decode process for Haswell:

  1. The branch predictor feeds an instruction pointer to the fetch unit.
  2. The fetch unit reads 16 bytes from the 32KB L1 instruction cache.
  3. Pre-decoders split the 16 byte buffer up into instructions. Even though in theory the buffer could contain up to 16 instructions, IIRC the pre-decoder outputs a maximum of 5 instructions per clock.
  4. Instructions are fed in-order into a 20 instruction queue.
  5. Instructions are pulled from the queue into decoders. Haswell has 3 simple decoders (single uop instructions) and 1 complex decoder (1-4 uop instructions). Microcoded instructions (> 4 uops) are handled by a separate microcode engine that outputs 4 uops per clock, but the decoders are blocked while it is in use.
  6. The decoders all work with a uop cache, so that an instruction is cached, its uop(s) are pulled from the cache and decoding is skipped to save power and increase throughput.
  7. uops go into a 56 entry queue, where from which they will be fed to the out of order execution engine.
I can't recall exactly how that is split up in terms of pipeline stages, but it's spread across something like 5-6 stages.
 

chrstrbrts

Senior member
Aug 12, 2014
522
3
81
Modern Intel CPUs are quite a bit more advanced than the old Pentiums. Here's the fetch/decode process for Haswell:

  1. The branch predictor feeds an instruction pointer to the fetch unit.
  2. The fetch unit reads 16 bytes from the 32KB L1 instruction cache.
  3. Pre-decoders split the 16 byte buffer up into instructions. Even though in theory the buffer could contain up to 16 instructions, IIRC the pre-decoder outputs a maximum of 5 instructions per clock.
  4. Instructions are fed in-order into a 20 instruction queue.
  5. Instructions are pulled from the queue into decoders. Haswell has 3 simple decoders (single uop instructions) and 1 complex decoder (1-4 uop instructions). Microcoded instructions (> 4 uops) are handled by a separate microcode engine that outputs 4 uops per clock, but the decoders are blocked while it is in use.
  6. The decoders all work with a uop cache, so that an instruction is cached, its uop(s) are pulled from the cache and decoding is skipped to save power and increase throughput.
  7. uops go into a 56 entry queue, where from which they will be fed to the out of order execution engine.
I can't recall exactly how that is split up in terms of pipeline stages, but it's spread across something like 5-6 stages.

How in God's name did we ever manage to build something so complex?
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,564
4,480
75
How in God's name did we ever manage to build something so complex?
Well, x86 CPUs started as CISC. They had complex instructions. Then people figured out that RISC is generally better. They couldn't change the x86 instruction set. So they added a decoder to convert from external CISC instructions to internal (mostly) RISC instructions. Then they added optimizations for improved performance. Eventually they wound up storing instructions in a low-level cache in "decoded" (meaning mostly-RISC) form.
 

chrstrbrts

Senior member
Aug 12, 2014
522
3
81
Well, x86 CPUs started as CISC. They had complex instructions. Then people figured out that RISC is generally better. They couldn't change the x86 instruction set. So they added a decoder to convert from external CISC instructions to internal (mostly) RISC instructions. Then they added optimizations for improved performance. Eventually they wound up storing instructions in a low-level cache in "decoded" (meaning mostly-RISC) form.

I'm just amazed that we can take a slab of silicon and turn it into a processor.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
Well, x86 CPUs started as CISC. They had complex instructions. Then people figured out that RISC is generally better. They couldn't change the x86 instruction set. So they added a decoder to convert from external CISC instructions to internal (mostly) RISC instructions. Then they added optimizations for improved performance. Eventually they wound up storing instructions in a low-level cache in "decoded" (meaning mostly-RISC) form.

At this point, there really isn't much difference between the main CISC and RISC architectures. Most RISC architectures end up having just as many instructions as their CISC counterparts. The main difference is really just the fact that RISC has a (mostly) fixed instruction width while the old CISC instruction sets do not. This all comes down to the fact that in the early days of computing, storage was expensive. Extra bytes really mattered.

All and all, the amount of extra power and complexity to decode CISC instructions is mostly a non-issue. It isn't the place where most CPUs are spending their power budget.

Now, if you are interested in seeing what state of the art CPU design looks like (even if it probably will never see the light of day). I would suggest looking into the Mill architecture. It is really fascinating. It is a next generation CPU design that rethinks just about every way you think about a CPU. Current CPUs were mostly designed for hand written assembly. Mill was designed to work well with modern compilers.
 

Merad

Platinum Member
May 31, 2010
2,586
19
81
How in God's name did we ever manage to build something so complex?

Honestly, that's barely the tip of the iceberg. After you decode the instructions you have the entire out-of-order execution engine, the reordering system that commits the execution results, the branch prediction engine, the entire system that maintains the caches... probably half a dozen other major components that I'm forgetting. I've wondered before how many people you actually have to bring together to have a complete understanding of a modern CPU in the room. It's far, far too much for any one person.
 

chrstrbrts

Senior member
Aug 12, 2014
522
3
81
Would you agree that modern processors constitute the most advanced technology created by mankind?
 

Merad

Platinum Member
May 31, 2010
2,586
19
81
Depends on exactly how you define "advanced technology." They're certainly a strong contender.