How are instructions delivered to the CPU?

chrstrbrts · Oct 12, 2016

Hello everyone,

I used to think that when a 64-bit CPU fetched an instruction, after translating the virtual address to a physical address in the MMU, it just grabbed 8 consecutive bytes from RAM starting at some base address and that one big 64-bit instruction was sent down an instruction bus.

Though, after reading the first chapters of Intel's 64 and IA-32 Architectures Software Developer’s Manual Volume 2, I realize that instructions can be of varied length.

Further, some instructions with prefixes, etc. are much longer than 64 bits long.

So, how exactly does an instruction make its way from RAM to the CPU?

Thanks.

Ken g6 · Oct 12, 2016

chrstrbrts said:
I used to think that when a 64-bit CPU fetched an instruction, after translating the virtual address to a physical address in the MMU, it just grabbed 8 consecutive bytes from RAM starting at some base address and that one big 64-bit instruction was sent down an instruction bus.

That, or some variation of that, would be true of a RISC processor. But Intel processors are more Complex.

You seem to be asking about the Fetch unit. Here's an old PDF with an overview of the Intel fetch cycle, on a 32-bit Pentium: http://users.utcluj.ro/~baruch/book_ssce/SSCE-Intel-Pipeline.pdf Modern processors aren't that much more complicated.

Merad · Oct 13, 2016

Modern Intel CPUs are quite a bit more advanced than the old Pentiums. Here's the fetch/decode process for Haswell:

The branch predictor feeds an instruction pointer to the fetch unit.
The fetch unit reads 16 bytes from the 32KB L1 instruction cache.
Pre-decoders split the 16 byte buffer up into instructions. Even though in theory the buffer could contain up to 16 instructions, IIRC the pre-decoder outputs a maximum of 5 instructions per clock.
Instructions are fed in-order into a 20 instruction queue.
Instructions are pulled from the queue into decoders. Haswell has 3 simple decoders (single uop instructions) and 1 complex decoder (1-4 uop instructions). Microcoded instructions (> 4 uops) are handled by a separate microcode engine that outputs 4 uops per clock, but the decoders are blocked while it is in use.
The decoders all work with a uop cache, so that an instruction is cached, its uop(s) are pulled from the cache and decoding is skipped to save power and increase throughput.
uops go into a 56 entry queue, where from which they will be fed to the out of order execution engine.

I can't recall exactly how that is split up in terms of pipeline stages, but it's spread across something like 5-6 stages.

chrstrbrts · Oct 19, 2016

Merad said:
Modern Intel CPUs are quite a bit more advanced than the old Pentiums. Here's the fetch/decode process for Haswell:

The branch predictor feeds an instruction pointer to the fetch unit.

The fetch unit reads 16 bytes from the 32KB L1 instruction cache.

Pre-decoders split the 16 byte buffer up into instructions. Even though in theory the buffer could contain up to 16 instructions, IIRC the pre-decoder outputs a maximum of 5 instructions per clock.

Instructions are fed in-order into a 20 instruction queue.

Instructions are pulled from the queue into decoders. Haswell has 3 simple decoders (single uop instructions) and 1 complex decoder (1-4 uop instructions). Microcoded instructions (> 4 uops) are handled by a separate microcode engine that outputs 4 uops per clock, but the decoders are blocked while it is in use.

The decoders all work with a uop cache, so that an instruction is cached, its uop(s) are pulled from the cache and decoding is skipped to save power and increase throughput.

uops go into a 56 entry queue, where from which they will be fed to the out of order execution engine.

I can't recall exactly how that is split up in terms of pipeline stages, but it's spread across something like 5-6 stages.

How in God's name did we ever manage to build something so complex?

Ken g6 · Oct 19, 2016

chrstrbrts said:
How in God's name did we ever manage to build something so complex?

Well, x86 CPUs started as CISC. They had complex instructions. Then people figured out that RISC is generally better. They couldn't change the x86 instruction set. So they added a decoder to convert from external CISC instructions to internal (mostly) RISC instructions. Then they added optimizations for improved performance. Eventually they wound up storing instructions in a low-level cache in "decoded" (meaning mostly-RISC) form.

chrstrbrts · Oct 19, 2016

Ken g6 said:
Well, x86 CPUs started as CISC. They had complex instructions. Then people figured out that RISC is generally better. They couldn't change the x86 instruction set. So they added a decoder to convert from external CISC instructions to internal (mostly) RISC instructions. Then they added optimizations for improved performance. Eventually they wound up storing instructions in a low-level cache in "decoded" (meaning mostly-RISC) form.

I'm just amazed that we can take a slab of silicon and turn it into a processor.

Cogman · Oct 21, 2016

Ken g6 said:
Well, x86 CPUs started as CISC. They had complex instructions. Then people figured out that RISC is generally better. They couldn't change the x86 instruction set. So they added a decoder to convert from external CISC instructions to internal (mostly) RISC instructions. Then they added optimizations for improved performance. Eventually they wound up storing instructions in a low-level cache in "decoded" (meaning mostly-RISC) form.

At this point, there really isn't much difference between the main CISC and RISC architectures. Most RISC architectures end up having just as many instructions as their CISC counterparts. The main difference is really just the fact that RISC has a (mostly) fixed instruction width while the old CISC instruction sets do not. This all comes down to the fact that in the early days of computing, storage was expensive. Extra bytes really mattered.

All and all, the amount of extra power and complexity to decode CISC instructions is mostly a non-issue. It isn't the place where most CPUs are spending their power budget.

Now, if you are interested in seeing what state of the art CPU design looks like (even if it probably will never see the light of day). I would suggest looking into the Mill architecture. It is really fascinating. It is a next generation CPU design that rethinks just about every way you think about a CPU. Current CPUs were mostly designed for hand written assembly. Mill was designed to work well with modern compilers.

Merad · Oct 21, 2016

chrstrbrts said:
How in God's name did we ever manage to build something so complex?

Honestly, that's barely the tip of the iceberg. After you decode the instructions you have the entire out-of-order execution engine, the reordering system that commits the execution results, the branch prediction engine, the entire system that maintains the caches... probably half a dozen other major components that I'm forgetting. I've wondered before how many people you actually have to bring together to have a complete understanding of a modern CPU in the room. It's far, far too much for any one person.

chrstrbrts · Oct 21, 2016

Would you agree that modern processors constitute the most advanced technology created by mankind?

Merad · Oct 21, 2016

Depends on exactly how you define "advanced technology." They're certainly a strong contender.

Search

How are instructions delivered to the CPU?

chrstrbrts

Senior member

Ken g6

Programming Moderator, Elite Member

Merad

Platinum Member

chrstrbrts

Senior member

Ken g6

Programming Moderator, Elite Member

chrstrbrts

Senior member

Cogman

Lifer

Merad

Platinum Member

chrstrbrts

Senior member

Merad

Platinum Member

TRENDING THREADS