<< what is Superscalar??? they can fetch 3 micro-ops per cycle in to their reorder buffers? ok, what is a reorder buffer 🙂 >>
Superscalar is just the method of exploiting instruction-level parallelism in code by executing multiple instructions at a time. Modern out-of-order superscalar processors can execute instructions out of the order that they are in the code...After instructions are fetched and decoded, they are put into reorder buffers, or "queues", out of which they can be issued out of order to the appropriate execution unit. There are three types of dependencies which prevent instructions from being executed out-of-order, in which case the sequential instuctions must be executed in the order in which they are presented in the code:
Read-after-write:
a = b + c
d = e + a
Write-after-read:
d = e + a
a = b + c
Write-after-write:
c = a + b
c = d + e
As you may or may not know, CPUs most commonly perform arithmetic on registers, which are temporary locations of fast storage on the CPU, located in the register file and whose number is specified by the instruction set. RISC CPUs typically have 32 logical registers, vs. 8 for x86. The RISC philosophy states that memory contents must be loaded into registers before the arithmetic expression is evaluated, while x86 can perform arithmetic using registers and/or memory locations (not necessarily a good thing). With register remaning, on the other hand, there exists a larger number of physical registers, and the logical registers are mapped onto these (each left-hand side use of a logical register in an arithmetic expression gets a new physical register mapping). With register renaming, write-after-read and write-after-write dependencies are nullified, but read-after-write dependencies must still be followed.
<< I guess this is where a branch prediction unit comes into play? >>
Depends on the architecture....branch prediction and resolution is typically decoupled from the normal execution logic. On the P4, it occurs during a trace segment build.
<< so basically it can take 3 micro-ops (does that mean it can decode x86 as fast?) >>
This depends on the microarchitecture. Modern x86 processors with decoupled execution decode the more complex x86 instructions into the RISC-like micro-ops. In the case where an arithmetic instruction uses both registers and memory locations, it gets decoded into load instruction and an arithmetic instruction using only registers. As a result, I believe the average x86 instruction gets decoded into 1.5 uops. The Athlon, with the traditional instruction cache, has 3 parallel decoders (taking two pipeline stages to complete); thus it can fetch and decode 3 x86 instructions/cycle. The P4 has a single decoder, but it is only used when it builds a trace cache segment (decoded uops are reused in the trace cache)...in normal operation, the P4 can fetch 3 uops out of the trace cache per cycle. The problem is that x86 decoding is very complex....it has to do the x86 -> uop decoding, and x86 instructions are variable length from 1 to 15 bytes (vs. fixed at 4 or 8 bytes for 32-bit or 64-bit RISC), which makes the program counter updating more complex. As a result, x86 decoders are large, hot, and slow...the large amount of logic probably has very high fan-in and fan-out, as well as long wire lengths, and is likely a major source of the critical path length. Thus, while the Athlon's 3 decoders are great for maintaining throughput, they can't be great for heat, die area, and clock speed.
<< anywho, so basically the CPU can feed 3 out of 9 pipelines, yet without SMT it can only feed 1 out of 9 pipes per clock cycle >>
No, the point is that with SMT, all the execution units can be shared among multiple threads. How this is accomplished again depends on the microarchitecture. At the basic level, the hardware changes that are necessary are multiple program counters (to keep track of the instruction address of each thread) and some way of identifying each instruction with each thread. The beauty of register renaming is that multiple physical register sets aren't necessary, the same register set can map logical registers for multiple threads.
How the CPU maintains issue and retire throughput for SMT is up to the architecture. The now defunct Alpha EV8 was going to be a 4-thread SMT, 8-way fetch/issue superscalar CPU. It was designed from the ground-up to support all four threads at once....compared to the EV7, the number of integer units was to be increased from 4 to 6, and the number of float-point units from 2 to 4. The register files and caches had enough ports to handle the fetching and retiring of 4 active threads; it was estimated that for a single thread, the 8-way superscalar EV8 would have an average IPC (instructions/cycle) of between 2 - 2.5, so the 8-way CPU could just about accomodate the 8-10 instructions/cycle being fetched, executed, and retired for 4 threads.
The P4's SMT is initially going to support 2 threads, which is probably ideal. A study once found that for P3-era code, x86 CPUs typically achieve an IPC of around 1 - 1.2....since there's around 1.5 uops/x86 instruction, that's 1.5 - 1.8 uops/cycle on average. Typical integer code is comprised of roughly 50% load/store instructions, 35 - 40% arithmetic instructions, and 10-15% branch instructions. With the P4's two load/store units and two fast integer units, resource conflicts between the two active threads shouldn't be a problem. Floating-point code should cause more difficulty, since there's only one FP add/mult unit and one FP store unit that must be shared between two threads.