Shall we recap what the Pentium 4 was?
-8KB L1 Data cache, no L1 Instruction cache. Pentium III had 16KB, since Pentium MMX.
-Went from Pentium III's 3 decoders to 1. Actually Pentium MMX had 2 decoders so Pentium 4 had the same amount of decoders as the 486 and predecessors before Superscalar execution was a thing.
--Trace Cache replaced the L1i cache.
-Fast ALU that clocked at double the rest of the core and Slow ALU for the rest. The Fast ALU could only execute some instructions.
-20+ stage pipeline. I believe it was 20 stages AFTER Trace Cache hit.
So they tried to up the clock by two methods, one by increased pipeline stages and two by "simplifying" the core. I put simplifying in quotes because the Trace Cache was almost 100KB cache in size, which is 4x as large as Sandy Bridge's uop cache with same hit rate. Of course people learn and why SNB was much better.
The Netburst chips as we find out later went much further than enhanced branch prediction and novel Trace Cache to make up for the "Hyper Pipeline". It had the replay feature that replicated significant parts of the entire pipeline so it can "replay". I heard that was the primary reason for the big loss in performance when Hyperthreading was enabled in single threaded applications and no doubt it ballooned die size and power consumption.
- The Datacache was renamed to L0. => No L1 buffer for data
- Yep.
- Trace cache was after decode, while if there was a L1i it would be before decode. => No L1 buffer for instructions
- Fast ALUs don't do these, and are run on 1x core clock:
-- Some examples of integer execution hardware put elsewhere are the multiplier, shifts, flag logic, and branch processing.
-- Most integer shift or rotate operations go to the complex integer dispatch port. These shift operations have a latency of four clocks. Integer multiply and divide operations also have a long latency. Typical forms of multiply and divide have a latency of about 14 and 60 clocks, respectively.
- Yep
The hottest part of any Pentium 4 was the processor around the Integer Core. Since, when ever something missed it went to L2 -> Decode -> Trace -> Core -> L0 -> L2. Overall, the coolest part in temperature was the integer core. While everything else was nuclear apocalypse.
Modern OoO non-x86 cores that do >5 GHz, high-frequency pumpin, all have L1 instruction and data caches. Specifically, to prevent L2 usage from ballooning power at high frequency.
The double-pumping is actually the only good thing of Pentium 4. If they kept Netburst close to the patented version, it probably wouldn't have been bad. Since, that version had a L1 cache while none of the production models had one. The L1 Caches based off patents would be near/inside the memory execution unit next to the L0 caches and branch prediction unit.
Gen1:
19-cycle L2 -> Decode
19-cycle L2 -> 2-cycle L0d
No buffer for that.
Gen2:
23/27-cycle L2 -> Decode
23/27-cycle L2 -> 4-cycle L0d
Where is the buffer?!
Also, trace cache is as big as the L0 data cache. Definitely, not 100KB in size.
