I think you'll find that just about any technology that the marketing departments like to run with does have real technical merit behind them, they just are not apparent from the marketing propoganda.
For example, take the P4's "double pumped" ALUs....Intel's marketing loves to say that the latest P4 has 4GHz ALUs.
The truth of the matter is that the ALUs are pipelined with a fast clock (FCLK) signal that indeed runs at twice the global clock. It takes two FCLK periods, equal to one global clock period, to complete an arithmetic operation....in the first FCLK cycle, the lower 16-bits of the operation are calculated, and in the second FCLK cycle, the upper 16-bits are calculated. But the ALU bypass network runs on the FCLK signal and takes care of fetching operands from the register file and issuing arithmetic operations twice per global clock cycle. This has a few distinct advantages over normal ALUs.
First, a little background (I don't know your background, disregard if you know all this already
)....with superscalar cores, it is the goal to issue as many operations as possible per clock cycle. To accomodate for the maximum amount of instructions of a certain type that may burst through in a particular clock cycle, you're going to want to have multiple copies of the same unit, such as 3 or 4 integer units (ALUs). But there are limitations to the number of instructions that you may issue in a cycle...aside from memory stalls, data dependencies between instructions may prevent you from issuing multiple instructions in a cycle. There are three types of data dependencies:
Read-after-write:
a = b + c
d =
a + e
The result of the first operation is used in the second, thus they have to be issued in order
write-after-read:
a =
b + c
b = e + f
Issuing these instructions out of order or at the same time may change the results for the first, since b will be written to.
write-after-write:
b = a + c
b = d + e
Issuing these instructions out of order or at the same time could semantically change the program.
Write-after-read and write-after-write dependencies can be solved with register renaming, but traditionally there is no way to issue two instructions with a read-after-write dependency at the same time or out of order.
But with the P4's two-staged pipelined ALUs, running at 2X the global clock rate, the bypass network can effectively issue two instructions with a read-after-write dependency in the same global clock cycle. From the perspective of the global clock, the first instruction will be issued at cycle N, and the second instruction will be issued at cycle N + 1/2...at this time, the lower 16-bits of the first instruction are already calculated and available to the second instruction, so in a way this solves the read-after-write dependency problem. This doesn't necessarily have a *huge* impact on performance (probably not noticeable for most programs), but it does affect the way the processor can handle bursts of integer operations.
Secondly, two fast ALUs can handle the same number of instructions/cycle as four normal ALUs, but with less die area used and heat produced, which could conceivably be very useful for the MPU designers.
So when the marketing department gets a handle on this concept, they're obviously not going to understand the issues and benefits behind it, so they'll say "Wow, 4GHz ALUs! That must mean the P4 has twice the performance!" I really don't think it should be the fault of the engineers when they have a good idea and marketing doesn't understand it.