for some problems, we need a GHz war, supercomputers wont help.
As
Amdahl knew many years ago.
And one more thing I need to check if my understanding is correct.
If a certain architecture introduces an instruction set that can do multiplication and addition simultaneously (is it the so called FMA in Haswell?), like 2+3*7 is just one operation the same as 3+5 is a single operation. But if the problem is 2+5-9+21, there is no way that the new instruction set will help, and the solution will be a three step operation (two additions, one subtraction), unless they also have an instruction that can treat a 3 fold addition-subtraction problem as one operation. So in this situation an older platform will perform similarly to a newer one given the same operating frequency.
Well, yes and no. Yes, FMA does a multiply and an add at once. Though most times when you see "FMA", including for Haswell, it refers to floating-point, not integers. And yes,
2+5-9+21 requires three operations. However, some platforms will do this faster than others, because of varying levels of
instruction-level parallelism.
Very old processors just worked on one instruction at a time. Add, add, add, three cycles. Very simple. I think this lasted up to about the 286 or 386 for Intel.
Around the 486, Intel started using a
pipeline. Think of separate units to decode an instruction, load data, add, and store results. This led to faster cycles - four in my hypothetical example - but required more cycles to do a single instruction. Leave those instructions in order, and each depends on the previous one, so that takes
12 cycles 10 cycles - no branch instructions means the decodes can proceed without delay. But write your code like this:
a = 2+5
b = 9+21
c = a-b
and the first add and the second add run almost in parallel, so that takes 4+1+3 or 8 cycles. (The subtract needs to wait for both adds to be done.)
Then along came the Pentium with another wrinkle. It could decode and run two simple instructions at the same time, as long as one wasn't dependent on the other. Original code with my 4-stage pipeline: still 10 cycles. Modified code: now only 7 cycles.
Modern processors can, at least in theory, run up to about half a dozen independent instructions of specific types in one cycle. And latency has also been reduced so that, unless there's a conditional, results from one step are ready much more quickly for the next step. Anything from about Core 2 up would run the original code in 3 cycles (one per add), and the modified code in just 2 cycles!
Modern processors can also search for independent instructions over a wide range - getting wider with each CPU version. So in your original code, a hypothetical modern processor might have 3*7 and 2+5 run at the same time for cycle 1, more of 3*7 (multiplies almost always take longer than one cycle) and (the-result-of-2+5)-9 in cycle 2, and (the-result-of-3*7)+2 and (the-result-of-2+5-9)+21 in cycle 3.