Originally posted by: IntelUser2000
Looks like its not really Hyperthreading that affects multi-thread performance increase but rather the core itself. The multi-core performance scaling must be much better on Core i7 than Core 2. Increased bandwidth, better thread synchronization, L3 acting as a buffer for multi-threading.
Actually if you crunch the numbers to cast them as true "thread scaling" numbers as most folks in the field would do then you get this:
Threads.....i7-965 w/o turbo.....QX9770.......Amdahl for 81% parallelized code
....1...........1.0...........................1.0................1.0
....2...........1.68.........................1.68..............1.68
....4...........2.44.........................2.48..............2.55
....8...........3.57.........................2.44..............3.43
So what is really amazing to me is that QX9770 scales 1 to 4 threads in this application no better or worse than Nehalem scales.
4 threads on QX9770 gives you 2.48x speedup while 4 threads on the i7 965 gives you 2.44x speedup.
The extra 4 logical cores on Nehalem are quite functional though, clearly, as they yield an additional 1.13x speedup (i.e. 46% faster with 8 threads than with 4 threads) over the single-thread performance when all 4 SMT cores are put to full use.
So it is quite the opposite conclusion IMO...SMT is clearly value-add in this application but the migration from FSB to QPI+IMC+3-channel+monolithic yielded zero
improvement (realtive to that of 1600MHz FSB of QX9770 and dual-channel) to the interprocessor communication speeds as far as this application is concerned.
edit: I meant to add this nugget: this manner of scaling where both processors score the same scaling despite radically differing methods of core-to-core communications is typical of serial-code limited course-grained applications.
The scaling limitation is not interprocessor communication speed but rather that only 81% of the code in this application is parallelizable (yes thats a word). I added the Amdahl scaling results to the table above.
The remaining speedup scaling discrepancy can be attributed to interprocessor communication not being infinitely fast...and would need to have Almasi and Gottlieb's modification to Amdahl's law added to scaling equation.