I know about the Amdahl's Law. But think about it. Do you think the efficiency of current multi-core implementations reach anywhere NEAR 100%? How much can they pull from that before Amdahl's Law(where its saying amount of single threaded code affects multi thread scaling) becomes a significant hindrance?
Compare the E8400 to Q9650, the closest dual core to quad core comparison you can EVER get from the Anand's beta Bench test above. Average improvement turns out to be 45%. If they can get the heavily multi-threaded apps from 70% scaling to near 100% and lightly multi-threaded apps from 25% to 70%, I can't say it'll be bad at all.
The subject of heterogeneous processing resources and speedup was the topic of a chapter in my dissertation, I would argue that I have put a considerable amount of thinking into it and my posts here are an attempt to share some of the fruits of all that thinking.
The infrastructure for characterizing the contributions of various hardware and software limitations on overall speedup exists, you just have to use it correctly if you intend to extract meaningful interpretations from the results.
To determine hardware-based limitations to thread scaling you must first deconvolve the scaling data and remove the portion of imperfect thread scaling that is application/software dependent. That is what Amdahl's law helps us characterize.
Take the Euler3D example above, cursory analysis of the scaling data available so far indicates this code is semi-coarse grained, roughly 96% of the computation efforts can be performed in parallel while roughly 4% of the computations are serial in nature.
That sets the upper limit of what we would call "perfect scaling" for a hardware solution at the Amdahl limit (the thick red line in my graph) which itself is a function of the number of threads.
Regardless your hardware efficiency you simply cannot exceed the scaling limits imposed by Amdahl's law. And owing to hardware scaling efficiencies we lose scaling from there.
I take the time to belabor this point because it speaks to the crux of the issue when people refer to the "efficiency of current multi-core implementations" by way of speaking to absolute scaling numbers which are convoluted by the ramifications of Amdahl's law.
With Euler3D code you will never see 100% thread scaling, never, no matter the type of microarchitecture involved.
At the absolute best case scenario in which interprocessor communication topology is infinitely fast (zero latency) and infinitely wide (infinite bandwidth) and the cores are absolutely identical in not sharing resources (no fetch/decode/cache contention) the best thread scaling you will see with Euler3D in going from
1 core to 2 cores is 92% and the best scaling you could ever see in going from
2 cores to 4 cores is 85% and the best scaling you would ever observe in going from
4 cores to 8 cores is 75%.
Thread scaling efficiency only goes down from there, solely owing to limitations of the ratio of time spent processing parallel computations versus those that must be done serially.
Further still the thread scaling efficiency declines because interprocessor communications are not infinitely fast and wide, and with the advent of hyperthreading and bulldozer we have further reduction in thread scaling because of resource contention adding idle cycles to any given thread.
What we see in the Euler3D data is that thread scaling certainly suffers from resource contention in nehalem with hyperthreading, no one is arguing any differently, but the results also show how much of that inefficiency in thread scaling with bloomfield can be eliminated by way of improving on the hardware - be it by adopting an architecture like that of Opteron or disabling HT.
The relevance of that statement is that we are here discussing the ramifications of the assured additional deterioration in thread scaling that bulldozer modules will incur (JFAMD says the penalty is 20%) over that of a more efficient (for scaling purposes) architecture as seen in current Shanghai/Istanbul systems.
Provided the right kinds of data are generated (thread scaling data), we have the tools to generate the sort of analyses that can isolate thread scaling efficiencies attributable to software versus hardware.