Yeah but since when does multicore cpu's offer perfect scaling, even with the best mutithreaded software out there.
Although that point is probably moot since I believe AMD said that performance number is relative to existing cores.
Finally there is one very important thing missing from these equations: clock speed. This will be on the new 32nm SOI HiK gate first process, which appears to have very good characteristics. Should know more on that early next year. Maybe this thing really sings with a high clock.
The reduction in thread scaling (2x->1.8x) is intended to reflect that the thread scaling is that much less than scaling would have been otherwise were there no unshared resources at the core/module level.
So an application like LinX which scales to ~3.5x for four threads on four CMP cores would be expected to scale to 3.5 x 0.9 = 2.6 on a CMT styled architecture like that of BD.
Its not that BD architecture magically makes all apps thread to 90% while other CMP-based architectures perform even worse, its that it scales even less well as CMP but the idea is you can potentially have more cores (because they are all the smaller) to throw at the problem.
Regarding clockspeed...we have been explicit to denote we are talking about IPC here. Yes a clockspeed advantage would create and even larger beneficial performance advantage for BD but we aren't talking about that yet. We are just trying to ascertain whether we should really be expecting a BD module (2 cores) to have greater dual-threaded throughput than an Athlon II X2 or Phenom II X2 (Deneb-based) dual-core chip.
I don't say that to dismiss your points about clockspeed, I think your points are spot-on.
But what I want to know is how the new bulldozer design is going to affect single threaded performance. Will a module utilize only half of its resources with a single thread, or will it be able to take advantage of resources that might otherwise be idle?
That is the advantage of CMT with Bulldozer...if a module gets loaded with only one thread then the aggregate of all the otherwise shared resources in the module are solely at the disposal of enhancing IPC of that one thread.
How much benefit? Could be 5-10%. Basically the opposite of the thread-scaling penalty experienced when the shared resources are actually being shared.
Probably they are unrealistic.
Sometimes we get surprises, though.
But do you reckon a 4 Bulldozer core (2 modules) will be slower than a Phenom II X4?
And so, 16 Bulldozer cores is 12.96 cores vs interlagos 12?
At the same clockspeed? Yes. But not by much and we must consider those 4 BD cores could be as much as 33% smaller in aggregate die-area compared those 4 Deneb cores we are comparing it to.
Remember with CMT your throughput (performance) goes down some (AMD says ~20% when the modules are loaded with two threads versus if the modules were CMP with no shared resources) but your cost (die area) goes down even more.
Also remember Interlagos is BD-based. Are you thinking of Magny-Cours? (12-core based on 45nm Lisbon cores, like Deneb) If that is the case then yes, I think at the same clockspeed 16-core Interlagos will be comparable to 12-core Magny-Cours (in 16+ threaded apps obviously) but will occupy vastly smaller die-area (even if Interlagos were 45nm, but it is 32nm so it will be even smaller still).
And because Interlagos will be 32nm I expect clockspeeds to be upwards of 30-40% higher than Magny-Cours, so for reasons piesquared mentioned the actual performance of Interlagos versus magny-cours could be quite substantial (30-40%).