IntelUser2000
Elite Member
I don't get the first sentence IDC. Stronger compared to what?
Update: I get what you mean. Yea that seems right
Update: I get what you mean. Yea that seems right
Last edited:
So with Bulldozer one thread is 25% stronger than the other right?
But since the second thread is "close enough" in power we just call the Bulldozer module "dual core" instead of single core with super strong hyperthreading.
-snip-
Imagine you had your PC and it had to do these instructions, "A+B+C+D", and would take 2 seconds.
Now you have 2 of those lists "A+B+C+D" and "E+F+G+H". A single core would take 4 seconds.
A dual core of the first CPU will do both in 2 seconds for a total CPU time of 4 seconds.
-snip-
See, its that semantics thing again.
Since they're positioning one module against Intel's dual core, and one module is like 80% scaling, they're positioning 1.8 AMD cores against 2.0 Intel cores. The real question is, hows the $/mm^2 gonna be?
The whole idea here is to intentionally create lower cost cores while minimizing the reduction in throughput. It is a tradeoff and its not a tradeoff that is done so that a dual-core bulldozer (a single bulldozer module) will suddenly have higher throughput (higher performance) than a CMP based quad-core (or dual-core for that matter).
20% higher IPC though?
Think about it, we are trying to compare a 4C/4T bulldozer to a 4C/4T deneb and we want to convince ourselves that the bulldozer will have higher performance at the same clock?
It has already been said that "at best" the four threads on bulldozer will give a thread scaling of 1.8^2 = 3.24x owing to CMT-based limitations whereas a four threads on deneb (being CMP-based architecture) can still deliver 4x scaling.
So at the same clockspeed we want to believe that bulldozer architecture will be improved enough over that of Deneb that 3.24x thread scaling will still yield higher performance than 4x thread scaling...that requires the bulldozer cores (each core, not each module) to have 23.5% higher throughput above and beyond that of Deneb's architecture.
And that is a minimum throughput improvement criterion since we used the best-case scenario number of 1.8x thread scaling per module.
You guys have really really tall expectations for the prospects of IPC improvements in sequential microarchitectures. Not that it can't be done, but those kinds of IPC improvements just seem unrealistic.
20% higher IPC though?
Think about it, we are trying to compare a 4C/4T bulldozer to a 4C/4T deneb and we want to convince ourselves that the bulldozer will have higher performance at the same clock?
It has already been said that "at best" the four threads on bulldozer will give a thread scaling of 1.8^2 = 3.24x owing to CMT-based limitations whereas a four threads on deneb (being CMP-based architecture) can still deliver 4x scaling.
So at the same clockspeed we want to believe that bulldozer architecture will be improved enough over that of Deneb that 3.24x thread scaling will still yield higher performance than 4x thread scaling...that requires the bulldozer cores (each core, not each module) to have 23.5% higher throughput above and beyond that of Deneb's architecture.
And that is a minimum throughput improvement criterion since we used the best-case scenario number of 1.8x thread scaling per module.
You guys have really really tall expectations for the prospects of IPC improvements in sequential microarchitectures. Not that it can't be done, but those kinds of IPC improvements just seem unrealistic.
I see that. But the problem Im seeing is that AMD is reducing the performance of 2 cores ("we'll give you 2 cores with 1.8 cores worth of performance"), while Intel is improving the performance of each core ("heres a quad core with 4.6x performance").The whole idea here is to intentionally create lower cost cores while minimizing the reduction in throughput. It is a tradeoff and its not a tradeoff that is done so that a dual-core bulldozer (a single bulldozer module) will suddenly have higher throughput (higher performance) than a CMP based quad-core (or dual-core for that matter).
All comes down to cost. AMD's cheap Athlon II X4 cores are significantly slower than Intel E8400 cores, but it's still a good processor because the price is right. If these newer and even more crippled AMD processors are anything like the Athlon X4, I'll be first in line to get one.I see that. But the problem Im seeing is that AMD is reducing the performance of 2 cores ("we'll give you 2 cores with 1.8 cores worth of performance"), while Intel is improving the performance of each core ("heres a quad core with 4.6x performance").
Unless the smallest bulldozer they'll ever sell is a 2 module "quad" part, in which case Id call it a dual core with 3.8x performance and AMD would do pretty well.
Which may or may not be a bad thing. This would be comparable to an Intel Q8200 and E8400 having a similar number of transistors but they have a different number of cores. Both have their advantages and disadvantages.Really, it sounds to me like what AMD is doing is less like hyperthreading and more like doing a tug-o-borrow between the cores
Yeah but since when does multicore cpu's offer perfect scaling, even with the best mutithreaded software out there.
Although that point is probably moot since I believe AMD said that performance number is relative to existing cores.
Finally there is one very important thing missing from these equations: clock speed. This will be on the new 32nm SOI HiK gate first process, which appears to have very good characteristics. Should know more on that early next year. Maybe this thing really sings with a high clock.
But what I want to know is how the new bulldozer design is going to affect single threaded performance. Will a module utilize only half of its resources with a single thread, or will it be able to take advantage of resources that might otherwise be idle?
Probably they are unrealistic.
Sometimes we get surprises, though.
But do you reckon a 4 Bulldozer core (2 modules) will be slower than a Phenom II X4?
And so, 16 Bulldozer cores is 12.96 cores vs interlagos 12?
I don't understand why we are saying these new processors are crippled.
Bulldozer isn't a Deneb core.
The number of instructions for 1 integer core is going to be increased.
AMD already increased IPC from Athlon x2 to Phenom x2 and then again to Athlon 2 X2 for example. Athlon II X2 is like 10-30% faster at same clocks then Athlon X2, depending on application.
Don't see why can't a Bulldozer be 20% faster than a Deneb (although AMD didn't say those Magny-Cours vs Interlagos was at same clock-speed, so can be at the same power instead).
EDIT: Yes I meant Magny-Cours (12 cores) vs Interlagos (16 cores).
20% higher IPC though?
Think about it, we are trying to compare a 4C/4T bulldozer to a 4C/4T deneb and we want to convince ourselves that the bulldozer will have higher performance at the same clock?
It has already been said that "at best" the four threads on bulldozer will give a thread scaling of 1.8^2 = 3.24x owing to CMT-based limitations whereas a four threads on deneb (being CMP-based architecture) can still deliver 4x scaling.
So at the same clockspeed we want to believe that bulldozer architecture will be improved enough over that of Deneb that 3.24x thread scaling will still yield higher performance than 4x thread scaling...that requires the bulldozer cores (each core, not each module) to have 23.5% higher throughput above and beyond that of Deneb's architecture.
And that is a minimum throughput improvement criterion since we used the best-case scenario number of 1.8x thread scaling per module.
You guys have really really tall expectations for the prospects of IPC improvements in sequential microarchitectures. Not that it can't be done, but those kinds of IPC improvements just seem unrealistic.
I think I will need to PM you to figure out why you feel that a 20% improvement in integer calculations is unreasonable
I don't understand why we are saying these new processors are crippled.
Bulldozer isn't a Deneb core.
The number of instructions for 1 integer core is going to be increased.
AMD already increased IPC from Athlon x2 to Phenom x2 and then again to Athlon 2 X2 for example. Athlon II X2 is like 10-30% faster at same clocks then Athlon X2, depending on application.
Don't see why can't a Bulldozer be 20% faster than a Deneb (although AMD didn't say those Magny-Cours vs Interlagos was at same clock-speed, so can be at the same power instead).
It is a new architecture. For example each Bulldozer core is 4-way INT and 2 to 4-way FP (depending on sharing), up from the 3-way each in K8/K10 today.
So I don't really know if we can assume a module is worse than a dual-core phenom II.
EDIT: Yes I meant Magny-Cours (12 cores) vs Interlagos (16 cores).
I don't think it's crippled at all. Just putting 2 of these together 'on die'* will be a threaded titan, and it seems to be implimented well so far from what little we can see. The int pipes is where the data spends most of it's time, it seems. Doubling up those at the expense of a much underutilized FP unit seems to make alot of sense, while beefing up the other one. And if the design scales well up to 8 modules, it could be a pretty big leap in performance. Should be interesting.
* IIRC, JF has described Bulldozer variations as a number of Bulldozer modules on die.
I thought that benchmark was a purely Integer calculation driven benchmark. Wouldn't that make the "remove 20% improvement because the two cores are sharing a single FPU" arraingement less meaningful? You have to look at what is being measured here, not just blindly say that each additional core is at best 80% as efficient as the previous core. Plus, the 80% figure was never meant to be a best case scenario, it was just a guess based on the above statement. The integer pipeline was improved immensely as stated previously by a few posters. To say that it improved by 20% doesn't sound like a huge stretch to me at least, especially for a brand new architecture with years of R&D focused on that area. Especially when Intel has already proven that this type of performance can be had with integer calculations with their architecture.
I have a headache from the flu and some dental work, and it is difficult to think straight, so maybe I am not understanding your logic here. I think I will need to PM you to figure out why you feel that a 20% improvement in integer calculations is unreasonable, when the improvements seem to point to much higher theoretical improvement in that area. I probably just don't have a good understanding of that benchmark, and what it is measuring. I also don't have a great understanding of the BD architecture, but I swear I remember reading that the IPU's and FPU's both have double the bandwidth of the Shanghai architecture.
Ok, I think I got how the confusion came from.
The sole reason for multi-threading technologies is to increase the utilization of the functional units.
A single core Core 2 processor might be 20% faster per clock in single thread apps than a dual core Celeron processor. But because a Celeron is a dual core, it'll be faster than Core 2 in well multi-threaded apps.
Well, the Bulldozer module can switch between a single core Core 2 and a dual core Celeron depending on the app.
Well, the Bulldozer module can switch between a single core Core 2 and a dual core Celeron depending on the app.
You are saying this because the second smaller "core" in the Bulldozer module becomes the bottleneck right?
If only they could "Turbo" that second core? Or "Turbo" the primary core when the second core isn't needed?
except I don't think it can quite switch like that... it's just that if you're only running one "core", you also get access to the whole FP, where as if you're using both cores you still only have access to 1 fp(because that's all there is to have)
It was estimated that the multithreading performance improvement would have been limited to 20% without these extensions.