Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

hamunaptra

Senior member
May 24, 2005
929
0
71
Yeah and not to mention..having more than one program open easily takes advantage of your multicore CPU. Same with windows, even it is multithreaded.
Heavily threaded software is just going to become more and more common in the future, it has happened in recent years and nothing is gonna stop it.

Its easier to tack on more cores then having a single core that can have insane IPC at high clock speed, the former traditionally uses less power if designed right and the latter...takes a crapload of power at this point in time.
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
price, performance, and watts are the only things that affect a consumer. that would be... pirce, 0-60 and fuel mileage. so :p

That's not true. Large die sizes mean greater manufacturing costs, which means greater environmental impact from building the chip. Consumers should penalize companies(and I think Nvidia is worse at this than Intel or AMD) for building excessively sized dies. An Intel example though would be the wasted die space on SB for hte IGP and HT that they disable. For AMD it would be the die space for their disabled cores in Phenom II(although they make it easy enough to re-enable them).
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
There wont be a differnce between the 2 cores in a modual,... so its not like 1 core will be differnt from the one next to it. And their both "real" cores, they simply share some stuff (that doesnt impact performance when your only useing 1 thread in a modual, and when your useing all your threads... well... do intel make 16 cores / thread cpus? any program that can use so many threads these processors will be extremly fast in)

besically this means the bulldozer will be really fast both in lightly threaded aplications, and in ones that can make use of many threads.

so what happens when a 2nd or 3rd core is needed? will the OS just automatically assign it to the next core in line (ie, the 2nd core in the first module), or assign it to the first core in the next module? Will it be software dependent, OS, etc? I can see this working out really well if the threads can be prioritized properly, like this: 1-5, 2-6, 3-7, 4-8, instead of 1-2, 3-4, 5-6, 7-8. Going from JFAMD's comment that the 2nd core when fully utilized will only act as 80% of an entire theortical single core, it seems that this software optimization could make or break BD.
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
That's not true. Large die sizes mean greater manufacturing costs, which means greater environmental impact from building the chip. Consumers should penalize companies(and I think Nvidia is worse at this than Intel or AMD) for building excessively sized dies. An Intel example though would be the wasted die space on SB for hte IGP and HT that they disable. For AMD it would be the die space for their disabled cores in Phenom II(although they make it easy enough to re-enable them).

but if you re-enable them then you're using more power, thus causing more pollution in the atmosphere, leading to global warming and the destruction of the polar bear habitat. thus:

YOU ARE A POLAR BEAR KILLER!! /sarcasm
 

hamunaptra

Senior member
May 24, 2005
929
0
71
so what happens when a 2nd or 3rd core is needed? will the OS just automatically assign it to the next core in line (ie, the 2nd core in the first module), or assign it to the first core in the next module? Will it be software dependent, OS, etc? I can see this working out really well if the threads can be prioritized properly, like this: 1-5, 2-6, 3-7, 4-8, instead of 1-2, 3-4, 5-6, 7-8. Going from JFAMD's comment that the 2nd core when fully utilized will only act as 80% of an entire theortical single core, it seems that this software optimization could make or break BD.

Try not to think of it as 100% on one core and 80% on the other.
Think of it in terms of modules. If one core is being use in a module it has 100%/100% of its performance.
If both cores are being used in a module equally, then its more like each core is working at 90% of their 100%

Which is 180% of 200%(given that your 100% is your 200% but in terms of overall module performance vs individual core performance which would be 200%) for a module, which equals 90% per module =)

So basically if your entire CPU composed of 4 modules (otherwise 800%) or 8 cores. If you have a full 8 thread workload of identical threads, your output would be 720% of that 800%. In terms of 100% of the entire CPU, your getting 90% of that performance.
If your just running 4 threads, 1 through each module you are getting 400% of that 800%, so an equal 50%.
If you are running 1 thread, you are getting the same ratio, 100% for that one thread or 80% of 800%
If there happens to be 4 threads, 2 on a set of 2 modules, then you are getting 360% of 800% performance. Or in other words just under 50%, what is that like 45%? or something? hahah...

Anyways, thats a basic rundown.
 
Last edited:

busydude

Diamond Member
Feb 5, 2010
8,793
5
76
If both cores are being used in a module equally, then its more like each core is working at 90% of their 100%

How can you quantify the amount of work each core is performing, if both are being used? A core working at 90% of its theoretical max.. seems like a number pulled out of thin air. Don't get me wrong, I am not trying to question your intelligence.. just trying to clear things up.
 

hamunaptra

Senior member
May 24, 2005
929
0
71
Well thats what 180% of 200% basically comes down too. If both cores were theoretically fully saturated and working on the same type of data at the same time. They would be able to individually do 90% performance relative to themselves. 180% module performance of 200% relative to the module.
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
but if you re-enable them then you're using more power, thus causing more pollution in the atmosphere, leading to global warming and the destruction of the polar bear habitat. thus:

YOU ARE A POLAR BEAR KILLER!! /sarcasm

Not really. They don't use too much extra power when you reenable the cores. I was hoping for some 45W triple cores but they never came out(except for maybe some Athlons).
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
How can you quantify the amount of work each core is performing, if both are being used? A core working at 90% of its theoretical max.. seems like a number pulled out of thin air. Don't get me wrong, I am not trying to question your intelligence.. just trying to clear things up.


The idea is for 12% extra die space (you add another core, that shares), you get 90% performance. The ~90% number is a number given by amd, based on information they have from testing things out (im assumeing).

so it ll look like this:

1-4 threads need for a program = 100% utilisation
(the cores in each modual wont have to share anything with any of the cores next to them, infact they now have all the resources available that are enough to power 2 cores at 90%+ results)

so you get 4 cores with almost enough resources to power 2 cores fully.
(so 1-4 threads will be fast)


5-8 threads needed for a pragram = 5-8 threads running at 90% of what their capable of.

so you get 8 cores, all shareing... not running as fast pr core as above where they dont have to share, but due to the double up numbers, useing 8 threads instead of 4 will still provide a BIG increase in thoughtput (compaired to only useing 4).




So besically... this design of the cpu, means it ll work really well both with light threaded stuff, and with very highly threaded stuff. But thats not the only reason for it, another brilliant side effect of shareing, is they use much less power this way.
 
Last edited:

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
That quote was from 2006 and Intels' policy since P4 has always been at least 1% performance gain per 1% more power usage. If it was inefficient, Atom would sure not use it.

And in certain cases benefit can be way more than 20 %. But depends on the workload.

If BD does scale so well, I just hope it also has decent single-threaded performance, meaning a lot more than Phenom II. Phenom II is probably even slower than core 2 duo clock for clock, not to mention SB.

Hyper threading is just a way of maximizing resource utilization. Atom uses SMT for that purpose, as does the XBox 360's cpu. Out of order execution also attempts to maximize resource utilization. There is some manner of diminishing returns by using both, but either one alone can approach a 2x increase in performance.
 

GaiaHunter

Diamond Member
Jul 13, 2008
3,731
428
126
So besically... this design of the cpu, means it ll work really well both with light threaded stuff, and with very highly threaded stuff. But thats not the only reason for it, another brilliant side effect of shareing, is they use much less power this way.

AMD design is assigning resources accordingly to their estimate of what future workloads will be, allowing them to have more relevant resources/performance in a smaller die.

The most obvious example is the integer to fpu ratio - instead of going 1 integer core + 1 256-bit FPU , AMD goes 2 integer cores + 1 2x128-bit FPU , since AMD is expecting 256-bit FP operations to be much less common than 128bit ones and integer operations.

This will allow AMD to have 8 integer cores + 4 2x128bit FPU in a die size only slightly larger than a traditional 4 integer cores + 4 256-bit FPU.

The downsides are:

a) AMD might be wrong in their estimations and so the 4 extra integer cores will be useless since the CPU will be bottlenecked by the shared resources;

b) shared resources have a performance toll even if estimations are correct (180% instead of 200% performance, according to AMD);

c) on the rare workloads where the shared resources are the bottleneck, performance will suffer compared to those where the shared aren't the bottleneck (this is, instead of performing as an octo-core product it will perform as a quad-core or worse).
 

JFAMD

Senior member
May 16, 2009
565
0
0
We have done enough profiling that you shouldn't see any bottlenecks. If I had to choose between implementations, sharing some front end (that is larger and wider) is a better bet then sharing execution pipelines.

The challenge on HT is that there is a limited amount of bandwidth available in the pipelines. Looking at some of the SPEC int rate numbers, it looks like ~14% increase from HT, which means that workload is ~85% efficient at best. (keep in mind there is some overhead...)

The real challenge is that software developers continue to try to squeeze more efficeincy out of their products, so the benefit from HT drops. If you get to 90% efficient, your HT benefit drops to ~10% or less.

Now, you could easily boost the efficeincy of HT by adding more pipelines. But that is no different than adding more cores. Would you rather have two cores with 4 pipelines or one core with 8 pipelines and HT?

Some of this obviously becomes a math problem at that point, but having more core is a better choice because an 8-pipeline core is going to suck up a huge amount of power, even when it is not busy, but with 2 4-pipeline cores you could shut one down and cut the power in hald during periods of low utilization.

Everything in life is a tradeoff.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
AMD design is assigning resources accordingly to their estimate of what future workloads will be, allowing them to have more relevant resources/performance in a smaller die.

The most obvious example is the integer to fpu ratio - instead of going 1 integer core + 1 256-bit FPU , AMD goes 2 integer cores + 1 2x128-bit FPU , since AMD is expecting 256-bit FP operations to be much less common than 128bit ones and integer operations.

This will allow AMD to have 8 integer cores + 4 2x128bit FPU in a die size only slightly larger than a traditional 4 integer cores + 4 256-bit FPU.

The downsides are:

a) AMD might be wrong in their estimations and so the 4 extra integer cores will be useless since the CPU will be bottlenecked by the shared resources;

b) shared resources have a performance toll even if estimations are correct (180% instead of 200% performance, according to AMD);

c) on the rare workloads where the shared resources are the bottleneck, performance will suffer compared to those where the shared aren't the bottleneck (this is, instead of performing as an octo-core product it will perform as a quad-core or worse).

First, someone who knows the BD architecture in full detail and not only as presented at Hot Chips 24, once hinted, that AMD still didn't disclose some important aspects of the architecture. So it's actually an even more difficult task to predict BD's behaviour and performance than the known architecture suggests. :)

a) What non-synthetic code does exist which might push the front end to its limits without leaving room for code to run on the second core? Do we even know for sure that the front end and the int cores run at the same clock frequency? Further what could be done with the power saved by not using some resources if such a case even exists?

b) Is such a performance toll worse than using a big core instead of 2 smaller ones? Pollack's Rule suggests diminishing returns for throwing more transistors and thus power in. A different question is: How much capacity/throughput of any unit is left unused if it's not shared?
OTOH SMT shares all resources and if this results in 130% performance we actually have 2 threads executing at 65% performance compared to a single thread. Throughput increases but latency of a single thread (time it takes to finish it's calculations) increases by 50%.

c) Workloads with such behaviour would actually only use the low level caches since every stall due to a cache missed memory request would leave room for another thread.

But in general this is not a black/white discussion of workloads with bottlenecked shared resources at only 100% of 1 core performance levels and workloads w/o bottleneck at 200%. There will be 150%, 170%, 186%, 192%, 160% cases etc. And each case itself will have phases with different levels. Just look at IPC diagrams of different codes.
 

GaiaHunter

Diamond Member
Jul 13, 2008
3,731
428
126
But in general this is not a black/white discussion of workloads with bottlenecked shared resources at only 100% of 1 core performance levels and workloads w/o bottleneck at 200%. There will be 150%, 170%, 186%, 192%, 160% cases etc. And each case itself will have phases with different levels. Just look at IPC diagrams of different codes.

No doubt about that.

I was just drawing boundaries for worst case scenarios (that might not even exist in the due to other factors).

If you read some previous posts in this thread, I think people are trying to fit BD in the mold they are familiar with, dual-core and single core with HT, while it probably fits neither.

I also believe, and hope so, that the success of BD won't be dependent on software optimization/ OS evenly assigning workloads to cores in different modules opposed to assigning workloads to different cores in the same module.

Maybe I should have said the "possible dowsides are".
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
From what Ive read about Bulldozer, the designs and the logic behinde them seem very straitforwards (these engineers are smart dudes).

I admit im kinda excited about BD, I more or less only see "upsides".

Its one hellva big jump compaired to the phenom's cpus theyve been makeing, and from reading JFAMD and the questions people asked him/got answeared, it all makes sense (design resonings).



we know it ll kick Opteron's butt.

"Bulldozer will be faster in single thread performance and have more IPC than current offerings.
"(JF-AMD)

since 1-8 threads wont share (in 16core bulldozers), we know in any software that uses 1-8 threads, these bulldozers will be faster.

We also know that the 16-core Interlagos will have more than 50% more throughput than the 12-core Opteron 6100 series. Again from quotes of JF-AMD.

Opteron 6176 SE is by no means a "weak" processor, and if the new Interlagos has more than 50% on that.... These bulldozers will be some good performers, will probably steal the light from the top Xeons in performance numbers.
 

hamunaptra

Senior member
May 24, 2005
929
0
71
Yeah the key we dont know of, at what clockspeed are they comparing this performance increase at.
I mean, that really doesnt matter all that much, if its higher clocks needed to pull off this performance increase fine. The uarch is a highspeed design anyways, so if it can scale extremely well clock wise then, that will not be an issue.
Except for maybe marketing lol!
Its good to know its single threaded performance is better than current offerings...lets hope he also was saying clock for clock its better.
I believe though, that its probably not as good as intels offerings clock for clock, which is where the high speed design of BD comes into play and in MT apps it should blow all competition out of the water.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Yeah the key we dont know of, at what clockspeed are they comparing this performance increase at.
I mean, that really doesnt matter all that much, if its higher clocks needed to pull off this performance increase fine. The uarch is a highspeed design anyways, so if it can scale extremely well clock wise then, that will not be an issue.
Except for maybe marketing lol!
Its good to know its single threaded performance is better than current offerings...lets hope he also was saying clock for clock its better.
I believe though, that its probably not as good as intels offerings clock for clock, which is where the high speed design of BD comes into play and in MT apps it should blow all competition out of the water.

We do know quite a few things that were updated in the Bulldozer architecture over the STARS architecture based on what was stated during hot chips. These changes suggest to me that the IPC will indeed improve in Bulldozer over STARS. Since I am lazy, I will just quote what I had posted in a nother thread here.

I'm not sure what you mean by initial numbers, since I haven't seen a single benchmark from the Bulldozer architecture. I do know from the things that were released during Hot Chips about the architecture that the vast majority of the architecture has been improved from K10.

The front end has been completely overhauled, including the branch prediction which probably is the most improved part of this architecture (although it was a weakness for the STARS architecture, so how improved this is will have a big impact on the Bulldozer performance since the new architecture has deeper pipelines.) The Branch target buffer now uses a two level hierarchy, just like Intel does on Nehalem and Sandybridge. Plus, now a mispredicted branch will no longer corrupt the entire stack, which means that the penalties for a misprediction are far less than in the STARS architecture. (Nehalem also has this feature, so it brings Bulldozer to parity with Nehalem wrt branch mispredictions)

Decoding has improved, but not nearly as much as the fetching on the processor. Bulldozer can now decode up to four (4) instructions per cycle (vs. 3 for Istanbul). This brings Bulldozer to parity with Nehalem, which can also decode four (4) instructions per cycle. Bulldozer also brings branch fusion to AMD, which is a feature that Intel introduced with C2D. This allows for some instructions to be decoded together, saving clock cycles. Again, this seems to bring Bulldozer into parity with Nehalem (although this is more cloudy, as there are restrictions for both architectures, and since Intel has more experience with this feature they are likely to have a more robust version of branch fusion.)

Bulldozer can now retire up to 4 Macro-ops per cycle, up from 3 in the STARS architecture. It is difficult for me to compare the out-of-order engine between STARS and Bulldozer, as they seem so dissimilar. I can say that it seems a lot more changed than just being able to retire 33% more instructions per cycle. Mostly the difference seems to be moving from dedicated lanes using dedicated ALUs and AGUs, to a shared approach.

Another major change is in the Memory Subsystem. AMD went away from the two-level load-store queue (where different functions were performed in in each level), and adopted a simple 40 entry entry load queue, with a 24 entry store queue. This actually increases the memory operations by 33% over STARS, but still keeps it ~20% less than Nehalem. The new memory subsystem also has an out-of-order pipeline, with a predictor that determines which loads can pass stores. (STARS had a *mostly* in-order memory pipeline) This brings Bulldozer to parity with Nehalem, as Intel has used this technique since C2D. Another change is that L1 cache is now duplicated in L2 cache (which Intel has been doing as long as I remember). Although L3 cache is still exclusive.

Bulldozer now implements true power gating. Although unlike Intel who gates at each core, they power gate at the module level. This shouldn't really effect IPC, but might effect the max frequency so it is a point to bring up when discussing changes to performance. The ability to completely shut off modules should allow higher turbo frequencies than we saw in Thuban, but we won't know what they are until we see some reviews.

Well, those are the main differences that I know of. Add that to the fact that this processor was actually designed to work on a 32nm process versus a 130nm process like STARS, and you should see additional efficiencies. I expect a good IPC improvement, along with a large clockspeed boost. Although I can't say how much, and I really am looking more for parity with Nehalem based processors than I am with Sandybridge based processors.

References:
Butler, Mike. "Bulldozer" A new approach to multithreaded compute performance. Hot Chips XXII, August 2010.

[URL]http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=1[/URL]

I made a couple corrections, based on the fact that I had a typo and an error in my original post, but if you want to see the original discussion, you can read it here: http://forums.anandtech.com/showthread.php?p=31118035&highlight=#post31118035
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
We have done enough profiling that you shouldn't see any bottlenecks. If I had to choose between implementations, sharing some front end (that is larger and wider) is a better bet then sharing execution pipelines.

The challenge on HT is that there is a limited amount of bandwidth available in the pipelines. Looking at some of the SPEC int rate numbers, it looks like ~14% increase from HT, which means that workload is ~85% efficient at best. (keep in mind there is some overhead...)

The real challenge is that software developers continue to try to squeeze more efficeincy out of their products, so the benefit from HT drops. If you get to 90% efficient, your HT benefit drops to ~10% or less.

Now, you could easily boost the efficeincy of HT by adding more pipelines. But that is no different than adding more cores. Would you rather have two cores with 4 pipelines or one core with 8 pipelines and HT?

Some of this obviously becomes a math problem at that point, but having more core is a better choice because an 8-pipeline core is going to suck up a huge amount of power, even when it is not busy, but with 2 4-pipeline cores you could shut one down and cut the power in hald during periods of low utilization.

Everything in life is a tradeoff.

Any comment on how additional threads will be prioritized with different modules, if at all? I'm not sure, but I seem to remember you mentioning last year that you guys were working with software companies to push threads 2/3/4 onto other modules. Is that plan working, is it moot b/c the computer will automatically do it anyway (like it does with HT), etc?
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
So if that works as planned, will running 4 simultaneous threads mean that you will have all 8 cores running at full voltage/frequency? So is there a performance/watt hit if you are running 1-4 threads compared to 5-8 threads?
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Any comment on how additional threads will be prioritized with different modules, if at all? I'm not sure, but I seem to remember you mentioning last year that you guys were working with software companies to push threads 2/3/4 onto other modules. Is that plan working, is it moot b/c the computer will automatically do it anyway (like it does with HT), etc?

One other thing you have to realize, is that since the processor is power gated at the module level, it may be considered preferable to have threads bunched together on the same module even with the minor penalties they get from sharing resources. The reason for this is that if they are grouped in the same module, then the other modules could be disabled, and the active module would get a higher turbo-core boost, which may make up for the penalties from sharing resources.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Any comment on how additional threads will be prioritized with different modules, if at all? I'm not sure, but I seem to remember you mentioning last year that you guys were working with software companies to push threads 2/3/4 onto other modules. Is that plan working, is it moot b/c the computer will automatically do it anyway (like it does with HT), etc?

No comment on work we are doing with other companies, never allowed to comment on that.

So if that works as planned, will running 4 simultaneous threads mean that you will have all 8 cores running at full voltage/frequency? So is there a performance/watt hit if you are running 1-4 threads compared to 5-8 threads?

see next comment.

One other thing you have to realize, is that since the processor is power gated at the module level, it may be considered preferable to have threads bunched together on the same module even with the minor penalties they get from sharing resources. The reason for this is that if they are grouped in the same module, then the other modules could be disabled, and the active module would get a higher turbo-core boost, which may make up for the penalties from sharing resources.

Yes, people are starting to come around. Everyone was getting all wrapped around "how do I spread my threads out across modules so that I have one thread per module. Yes, you get a performance increase with that, but it is marginal. However, running threads on the same module would allow for a.) sharing of L2 cache for apps that are utilizing the same data set and b.) the other modules to be shut down, reducing power and increasing the ability to boost.

Lots of people don't get it. You have a maximum amount of power that the processor can consume. You may be better off concentrating the power on fewer modules to achieve higher clocks than try to spread threads out to get 100% of the L2 resources.

Ultimately all of this becomes really academic because threads start and finish at different times. Fire up a program and it might instantly utilize all of the threads, but once it starts running each thread is going to start and stop at a different time. Take a look at a F1 race. Every car starts out at the same place at the same time. Then some win by multiple laps and they never finish in order.

Too many people focus on the theoretical and orderly and not the reality of how things are processed.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
No comment on work we are doing with other companies, never allowed to comment on that.


Will the OS be module-aware at all? I know you've said it a thousand times no, but I was under the impression (I could be wrong) Windows can distinguish between physical cores and HT 'cores'.

Or should we really just drop this and stop caring, since anything using 4 threads will probably scale up to 8 anyway, and with Turbo CORE the true difference will be negligible anyway (or even preferable to pack threads into modules?)


Of course, this begs the question, if it really IS preferable to pack threads on the smallest number of modules possible, will the OS be able to do that?
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

1. core i7 refers to nehalem, westmare, and sandy bridge. Which vary in performance.
2. core i7 is more than 50% faster than Phenom II... so how can it be 50% faster then both?
3. I will believe it when I see it.
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
Quote:
Originally Posted by bryanW1995
Any comment on how additional threads will be prioritized with different modules, if at all? I'm not sure, but I seem to remember you mentioning last year that you guys were working with software companies to push threads 2/3/4 onto other modules. Is that plan working, is it moot b/c the computer will automatically do it anyway (like it does with HT), etc?

No comment on work we are doing with other companies, never allowed to comment on that.


Quote:
Originally Posted by drizek
So if that works as planned, will running 4 simultaneous threads mean that you will have all 8 cores running at full voltage/frequency? So is there a performance/watt hit if you are running 1-4 threads compared to 5-8 threads?

see next comment.


Quote:
Originally Posted by Martimus
One other thing you have to realize, is that since the processor is power gated at the module level, it may be considered preferable to have threads bunched together on the same module even with the minor penalties they get from sharing resources. The reason for this is that if they are grouped in the same module, then the other modules could be disabled, and the active module would get a higher turbo-core boost, which may make up for the penalties from sharing resources.

Yes, people are starting to come around. Everyone was getting all wrapped around "how do I spread my threads out across modules so that I have one thread per module. Yes, you get a performance increase with that, but it is marginal. However, running threads on the same module would allow for a.) sharing of L2 cache for apps that are utilizing the same data set and b.) the other modules to be shut down, reducing power and increasing the ability to boost.

Lots of people don't get it. You have a maximum amount of power that the processor can consume. You may be better off concentrating the power on fewer modules to achieve higher clocks than try to spread threads out to get 100% of the L2 resources.

Ultimately all of this becomes really academic because threads start and finish at different times. Fire up a program and it might instantly utilize all of the threads, but once it starts running each thread is going to start and stop at a different time. Take a look at a F1 race. Every car starts out at the same place at the same time. Then some win by multiple laps and they never finish in order.

Too many people focus on the theoretical and orderly and not the reality of how things are processed.

ok, that makes sense for most users, but many/most of us turn off power saving features when going for max overclocks. so in your theoretical example, a 2 thread process that would get, say, 90% of the performance on one module that it would get on 2 separate ones, could get +5% from sharing the L2 and another +10% from a higher turbo, leading to 105% of the performance from using 2 diferent modules. however, in a highly oc'd system without turbo boost enabled it would only get 95% of the performance, right? On a server this clearly doesn't matter in most cases, but on the desktop most apps are designed to take advantage of 2-4 cores so it COULD matter. Maybe I'll leave turbo on when oc'ing...hmmm...


1. core i7 refers to nehalem, westmare, and sandy bridge. Which vary in performance.
2. core i7 is more than 50% faster than Phenom II... so how can it be 50% faster then both?
3. I will believe it when I see it.

1. the core i7 from this quote is generally considered to refer to nehalem.
2. in highly multithreaded apps phII x6 is competitive with a nehalem i7 quad. jfamd is a server guy, so he's only talking about servers.
3. All amd needs is to make some small iterative improvements and they'll have a winner. say +10% clock/clock and +20% clocks, both of which seem to be conservative estimations. A stock 8 core BD @ 4.0 would run circles around a 2600k. Intel will certainly have a response, but as long as amd executes then we should at least have an interesting summer. :)
 
Last edited:
Status
Not open for further replies.