AMD K10.5 is 10-20 percent faster than K10

CTho9305 · Feb 26, 2008

Somewhere along the road in the last few years AMD also switched from an 'inclusive' cache to an 'exclusive' cache whereby each level contains unique data.

AMD has been exclusive since K7. The original Durons had something like 64KB L2... the L1's alone were 128KB combined, so obviously an inclusive system couldn't be used.

Idontcare · Feb 26, 2008

Originally posted by: CTho9305

Somewhere along the road in the last few years AMD also switched from an 'inclusive' cache to an 'exclusive' cache whereby each level contains unique data.

Click to expand...

AMD has been exclusive since K7. The original Durons had something like 64KB L2... the L1's alone were 128KB combined, so obviously an inclusive system couldn't be used.

What's the 50k-ft strategy there? Is one form always superior over the other, or is there a trade-off that makes one form superior in certain scenarious and the other form superior in other situations?

Kuzi · Feb 27, 2008

Originally posted by: Extelleron
So given that assumption, a Barcelona chip with 8MB Of cache (1MB*4 + 4MB L3) would be around 355mm^2, a far cry from 570mm^2.

What about size at 45nm with 8MB cache, any guess?

GFORCE100 · Feb 27, 2008

Originally posted by: Extelleron
So given that assumption, a Barcelona chip with 8MB Of cache (1MB*4 + 4MB L3) would be around 355mm^2, a far cry from 570mm^2.

Right, I forgot to substract the transistors used by the core logic. The 570mm2 would have been for an eight core Phenom substract however many transistors the IMC uses as we wouldn't need two of those on one die. Around 500mm2 either way.

It begs to ask why so many transistors are core logic in AMD's K10 when it's really the other way round in Intel ever since the Coppermine or even P6, the Pentium PRO, especially in 512K and 1MB L2 cache variants.

CTho9305 · Feb 27, 2008

Originally posted by: Idontcare

Originally posted by: CTho9305

Somewhere along the road in the last few years AMD also switched from an 'inclusive' cache to an 'exclusive' cache whereby each level contains unique data.

Click to expand...

AMD has been exclusive since K7. The original Durons had something like 64KB L2... the L1's alone were 128KB combined, so obviously an inclusive system couldn't be used.

Click to expand...

What's the 50k-ft strategy there? Is one form always superior over the other, or is there a trade-off that makes one form superior in certain scenarious and the other form superior in other situations?

There's no clear winner. If your caches are inclusive, you only have to check probes in your farthest-out level of cache. The benefit there is that you can significantly reduce the number of accesses to your L1 tag, which will either gain you a little performance or save you a port on the tag (area, power, complexity reduction). You generally won't need an extra port on the L2 tag if you're inclusive because the L2 isn't accessed very often (so it's less likely that a probe and a request from a local core would collide, delaying one of them). Note that probe traffic really becomes significant only for multiprocessor systems (dual core may count here). Before the multicore days, having probes go to the L1 and not using a dedicated port may have had minimal performance penalty (I don't know for sure - that's an educated guess).

The tradeoff is that your L2 effectively just lost a big chunk of its capacity - if you consider a 64KB+64KB L1 CPU (e.g. an AMD one), even with a 256KB L2 you're wasting fully half of the transistors in the L2 (and a lot of area). With Intel's small L1 caches (something like 8KB+12kuops for P4, something like 32+32 for most of their other chips?) the amount of L2 space that's wasted is smaller. Of course, it also means you have reduced L1 capacity, and if somebody's working set happens to take between 32 and 64KB, the Intel CPU is going to get trounced.

Idontcare · Feb 27, 2008

Originally posted by: CTho9305
The tradeoff is that your L2 effectively just lost a big chunk of its capacity - if you consider a 64KB+64KB L1 CPU (e.g. an AMD one), even with a 256KB L2 you're wasting fully half of the transistors in the L2 (and a lot of area). With Intel's small L1 caches (something like 8KB+12kuops for P4, something like 32+32 for most of their other chips?) the amount of L2 space that's wasted is smaller. Of course, it also means you have reduced L1 capacity, and if somebody's working set happens to take between 32 and 64KB, the Intel CPU is going to get trounced.

Does this trade-off become more problematic when adding another level of cache (L3)?

What about the very relevant situation where you have shared caches, in what ways does it it hinder and help to have inclusive or exclusive cache when dealing with non-shared L2 but shared L3 across the cores?

KingstonU · Feb 27, 2008

Originally posted by: SlowSpyder
Maybe they're comparing the K10.5 to the current Phenom with the TLB bug fix enabled. There's 10-20% just from it working correctly without the TLB bug fix.

^ I was wondering the same ^

aussiestilgar · Feb 27, 2008

Lets hope AMD can deliver, or at least bring the competition back to a level where one could seriously consider buying either Intel or AMD.

CTho9305 · Feb 27, 2008

Originally posted by: Idontcare

Originally posted by: CTho9305
The tradeoff is that your L2 effectively just lost a big chunk of its capacity - if you consider a 64KB+64KB L1 CPU (e.g. an AMD one), even with a 256KB L2 you're wasting fully half of the transistors in the L2 (and a lot of area). With Intel's small L1 caches (something like 8KB+12kuops for P4, something like 32+32 for most of their other chips?) the amount of L2 space that's wasted is smaller. Of course, it also means you have reduced L1 capacity, and if somebody's working set happens to take between 32 and 64KB, the Intel CPU is going to get trounced.

Click to expand...

Does this trade-off become more problematic when adding another level of cache (L3)?

It's pretty much the same. The only difference is that traffic to an L2 is already going to be low enough that there's probably not a significant savings from reduced probe traffic with an inclusive scheme (maybe if the L1 is write-through...but consumer CPUs don't currently use write-through L1s as far as I know). That combined with the large size of L2s means you probably don't want the L3 to be inclusive of the L2. I haven't really thought about L3s though.

What about the very relevant situation where you have shared caches, in what ways does it it hinder and help to have inclusive or exclusive cache when dealing with non-shared L2 but shared L3 across the cores?

I'm not sure that really changes anything. I haven't thought about it much though.

nyker96 · Feb 27, 2008

Link dead now.

Kuzi · Feb 28, 2008

Even though Shanghai will have more cache than Phenom, at 45nm it should be smaller, and I'm guessing it will be similar to the size of Intel Yorkfield (214mm).

So with higher performance/smaller size than Phenom, if AMD can clock Shanghai higher (I'm sure they can) than 3GHz, they should be good competition to Yorkfield.

Nehalem will be faster, but probably bigger and more expensive. I haven't seen any mention of Nehalem clock speeds yet, so that is another thing to keep in mind right now.

Idontcare · Feb 28, 2008

Originally posted by: Kuzi
So with higher performance/smaller size than Phenom, if AMD can clock Shanghai higher (I'm sure they can) than 3GHz, they should be good competition to Yorkfield.

Is there a concensus on the interweb as to what is limiting Phenom clocks at 65nm?

Is it TDP limited? xtor clocking limited (Vcore)? clock-skew limited (die-size)? or speed-path limited (layout)?

Martimus · Feb 28, 2008

Originally posted by: Idontcare

Originally posted by: Kuzi
So with higher performance/smaller size than Phenom, if AMD can clock Shanghai higher (I'm sure they can) than 3GHz, they should be good competition to Yorkfield.

Click to expand...

Is there a concensus on the interweb as to what is limiting Phenom clocks at 65nm?

Is it TDP limited? xtor clocking limited (Vcore)? clock-skew limited (die-size)? or speed-path limited (layout)?

From what I have gathered, the biggest issue is the IMC clock speed. Once the core speed gets farther and farther away from the memory speed, the chip gets errors when multiple cores go to access the memory. (Because two cores could be accessing the same memory location at different clock cycles, but since the IMC is running slower, the memory hasn't caught up yet giving the second access incorrect data.) Although there is so little out there about the technical limitations (at least that I have read) that this is the only thing I can think of. This is pure conjecture though, so I am probably wrong.

CTho9305 · Feb 28, 2008

Originally posted by: Martimus

Originally posted by: Idontcare

Originally posted by: Kuzi
So with higher performance/smaller size than Phenom, if AMD can clock Shanghai higher (I'm sure they can) than 3GHz, they should be good competition to Yorkfield.

Click to expand...

Is there a concensus on the interweb as to what is limiting Phenom clocks at 65nm?

Is it TDP limited? xtor clocking limited (Vcore)? clock-skew limited (die-size)? or speed-path limited (layout)?

Click to expand...

From what I have gathered, the biggest issue is the IMC clock speed. Once the core speed gets farther and farther away from the memory speed, the chip gets errors when multiple cores go to access the memory. (Because two cores could be accessing the same memory location at different clock cycles, but since the IMC is running slower, the memory hasn't caught up yet giving the second access incorrect data.) Although there is so little out there about the technical limitations (at least that I have read) that this is the only thing I can think of. This is pure conjecture though, so I am probably wrong.

Has anyone tried underclocking the northbridge/memory controller to verify this?

Idontcare · Feb 28, 2008

Originally posted by: Martimus

Originally posted by: Idontcare

Originally posted by: Kuzi
So with higher performance/smaller size than Phenom, if AMD can clock Shanghai higher (I'm sure they can) than 3GHz, they should be good competition to Yorkfield.

Click to expand...

Is there a concensus on the interweb as to what is limiting Phenom clocks at 65nm?

Is it TDP limited? xtor clocking limited (Vcore)? clock-skew limited (die-size)? or speed-path limited (layout)?

Click to expand...

From what I have gathered, the biggest issue is the IMC clock speed. Once the core speed gets farther and farther away from the memory speed, the chip gets errors when multiple cores go to access the memory. (Because two cores could be accessing the same memory location at different clock cycles, but since the IMC is running slower, the memory hasn't caught up yet giving the second access incorrect data.) Although there is so little out there about the technical limitations (at least that I have read) that this is the only thing I can think of. This is pure conjecture though, so I am probably wrong.

What's the root-cause of the clockspeed delta between the IMC and the logic cores?

Are AMD's 65nm node transistors just so weak (Ion/Ioff) that the switching speed effectively caps the maximum clockspeed of the Phenom to the current ~2.8GHz max?

What's the fastest clocked 65nm part AMD has ever shipped? What about 90nm?

heyheybooboo · Feb 28, 2008

Originally posted by: CTho9305

Originally posted by: Martimus

Originally posted by: Idontcare

Originally posted by: Kuzi
So with higher performance/smaller size than Phenom, if AMD can clock Shanghai higher (I'm sure they can) than 3GHz, they should be good competition to Yorkfield.

Click to expand...

Is there a concensus on the interweb as to what is limiting Phenom clocks at 65nm?

Is it TDP limited? xtor clocking limited (Vcore)? clock-skew limited (die-size)? or speed-path limited (layout)?

Click to expand...

From what I have gathered, the biggest issue is the IMC clock speed. Once the core speed gets farther and farther away from the memory speed, the chip gets errors when multiple cores go to access the memory. (Because two cores could be accessing the same memory location at different clock cycles, but since the IMC is running slower, the memory hasn't caught up yet giving the second access incorrect data.) Although there is so little out there about the technical limitations (at least that I have read) that this is the only thing I can think of. This is pure conjecture though, so I am probably wrong.

Click to expand...

Has anyone tried underclocking the northbridge/memory controller to verify this?

I stumbled across a forum post (which I can't locate now) whereby someone disabled 'core2' of the Phenom in AMD Overdrive and succeeded in a generous overclock and control of the IMC/northbridge. It was difficult to determine if this was 'factual or FUD' because of the tone of the post.

And I was of the general impression that errata 298 was partially based on the 'flush' (or lack thereof) of the L2 in one core that effectively created a memory addressing conflict within the cache structure of the cpu.

Someone above my paygrade will have to determine the connection (if there is one) with the nb/imc speed limitation of 1.8GHz(?)

AMD could gain some traction with an upfront explanation of the problem and their success in addressing it with B3 ... but I guess I'm asking for too much

Originally posted by: Idontcare

Are AMD's 65nm node transistors just so weak (Ion/Ioff) that the switching speed effectively caps the maximum clockspeed of the Phenom to the current ~2.8GHz max?

What's the fastest clocked 65nm part AMD has ever shipped? What about 90nm?

90nm - X2 6400+ @ 3.2GHz ""Windsor""
65nm - X2 5400+ @ 2.8GHz ""Brisbane""

I don't know how to characterize the clockspeed issues with the Phenom but have always assumed it related to the 1.8GHz limitation of the imc/nb

Roy2001 · Feb 28, 2008

I still remember AMD executives claimed K10 was 40% faster than Kentsfield?

Originally posted by: Kuzi
According to Fudzilla:

Shanghai K10.5 is about 10 to 20 percent faster

I don't know but it seems hard to believe, since from what I read K10.5 is just a die shrink to 45nm with 6MB L3 cache (up from 2MB).

AMD will need every ounce of performance they can get, especially that Nehalem will be released around the same time as K10.5 if not earlier.

RaulF · Feb 28, 2008

Originally posted by: SlowSpyder
Maybe they're comparing the K10.5 to the current Phenom with the TLB bug fix enabled. There's 10-20% just from it working correctly without the TLB bug fix.

Ah no, you loose that if you enable the fix on the bios. But most benchmarks are with the tlb problem still on.

I thought cache for AMD was not too big of a big deal since they have the IMC?? So now cache does help AMD that much?

Sable · Feb 28, 2008

Originally posted by: Kuzi
According to Fudzilla:

*stops reading*

Sable · Feb 28, 2008

edits

flippin rubbish this forum software.

Idontcare · Feb 28, 2008

Originally posted by: heyheybooboo

Originally posted by: Idontcare
What's the fastest clocked 65nm part AMD has ever shipped? What about 90nm?

Click to expand...

90nm - X2 6400+ @ 3.2GHz ""Windsor""
65nm - X2 5400+ @ 2.8GHz ""Brisbane""

I don't know how to characterize the clockspeed issues with the Phenom but have always assumed it related to the 1.8GHz limitation of the imc/nb

Thanks for the info heyheybooboo.

So any way that AMD might explain the results of Phenom clockspeed, the bottom line is both Phenom and X2 at 65nm aren't getting above 2.8GHz.

To me that would rule-out layout (speedpath) limitations as gating the clockspeed for these 65nm parts.

So are they too hot (like the old prescott heat limitation on their clocks) are is AMD's 65nm node just got too wimpy of xtors? I don't think X2 is too hot, so it must just be wimpy xtor's.

IBM can no doubt help here, they got their Power6 chips clocked up to 4.4GHz, which requires fast switching transistors. Hopefully the technology floats over to AMD.

taltamir · Feb 28, 2008

I see no reason why it WOULDN'T improve performance.
Phenom is a peice of crap, no if ands or buts about it. But its not like AMD is a failure as a company suddenly. They simply released an unfinished product. They had since had time to finish it, fixing bugs and improving performance...

Although they COULD be refering to the 10-20% performance decrease caused by the TLB workaround being elimiated (that is the same as a 10-20% performance increase right?

)

I could have made the same guess as theinq... theinq isn't journalism, its guessworks which rarely gets something right.

Idontcare · Feb 28, 2008

Originally posted by: taltamir
I see no reason why it WOULDN'T improve performance.
Phenom is a peice of crap, no if ands or buts about it. But its not like AMD is a failure as a company suddenly. They simply released an unfinished product. They had since had time to finish it, fixing bugs and improving performance...

Although they COULD be refering to the 10-20% performance decrease caused by the TLB workaround being elimiated (that is the same as a 10-20% performance increase right? )

I could have made the same guess as theinq... theinq isn't journalism, its guessworks which rarely gets something right.

To be fair, wasn't there a considerable lag between the top speedbin of the X2 versus the top speedbin of the single-core FX? There is a thermal and die-size logistics issue when doubling the number of cores, no amount of engineering makes the physics go away.

I have a feeling that were the Phenom released in the absence of Conroe/Penryn that most computer users would probably feel things were progressing quite normally given the history of the single-core to dual-core transistion.

Intel's product line is certainly superior, they should be ashamed if their 4X larger R&D budget didn't ensure this for them, but just because something is super great doesn't mean something else is suddenly garbage. Via C7's sell for a good reason, and the people that are happy to be able to buy them aren't likely to consider them to be crap.

Kuzi · Feb 28, 2008

Originally posted by: Idontcare

Originally posted by: heyheybooboo

Originally posted by: Idontcare
What's the fastest clocked 65nm part AMD has ever shipped? What about 90nm?

Click to expand...

90nm - X2 6400+ @ 3.2GHz ""Windsor""
65nm - X2 5400+ @ 2.8GHz ""Brisbane""

I don't know how to characterize the clockspeed issues with the Phenom but have always assumed it related to the 1.8GHz limitation of the imc/nb

Click to expand...

Thanks for the info heyheybooboo.

So any way that AMD might explain the results of Phenom clockspeed, the bottom line is both Phenom and X2 at 65nm aren't getting above 2.8GHz.

To me that would rule-out layout (speedpath) limitations as gating the clockspeed for these 65nm parts.

So are they too hot (like the old prescott heat limitation on their clocks) are is AMD's 65nm node just got too wimpy of xtors? I don't think X2 is too hot, so it must just be wimpy xtor's.

IBM can no doubt help here, they got their Power6 chips clocked up to 4.4GHz, which requires fast switching transistors. Hopefully the technology floats over to AMD.

AMD milked their 90nm (it was great) process for far too long and didn't spend as much time or resources on thier 65nm process, it is way too late to fix it now. If dual-core Brisbane hardly gets to 2.8GHz, what chance does a quad-core Phenom have.

That's why I don't think we will see 65nm Phenom sold at higher than 2.3 or 2.4Ghz. It all goes down to their 45nm process now.

CTho9305 · Feb 28, 2008

Originally posted by: heyheybooboo

Originally posted by: CTho9305

Originally posted by: Martimus

Originally posted by: Idontcare

Originally posted by: Kuzi
So with higher performance/smaller size than Phenom, if AMD can clock Shanghai higher (I'm sure they can) than 3GHz, they should be good competition to Yorkfield.

Click to expand...

Is there a concensus on the interweb as to what is limiting Phenom clocks at 65nm?

Is it TDP limited? xtor clocking limited (Vcore)? clock-skew limited (die-size)? or speed-path limited (layout)?

Click to expand...

From what I have gathered, the biggest issue is the IMC clock speed. Once the core speed gets farther and farther away from the memory speed, the chip gets errors when multiple cores go to access the memory. (Because two cores could be accessing the same memory location at different clock cycles, but since the IMC is running slower, the memory hasn't caught up yet giving the second access incorrect data.) Although there is so little out there about the technical limitations (at least that I have read) that this is the only thing I can think of. This is pure conjecture though, so I am probably wrong.

Click to expand...

Has anyone tried underclocking the northbridge/memory controller to verify this?

Click to expand...

I stumbled across a forum post (which I can't locate now) whereby someone disabled 'core2' of the Phenom in AMD Overdrive and succeeded in a generous overclock and control of the IMC/northbridge. It was difficult to determine if this was 'factual or FUD' because of the tone of the post.

Without a reasonable number of people doing controlled tests you can't really tell if it wasn't just dumb luck that that guy ended up with one slow core and 3 fast ones. You'll never get controlled tests out of overclockers though - everybody will have some other tweak set slightly differently or be using different cooling/voltage/definition of stability/etc.

How do you disable Phenom cores?

AMD K10.5 is 10-20 percent faster than K10

Elite Member

Elite Member

Senior member

Golden Member

Elite Member

Elite Member

Golden Member

Senior member

Elite Member

Diamond Member

Senior member

Elite Member

Diamond Member

Elite Member

Elite Member

Diamond Member

Senior member

Senior member

Golden Member

Golden Member

Elite Member

Lifer

Elite Member

Senior member

Elite Member