Dual core caches @ X-bit labs

OCedHrt

Senior member
Oct 4, 2002
613
0
0
Seems like I don't know how to use the search function here since I'm obviously looking at a conroe thread with a post from today (6/4) and I'm not seeing it in the search results. How the heck are they ranked anyways?

Anyways, I don't think it's a repost since I searched for conroe, xbit, etc and didn't see anything, but then again, doesn't seem like I know how to use that search function.

X-bit labs has an interesting article on dual core caches ranging from AMD's X2 through all of Intel's (even the HT ones). Conroe has a surprisingly low cache/memory access latency lower than even AMD's on die memory controller (surprisning to me anyhow since I haven't cared to keep up with Conroe). Anyways, this may be the explaining factor for the performance lead Conroe has on the K8 architecture. Sorry if you guys already figured that out :p
 

FelixDeCat

Lifer
Aug 4, 2000
30,957
2,670
126
The biggest advantage conroe has is the 4MB L2 cache, among other things. AT has a good article on it.
 

SexyK

Golden Member
Jul 30, 2001
1,343
4
76
Here's the original thread (for the search deficient ;-)... the article is definitely an interesting read

Original
 

BassBomb

Diamond Member
Nov 25, 2005
8,390
1
81
caches are always faster than ram or ram controllers...

u gotta compare amd's cache to intel's not intel's cache to amds memory, because even their netburst caches are faster than amd's memory controller

the fact that conroe has 4 mb cache makes it good is that programs taht are small enough to fit in that, wont even need to go to memory... (thats where intel usually stinks we all know)
 

dexvx

Diamond Member
Feb 2, 2000
3,899
0
0
Originally posted by: BassBomb
the fact that conroe has 4 mb cache makes it good is that programs taht are small enough to fit in that, wont even need to go to memory... (thats where intel usually stinks we all know)

Please tell me of any mainstream program nowadays that fits entirely into cache. Not even SuperPi 1M will fit into a 4MB cache. Having a large cache is beneficial because it goes to system memory less, but saying it is ONLY as simplisitc as that is just stupid.
 

Keysplayr

Elite Member
Jan 16, 2003
21,219
54
91
Originally posted by: dexvx
Originally posted by: BassBomb
the fact that conroe has 4 mb cache makes it good is that programs taht are small enough to fit in that, wont even need to go to memory... (thats where intel usually stinks we all know)

Please tell me of any mainstream program nowadays that fits entirely into cache. Not even SuperPi 1M will fit into a 4MB cache. Having a large cache is beneficial because it goes to system memory less, but saying it is ONLY as simplisitc as that is just stupid.

I can be said millions of times, but the myth has gotten more mileage than the fact, unfortunately.

 

Duvie

Elite Member
Feb 5, 2001
16,215
0
71
I believe conroes power has more to do then cache alone....We have seen where 512kb to 1mb had a small difference and 1mb to 2mb on a single corehad even less....I dont think 4mb is going to gain the 20% lead in apps....If we can assume the Core Duo chip is quite comparable clock to clock to AMD then this is a 20% gain over that...

I think Dmens went into this a bit back with mention of buffers as well being very beneficial...plus 3 executing units...I may have the last term wrong but it is 1 more then AMDs now....I think it was an integer unit...AMD still does quite well in FPU apps versus the conroe
 

OCedHrt

Senior member
Oct 4, 2002
613
0
0
Originally posted by: BassBomb
caches are always faster than ram or ram controllers...

u gotta compare amd's cache to intel's not intel's cache to amds memory, because even their netburst caches are faster than amd's memory controller

the fact that conroe has 4 mb cache makes it good is that programs taht are small enough to fit in that, wont even need to go to memory... (thats where intel usually stinks we all know)

The article at x-bit is comparing Intel's memory latency to AMD's memory latency and Intel is whooping AMD's on-die memory controller when they try to determine whether not Conroe supports data sharing between L1 caches ( the answer is no they don't ). They then compare Conroe's memory latency to it's L2 cache latency to determine whether or not the L2 shared cache is functioning ( and it is, about 14-20 cycles vs 40-60 i think? ) The cache is completely bypassed by flushing every line of cache. However, AMD has about 80 cycles of latency for memory.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Originally posted by: BassBomb
u gotta compare amd's cache to intel's not intel's cache to amds memory, because even their netburst caches are faster than amd's memory controller

the fact that conroe has 4 mb cache makes it good is that programs taht are small enough to fit in that, wont even need to go to memory... (thats where intel usually stinks we all know)

why should xbit deliberately exclude an intended feature from yonah/merom, especially when the stated purpose of the test was to determine the lowest latency between core-core data transfer. duh. and stop using the fit-in-L2 argument, it is soooo absurd.

to the op, I don't think the shared L2 is much of a factor in anything except for shared-memory programs, and the big L2 can be seen as a knockout for a specific set of programs but a drawback for others. also, conroe's load-to-use latency from main memory is still higher than K8, but that is offset somewhat by the cache size and the large LB.
 

lopri

Elite Member
Jul 27, 2002
13,314
690
126
Originally posted by: AkumaX
Hmmm... me ponders an X2 w/ 2MB L2...

Will only be (theoretically) possible when AMD moves to 65nm. Have you seen the die size of X2s.. (Granted the award may have to go to the Tulsa)

By the time AMD moves to 65, they will have to deal with quad-core.. It'll be an uphill battle for AMD to address bigger L2 cache. It can/should eventually be done, but when.. who knows..

 

mamisano

Platinum Member
Mar 12, 2000
2,045
0
76
All, this was a response posted on AMDZone back on June 1st:

by abinstein on Jun 01, 2006 - 03:37 AM

According to my tests, the most recent copy of data is always read from system RAM. This must be a limitation of the MOESI protocol implementation. ... Why is there no direct transfer between the cores via the crossbar switch? Ask AMD?s engineers about that!


Unfortunately xbit is wrong here.

If we look at AMD's MOESI protocol (Ch.7 of Programmer's Manual Vol.2), there is NO read probing in the invalid state. In fact, there is NO ANY probing when a cache line is invalid. (How could it, anyway, since the cache line isn't mapped to any physical address yet!)

So yes, with his toy app xbit won't observe inter-core communication on Athlon64 X2, because core#1 doesn't (and shouldn't) care about what data core#0 is reading if core#1 wasn't previously working on the data. This is also true with Pentium-D. On the other hand, Prescott and Yonah/Conroe manifest their inter-core communications with this toy app just fine simply because in their cases core#1's cache line is loaded (shared) with core#0's.

A more realistic test program should measure the sequence of {core#0.read, core#0.write, core#1.read, core#1.write} m times on the same address, divide the number of cycles by m, then measure the sequence on the next address, and so on. Such a producer-consumer scenario at least better reflects the behaviors of real multi-threaded applications.

What xbit just proved in the article is really the following: simple artifical benchmark results are sometimes misleading and could lead to completely wrong conclusions (in this case to the conclusion that there is no inter-core communications in Athlon64 X2).
 

OCedHrt

Senior member
Oct 4, 2002
613
0
0
It seems like the only inter-core communication done is to check whether or not the other core has invalidated something in the current core's cache.

Basically, core 0 and 1 both read the same data into cache. If core 0 alters the data, that data is also now invalid in core 1's cache. Thus, the only inter-core communication is when a core has some line in cache, and that line's validity is checked with the other core.

What I think x-bit was curious about was whether or not cache data could be shared across cores. I think obviously this isn't being implemented otherwise there would not be too much point in having a shared L2 cache. What surprised me though is that a L1 cache miss went to memory instead of the L2 cache.