About the current Intel processor's cache hierarchy

hshen1

Member
May 5, 2013
70
0
66
Hi,

I have a question about the lastest Intel processor's cache hierarchy due to my project:>.

In some old models, for example, Intel Quad Core Xeon E5430, there exists two last level(in this case, Level 2) caches, each of which is shared by two cores(there are totally four cores).

However, after some search, I have found now for Intel processors(Xeon, i7,i5...), such architecture no longer exists. There are always dedicated L1 and L2 cache for each core. And the last level cache(L3 or L2) are big and shared by all the cores on one processor.

I think probably this is a better design. I just want to confirm this for my project.

Thanks,:p
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
http://www.realworldtech.com/nehalem/2/

Harpertown is a Core 2 Quad, basically. One half of it would be a Core 2 Duo, with only a single L2 cache. The Xeon E5430 is basically a server version of the Penryn-based Core 2 Quad Q9450.

Nehalem was the first one to use the current hierarchy, with the shared L3 between all cores, however many are on the chip.

Barcelona is basically a 1st-gen AMD Phenom.
 

pantsaregood

Senior member
Feb 13, 2011
993
37
91
The quad-core (and six-core) implementations of Conroe and Penryn were similar to the dual-core implementation of Netburst.

Netburst was designed as a single-core CPU. Intel created Smithfield by essentially taping two Prescott dies together. Core 2 Quad took the same approach, except using dual-core dies.

The multi-die approach, from my understanding, introduced latency into the communication between cores, and ultimately resulted in Athlon 64 X2 leading Pentium D more significantly than Athlon 64 outperformed Pentium 4.
 

hshen1

Member
May 5, 2013
70
0
66
http://www.realworldtech.com/nehalem/2/

Harpertown is a Core 2 Quad, basically. One half of it would be a Core 2 Duo, with only a single L2 cache. The Xeon E5430 is basically a server version of the Penryn-based Core 2 Quad Q9450.

Nehalem was the first one to use the current hierarchy, with the shared L3 between all cores, however many are on the chip.

Barcelona is basically a 1st-gen AMD Phenom.

Hi, thanks for your explanation. Fair enough. So I think the "current hierarchy" all use the shared last level cache(using smartcache technology),right?

The reason why I asked this is because now I am doing some cache contention project. If all the cores share the same last level cache, then scheduling different threads(with different memory working set) to different cores will have no effect (just consider all level caches are very fast and only the memory access is the bottleneck). However, for the old hierarchy, scheduling a thread on cores with different last level cache may have some benefit:> :cool:
 

hshen1

Member
May 5, 2013
70
0
66
The quad-core (and six-core) implementations of Conroe and Penryn were similar to the dual-core implementation of Netburst.

Netburst was designed as a single-core CPU. Intel created Smithfield by essentially taping two Prescott dies together. Core 2 Quad took the same approach, except using dual-core dies.

The multi-die approach, from my understanding, introduced latency into the communication between cores, and ultimately resulted in Athlon 64 X2 leading Pentium D more significantly than Athlon 64 outperformed Pentium 4.

OK.I see. So I think the current hierarchy all use the shared last level cache. It's actually better in terms of the performance, I think.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,678
2,564
136
Yes, it's better. Notably, the last, shared L3 cache is inclusive, that is, normally all lines present in the inner cache levels (L1 and L2) are also present in the L3. This is used for snooping. The CPU uses a MESIF cache coherency protocol. Or, when I request a cache line for writing (state: exclusive), the L3 marks that line as being in my cache. This way, when another core wants to read from it, it's not necessary for it to scan every cache to figure out where it is. It can just look in the L3, find out that I have it, then ask me for it.

This is a very good system for low-latency communication between cpu cores.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
The quad-core (and six-core) implementations of Conroe and Penryn were similar to the dual-core implementation of Netburst.
Case in point:
http://www.anandtech.com/show/2743/5

You could find lots of situations where it didn't happen, but when it did happen, you were stuck.

The big shared cache also reduces traffic, compared to separate caches. In the case of a Core 2 Quad, dual-core K8, or L3-less AMD desktop/mobile CPU, if multi threads either needed exclusive access to a few cache lines a lot, or dirtied shared cache lines (causing the next read to need to re-read from the core running the thread writing to them), a lot of cache bandwidth, and a lot of clock cycles, would be wasted just bouncing cache lines around. Current Intel and AMD shared caches don't entirely get rid of that problem, but they greatly reduce it, since there is a single pool that it must bounce down to, and the thread using it can keep on going while L3's copy (and then the copies for other cores) gets updated.

The only reach catch is complexity: for any given number of cache lines, you will want to have nearly as much associativity as separate caches combined, to keep from running any significant risk of way conflict issues (IE, 16-way for 2 cores that would otherwise have 2 8-way caches, less with more data being shared between threads). Nobody seems to be having problems implementing that, though, these days (do they usually do way-skewed slices, has Moore's Law just made it cheap to do, or what? I'm really ignorant on this part).