Intel Cache Latency Question

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
After comparing a number of their CPUs (high end), I noticed that their cache latencies are very different across their brands. Below is a few examples:

Nehalem:
L1: 64KB / 4 cycles
L2: 256KB / 11 cycles
L3: 8MB / 39 cycles

Nehalem EX: (Beckton)
L1: 64KB / 4 cycles
L2: 256KB / 9 cycles
L3: 24MB / 63 cycles

Itanium 9300: (Tukwila)
L1: 64KB / 1 cycle
L2: 512KB (I) + 256KB (D) / 5-7 cycles
L3: 20MB-24MB / <39 cycles *
*I have not found the exact number here, but multiple references state lower than Nehalem.

So my question is, if Intel can produce such low latencies with current technology methods, why don't they do it for all brands?

At first I thought it may be money, only put it on the expensive chips. But Beckton is in the same price range as Itanium so I tend not to think its money related. And Gulftown is no cheap cpu either.

Is there some sort of limitation with x86 CPUs that limit this?
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Latency alone does not provide the full picture regarding the performance capabilities of a given cache design. You also need to factor associativity, ports, and so on. Plus cell size (sram density) and operating voltage. Its complicated and yes the "yieldability" of the resultant sram/cache circuit is a factor, as well as development time and resources.

The "best" sram may simply be unfeasible to develop within a given time constraint and R&D resource constraint and that impacts what projects get the sram layout in time for inclusion into a cpu architecture.
 

aphorism

Member
Jun 26, 2010
41
0
0
a few things to notice:

the processors with the faster caches launched later.

itanium's caches are very different. there is no L1 data cache for floating point. L2 caches typically have 10-11 cycle access latency.

nehalem ex's L3 is probably slower because all 8 cores/ 16 threads share it.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
nehalem ex's L3 is probably slower because all 8 cores/ 16 threads share it.

And its on that ring-bus, broken up into 3MB domains. Definitely pushing the topology to maintain coherency, you can see where they are going with their light peak research.
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,846
3,189
126
Westmere-EP's have lower latency then standard Nehalem's

There the 32nm 12meg cache quadcores.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
Itanium cycle count cannot be compared 1:1 with Nehalems.
The average Nehalem runs in the 2.66-3.46 GHz regions.

Itanium runs at around 1.6 GHz. So its cycles are approximately twice as long as the faster Nehalems.
Taking that into account, only the L1 appears to be pretty fast. But that probably has a lot to do with the instructionset aswell (x86 has very complex addressing modes, making it more difficult to forward an address and predict memory access).
 

khon

Golden Member
Jun 8, 2010
1,319
124
106
And its on that ring-bus, broken up into 3MB domains. Definitely pushing the topology to maintain coherency, you can see where they are going with their light peak research.

As far as I know Light Peak isn't designed for that, at least not according the the Intel presentation I heard a few months back. It's designed for medium range transfer, not short range like that between the CPU and cache.
 

aphorism

Member
Jun 26, 2010
41
0
0
And its on that ring-bus, broken up into 3MB domains. Definitely pushing the topology to maintain coherency, you can see where they are going with their light peak research.

yeah, i'd also like to see theoretical and net bandwidth of L3 caches on nehalem-EX vs. magny cores or power7.

i think with parallelism in full steam and global wires delays increasing we will see more focus on interconnects and bus topologies. optical interconnects and carbon nanotubes look promising imo.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Each core on the Itanium 9300 has 6MB L3, and cache coherency for the L3 cache is managed by the on-die router using a directory, and there are four cores. The latency is at 15 cycles.

There's two L2 cache on the Tukwila. Just like the L1 caches which is split into data and instruction, its same with the L2 caches. There's 256KB Data and 512KB Instruction L2 cache, with cache latencies at 5 and 7 cycles respectively.

So my question is, if Intel can produce such low latencies with current technology methods, why don't they do it for all brands?

CPU design, just like other architectural design, is a balancing act. You sacrifice one thing to gain in another. Larger capacities result in higher latency and lower bandwidth, faster caches might be larger in area than slower caches, even instruction set and architectural differences will affect the outcome.

Plus, caches are all synchronized to the clock. 1 cycle @ 100MHz = 10 cycles at 1GHz
 

NP Complete

Member
Jul 16, 2010
57
0
0
As others have said, cache design is a trade off of density vs features/size. The reason for this is primarily due to transistor capacitance, as well as some resistance from wires/vias.

A basic transistor's gate looks like a capacitor (which stores charge), switching off/on when enough of it's charge has been drained/sourced. In circuits, the gates of one transistor are connected to the output of another to create logic. A transistor of a given size only allow so much current to flow, so as more logic is added to a transistor's output, more time it takes to drive enough current from it's output to switch all the gates connected to it. Larger transistors can be used to drive more current, and allow faster switching if needed, though with there are diminishing returns when sizing up transistors/adding a bunch of large transistors, since each transistor added in a logic circuit adds a delay between recieving input and driving output, and the larger the transistor the more capacitance it's gate has.

Make caches larger, adding more read/write ports, etc all increase the amount of cirtuitry is SRAM cell has to drive, meaning the whole cache has to be clocked slower to allow it to operate correctly. The work around is to add large buffers between the flops (bits) output which can drive more current, and allow faster switching. Since overall CPU performance isn't solely determined by cache latency, it's quite a balancing act between choosing the right cache size/latency/# read & write ports.