Intel Cache Latency Question

Edrick · Jul 29, 2010

After comparing a number of their CPUs (high end), I noticed that their cache latencies are very different across their brands. Below is a few examples:

Nehalem:
L1: 64KB / 4 cycles
L2: 256KB / 11 cycles
L3: 8MB / 39 cycles

Nehalem EX: (Beckton)
L1: 64KB / 4 cycles
L2: 256KB / 9 cycles
L3: 24MB / 63 cycles

Itanium 9300: (Tukwila)
L1: 64KB / 1 cycle
L2: 512KB (I) + 256KB (D) / 5-7 cycles
L3: 20MB-24MB / <39 cycles *
*I have not found the exact number here, but multiple references state lower than Nehalem.

So my question is, if Intel can produce such low latencies with current technology methods, why don't they do it for all brands?

At first I thought it may be money, only put it on the expensive chips. But Beckton is in the same price range as Itanium so I tend not to think its money related. And Gulftown is no cheap cpu either.

Is there some sort of limitation with x86 CPUs that limit this?

Idontcare · Jul 29, 2010

Latency alone does not provide the full picture regarding the performance capabilities of a given cache design. You also need to factor associativity, ports, and so on. Plus cell size (sram density) and operating voltage. Its complicated and yes the "yieldability" of the resultant sram/cache circuit is a factor, as well as development time and resources.

The "best" sram may simply be unfeasible to develop within a given time constraint and R&D resource constraint and that impacts what projects get the sram layout in time for inclusion into a cpu architecture.

aphorism · Jul 29, 2010

a few things to notice:

the processors with the faster caches launched later.

itanium's caches are very different. there is no L1 data cache for floating point. L2 caches typically have 10-11 cycle access latency.

nehalem ex's L3 is probably slower because all 8 cores/ 16 threads share it.

Idontcare · Jul 29, 2010

aphorism said:
nehalem ex's L3 is probably slower because all 8 cores/ 16 threads share it.

And its on that ring-bus, broken up into 3MB domains. Definitely pushing the topology to maintain coherency, you can see where they are going with their light peak research.

aigomorla · Jul 29, 2010

Westmere-EP's have lower latency then standard Nehalem's

There the 32nm 12meg cache quadcores.

Scali · Jul 29, 2010

Itanium cycle count cannot be compared 1:1 with Nehalems.
The average Nehalem runs in the 2.66-3.46 GHz regions.

Itanium runs at around 1.6 GHz. So its cycles are approximately twice as long as the faster Nehalems.
Taking that into account, only the L1 appears to be pretty fast. But that probably has a lot to do with the instructionset aswell (x86 has very complex addressing modes, making it more difficult to forward an address and predict memory access).

khon · Jul 29, 2010

Idontcare said:
And its on that ring-bus, broken up into 3MB domains. Definitely pushing the topology to maintain coherency, you can see where they are going with their light peak research.

As far as I know Light Peak isn't designed for that, at least not according the the Intel presentation I heard a few months back. It's designed for medium range transfer, not short range like that between the CPU and cache.

aphorism · Jul 29, 2010

Idontcare said:
And its on that ring-bus, broken up into 3MB domains. Definitely pushing the topology to maintain coherency, you can see where they are going with their light peak research.

yeah, i'd also like to see theoretical and net bandwidth of L3 caches on nehalem-EX vs. magny cores or power7.

i think with parallelism in full steam and global wires delays increasing we will see more focus on interconnects and bus topologies. optical interconnects and carbon nanotubes look promising imo.

IntelUser2000 · Jul 29, 2010

Each core on the Itanium 9300 has 6MB L3, and cache coherency for the L3 cache is managed by the on-die router using a directory, and there are four cores. The latency is at 15 cycles.

There's two L2 cache on the Tukwila. Just like the L1 caches which is split into data and instruction, its same with the L2 caches. There's 256KB Data and 512KB Instruction L2 cache, with cache latencies at 5 and 7 cycles respectively.

So my question is, if Intel can produce such low latencies with current technology methods, why don't they do it for all brands?

CPU design, just like other architectural design, is a balancing act. You sacrifice one thing to gain in another. Larger capacities result in higher latency and lower bandwidth, faster caches might be larger in area than slower caches, even instruction set and architectural differences will affect the outcome.

Plus, caches are all synchronized to the clock. 1 cycle @ 100MHz = 10 cycles at 1GHz

NP Complete · Jul 30, 2010

As others have said, cache design is a trade off of density vs features/size. The reason for this is primarily due to transistor capacitance, as well as some resistance from wires/vias.

A basic transistor's gate looks like a capacitor (which stores charge), switching off/on when enough of it's charge has been drained/sourced. In circuits, the gates of one transistor are connected to the output of another to create logic. A transistor of a given size only allow so much current to flow, so as more logic is added to a transistor's output, more time it takes to drive enough current from it's output to switch all the gates connected to it. Larger transistors can be used to drive more current, and allow faster switching if needed, though with there are diminishing returns when sizing up transistors/adding a bunch of large transistors, since each transistor added in a logic circuit adds a delay between recieving input and driving output, and the larger the transistor the more capacitance it's gate has.

Make caches larger, adding more read/write ports, etc all increase the amount of cirtuitry is SRAM cell has to drive, meaning the whole cache has to be clocked slower to allow it to operate correctly. The work around is to add large buffers between the flops (bits) output which can drive more current, and allow faster switching. Since overall CPU performance isn't solely determined by cache latency, it's quite a balancing act between choosing the right cache size/latency/# read & write ports.

Search

Intel Cache Latency Question

Edrick

Golden Member

Idontcare

Elite Member

aphorism

Member

Idontcare

Elite Member

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Scali

Banned

khon

Golden Member

aphorism

Member

IntelUser2000

Elite Member

NP Complete

Member

TRENDING THREADS