Cache question

Xpage · Dec 18, 2012

I am not able to find cache times for Piledriver, at least with a basic google search, though I am sure i have seen them before.

IVY bridge has much better latency than AMD's last offerings, and Intel always has had an advantage

L1 Cache: 4 cycles (Ivy Bridge)
L2: 12 cycles (Ivy Bridge)
L3: 24 cycles (Ivy Bridge)
RAM: 133 cycles (Ivy Bridge @ 3.4GHz)

I am curious why they have an advantage when all SRAM should be designed the same, unless one 32nm process is actually smaller or larger than the other.

Also since node shrinks are continuing you would think the distance between two points would decrease, and since the cycle time hasn't increased (think a 5.0ghz cap) you should be able to either A) Have a lower latency cache or B) add more cache for the same latency

Of course thinker wires = more noise, more leakage or technical stuff i don't fully understand (not an EE).

I recall reading somewhere that if you cut latency in L1$ from 4 to 3 cycles you a 5% performance improvement. Since AMD has $ latency issues I think amd should look at spintronic MRAM.Maybe get a jump on intel as a large L3 MRAM cache since the MRAM latency is slower than SRAM but AMD has poor L3 latency atm, so they might as well get a much higher density. Plus the added benefits of MRAM and power usage.

I am not sure how much density would increase but due to their HSA, they would need a larger on die cache anyway especially if they start using a unified memory architecture. I don't think MRAM will be fast enough for L1 but I assume you could design wider systems with slower cache, or drop mhz.

any thoughts?

TuxDave · Dec 18, 2012

Also since node shrinks are continuing you would think the distance between two points would decrease, and since the cycle time hasn't increased (think a 5.0ghz cap) you should be able to either A) Have a lower latency cache or B) add more cache for the same latency

Assuming you scale everything by a factor of 2 classical wire delay slows down by ~4x and so the wire delay gets worse even though things are closer together. Devices generally perform faster so whether or not you can get more DONE in a single cycle highly depends on the ratio of wire and device delays.

As for why various CPUs have various cache latency, the latency isn't just the SRAM read timing. As far as I understand it, it's a datapath loop that starts from how to handle getting the data from the cache and returning the data to its useful location as well as checking for various conditions along the way. So there's a lot of control logic and decoding in those cycles as well. Different architectures may lend itself to different latencies depending on how they handle those controls, what those controls do AND on top of that, process performance variations.

I'm sure everyone would love to just cut L1 latency to 3, but if they haven't done it, it's probably because they don't know how to get it to 3 or the scope of change is too great to get it to 3.

I haven't read up on MRAM, so i don't know the advantages/disadvantages. If timing, density, cost of integration and validation is all on the "better" side, then yeah, MRAM replacing SRAM would probably be an obvious choice. Unfortunately I think one of the terms is on the "worse" side, but I don't really know.

Idontcare · Dec 18, 2012

Xpage said:
I am curious why they have an advantage when all SRAM should be designed the same, unless one 32nm process is actually smaller or larger than the other.

I had the pleasure of being intimately involved in the design of sram cells for many years, and the bottom line is that just like modes of transportation (cars, trucks, mini-vans, SUVs, boats, ships, submarines, cruise liners, aircraft carriers, helicoptors, airplanes, jets, biplanes, space planes, space shuttles, space stations, etc) are diverse and varied owing to their targeted applications so too is sram.

Even within a seemingly narrow product segment such as you are highlighting, comparing two products that compete in the exact same space (Intel x86 desktop processor and AMD x86 desktop processor) the trade-offs that have to be made are still dauntingly numerous.

For starters you have the project triangle in play.

How much money are you investing into your sram cell design? How soon do you want it to be production-worthy? How much power-consumption do you want it to consume? How fast do you want it to clock? What kind of yield target are you shooting for? How much array redundancy are you willing to allocate? How much die-area per sram cell are you willing to allocate?

Just as there are design trade-offs that compel a project manager at some point to tell their team "folks, a helicopter design just isn't going to get the job done, it is too slow, we need to start thinking jet plane going forward", so too will there be a bevy of choices initially entertained and from there the program manager will start to whittle down the list.

No two srams are designed to accomplish the same thing, with the same resources, on the same schedule, with the same risk-acceptance profile.

For example, at Texas Instruments on the same node we'd design sram cells that were slow in clockspeed, low-power footprint, and extremely small cell size for the mobile designs (cellphones, etc). And on the same node we'd develop comparatively huge (2-3x the size) sram cells that operated at higher voltage, higher clockspeeds, higher power-consumption for use in SUN microsystems' processors.

You'll see this same tiered hierarchy effect in play in both Intel's and AMD's sram. L1$ will be fast, low-latency, but large cell size. L2$ may use the same cells but run at higher latency to save power (or to increase yields) or they will use slightly smaller cells (to save die-size, or to increase total L2$ array size), and L3$ will have even smaller cells running at even lower clockspeed - increasing the amount that can be added to the chip for a given die-size target and power-usage profile.

I'm sure AMD would love for GloFo to offer electrically and physically equivalent sram arrays as Intel's, but in order for GloFo to do that they would have to invest the same resources into optimizing and characterizing their sram designs and GloFo is probably not going to do that (ever).

pablo87 · Dec 18, 2012

I have a similar thought with the OP relating to cache/SRAM. BITD, a PC with 4MB of main memory was paired with 256K L2 cache (8 x 256k). Cost was around $2-3 per chip. So why don't we have 256MB cache systems today??

sefsefsefsef · Dec 18, 2012

TuxDave said:
As for why various CPUs have various cache latency, the latency isn't just the SRAM read timing. As far as I understand it, it's a datapath loop that starts from how to handle getting the data from the cache and returning the data to its useful location as well as checking for various conditions along the way. So there's a lot of control logic and decoding in those cycles as well. Different architectures may lend itself to different latencies depending on how they handle those controls, what those controls do AND on top of that, process performance variations.

This is right. The SRAM read is the least of your worries when you're talking about cache performance. Even traversing the bit-lines (wire delay) isn't necessarily the worst part. A data cache read hit involves (at least) the following:

1. Address generation (this might not be factored into the "4 cycle" numbers you always see, I'm not sure).
2. Select bits out of that address to choose which cache set to access, and then feed the address's tag bits to that set in the tag array for the next step.
3. Perform a content-addressable-memory (CAM) lookup on the tag array for the appropriate cache set, using the address's tag bits. (This can be done in parallel or sequentially, or sequentially+way-predicted, which is very common)
4. Identify which, if any, of the ways in that set are a hit for this address's tag.

All that is just to identify if and where the data you're looking for is in the cache's data array. Once you've made that determination, you have to actually get the data out.

(What follows is one possible implementation for how to do the next part, and I don't know if it's common or not)
5. ***Use the combination of set number + way number to light up the appropriate region of the data array, and perhaps bring the whole cache line into a buffer.***
6. Use the index bits of your original address + the size of the load op to pick out which bytes you are really interested in. This may involve some kind of alignment or shifting operation.
7. Wire those bytes up to some output port so the cache can send the data where needed.

The part in the ***s is the main one that really relies on the performance of SRAM cells. Well, maybe that and the CAM lookup. But the point is, a lot goes into every single cache read, and speeding up SRAM cells probably isn't the solution to having faster caches. If you want to speed up a cache, find a way to eliminated or speed up some of those other steps.

Tuna-Fish · Dec 18, 2012

sefsefsefsef said:
This is right. The SRAM read is the least of your worries when you're talking about cache performance. Even traversing the bit-lines (wire delay) isn't necessarily the worst part. A data cache read hit involves (at least) the following:

1. Address generation (this might not be factored into the "4 cycle" numbers you always see, I'm not sure).

In x86 it is. Note that address generation itself is complex enough to warrant a few steps.
First, you need to compute the virtual address based on the instruction and address mode (so, for example, VA = base + 4*offset).
Then, you need to convert that to a physical address, by looking the virtual address in the TLB. It does help that this phase can be done in parallel with your phase 2.

It's also important to note that even once you know the piece of memory you want, actually accessing it requires log2(size_of_cache) transistor switches, as you need to get the signal to the cell.

Generally, the speed of the SRAM cache cells themselves is not that particularly relevant in the total access time. Even if your memory cells took 0 time to read, there probably isn't a clock cycle worth of savings in the total cache access pipeline for high-end CPU L1 caches. L2 or L3 caches might be made a little faster if you can make the cells smaller -- propagation latencies begin to be relevant there. This is very relevant with eDRAM, as the read latency of IBM edram is something like 100 times longer than the read latency for their SRAM, but by making their L3 with eDRAM, the access latency is smaller than it would be to a similarly sized pool of SRAM.

This is a bit of a pet peeve of mine when we are talking about main ram. The memory wall is a big problem for programmers. When talking about it, a lot of them think that we might be able to defeat it by moving to some new awesome kind of memory that's just around the corner. (And probably uses memristors). Not really. The act of choosing an element from a pool of a few gigabytes is inherently very slow compared to, say, adding two numbers together. It really doesn't matter *that* much how fast the memory itself is. Moving from DRAM to some faster kind of memory could perhaps buy us main ram that is 2-3 times faster, but that's about it. It'll still be slow enough that it's a problem for programmers.

ShintaiDK · Dec 18, 2012

pablo87 said:
I have a similar thought with the OP relating to cache/SRAM. BITD, a PC with 4MB of main memory was paired with 256K L2 cache (8 x 256k). Cost was around $2-3 per chip. So why don't we have 256MB cache systems today??

256MB cache with the proper speed would be like...100W?

Also it would take a huge size. Just look at Itanium or the highend Xeons how much die space it takes.

The old cache you talk about was also external and optional. Plus endlessly much slower. Smaller and faster was a much better option.

pablo87 · Dec 19, 2012

ShintaiDK said:
256MB cache with the proper speed would be like...100W?

Also it would take a huge size. Just look at Itanium or the highend Xeons how much die space it takes.

The old cache you talk about was also external and optional. Plus endlessly much slower. Smaller and faster was a much better option.

That's what I'm trying to understand, why sram didn't keep up relatively to dram in speed and density.

Idontcare · Dec 19, 2012

pablo87 said:
That's what I'm trying to understand, why sram didn't keep up relatively to dram in speed and density.

Money, and dram.

It takes a lot of money to keep designing shrunk sram ICs. If there isn't much market for them then there won't be much money to be spent on shrinking them at successive nodes.

Dram far surpassed the speed of external sram, and there is a huge market for dram so the revenue is there to support continued R&D investments into continuing to shrink it for ever higher densities and lower power consumption.

Same reason your smartphone isn't sporting a silly tiny CRT for the screen, instead it is an LCD. All the miniaturization R&D went into LCD tech, so CRT was never an option even if you wanted it.

Xpage · Dec 19, 2012

Thank you all for the replies, this has been a good read. Since the speed of the SRAM seems to be not too significant, a slight delay in switching times caused by MRAM doesn't seem to be too significant of an issue then, aside from changes needed in producing the new type of ram without contaminating the chip processing equipment.

Of course this also assumes no issues crop up with MRAM shrinks to catch it up to the current advanced CMOS process used.

http://en.wikipedia.org/wiki/Magnetoresistive_random-access_memory

ShintaiDK · Dec 19, 2012

SRAM is absolutely the fastest memory. I think you got confused with the external vs internal SRAM part. MRAM is another dream that goes nowhere. Looks good on paper, but products...

Hardly any MRAM news these years. Seems it peaked in the bubble years.

Lastest news from this year:

Everspin is current sampling a 64Mb MRAM device that promises bandwith of 3.2GB/s in a 16-chip configuration.

To compare a 8 chip 4GB DRAM DIMM gives you 12.8GB/sec for essentially no money. Vs 3.2GB/sec 128MB MRAM module with no price information or expectation. Not hard to pick whats best is it?

pm · Dec 19, 2012

I'm sure everyone would love to just cut L1 latency to 3, but if they haven't done it, it's probably because they don't know how to get it to 3 or the scope of change is too great to get it to 3.

I worked on a single-cycle (latency=1) 16kB first level cache design years ago. (http://www.ptlsim.org/papers/Research/isscc_2002_3s.pdf. pm=>Patrick Mahoney) The clock rate for the CPU was on the low-side, the look-up used a bit of a trick ("prevalidated tags") that isn't easily scalable and some of the circuits we used were a bit finicky.

But it worked. My chunk of the design was the big-endian/little-endian double-pumped rotator circuit (slide 18).

One might ask why don't people use this method in more modern CPUs and it's because the circuitry was aggressive, and the method we used was specific to Itanium and not scalable... we used a trick. But you can get latencies down if you spend the time to design things out and really think it through... but this goes back to IDC's design triangle in post #3 above. You can do amazing things if you give a decent sized team time to optimize the design for a limited corner-case (limit scope, large resources, long schedule).

Search

Cache question

Xpage

Senior member

TuxDave

Lifer

Idontcare

Elite Member

pablo87

Senior member

sefsefsefsef

Senior member

Tuna-Fish

Golden Member

ShintaiDK

Lifer

pablo87

Senior member

Idontcare

Elite Member

Xpage

Senior member

ShintaiDK

Lifer

pm

Elite Member Mobile Devices

TRENDING THREADS