Cache memory

uribag

Member
Nov 15, 2007
41
0
61
Hi guys.

I´m just a computer user that want to learn how things work in this fascinating world.
So, don´t be hard on me.

I was just thinking about Quad Cores, Cache Memory and it´s leves - L1, L2 and L3.

Is it possible to make a structure with only one big cache that could be internally divided to work with the same herarchy but with one big difference: If only one core is being used than it could acess the whole cache (4L1, 4L2 and the L3); if two cores are being used than they could share the cache equally(2L1, 2L2 and L3 for communication). When the three or four cores are being used the they would use their individual cache (1L1, 1L2 and L3).

I think this is important because nowadays the majority of programs don´t use more than 2 cores.

What are the cons of this idea? Would it be a mess?

 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
This is what "shared" cache means.

The fact that the L3 cache on K10's and Nehalem's is shared (all cores have access to all info contained in it) is not what makes it an L3 cache.

If you were to remove the L1 and L2 cache from Phenom, leaving the shared L3 cache, then the L3 cache would become (by namesake only) the L1 cache and it would be shared (by architecture design) and you would have your exact proposal.

The trade-off then is one of cache-size (big cache tends to require more latency), cache location on die (farther from critical FPU/ALU means more latency), and the number of ports needed (cores accessing data).

So what is faster (as in lower latency and higher bandwidth) - small dedicated L1/L2 caches that are core specific followed by a shared L3 cache which is slower but much more massive...or a single unified cache that is shared but by virtue of its size and necessary die placement incurs a latency that is significantly more than the aforementioned dedicated L1 and L2 caches?

Current CPU designs tell you the answer.

http://www.anandtech.com/cpuch...howdoc.aspx?i=3382&p=9

http://www.intel.com/technolog.../5-cache-heirarchy.htm

http://www.sun.com/blueprints/1102/817-0742.pdf

http://en.wikipedia.org/wiki/CPU_cache
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
While you can definitely do this with L3 caches - and in fact, most will do exactly what you are proposing automatically - it's wouldn't be effective for lower level caches. With caches, there's an issue of size versus latency. You can make big caches and you can make fast caches, but you can't have both. So the largest cache - the L3 - is the slowest, and the L1 is the fastest (and smallest). You want to put the caches close to where they will be used - so the L1 is usually right next to the register file unit and the integer execution unit - so that you can read the data out and use it right away. Otherwise you can waste multiple clock cycles buffering (nanometer-sized wires are not really great wires) and pipelining the signal to get it from where it is to where it needs to go.

The problem with what you are proposing is that multi-core CPUs are designed by essentially rubber stamping out one core 4 times, and then surrounding them with L3 cache and connecting the cores together to the outside world. Since you are rubber stamping out the cores, the caches are not going to be close to where the data needs to be "consumed" and you'll almost certainly end up with it taking longer to get from core 4's L1 to core 0's integer unit than it would to read it out of the core 0 L2 cache - or maybe even the L3. It's a good idea, but I don't think it would work without a greater investment in the complexity of the design of multi-core CPUs - and the design teams are already pretty big.


Edit: IDK beat me to it. There were no replies when I started typing. At least our answers are more or less the same. :)
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: pm
Edit: IDK beat me to it. There were no replies when I started typing. At least our answers are more or less the same. :)

:)

On the subject of getting cache where it needs to be (interconnect trace-wise) do you have any thoughts on whether sram stacking (ala thru-silicon via) might actually help in this regard?

Obviously it is plausible, but by "help" I mean provide enough latency benefits to outweigh the cost and mfg complexity involved versus expending those resources on designing a better monolithic die with more cache-savviness from the outset?
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
I'm way out of touch with process technology nowadays - I've read your posts, and you know much more than I do about these things. The last circuit design I did was the last stage H-tree clock network on Montecito. For the last couple of years, I've been working on structural test.

Still, for all that, I think it would be a clear gain in terms of how many FETs you can hang off of the sense amp and still get it to evaluate quickly. The RC effects of those long bit-lines would be cut effectively in half. But from my way of thinking Intel is a manufacturing company that happens to design microprocessors. :) So I can't imagine that Intel would adopt FET stacking any time soon due to the increased mask steps (and they are the expensive masks too) and alignment and planarization issues. Cost and yield are king, and stacked FETs would be a hit to both. I've always wondered what it would do for power dissipation too.

It would be neat to stack the pFETs on one layer and nFETs on another and get rid of all the well spacing DRC issues. If everything was nFETs or pFETs, spacing could be much tighter.