Haswell to include a L4 cache?

Edrick · Mar 20, 2012

http://www.bit-tech.net/news/hardware/2012/03/19/haswell-l4-cache-rumour/1

Interesting rumor. But I thought that the SB and IB chips already shared the L3 cache with the GPU. Maybe they will make the L3 be dedicated to only the CPU (which should increase speed), and then a new L4 shared between the CPU and GPU. Either way, I really can not wait for more details on Haswell.

Topweasel · Mar 20, 2012

You would think that eventually you would start to run into a point where adding more cache levels would be pointless. I mean each time you add another level that increases the latency to system memory. As system memory gets faster and you start adding more channels, and the ram is now talking directly to the CPU, is the next system performance increase really adding more latency to that?

Edrick · Mar 20, 2012

Eventually I imagine we will get to that point. But as it stands today, all 3 levels of cache are still much faster than DDR3 RAM. And it is possible to add a 4th layer which will still be faster than RAM (for now).

alyarb · Mar 20, 2012

For a given system that never evolves, there is a sweet spot that you don't want to get too far from.

What we have in real life is a ballooning piece of hardware and a ballooning library of instructions to run on it. Not only that, but the market, more or less, has decided it will be increasingly unkind to 150+ watt CPUs, so this limits us to the 2.0-4.0 GHz operating range for now. Holding frequency somewhat constant, RAM latencies are going to get worse at a faster rate than cache latencies.

Now the increasing size and usefulness of the IGP as well as the increasing instruction-level intimacy between x64 and GPGPU will place greater demands on the cache system. If you think 30 cycles at 3 GHz is bad, then what about 30 cycles at 700 MHz? I don't even know what speed the IGP or cache runs at, I'm just spitballing to illustrate.

intel hasn't had to make any production decisions on this yet, but IBM has. What would you do in 2005 if you had to share a nice CPU and GPU with a horrible memory system? Would you rather give the GPU a 64-bit GDDR5 sideport or would you wedge in the largest eDRAM that you could possibly fit?

Consumer Haswell systems will probably continue to use 128-bit DDR3-1333, 1600, etc, but some of the higher-end parts will be 8-threaded CPUs with some larger, more evolved version of the HD 4000. So they are definitely between a rock and a hard place here, and the only elbow room they have left is the presumed maturity and density of their 22nm tech which could enable 24MB or larger last level caches. Not a bad stopgap until faster/wider and cheaper memory systems come to town.

Can someone compute the Romley cache area and interpolate to 22 or 16nm? What about higher density 1T-SRAM-Q or eDRAM?

edit: Romley's 15MB L3 cache needs about 108 sq mm. 1T-SRAM-Q could fit 15 MB into 17 sq mm (on 45nm SOI).

Topweasel · Mar 20, 2012

Edrick said:
Eventually I imagine we will get to that point. But as it stands today, all 3 levels of cache are still much faster than DDR3 RAM. And it is possible to add a 4th layer which will still be faster than RAM (for now).

But its not about being faster. It will always be faster. But the CPU on pretty much every job is constantly going back to system memory. Is it really smart to keep adding layers for software that keeps getting more and more bloated, prior to the one part of the system that can in theory handle all of the information. Is there really going to be enough space savings, or allowance for the CPU cores to get clocked higher, to make sense for increasing latency to the CPU?

I guess the answer is yes, because speed, bandwidth, and latency seem to have very, very little affect on system performance of Intel based systems. But I think eventually somethings got to give.

alyarb · Mar 20, 2012

It sounds like you just want instantaneously fast RAM. While we would love that too, no one is going to develop that without decades of intermediate development of cache and RAM technologies. Increasing the complexity of the cache hierarchy is merely one such intermediate development.

we can't just keep clocking higher. been there and done that.

Edrick · Mar 20, 2012

Plus with all the new instructions being introduced with Haswell, and the huge FP performance increase that will come along with that, it makes sense that Intel would dedicate the L3$ to just the CPU cores instead of sharing it with the IGP, which will also be more bandwidth hungry.

I am trying to find information on Haswells new cache structure. Wiki (yes, I know its wiki), for a long time as had these values listed: (64kb+64kb L1 per core, 1mb L2 per core, and up to 32mb L3, which based on a 8 core design would mean 4mb L3 per core.) Very interesting if true, but I will wait to see a preview before I believe anything.

Khato · Mar 20, 2012

I'm somewhat disappointed with the bit-tech rehash of the original vr-zone article - I don't even see it mentioning the fact that the L4 would be a separate die, which is quite clear in what appears to be a package drawing in the vr-zone article - http://vr-zone.com/articles/mystery...up-the-graphics-ante-further-again/15272.html

As for the purposes of it... I'm not certain why anyone would expect it to be for anything other than graphics. Larger caches for CPU doesn't make much sense with typical consumer workloads, and its the graphics that need large amounts of high bandwidth memory. Well, theoretically - SNB iGPU couldn't care less about memory bandwidth, will be interesting to see how IVB iGPU reacts to varying memory bandwidth.

gevorg · Mar 20, 2012

With a shrinking and more mature process, why not just add on-chip RAM like 4C/8T with 1GB RAM and no IGP. Too much niche here?

Might be great for highend gaming and server/workstation class machines.

exdeath · Mar 20, 2012

WTB STT-MRAM for main memory and eliminate the HDD and the L2/L3 cache.

When your main memory is non volatile and high capacity, you don't need a HDD/SSD anymore.

When that same memory is as fast as SRAM, you don't need CPU cache, and can instead utilize the die space previously used by cache for 16+ cores.

Win/win.

greenhawk · Mar 20, 2012

Edrick said:
up to 32mb L3, which based on a 8 core design would mean 4mb L3 per core.)

probably like every other advanced news for cpus like the s2011. The $1000 chip might have that much cache, but the ones that the average hard core user can buy will have about 1/2 that.

GammaLaser · Mar 20, 2012

exdeath said:
When that same memory is as fast as SRAM, you don't need CPU cache, and can instead utilize the die space previously used by cache for 16+ cores.

Even SRAM gets slower when you make the memory array larger (think increasing hit latencies going from L1->L2->L3, which goes from ~3-4 cycles in L1 to dozens of cycles at L3). The same can be said about other solid-state memory technologies. As long as this is true there will always be a need for very small but very fast caches.

Tuna-Fish · Mar 20, 2012

exdeath said:
WTB STT-MRAM for main memory and eliminate the HDD and the L2/L3 cache.

When that same memory is as fast as SRAM, you don't need CPU cache, and can instead utilize the die space previously used by cache for 16+ cores.

Wouldn't work. A large part of the time of accessing memory is selecting the relevant line and getting a signal there and back. No matter how fast the memory cells themselves are, you are limited in speed by lightspeed to the physical size and distance of the array, and to log2(array size) + 1 transistor switches.

Even if your memory cells themselves have zero latency (and mram sort of does for reads!), if your pool is gigabytes in size, you *will* need L2 and L3 cache.

cytg111 · Mar 20, 2012

Topweasel said:
You would think that eventually you would start to run into a point where adding more cache levels would be pointless. I mean each time you add another level that increases the latency to system memory. As system memory gets faster and you start adding more channels, and the ram is now talking directly to the CPU, is the next system performance increase really adding more latency to that?

I woulda thought it was something like a ratio, like mainmem:l3, l3:l2, l2:1.. and at some point you get so much main memory that a l4 is warranted.

Who needs all this memory? Take a look at windows live messenger in your systray .. 100M commited footprint, for an app that send and recieve utf8 characters over the pipe.

It is what it is.

jpiniero · Mar 20, 2012

I was wondering why Intel jacked up the TDP on Haswell desktop. I think I have my answer.

IntelUser2000 · Mar 20, 2012

Edrick said:
Plus with all the new instructions being introduced with Haswell, and the huge FP performance increase that will come along with that, it makes sense that Intel would dedicate the L3$ to just the CPU cores instead of sharing it with the IGP, which will also be more bandwidth hungry.

Larger on-package memory makes a whole lot of sense on servers and iGPU systems. They can't ever be satisfied with enough bandwidth. 256MB-1GB system memory on package will significantly improve performance. Even on mainstream CPUs, it should benefit performance.

Imagine now the 3 level cache subsystem satisfies 50% of total workloads and never goes out to system memory. That figure could rise to high 90% at 512MB capacity. Even if that makes main memory somewhat slower, it wouldn't matter as on package memory would have more bandwidth and lower latency than any system memory.

I am trying to find information on Haswells new cache structure. Wiki (yes, I know its wiki), for a long time as had these values listed: (64kb+64kb L1 per core, 1mb L2 per core, and up to 32mb L3, which based on a 8 core design would mean 4mb L3 per core.) Very interesting if true, but I will wait to see a preview before I believe anything.

I wouldn't believe those figures for two reasons. First, the leaked Haswell package shot shows exactly same die width as Ivy Bridge, just longer, which implies a larger GPU. Although optimizations would still allow CPU architecture changes, they would not be able to fit larger L1 and L2 caches in the same space. You wouldn't see larger L1 and L2 caches on the server or higher performing variants either. That would go against modularity as it would require a layout change, which is significant.

L3 cache being larger is possible, but 4MB/core is already beyond what the Westmere EX chips have at 3MB/core. I guess its too early to throw that idea away. Of course only the largest chips would have it, if at all.

alyarb · Mar 20, 2012

http://www.anandtech.com/show/5078/...abooks-gt3-gpu-for-mobile-lga1150-for-desktop

This is a year old, but it suggests Haswell will bring hyperthreading to the low end and introduce integrated voltage regulators.

and

Haswell for Ultrabooks will be available in a 15W TDP, similar to where SNB based Ultrabooks are today. The big news here is Intel will move the PCH (Platform Controller Hub) onto the same package as the CPU, making the Ultrabook version of Haswell a single chip solution.

this was a pretty cool video
http://www.anandtech.com/show/1770

makes you wonder why it wasn't done earlier, if you can bring the PCH onto the CPU die and still see your TDP go from 17 to 15.

IntelUser2000 · Mar 20, 2012

alyarb said:
Can someone compute the Romley cache area and interpolate to 22 or 16nm? What about higher density 1T-SRAM-Q or eDRAM?

edit: Romley's 15MB L3 cache needs about 108 sq mm. 1T-SRAM-Q could fit 15 MB into 17 sq mm (on 45nm SOI).

Remember you are comparing area taken up by everything required in a CPU vs. data in a research paper.

(Also, Sandy Bridge EP's cache portion takes up ~110mm2 using 20MB L3, not 15MB)

In a CPU you need more than just the data storage portion and it makes it look larger per capacity than research papers indicate, which only show pure per MB area.

On Nehalem, 1MB L3 cache takes 5.7mm2. 15MB would take up 85mm2. While using aforementioned 1T tech would save area, its nowhere as significant as you are putting it. The ratio of size then turns out to be 4:1 for 6T and 1T. It's similar for eDRAM.

alyarb · Mar 20, 2012

I was using the "overhead" figures which were over double the bit cell figures, but you can still see any undocumented overhead doesn't come near the intel cache. and 1T-SRAM is an old technology.

Meaker10 · Mar 20, 2012

Topweasel said:
But its not about being faster. It will always be faster. But the CPU on pretty much every job is constantly going back to system memory. Is it really smart to keep adding layers for software that keeps getting more and more bloated, prior to the one part of the system that can in theory handle all of the information. Is there really going to be enough space savings, or allowance for the CPU cores to get clocked higher, to make sense for increasing latency to the CPU?

I guess the answer is yes, because speed, bandwidth, and latency seem to have very, very little affect on system performance of Intel based systems. But I think eventually somethings got to give.

Applications don't manage the cache, nor does the OS, this has no effect on the programming at all.

IntelUser2000 · Mar 20, 2012

I assume you found it from wiki, right?

http://en.wikipedia.org/wiki/1T-SRAM

Even on that table, the ratio of 6T SRAM to 1T SRAM Q is again 4:1, a very commonly noted figure.

Khato · Mar 20, 2012

I really do wonder on the origin of the image in the original vr-zone article. Because if it's at all representative of HSW, then the likely 37.5mm x 37.5mm dimensions of socket 1150 would yield a die size of roughly 285 mm2 for quad core GT3 and 89 mm2 for the L4. Assuming that it's eDRAM, then taking a conservative figure of 11Mbit/mm2 (roughly what IBM obtains on their 32nm eDRAM) would yield approximately 128MB.

Edrick · Mar 20, 2012

Is this just actually dedicated GPU memory and someone threw the L4 cache moniker in there for whatever reason? If that is the case, then VRAM on dedicated GPUs could also be considered L4 since it is so much faster than regular system RAM. ( I am only half serious.)

I am really interested to see the term intel decides to use for this memory and what type of memory it will actually be.

denev2004 · Mar 20, 2012

I wonder whether the L4 use the new interposer tecnology

IntelUser2000 · Mar 20, 2012

Khato said:
then the likely 37.5mm x 37.5mm dimensions of socket 1150 would yield a die size of roughly 285 mm2 for quad core GT3 and 89 mm2 for the L4.

If that's true, then the GT2 figure shown would be way too large. Assuming its 37.5mm x 37.5mm, then the GT2 die would be at 230mm2, and many earlier shots show its closer to 180mm2.

So what if we take in relative terms with the known GT2 size of ~185mm2?

GT3 die = 230mm2
"L4" die = 72mm2

That's still big, but much more reasonable.

On package DRAM could be smaller as well. 54nm 1Gbit(128MByte) DRAM takes 40mm2 die. 46nm DRAM takes 55mm2 with 2Gbit capacity. With 30nm-generation, even 8Gbit(1GByte) would be possible.

I doubt its anything more than a block diagram. Trying to figure out die size might be as ridiculous as trying to do the same using a block diagram.

Edrick said:
Is this just actually dedicated GPU memory and someone threw the L4 cache moniker in there for whatever reason? If that is the case, then VRAM on dedicated GPUs could also be considered L4 since it is so much faster than regular system RAM. ( I am only half serious.)

I am really interested to see the term intel decides to use for this memory and what type of memory it will actually be.

Yes, you may well be on to something. A mere framebuffer for GPU rather than seperate caching level to simplify things. That would be just a start.

Haswell to include a L4 cache?

Golden Member

Diamond Member

Golden Member

Platinum Member

Diamond Member

Platinum Member

Golden Member

Golden Member

Diamond Member

Lifer

Platinum Member

Member

Golden Member

Lifer

Lifer

Elite Member

Platinum Member

Elite Member

Platinum Member

Senior member

Elite Member

Golden Member

Golden Member

Member

Elite Member