Haswell to include a L4 cache?

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,657
136
You would think that eventually you would start to run into a point where adding more cache levels would be pointless. I mean each time you add another level that increases the latency to system memory. As system memory gets faster and you start adding more channels, and the ram is now talking directly to the CPU, is the next system performance increase really adding more latency to that?
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Eventually I imagine we will get to that point. But as it stands today, all 3 levels of cache are still much faster than DDR3 RAM. And it is possible to add a 4th layer which will still be faster than RAM (for now).
 

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
For a given system that never evolves, there is a sweet spot that you don't want to get too far from.

What we have in real life is a ballooning piece of hardware and a ballooning library of instructions to run on it. Not only that, but the market, more or less, has decided it will be increasingly unkind to 150+ watt CPUs, so this limits us to the 2.0-4.0 GHz operating range for now. Holding frequency somewhat constant, RAM latencies are going to get worse at a faster rate than cache latencies.

Now the increasing size and usefulness of the IGP as well as the increasing instruction-level intimacy between x64 and GPGPU will place greater demands on the cache system. If you think 30 cycles at 3 GHz is bad, then what about 30 cycles at 700 MHz? I don't even know what speed the IGP or cache runs at, I'm just spitballing to illustrate.

intel hasn't had to make any production decisions on this yet, but IBM has. What would you do in 2005 if you had to share a nice CPU and GPU with a horrible memory system? Would you rather give the GPU a 64-bit GDDR5 sideport or would you wedge in the largest eDRAM that you could possibly fit?

Consumer Haswell systems will probably continue to use 128-bit DDR3-1333, 1600, etc, but some of the higher-end parts will be 8-threaded CPUs with some larger, more evolved version of the HD 4000. So they are definitely between a rock and a hard place here, and the only elbow room they have left is the presumed maturity and density of their 22nm tech which could enable 24MB or larger last level caches. Not a bad stopgap until faster/wider and cheaper memory systems come to town.


Can someone compute the Romley cache area and interpolate to 22 or 16nm? What about higher density 1T-SRAM-Q or eDRAM?

edit: Romley's 15MB L3 cache needs about 108 sq mm. 1T-SRAM-Q could fit 15 MB into 17 sq mm (on 45nm SOI).
 
Last edited:

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,657
136
Eventually I imagine we will get to that point. But as it stands today, all 3 levels of cache are still much faster than DDR3 RAM. And it is possible to add a 4th layer which will still be faster than RAM (for now).

But its not about being faster. It will always be faster. But the CPU on pretty much every job is constantly going back to system memory. Is it really smart to keep adding layers for software that keeps getting more and more bloated, prior to the one part of the system that can in theory handle all of the information. Is there really going to be enough space savings, or allowance for the CPU cores to get clocked higher, to make sense for increasing latency to the CPU?

I guess the answer is yes, because speed, bandwidth, and latency seem to have very, very little affect on system performance of Intel based systems. But I think eventually somethings got to give.
 

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
It sounds like you just want instantaneously fast RAM. While we would love that too, no one is going to develop that without decades of intermediate development of cache and RAM technologies. Increasing the complexity of the cache hierarchy is merely one such intermediate development.

we can't just keep clocking higher. been there and done that.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Plus with all the new instructions being introduced with Haswell, and the huge FP performance increase that will come along with that, it makes sense that Intel would dedicate the L3$ to just the CPU cores instead of sharing it with the IGP, which will also be more bandwidth hungry.

I am trying to find information on Haswells new cache structure. Wiki (yes, I know its wiki), for a long time as had these values listed: (64kb+64kb L1 per core, 1mb L2 per core, and up to 32mb L3, which based on a 8 core design would mean 4mb L3 per core.) Very interesting if true, but I will wait to see a preview before I believe anything.
 
Last edited:

Khato

Golden Member
Jul 15, 2001
1,240
306
136
I'm somewhat disappointed with the bit-tech rehash of the original vr-zone article - I don't even see it mentioning the fact that the L4 would be a separate die, which is quite clear in what appears to be a package drawing in the vr-zone article - http://vr-zone.com/articles/mystery...up-the-graphics-ante-further-again/15272.html

As for the purposes of it... I'm not certain why anyone would expect it to be for anything other than graphics. Larger caches for CPU doesn't make much sense with typical consumer workloads, and its the graphics that need large amounts of high bandwidth memory. Well, theoretically - SNB iGPU couldn't care less about memory bandwidth, will be interesting to see how IVB iGPU reacts to varying memory bandwidth.
 

gevorg

Diamond Member
Nov 3, 2004
5,070
1
0
With a shrinking and more mature process, why not just add on-chip RAM like 4C/8T with 1GB RAM and no IGP. Too much niche here? :) Might be great for highend gaming and server/workstation class machines.
 

exdeath

Lifer
Jan 29, 2004
13,679
10
81
WTB STT-MRAM for main memory and eliminate the HDD and the L2/L3 cache.

When your main memory is non volatile and high capacity, you don't need a HDD/SSD anymore.

When that same memory is as fast as SRAM, you don't need CPU cache, and can instead utilize the die space previously used by cache for 16+ cores.

Win/win.
 
Last edited:

greenhawk

Platinum Member
Feb 23, 2011
2,007
0
71
up to 32mb L3, which based on a 8 core design would mean 4mb L3 per core.)

probably like every other advanced news for cpus like the s2011. The $1000 chip might have that much cache, but the ones that the average hard core user can buy will have about 1/2 that.
 

GammaLaser

Member
May 31, 2011
173
0
0
When that same memory is as fast as SRAM, you don't need CPU cache, and can instead utilize the die space previously used by cache for 16+ cores.

Even SRAM gets slower when you make the memory array larger (think increasing hit latencies going from L1->L2->L3, which goes from ~3-4 cycles in L1 to dozens of cycles at L3). The same can be said about other solid-state memory technologies. As long as this is true there will always be a need for very small but very fast caches.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,957
136
WTB STT-MRAM for main memory and eliminate the HDD and the L2/L3 cache.

When that same memory is as fast as SRAM, you don't need CPU cache, and can instead utilize the die space previously used by cache for 16+ cores.
Wouldn't work. A large part of the time of accessing memory is selecting the relevant line and getting a signal there and back. No matter how fast the memory cells themselves are, you are limited in speed by lightspeed to the physical size and distance of the array, and to log2(array size) + 1 transistor switches.

Even if your memory cells themselves have zero latency (and mram sort of does for reads!), if your pool is gigabytes in size, you *will* need L2 and L3 cache.
 

cytg111

Lifer
Mar 17, 2008
23,924
13,413
136
You would think that eventually you would start to run into a point where adding more cache levels would be pointless. I mean each time you add another level that increases the latency to system memory. As system memory gets faster and you start adding more channels, and the ram is now talking directly to the CPU, is the next system performance increase really adding more latency to that?

I woulda thought it was something like a ratio, like mainmem:l3, l3:l2, l2:1.. and at some point you get so much main memory that a l4 is warranted.

Who needs all this memory? Take a look at windows live messenger in your systray .. 100M commited footprint, for an app that send and recieve utf8 characters over the pipe.

It is what it is.
 

jpiniero

Lifer
Oct 1, 2010
15,082
5,650
136
I was wondering why Intel jacked up the TDP on Haswell desktop. I think I have my answer.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
Plus with all the new instructions being introduced with Haswell, and the huge FP performance increase that will come along with that, it makes sense that Intel would dedicate the L3$ to just the CPU cores instead of sharing it with the IGP, which will also be more bandwidth hungry.

Larger on-package memory makes a whole lot of sense on servers and iGPU systems. They can't ever be satisfied with enough bandwidth. 256MB-1GB system memory on package will significantly improve performance. Even on mainstream CPUs, it should benefit performance.

Imagine now the 3 level cache subsystem satisfies 50% of total workloads and never goes out to system memory. That figure could rise to high 90% at 512MB capacity. Even if that makes main memory somewhat slower, it wouldn't matter as on package memory would have more bandwidth and lower latency than any system memory.

I am trying to find information on Haswells new cache structure. Wiki (yes, I know its wiki), for a long time as had these values listed: (64kb+64kb L1 per core, 1mb L2 per core, and up to 32mb L3, which based on a 8 core design would mean 4mb L3 per core.) Very interesting if true, but I will wait to see a preview before I believe anything.
I wouldn't believe those figures for two reasons. First, the leaked Haswell package shot shows exactly same die width as Ivy Bridge, just longer, which implies a larger GPU. Although optimizations would still allow CPU architecture changes, they would not be able to fit larger L1 and L2 caches in the same space. You wouldn't see larger L1 and L2 caches on the server or higher performing variants either. That would go against modularity as it would require a layout change, which is significant.

L3 cache being larger is possible, but 4MB/core is already beyond what the Westmere EX chips have at 3MB/core. I guess its too early to throw that idea away. Of course only the largest chips would have it, if at all.
 
Last edited:

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
http://www.anandtech.com/show/5078/...abooks-gt3-gpu-for-mobile-lga1150-for-desktop

This is a year old, but it suggests Haswell will bring hyperthreading to the low end and introduce integrated voltage regulators.

and
Haswell for Ultrabooks will be available in a 15W TDP, similar to where SNB based Ultrabooks are today. The big news here is Intel will move the PCH (Platform Controller Hub) onto the same package as the CPU, making the Ultrabook version of Haswell a single chip solution.


this was a pretty cool video
http://www.anandtech.com/show/1770

makes you wonder why it wasn't done earlier, if you can bring the PCH onto the CPU die and still see your TDP go from 17 to 15.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
Can someone compute the Romley cache area and interpolate to 22 or 16nm? What about higher density 1T-SRAM-Q or eDRAM?

edit: Romley's 15MB L3 cache needs about 108 sq mm. 1T-SRAM-Q could fit 15 MB into 17 sq mm (on 45nm SOI).

Remember you are comparing area taken up by everything required in a CPU vs. data in a research paper.

(Also, Sandy Bridge EP's cache portion takes up ~110mm2 using 20MB L3, not 15MB)

In a CPU you need more than just the data storage portion and it makes it look larger per capacity than research papers indicate, which only show pure per MB area.

On Nehalem, 1MB L3 cache takes 5.7mm2. 15MB would take up 85mm2. While using aforementioned 1T tech would save area, its nowhere as significant as you are putting it. The ratio of size then turns out to be 4:1 for 6T and 1T. It's similar for eDRAM.
 
Last edited:

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
I was using the "overhead" figures which were over double the bit cell figures, but you can still see any undocumented overhead doesn't come near the intel cache. and 1T-SRAM is an old technology.
 

Meaker10

Senior member
Apr 2, 2002
370
0
0
But its not about being faster. It will always be faster. But the CPU on pretty much every job is constantly going back to system memory. Is it really smart to keep adding layers for software that keeps getting more and more bloated, prior to the one part of the system that can in theory handle all of the information. Is there really going to be enough space savings, or allowance for the CPU cores to get clocked higher, to make sense for increasing latency to the CPU?

I guess the answer is yes, because speed, bandwidth, and latency seem to have very, very little affect on system performance of Intel based systems. But I think eventually somethings got to give.

Applications don't manage the cache, nor does the OS, this has no effect on the programming at all.
 

Khato

Golden Member
Jul 15, 2001
1,240
306
136
I really do wonder on the origin of the image in the original vr-zone article. Because if it's at all representative of HSW, then the likely 37.5mm x 37.5mm dimensions of socket 1150 would yield a die size of roughly 285 mm2 for quad core GT3 and 89 mm2 for the L4. Assuming that it's eDRAM, then taking a conservative figure of 11Mbit/mm2 (roughly what IBM obtains on their 32nm eDRAM) would yield approximately 128MB.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Is this just actually dedicated GPU memory and someone threw the L4 cache moniker in there for whatever reason? If that is the case, then VRAM on dedicated GPUs could also be considered L4 since it is so much faster than regular system RAM. ( I am only half serious.)

I am really interested to see the term intel decides to use for this memory and what type of memory it will actually be.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
then the likely 37.5mm x 37.5mm dimensions of socket 1150 would yield a die size of roughly 285 mm2 for quad core GT3 and 89 mm2 for the L4.

If that's true, then the GT2 figure shown would be way too large. Assuming its 37.5mm x 37.5mm, then the GT2 die would be at 230mm2, and many earlier shots show its closer to 180mm2.

So what if we take in relative terms with the known GT2 size of ~185mm2?

GT3 die = 230mm2
"L4" die = 72mm2

That's still big, but much more reasonable.

On package DRAM could be smaller as well. 54nm 1Gbit(128MByte) DRAM takes 40mm2 die. 46nm DRAM takes 55mm2 with 2Gbit capacity. With 30nm-generation, even 8Gbit(1GByte) would be possible.

I doubt its anything more than a block diagram. Trying to figure out die size might be as ridiculous as trying to do the same using a block diagram.

Edrick said:
Is this just actually dedicated GPU memory and someone threw the L4 cache moniker in there for whatever reason? If that is the case, then VRAM on dedicated GPUs could also be considered L4 since it is so much faster than regular system RAM. ( I am only half serious.)

I am really interested to see the term intel decides to use for this memory and what type of memory it will actually be.

Yes, you may well be on to something. A mere framebuffer for GPU rather than seperate caching level to simplify things. That would be just a start.
 
Last edited: