External Level 3 Cache

lexxmac

Member
Nov 25, 2003
85
0
0
Why not add extra L3 cache? Back in the old days there used to be cache on a module similar to ram. Is there any outstanding reason why high end systems shouldn't have an expansion option for the L3 cache? I know chips like the Itaniums have 6MB of L3 cache, but they also lack very large L2 caches. Putting cache on the CPU die is extremely expensive, and a small amount like 256k or 512k (even 1MB) sounds reasonable to build in to the CPU itself. SRAM is lots faster than DRAM and having a module with a few megs of cache on it should be a performance booster. Anyone know what the deal is?
 

Varun

Golden Member
Aug 18, 2002
1,161
0
0
Well, latency would be a bit better than the DRAM, but it would still have to go through the FSB, restricting the access bandwidth.
 

wacki

Senior member
Oct 30, 2001
881
0
76
I think he means add ram that is not on the dye itself, but built into the chips packaging material. This has been done before, and I do believe the fsb only effects the connection between the chip package and the main memmory, aka that 512 MB stick of ram.
 

Pudgygiant

Senior member
May 13, 2003
784
0
0
Back in the day the external cache (AFAIK) was not accessed through the northbridge, but the CPU had a direct pipe... exactly for the reason that it would not be limited by the FSB (of course it had other limitations though). Would putting it in the package but not on the die really be that much faster than having it external?
 

NickE

Senior member
Mar 18, 2000
201
0
0
The downside of taking it off-die, but still running at a high speed is twofold: first of all, cost - if you need a few MB's of (very) high-speed (static?) RAM, this is going to cost a lot of money. Probably more important though is the issue of electrical noise and signal integrity when traces move from microns to millimeters or even centimeters.

It's for these reasons that the early Slot-A Athlons had off-die cache that ran at fractions (1/3, 1/5, 2/5 etc.) of the core speed; similarly, that's why the Thunderbird moved it on-die, as in the K6-2 vs. K6-III.
 

Cashmoney995

Senior member
Jul 12, 2002
695
0
0
Because if u had some type of slot "on" the cpu or near it you would have to have some ram in it otherwise you would get a hugomongo amount of intereference. You would have to then have some type of terminator if you did not wnat all that intererference and I'm thinking that terminator that would stop all that stuff would be very expensive because it would have to be micron level shileding meaning that it would have to a higher level manufacture then just a regular aluminum or lead shield. THe best thing that you can do is move the memory positions closer to the processor. I believe that this will happen in our next form factors, BTX or CTX will solve these problems.
 

lexxmac

Member
Nov 25, 2003
85
0
0
I admit putting it on the cpu packaging like on slot A CPUs would be better than a dedicated slot, but any insight to why it hasn't been done?
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
I think it would just be very, very hard to run such a cache at even close to today's processor speeds. I mean, they couldn't get all that close years ago, when the clocks were about 1/10 as high. Even a simple chip-to-chip interconnect is going to be hard-pressed to run at more than a few hundred MHz.

So you're looking at:

1) High cost, both in development and production due to the increase in complexity of the processor die and packaging.
2) Possibly lower processor maximum clockspeed due to increased electrical noise and capacitance caused by adding an external memory chip to the die.
2) Relatively low performance compared to on-die cache. I find it unlikely that you could run this stuff more then 2-4x faster than today's high-speed DDR memory already runs (about 250Mhz is the effective limit for mass-produced DRAM right now), and L2 cache is much more than 10x faster. Making it out of SRAM would be prohibitively expensive and still might not buy you enough speed. Those Itanium chips cost several thousand dollars for a reason.

Unless you could add a fairly HUGE L3 cache (say, 128MB) at a very good speed (an off-chip interconnect at even (CPU frequency / 8) would be extraordinarily fast with a CPU clock over 2Ghz), the performance gain probably just isn't worth the added expense. And it certainly wouldn't be cheap either way.
 

tinyabs

Member
Mar 8, 2003
158
0
0
Unless you could add a fairly HUGE L3 cache (say, 128MB) at a very good speed (an off-chip interconnect at even (CPU frequency / 8) would be extraordinarily fast with a CPU clock over 2Ghz), the performance gain probably just isn't worth the added expense. And it certainly wouldn't be cheap either way.

In other words, does a L3 cache just acts as a partial and intelligence shadow memory for the L2? An example is BIOS shadowing, as BIOS are slower than RAM, it is moved into the upper 768Kb RAM region to achieve faster performance.

I think cache simply providing faster access to frequently used memory. The larger the cache, the better the performance, until entire RAM content is moved into the cache memory, which provides the best possible speed between DRAM and L2/L1.

Implementing a processor cache depends greatly on the algorithms used because the capacity of the cache is limited. As the nature of processor has big possibility of looping, cache in this area really speed up the processing. Branching does have mechanism like instructions trace and speculative execution to help.

For example, in reading the entire content of the 512Mb, cache simply fetch more than the processor needs because the cache always stores in cache lines of 8 bytes or more. It is obvious that cache shows its effects here. For random access, it will need to get the memory from DRAM which is unlikely to be store in the cache (over 512Mb). Instead, the instructions are predecoded to determine the memory to read and stored in RAM before they are executed.

The processor is just a Turning Machine with several standard constructs so specific optimization is already aggressively implemented in modern processors.

 

lexxmac

Member
Nov 25, 2003
85
0
0
Matthias, I'd like to ask you for some proof that a chip to chip connection cannot run at 1 GHz or higher. Yes external SRAM is relatively low performance compared to on-die cache, but it is as you said yourself, still 10 times faster than DRAM even if it is more expensive to manufacture. Using the excuse of high cost is well, wrong. The more on die cache you have, the more die space you take up per core, driving up the production costs due to low yields. By cutting the amount of on die cache, you increase yields on the CPUs themselves. Then, split the external cache between several chips, say four with one, two, or even four megs each. Remeber, low yields mean high cost. The higher the die size the lower the yields. Cut it up and increase the amount to compensate for the minor speed loss of not having it all on the same peice of silicon, and then you still have a faster and cheaper product. Itaniums cost so much becasue the people who need them have more money than they can shake a stick at and Intel can get away with charging so much for them.
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
I'm looking at things like HyperTransport, which is currently limited to 800Mhz at most. I haven't heard of much running faster than that, and certainly not in the sort of wide parallel configurations you want for cache. You can push single serial interconnects way faster than 1Ghz, but you're just not going to find something with 10+gigabytes/sec. of bandwidth, at least that you can integrate cheaply into a processor core.

What I said before was that cache is about 10x as fast as external DRAM. I estimated that onboard RAM might be 2-4x as fast as external RAM -- that is, still less than half the speed of cache. The fastest SRAM chips I can find info on are about 5ns read/write cycle, or only 200Mhz. Even if you had something twice that speed, it's just not going to be a whole lot faster than going to external DDR400 memory, for a huge increase in cost.

And while you may be increasing your processor core yields, you're now building a much more complex system containing probably several high-speed external SRAM chips and numerous traces and interconnects to talk to them. Your die is going to be significantly larger and more expensive, and will almost certainly have a lower yield as well.

Could you do it? I'm sure you could. Would it improve performance? Probably, but at current memory chip speeds probably not by a huge amount. Is it worth it? Nobody's found it worthwhile enough to build one in a while.

 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
Originally posted by: Matthias99
Could you do it? I'm sure you could. Would it improve performance? Probably, but at current memory chip speeds probably not by a huge amount. Is it worth it? Nobody's found it worthwhile enough to build one in a while.

Xeon... what's a Xeon? ;)
 

Pudgygiant

Senior member
May 13, 2003
784
0
0
What would be the limitations prohibiting dual-pumping HyperTransport or similar at 800mhz per pipe?
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Guesses:

Signal noise/crosstalk, especially in such a confined area.
Capacitance of all the wires if you're running in a highly parallel configuration.

And once you're already going faster than the memory chips, upping the bus speed just doesn't help you. If you had magic 2Ghz (or even 1Ghz) memory chips to interface with, this might be worthwhile. But at 200-400Mhz, I'm guessing it's too expensive for the limited benefit compared to just making the main memory controller as fast as possible.
 

Sahakiel

Golden Member
Oct 19, 2001
1,746
0
86
Originally posted by: lexxmac
Matthias, I'd like to ask you for some proof that a chip to chip connection cannot run at 1 GHz or higher.
It would be quite simple to implement a 1GHz chip interconnect with today's technology. However, the data path would be so narrow that it wouldn't be worth the cost of redesiging existing platforms to use it to interface with CPUs. Of course, pin-count is always a major cost consideration, which is why we have RAMBUS.

Yes external SRAM is relatively low performance compared to on-die cache, but it is as you said yourself, still 10 times faster than DRAM even if it is more expensive to manufacture. Using the excuse of high cost is well, wrong.
Oh, see this is where you're wrong, lexx. Cost is the reason why we don't see external SRAM modules anymore. In the old days, you could see tangible performance benefits because clock speeds varied little between various componenets. Popping in SRAM caches off chip decreased access latency from main memory, which reached something like 70ns during burst reads while today's memory does aroud 40 ns for a random read.
Today, for the performance level required to get any type of benefit out of caching with the old SRAM module mentality, we're talking extremely high cost. Even in systems which try to incorporate more caches, the performance benefit is often measured in single digit percentage points whereas the costs exceed four-five figures. You'd have to be seriously gung-ho about squeezing the last drop of performance to buy off-die cache modules. However, at that point, you're probably more likely to replace your main memory with SRAM.
Speaking of which, Cray machines traditionally used SRAM only designs for the very reason that performance, no matter how little, was the main goal and cost was not a consideration.

The more on die cache you have, the more die space you take up per core, driving up the production costs due to low yields. By cutting the amount of on die cache, you increase yields on the CPUs themselves. Then, split the external cache between several chips, say four with one, two, or even four megs each. Remeber, low yields mean high cost. The higher the die size the lower the yields. Cut it up and increase the amount to compensate for the minor speed loss of not having it all on the same peice of silicon, and then you still have a faster and cheaper product.

You want to move on-die cache onto seperate modules. You believe that this will drive down costs. Your argument about die yield is along the right track, but it ignores everything else. Yes, smaller dies increase yield. However, too small a die also increases cost. The main reason cache is on-die is speed. There really is no easier way to decrease access latency than having the cache physically close to the CPU. With the cache on-die, L2 accesses of less than 5 cycles are possible and that's somewhat conservative. If you move the cache onto a seperate chip, latency increases to 10's or even 100's of cycles long. Long latency absolutely kills CPU performance, although I believe it runs a lot cooler.
Along the same lines, cache on die means the cache itself runs at very high clock speeds.
There is a reason why Intel and AMD moved cache on-die as soon as technologically and financially feasible. For each process technology, a certain number of defects tend to show up on the wafer. The average is fairly constant between improvements. That means that there exists an ideal die size where the yield just reaches its peak. After that, the dies get smaller but the yield is fairly constant because the defects still affect the same percentage of dies. What does this mean? It means that when your CPU logic takes up less space than the ideal die size, it would be better to fill the extra space with cache than to build another CPU design. Not only would you get vast performance benefits, but you'd also bypass the need to do Pentium Pro or cartridge type packaging which, btw, is seriously expensive.
If, on the other hand, we were to follow your philosophy, then one is quick to realize that individual CPUs would not be the right component to focus our efforts. Instead, we'd have massively parallel systems because lots of CPUs would"...compensate for the minor speed loss of not having it all on the same peice of silicon, and then you still have a faster and cheaper product." I think we all know that's not entirely true. Massively parallel works sometimes, but not always same with massive amounts of cache or small, fast cache.

Itaniums cost so much becasue the people who need them have more money than they can shake a stick at and Intel can get away with charging so much for them.
Xeons cost so much because the very same type of people don't want to scarp existing x86 systems and invest in IA-64 but still want performance. Notice Xeons have on-die L3 cache. If it were that much cheaper to use off-die L3, Intel would do it, and still charge the same amount as they do now just to reap higher margins. Cache is on-die for a reason, and it's not to screw over the customer.

Originally posted by: Pudgygiant
What would be the limitations
prohibiting dual-pumping HyperTransport or similar at 800mhz per pipe?

If I remember correctly, HyperTransport is already "double-pumped." It reads data at the rise and fall of each clock.

 

lexxmac

Member
Nov 25, 2003
85
0
0
To prove my own thought wrong, I point to the minor speed difference between the Athlon 64 3000 (512k L2 cache) and the 3200 (1 MB L2 cache). The speed difference is so little for most purposes and yet the cost is close to double, even with on die cache. I've been proven wrong, still it was an interesting discussion.
 

Pudgygiant

Senior member
May 13, 2003
784
0
0
I mean double-pumping as in two channels at 800mhz, whether it's already double pumped, quad pumped, or whatnot.

EDIT
And about external L3 cache, would the memory controller be on-die or off? Which would give better performance?
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
The problem really isn't with the speed of the bus, but with the speed of off-die L3 cache. In order to get off-die L3 cache, it would have to be manufacturered and packaged separately. L3 caches made in this method are extremely slow and, compared to modern DRAM, isn't that much faster (look at the ERAM on the IBM Power series). So the benefits aren't worth the cost.
 

Pudgygiant

Senior member
May 13, 2003
784
0
0
Wouldn't an external cache BE DRAM? And wouldn't modern DRAM push the limits of a hypertransport bus?
 

Jeff7181

Lifer
Aug 21, 2002
18,368
11
81
Well, if it was an AMD system and connected to a 1600 Mhz HT link, it wouldn't be THAT slow... although the HT bus is only 16 bits wide, that might be a problem.
 

lexxmac

Member
Nov 25, 2003
85
0
0
Originally posted by: Pudgygiant
Wouldn't an external cache BE DRAM? And wouldn't modern DRAM push the limits of a hypertransport bus?


No, external cache is not DRAM. DRAM is dynamic RAM and SRAM is static RAM. The difference between the two types is that DRAM is composed of a single transistor and a capacitor. SRAM is composed of up to 6 transistors. DRAM must be constantly refreshed to retain its data, and most of the ram is refreshing, which to simplify things theoretically slows it down becasue that time spent refreshing could be used to be reading or writing. When SRAM is written to, it retains its contents as long as electricity is flowing. The major difference is the physical size of the two. SRAM is much much larger compared to DRAM, and that is why its expensive to manufacture, it eats up space on silicon.

As far as DRAM speed pushing the limits of HyperTransport (let me first say HT was never intended for communication with main memory and as far as I know has never been used in such a way), no not really. Dual Channel DDR400 and a 16-bit HyperTransport Bus @ 800MHz both have 6.4GB p/s of bandwidth. Also, HyperTransport is serial and DDR RAM is parallel. Considering that, they are in a dead heat for bandwidth, but they are not really competing with each other as they are intended for different uses. And I might add HyperTransport is packet based, so athough you could probally use it for memory communications, it wouldn't be a great idea as you would need a controller to decode and encode packets for the HT bus.
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
So how many pins does a 16-bit HT link need? Obviously its higher than just 16 individual pins.
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
The HT links on the hammer cores use 76pins and AMD's tech docs are fouled up on this point.