PS3/Cell : What effect does cache have with XDR?

imported_philpoe

Junior Member
Jan 26, 2005
15
0
0
Under the high-level overview of the Cell article, it states that the PPE core has 64KB L1 and 512KB L2 cache.
On the other hand, under the on-die memory controller section, we see that the XDR memory gives bandwidth of 25.6GB/sec, and the integrated memory controller "significantly reduces memory latencies".

My question then is, what good is the L1 and L2 cache doing? Given the amount of real estate those transistors take up, isn't it effective enough to skip cache usage and use the system RAM exclusively? The L2 cache takes up about the same amount of space as an SPE, not that it would help but so much to put another one on the die, but what effect on performance would getting rid of the L2 or even L1 cache have on memory with such high bandwidth?

any info/opinions appreciated
 

n yusef

Platinum Member
Feb 20, 2005
2,158
1
0
Even though the memory latency and bandwidth is faster with an on-die mem controller, it's nowhere near as fast (especially the latency) as cache.
 

imported_philpoe

Junior Member
Jan 26, 2005
15
0
0
While I can certainly understand a latency difference given the physical distances, how much bandwidth does L1 cache have?
I saw some read tests for K7 vs K8 cache reads, and it was in the neighborhood of 10-15GB/sec, which is roughly half as fast as XDR in the cell.
Given that L2 cache takes up the majority of an X86 core, might it be worth it to dump the cache (or a large portion of it) in favor of very fast memory?

Originally posted by: n yusef
Even though the memory latency and bandwidth is faster with an on-die mem controller, it's nowhere near as fast (especially the latency) as cache.

 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
For an x86 core, the problem isn't fast memory; it's the bus. If something like GDDR3 could be used, we'd have better bandwidth and latency. However, modularity and upgradability are very important, as is cost. So, we must fit 8 chips per bank, that then connects to a bus, rather than a few very dense chips that go straight to the memory controller. We mainly don't use XDR due to reputations of Rambus and a few memory companies.

Also, for x86, over 95% of instructions are typically ready to go, in cache. Better prediction helps, but so does just having more cache to use. Having super-fast (super-expensive) main RAM would be good, but it would not give enough of a benefit to be worthwhile as a replacement for the typical on-die caches.

For the cell, much of what makes cache work well for x86 stuff does not exist (like OooE and branch prediction). They had to cut corners everywhere they could, so that it would be an economical chip to manufacture.

As for bandwidth vs. latency on the cell, without branch prediction, software must do prefetching. If it is done well, the main PPC core and other cores may be able to mostly fill themselves, with latency not being much of an issue. If the data needed is sent with enough lead time on its execution, latency issues would be nullified. While that isn't likely to happen all the time, it could still work well, as long as it can either (a) ocur enough to not cause lots of long stalls, and/or (b) be improved as the platform matures, allowing software created with newer compiler and SDK versions to get better performance out of the cores, or (c) mature by programmers getting used to the paradigms of the design, which favor throughput over efficiency. For highly parallel work, like physics, that could work out perfectly (and games are only getting more and more of that).

As for the overall bandwidth, do note that the cell has several cores, and most of them are doing basic maths. This takes a lot of bandwidth to keep filled. Even a dual-core x86, like the 4800+, will have no more than two pipelines being filled with maths, and is going to be doing, on those same cores, much work that needs to check for conditions and choose which fork to use based on them, which is going to be much more sensitive to the latency of fetching a small piece of data, than it is being able to push through many GB of it. The x86 CPUs don't have anywhere near the power of a Cell, when it comes to maths.
 

imported_philpoe

Junior Member
Jan 26, 2005
15
0
0
Re: Bus, in the case of the K8, why not have it go directly to the memory controller?

re: Cache size, another question might be at what point does the cache begin to show diminishing returns?
Does a 16KB L1 and 128KB L2 offer 80% of the performance? What about a 128KB L1 and no L2? (just rhetorical questions, btw :) )
Where might the benefits of the faster system RAM take over for the smaller cache?

The goal of course being to reduce the die size that more cores can go in a single socket.

Besides the lag of gaming engines, won't DB and Web servers see the benefit of multiple cores with smaller caches?

Even if this XDR (or DDR3) is extra-expensive, we're already paying for it in the form of high-end video cards, with the additional expense of system memory. If the whole allotment of RAM was high-speed, turbo-cache/hyper-memory on high-end GPUs could work well.


Originally posted by: Cerb
For an x86 core, the problem isn't fast memory; it's the bus. If something like GDDR3 could be used, we'd have better bandwidth and latency. However, modularity and upgradability are very important, as is cost. So, we must fit 8 chips per bank, that then connects to a bus, rather than a few very dense chips that go straight to the memory controller. We mainly don't use XDR due to reputations of Rambus and a few memory companies.

Also, for x86, over 95% of instructions are typically ready to go, in cache. Better prediction helps, but so does just having more cache to use. Having super-fast (super-expensive) main RAM would be good, but it would not give enough of a benefit to be worthwhile as a replacement for the typical on-die caches.

For the cell, much of what makes cache work well for x86 stuff does not exist (like OooE and branch prediction). They had to cut corners everywhere they could, so that it would be an economical chip to manufacture.

As for bandwidth vs. latency on the cell, without branch prediction, software must do prefetching. If it is done well, the main PPC core and other cores may be able to mostly fill themselves, with latency not being much of an issue. If the data needed is sent with enough lead time on its execution, latency issues would be nullified. While that isn't likely to happen all the time, it could still work well, as long as it can either (a) ocur enough to not cause lots of long stalls, and/or (b) be improved as the platform matures, allowing software created with newer compiler and SDK versions to get better performance out of the cores, or (c) mature by programmers getting used to the paradigms of the design, which favor throughput over efficiency. For highly parallel work, like physics, that could work out perfectly (and games are only getting more and more of that).

As for the overall bandwidth, do note that the cell has several cores, and most of them are doing basic maths. This takes a lot of bandwidth to keep filled. Even a dual-core x86, like the 4800+, will have no more than two pipelines being filled with maths, and is going to be doing, on those same cores, much work that needs to check for conditions and choose which fork to use based on them, which is going to be much more sensitive to the latency of fetching a small piece of data, than it is being able to push through many GB of it. The x86 CPUs don't have anywhere near the power of a Cell, when it comes to maths.

 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Bus, in the K8: the question doesn't really make sense.

Where the caches show diminishing returns depends a lot on everything else. For the tasks most PC users need, high speed RAM can't replace cache. Even at a somewhat low bandwidth, cache is extremely low latency, being right there.

The thing about a lot of physics and other such things is that you can have hundreds of items, that are already in a sequential order in memory, or at least a way to get them into some order like that. Then they need to have several operations done to them, again, across many many items. If the chip can be kept fed, these tasks should be able to just keep scaling on up.

Look at Celerons. A 3GHz Celeron is almost as fast at several encoding tasks as the real P4. However, try using it for simple things like office apps, and compare it against the P4.

As far as the server stuff, they really don't, right now. AT's reviews show the same kind of pattern found elsewhere:
Apache scales like nobody's business. It would probably do quite well with extra cores that have small caches. However, that performance starts to dwindle in real use, with PHP, and maybe a DB attached.
DB apps scale up for CPU speed well, but less so for extra cores. Since speed bumps are coming more slowly, extra cores are a must (and in AMD's case, do add more than two separate CPUs), but a faster single core would reap more performance. Also, with the overwhelming majority of stuff being in cache when needed, removing the cache would mean constantly going out to RAM. Even fast RAM, like on video cards, is nowhere close to as fast as cache. They might move 30GB/s, but the time taken to move a handful of bytes from point A to point B is ludicrous.

Applications can be ported to specifically use a certain design well (that's part of how otherwise average dual G5s manage to make a cheap top-10 supercomputer), but it takes time and effort that most people, and companies, can't deal with. A well-balanced solution is generally better than a highly tweaked one. The balanced solution will offer good, predictable, performance across a range of uses, even with stock parts and software. Current x86 stuff fits the bill, with cache taking up plenty of space. The Cell, with so little in the way of cache, is anything but balanced.

Even if this XDR (or DDR3) is extra-expensive, we're already paying for it in the form of high-end video cards, with the additional expense of system memory. If the whole allotment of RAM was high-speed, turbo-cache/hyper-memory on high-end GPUs could work well.
Additional expense? You can get 1GB of name brand DDR RAM for under $90. That's CHEAP.

The other problem is that this RAM won't work that fast as system RAM in the same scenario. As a waveform passes through things, it gets bounced around, and has interference added in. There's much more room for that to occur on the motherboard's memory bus(es) than on a video card.

Video card: rarely more than four chips per RAM channel, and only a few inches of traces to go through.
System: eight or sixteen chips per module, with one to four modules per channel, and each module is plugged into a shared bus.

How can you clean up the signal between chips, banks, and over the bus? Having a point-to-point design needs more pins and traces, or increases latency (FBDIMM). A method with a repeater would add to latency. RDRAM (again, dunno about XDR) had some help by having dummy modules to help keep the bus from being much of a problem.

The system RAM has to deal with a very different situation than video, and must be very tolerant--you don't know what it will work with! On a video card, it's all set down. The BIOS is made specifically for the RAM chips it uses, and the other stuff, like PCB and power regulation, still only have to work with a limited range of parts. If they work, you're done (OK, it's hyperbole, but the point remains). With a typical RAM stick, there will be dozens of, probably more, chips, chipsets, and hundreds of motherboards and BIOSes, that may need to use it. To top that off, they may then need to run it at a different speed than the max it was sold as.
 
Nov 11, 2004
10,855
0
0
You will notice that heavy duty Xeon CPUs have 8MB of L3 cache.
You will also notice that 2MB cache versions of the Xeon outperform their 512K counterparts.

128KB L1 cache would bring a very large performance increase in comparison to 16KB L1 cache.

I want 5PB of L1 cache on my next CPU. Unfortunately, that would make one very,very, very, very large die and costs would go through the roof. That's why 8MB cache versions of the Xeon retail for $3000+ USD per CPU and they work in pairs.
 

bunnyfubbles

Lifer
Sep 3, 2001
12,248
3
0
Originally posted by: Kensai
You will notice that heavy duty Xeon CPUs have 8MB of L3 cache.
You will also notice that 2MB cache versions of the Xeon outperform their 512K counterparts.

128KB L1 cache would bring a very large performance increase in comparison to 16KB L1 cache.

I want 5PB of L1 cache on my next CPU. Unfortunately, that would make one very,very, very, very large die and costs would go through the roof. That's why 8MB cache versions of the Xeon retail for $3000+ USD per CPU and they work in pairs.

what are you talking about? L1 cache sizes are small, because if they're too large they become slower. You have a second level of a larger ammount of cache that is much slower than L1 cache to speed things up (it can hold more data but is slower), and then L3 cache in very few cases (slower than L2 yet larger still)...then there's the system memory and eventually the hard drive.