What advantages does the new IBM 970 processor (in Apple's latest G5) have in its asymmetrical L1 cache arrangement...?

MadRat · Jan 8, 2004

Originally posted by: Eug
New G5
Die size: 66 mm2
Transistors: 58 million
Process: 90 nm, Silicon on insulator
L1 instruction cache: 64 KB
L1 data cache 32 KB
L2 cache: 512 KB

Why, on the new 90nm IBM 970 processors, do the i- and d- caches in the L1 not match in size?
What advantage does it have on performance when using the non-symmetrical sizes like this, rather than 64k/64k?
What advantage does it have on performance when using the non-symmetrical sizes like this, rather than 64k/64k?
What disadvantage does it have on performance when using the non-symmetrical sizes like this, rather than 32k/32k?
What disadvantage does it have on performance when using the non-symmetrical sizes like this, rather than 32k/32k?
Is a larger instruction cache better than a larger data cache?

uart · Jan 8, 2004

The optimal division between instruction and data cache is clearly going to differ from one program to the next. In particular, where the critical loops of a program are code-compact but the data set is large then the optimal division for that program would normally favour more d-cache than i-cache. Other programs may only operate on a relatively small data set yet have large and complicated code loops, these would favour a large i-cache division.

I assume that IBM would have profiled a number of programs, particularly those for which they wanted optimal performance (or most impressive benchmarks as the case may be) and then chosen the L1 partitioning based on that profiling.

BTW. Most CPU's have a fixed division between i and d cache, however some can do it dynamically to effectively get the best of both worlds. From memory I think the humble old Cyrix 686MX had a 64k combined d-i cache that had no fixed division.

MadRat · Jan 8, 2004

Does a dynamic-combination L1 cache incur alot of latency penalty in exchange for this type of utility?

lexxmac · Jan 8, 2004

Originally posted by: MadRat

Originally posted by: Eug
New G5
Die size: 66 mm2
Transistors: 58 million
Process: 90 nm, Silicon on insulator

Click to expand...

The current G5 chips are 130nm. The 90nm chips are still a ways off, or at least not yet in full scale production. Anyways, the currently shipping machines use 130nm chips. Interestingly, the northbridge is made on the same 130nm process of the CPUs.

glugglug · Jan 8, 2004

I would think with todays data-instensive applications (graphics and sound take more space as they get better, code does not), that the exact REVERSE of this would be a better design. That is I would think the sweetspot on price/performance for data cache would be larger than that for instruction cache. Especially on a CPU with a strong vector-based instruction set like Altivec. You sure those numbers are not reversed?

Peter · Jan 11, 2004

The point in making the I cache larger is exactly what glugglug says ... today, typical high-performance applications use a certain (small) codeset to work on streaming (HUGE) datasets. Trying to cache the datasets is rather pointless given their size; good prefetching and write buffering strategies are needed here - and have long been in place. Being able to fit a larger chunk of code into L1 cache, in contrast, will improve performance quite impressively.

And yes, uart, the superscalar Cyrixes from 5x86 to MIII had unified L1 cache.

lexxmac · Jan 11, 2004

Originally posted by: Peter

And yes, uart, the superscalar Cyrixes from 5x86 to MIII had unified L1 cache.

Why isn't that same principle in use today? What are the pros and cons to have unified L1 cache?

CTho9305 · Jan 11, 2004

Originally posted by: lexxmac

Originally posted by: Peter

And yes, uart, the superscalar Cyrixes from 5x86 to MIII had unified L1 cache.

Click to expand...

Why isn't that same principle in use today? What are the pros and cons to have unified L1 cache?

With a unified cache, you need enough bandwidth to provide the data AND instructions. With a harvard architecture (separate I and D cache), each cache only needs the bandwidth for either instructions or data.

Also, in general, bigger caches are slower, so two smaller caches would be faster.

I can't remember other reasons right now.

glugglug · Jan 11, 2004

A unified cache can't be used to cache microinstructions, rather than x86 machine instructions as they are found in memory, like the P4 and Athlon64/Opteron do.

Peter · Jan 11, 2004

Well, the Cyrix legacy from 5x86 to M-III were about the last native x86 engines. They didn't internally translate to some RISC micro-op set.

Lynx516 · Jan 13, 2004

Only the most recent Intel processors cache micro ops. They use the Trace cache. No AMD processor stores microops.

drag · Jan 15, 2004

The way I figure it is that having a asymmetrical cache makes a proccessor more efficient. Or at least is more usefull on a more efficient proccessor.

My reasoning goes like this:

In the playstation2 you have relatively low amounts of cache and memory spread out thru-out the system. This is because they don't want large amounts of cache, instead they concintrate on creating wide busses between everything. They move the graphics fast, so less time is needed in the cache, less cache you need and the faster everything gets. etc etc. So you could create a system that would run a wordproccessor as fast as a 486, but handle graphics like a 800mhz+ pentium with a 300mhz cpu.

Now you have the data cache and the instruction cache.

Most programming is done in loops. Meaning that you have the same instruction used over and over again on different data.

You don't need a large data cache, because the data isn't going to be used more then once in a while, so it needs to be moved fast. Instructions are going to sit around for a long time, so you want to make the cache large enough to hold even complex instruction sets, since they don't move around as much then lantecy doesn't matter as much as it does for the data cache.

.............

Now on systems that use symetrical cache like pentiums.

The P4 is a very inefficiant proccessor. It takes 2-3 times the clock cycles to get any one thing done. But it has very long pipelines so that it can do more then bunches of stuff at the same time and do it fast.

You also have limitations like the x86 ISA. ISA is the rules for how software is suppose to interact with the hardware. Originally it dictated how the hardware was designed, now it dictates a intermediate layer between the hardware and the software. You can create any god-awhfule or weird hardware imaginable, but as long as software sees the x86 ISA then it will run.

For instance x86 ISA dictates that their will be 8 general purpose registers aviable for proccessing inputs into the CPU.

Well the P4 has 128 GPRs aviable.

So all the software can see is 8. So what it does is that it begins proccessing on 8 GPRs at a time then shifts the mapping to another used 8 GPRs to start proccessing another chunk of data.

So the large data cache is important to keep the data ready as it waits for another section of the CPU to open up and proccess it.

Also with long pipelines it's importent to make sure that you have the scedualling down very very well. It may take a long time to proccess info on a P4 (in terms of clock cycles), but as long as you time everything right all the data will come out on the other end ready to go and in order, but if you get stuff out of order then you have to waste time reproccessing things and bringing data back up out of memory and into cache and so on and so forth.

Bigger data caches makes this easier I figure.

So having symetrical cache isn't so important as it "has to be symetrical" but it is a good rule of thumb and easier to design chips to use.

This is all of course pure guess work. I have no idea if I am right or wrong.

titananandtech · Jan 17, 2004

Originally posted by: lexxmac

Originally posted by: MadRat

Originally posted by: Eug
New G5
Die size: 66 mm2
Transistors: 58 million
Process: 90 nm, Silicon on insulator

Click to expand...

The current G5 chips are 130nm. The 90nm chips are still a ways off, or at least not yet in full scale production. Anyways, the currently shipping machines use 130nm chips. Interestingly, the northbridge is made on the same 130nm process of the CPUs.

Click to expand...

Nope, IBM is currently making some 90nm G5 chips. They power the Xserve blade computers from Apple. The bulk of Apple's sales I assume are the desktop machines which use the older 130nm process. So the 90nm are only shipping in limited quantities, but enough for the xserve systems.

BTW the Pentium4 basically has asymmetrical L1 cache too.

MadRat · Jan 17, 2004

Originally posted by: titananandtech
BTW the Pentium4 basically has asymmetrical L1 cache too.

How so?

CTho9305 · Jan 17, 2004

Originally posted by: MadRat

Originally posted by: titananandtech
BTW the Pentium4 basically has asymmetrical L1 cache too.

Click to expand...

How so?

Trace cache. It holds 12k uops.

MadRat · Jan 18, 2004

Besides the 12k of uops, how much data does it hold?

CTho9305 · Jan 18, 2004

Originally posted by: MadRat
Besides the 12k of uops, how much data does it hold?

None. It just caches decoded instruction traces. Getting its size in kb isn't particularly informative unless you know the size of uops, which isn't really useful information by itself either.
edit: Look for a paper called "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching" by Rotenberg, Benet, and Smith. Found it here.

MadRat · Jan 19, 2004

Does the x86 architecture make one size of L1 cache optimal as opposed to what would be found in RISC architecture? AMD appears to stay with the 64/64 arrangement even after jumping from the 250nm process to the 180nm process, from the 180nm process to the 130nm process, and from the 130nm process to the 90nm process. I have to assume that AMD sees no benefit moving to a 96/96 or 128/128 configuration.

I'm impressed by trace cache and the claims of performance improvement:

17-34% integer boost
6-9% FPU boost

Does anyone not believe AMD will not eventually move to this layout?

Search

What advantages does the new IBM 970 processor (in Apple's latest G5) have in its asymmetrical L1 cache arrangement...?

MadRat

Lifer

uart

Member

MadRat

Lifer

lexxmac

Member

glugglug

Diamond Member

Peter

Elite Member

lexxmac

Member

CTho9305

Elite Member

glugglug

Diamond Member

Peter

Elite Member

Lynx516

Senior member

drag

Elite Member

titananandtech

Member

MadRat

Lifer

CTho9305

Elite Member

MadRat

Lifer

CTho9305

Elite Member

MadRat

Lifer

TRENDING THREADS