What advantages does the new IBM 970 processor (in Apple's latest G5) have in its asymmetrical L1 cache arrangement...?

MadRat

Lifer
Oct 14, 1999
11,961
278
126
Originally posted by: Eug
New G5
Die size: 66 mm2
Transistors: 58 million
Process: 90 nm, Silicon on insulator
L1 instruction cache: 64 KB
L1 data cache 32 KB
L2 cache: 512 KB

Why, on the new 90nm IBM 970 processors, do the i- and d- caches in the L1 not match in size?
What advantage does it have on performance when using the non-symmetrical sizes like this, rather than 64k/64k?
What advantage does it have on performance when using the non-symmetrical sizes like this, rather than 64k/64k?
What disadvantage does it have on performance when using the non-symmetrical sizes like this, rather than 32k/32k?
What disadvantage does it have on performance when using the non-symmetrical sizes like this, rather than 32k/32k?
Is a larger instruction cache better than a larger data cache?
 

uart

Member
May 26, 2000
174
0
0
The optimal division between instruction and data cache is clearly going to differ from one program to the next. In particular, where the critical loops of a program are code-compact but the data set is large then the optimal division for that program would normally favour more d-cache than i-cache. Other programs may only operate on a relatively small data set yet have large and complicated code loops, these would favour a large i-cache division.

I assume that IBM would have profiled a number of programs, particularly those for which they wanted optimal performance (or most impressive benchmarks as the case may be) and then chosen the L1 partitioning based on that profiling.

BTW. Most CPU's have a fixed division between i and d cache, however some can do it dynamically to effectively get the best of both worlds. From memory I think the humble old Cyrix 686MX had a 64k combined d-i cache that had no fixed division.
 

MadRat

Lifer
Oct 14, 1999
11,961
278
126
Does a dynamic-combination L1 cache incur alot of latency penalty in exchange for this type of utility?
 

lexxmac

Member
Nov 25, 2003
85
0
0
Originally posted by: MadRat
Originally posted by: Eug
New G5
Die size: 66 mm2
Transistors: 58 million
Process: 90 nm, Silicon on insulator

The current G5 chips are 130nm. The 90nm chips are still a ways off, or at least not yet in full scale production. Anyways, the currently shipping machines use 130nm chips. Interestingly, the northbridge is made on the same 130nm process of the CPUs.
 

glugglug

Diamond Member
Jun 9, 2002
5,340
1
81
I would think with todays data-instensive applications (graphics and sound take more space as they get better, code does not), that the exact REVERSE of this would be a better design. That is I would think the sweetspot on price/performance for data cache would be larger than that for instruction cache. Especially on a CPU with a strong vector-based instruction set like Altivec. You sure those numbers are not reversed?
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
The point in making the I cache larger is exactly what glugglug says ... today, typical high-performance applications use a certain (small) codeset to work on streaming (HUGE) datasets. Trying to cache the datasets is rather pointless given their size; good prefetching and write buffering strategies are needed here - and have long been in place. Being able to fit a larger chunk of code into L1 cache, in contrast, will improve performance quite impressively.

And yes, uart, the superscalar Cyrixes from 5x86 to MIII had unified L1 cache.
 

lexxmac

Member
Nov 25, 2003
85
0
0
Originally posted by: Peter


And yes, uart, the superscalar Cyrixes from 5x86 to MIII had unified L1 cache.

Why isn't that same principle in use today? What are the pros and cons to have unified L1 cache?
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: lexxmac
Originally posted by: Peter


And yes, uart, the superscalar Cyrixes from 5x86 to MIII had unified L1 cache.

Why isn't that same principle in use today? What are the pros and cons to have unified L1 cache?

With a unified cache, you need enough bandwidth to provide the data AND instructions. With a harvard architecture (separate I and D cache), each cache only needs the bandwidth for either instructions or data.

Also, in general, bigger caches are slower, so two smaller caches would be faster.

I can't remember other reasons right now.
 

glugglug

Diamond Member
Jun 9, 2002
5,340
1
81
A unified cache can't be used to cache microinstructions, rather than x86 machine instructions as they are found in memory, like the P4 and Athlon64/Opteron do.
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
Well, the Cyrix legacy from 5x86 to M-III were about the last native x86 engines. They didn't internally translate to some RISC micro-op set.
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
Only the most recent Intel processors cache micro ops. They use the Trace cache. No AMD processor stores microops.
 

drag

Elite Member
Jul 4, 2002
8,708
0
0
The way I figure it is that having a asymmetrical cache makes a proccessor more efficient. Or at least is more usefull on a more efficient proccessor.

My reasoning goes like this:

In the playstation2 you have relatively low amounts of cache and memory spread out thru-out the system. This is because they don't want large amounts of cache, instead they concintrate on creating wide busses between everything. They move the graphics fast, so less time is needed in the cache, less cache you need and the faster everything gets. etc etc. So you could create a system that would run a wordproccessor as fast as a 486, but handle graphics like a 800mhz+ pentium with a 300mhz cpu.

Now you have the data cache and the instruction cache.

Most programming is done in loops. Meaning that you have the same instruction used over and over again on different data.

You don't need a large data cache, because the data isn't going to be used more then once in a while, so it needs to be moved fast. Instructions are going to sit around for a long time, so you want to make the cache large enough to hold even complex instruction sets, since they don't move around as much then lantecy doesn't matter as much as it does for the data cache.

.............

Now on systems that use symetrical cache like pentiums.

The P4 is a very inefficiant proccessor. It takes 2-3 times the clock cycles to get any one thing done. But it has very long pipelines so that it can do more then bunches of stuff at the same time and do it fast.

You also have limitations like the x86 ISA. ISA is the rules for how software is suppose to interact with the hardware. Originally it dictated how the hardware was designed, now it dictates a intermediate layer between the hardware and the software. You can create any god-awhfule or weird hardware imaginable, but as long as software sees the x86 ISA then it will run.

For instance x86 ISA dictates that their will be 8 general purpose registers aviable for proccessing inputs into the CPU.

Well the P4 has 128 GPRs aviable.

So all the software can see is 8. So what it does is that it begins proccessing on 8 GPRs at a time then shifts the mapping to another used 8 GPRs to start proccessing another chunk of data.

So the large data cache is important to keep the data ready as it waits for another section of the CPU to open up and proccess it.

Also with long pipelines it's importent to make sure that you have the scedualling down very very well. It may take a long time to proccess info on a P4 (in terms of clock cycles), but as long as you time everything right all the data will come out on the other end ready to go and in order, but if you get stuff out of order then you have to waste time reproccessing things and bringing data back up out of memory and into cache and so on and so forth.

Bigger data caches makes this easier I figure.

So having symetrical cache isn't so important as it "has to be symetrical" but it is a good rule of thumb and easier to design chips to use.

This is all of course pure guess work. I have no idea if I am right or wrong.
 
Nov 22, 2003
36
0
0
Originally posted by: lexxmac
Originally posted by: MadRat
Originally posted by: Eug
New G5
Die size: 66 mm2
Transistors: 58 million
Process: 90 nm, Silicon on insulator

The current G5 chips are 130nm. The 90nm chips are still a ways off, or at least not yet in full scale production. Anyways, the currently shipping machines use 130nm chips. Interestingly, the northbridge is made on the same 130nm process of the CPUs.

Nope, IBM is currently making some 90nm G5 chips. They power the Xserve blade computers from Apple. The bulk of Apple's sales I assume are the desktop machines which use the older 130nm process. So the 90nm are only shipping in limited quantities, but enough for the xserve systems.

BTW the Pentium4 basically has asymmetrical L1 cache too.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: MadRat
Besides the 12k of uops, how much data does it hold?

None. It just caches decoded instruction traces. Getting its size in kb isn't particularly informative unless you know the size of uops, which isn't really useful information by itself either.
edit: Look for a paper called "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching" by Rotenberg, Benet, and Smith. Found it here.
 

MadRat

Lifer
Oct 14, 1999
11,961
278
126
Does the x86 architecture make one size of L1 cache optimal as opposed to what would be found in RISC architecture? AMD appears to stay with the 64/64 arrangement even after jumping from the 250nm process to the 180nm process, from the 180nm process to the 130nm process, and from the 130nm process to the 90nm process. I have to assume that AMD sees no benefit moving to a 96/96 or 128/128 configuration.

I'm impressed by trace cache and the claims of performance improvement:

17-34% integer boost
6-9% FPU boost

Does anyone not believe AMD will not eventually move to this layout?