The way I figure it is that having a asymmetrical cache makes a proccessor more efficient. Or at least is more usefull on a more efficient proccessor.
My reasoning goes like this:
In the playstation2 you have relatively low amounts of cache and memory spread out thru-out the system. This is because they don't want large amounts of cache, instead they concintrate on creating wide busses between everything. They move the graphics fast, so less time is needed in the cache, less cache you need and the faster everything gets. etc etc. So you could create a system that would run a wordproccessor as fast as a 486, but handle graphics like a 800mhz+ pentium with a 300mhz cpu.
Now you have the data cache and the instruction cache.
Most programming is done in loops. Meaning that you have the same instruction used over and over again on different data.
You don't need a large data cache, because the data isn't going to be used more then once in a while, so it needs to be moved fast. Instructions are going to sit around for a long time, so you want to make the cache large enough to hold even complex instruction sets, since they don't move around as much then lantecy doesn't matter as much as it does for the data cache.
.............
Now on systems that use symetrical cache like pentiums.
The P4 is a very inefficiant proccessor. It takes 2-3 times the clock cycles to get any one thing done. But it has very long pipelines so that it can do more then bunches of stuff at the same time and do it fast.
You also have limitations like the x86 ISA. ISA is the rules for how software is suppose to interact with the hardware. Originally it dictated how the hardware was designed, now it dictates a intermediate layer between the hardware and the software. You can create any god-awhfule or weird hardware imaginable, but as long as software sees the x86 ISA then it will run.
For instance x86 ISA dictates that their will be 8 general purpose registers aviable for proccessing inputs into the CPU.
Well the P4 has 128 GPRs aviable.
So all the software can see is 8. So what it does is that it begins proccessing on 8 GPRs at a time then shifts the mapping to another used 8 GPRs to start proccessing another chunk of data.
So the large data cache is important to keep the data ready as it waits for another section of the CPU to open up and proccess it.
Also with long pipelines it's importent to make sure that you have the scedualling down very very well. It may take a long time to proccess info on a P4 (in terms of clock cycles), but as long as you time everything right all the data will come out on the other end ready to go and in order, but if you get stuff out of order then you have to waste time reproccessing things and bringing data back up out of memory and into cache and so on and so forth.
Bigger data caches makes this easier I figure.
So having symetrical cache isn't so important as it "has to be symetrical" but it is a good rule of thumb and easier to design chips to use.
This is all of course pure guess work. I have no idea if I am right or wrong.