First post! (for me, anyway)
Am I right in thinking that speed wise, L1 > L2 > L3?
To make things more complicated, how is the SandyBridge's new Micro-Op cache different than traditional caches? Aren't they all just storing instructions?
Inteluser2000 already gave you a good answer, but perhaps I can provide a somewhat more nuanced answer.
I'm not sure what exactly you mean by 'speed'. Assuming you mean access latency, then it would be:
L1 < L2 < L3
If you mean the number of requests serviced per unit of time, then as I understand it, it would be:
L1 > L2 >= L3
In terms of the L0 I cache in the front end, it likely stores
partially decoded instructions. Current CISC microarchitectures all contain two main components in their decode stages of the pipeline (sometimes other functionalities as well, such as detecting ops that can be fused). The x86 decode, which converts the compiler compatible representation into its native representation for the remainder of the pipeline. Format decode, which determines which bits of the instruction goes to which part of the instruction control circuitry, or to following datapath (for immediates); which format is generally determined by the number of regsiter operands and immediate operands, as well as by the length of the op-code.
For most of the designs out there that contain some instruction $ holding processor internal representation of a CISC ISA, the representation in the L0 is has usually gone through the first major component of decode, but not the second. Sandy bridge probably has this implementation too.
Hope this helps.