• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Microprocessor Level 1 and 2 Cache Architectures

AGodspeed

Diamond Member
I am wondering exactly what advantages there are to adding more on-die cache to a processor. For example, when Intel decided to add an extra 256kb of Level 2 to the recently released Northwood processor (for a total of 512kb L2), was this a purely performance driven initiative, or was there something else.

It would be interesting to read the differences in cache architectures between the PIII, P4, Athlon, and other processors. I have a basic understanding that the P4 has a 256-bit wide L1 and L2 data pathway (the Athlon's is 64-bit wide, 16-way associative) but I am just not satisfied with this very low level understanding (I know more, but nothing quite detailed enough I?m afraid).

Any info you can contribute would be greatly appreciated; I'll add it to the large collection of info I've already received from the knowledgeable people of these forums. 🙂
 
Cache is just very fast memory actually built onto the processor die (on most modern CPUs anyway). Whenever a processor executes an instruction, it needs some information to be able to do it, whether this is the instruction code itself or data on which the instruction will operate.

If a processor could not get information quickly enough, it would have nothing to do and the performance of the system would drop. Given that for example main memory is not fast enough to keep the processor constantly supplied with the information it needs, then a cache must be installed on the processor to store the information it is currently working on. With most modern CPUs this cache runs at full processor speed.

Often, a set of instructions will reuse the same information many times over. As long as the currently relevant information can stay in the cache then the processor will not be starved of information. With CPUs getting faster, they are requiring more and more memory throughput in order to keep the CPU supplied with memory. That was the reason Intel gave for including a larger cache on the P4. Especially now that it is much more likely to be paired with SDR or DDR RAM as opposed to higher bandwidth RAMBUS, this larger cache will improve performance.

With graphics cards, the data usage patterns are not the same. Because of the large amount of data present in all the textures and the frame buffer / z buffer etc of a 3D image, it will not all fit into cache. Also, the pattern is more likely to be that every clock cycle the processor will be working on different pixels, hence different data. This means that a very fast path to the graphics memory is necessary, hence the memory used often being 250 MHz DDR with far higher throughput than system memory. It also explains why increasing the clock speed of the processor helps very little, since then it simply doesn't have information getting to it quick enough and has to idle instead.
 
At a very low level, with a larger cache, you can obviously store more information that is frequently accessed. Since cache is significantly faster and closer to the processor, this means that data and/or instructions are able to reach the execution units much faster than if it had to go all the way to RAM, and even longer to go to the hard drive.

Take SETI for example. From what I understand, it does extremely well on larger cache processors. Clearly the SETI program is able to fit more of itself, or more data in the L1 and L2 caches, thereby preventing a costly access (in terms of latency) to RAM.

Obviously adding more cache has some sort of downside. The most obvious is silicon real-estate, as well as a large increase in the number of transistors. Perhaps less obvious is that a larger cache also takes longer to access: more cycles are added to decode the addresses. This appears to be the reason why the Pentium 4 L1 cache is so small: at only 8KB, its access latency is very small and is built for sheer speed. If Intel were to double it to 16KB, it has been speculated that it would take another 1-2 cycles to access.

Another major reason why the P4 has such a small L1 cache is that there's a bit of a re-think on the typical cache architecture: the Execution Trace Cache. As I understand it, this is really an L1 instruction cache (caching 12K micro-ops, speculated to be around 96KB in size), but placed after the execution units. For regularly used instructions, the execution units can look up the ETC, grab the instructions from there and effectively skip one or two steps.

Lets compare a more conventional Coppermine to the Thunderbird/Palomino. The Coppermine uses an inclusive L1/L2 cache policy, meaning that the contents of the L1 cache is also present in the L2 cache. For 256KB of L2 and 32KB of L1, this means you have an effective total of 256KB of stuff in the cache. The TBird/Palomino uses an exclusive policy, were L1 contents is NOT duplicated in the L2. Thus you have an effective total of 384KB (128KB L1 and 256KB L2). This seems to make up for the fact that the pathway is only 64bit, but you need to consider the fact that with an inclusive architecture on the Coppermine (smaller total size), you will need more bandwidth for cache evictions.

As usual, that's a very ham-fisted explanation from me.

For a much more coherent explanation, check out BurntKooshie's article, The Fundamentals of Cache.
 
pm or Sohcan once pointed out that the L1 cache is for raw performance-based and L2 cache is size-based. Different manufacturing processes seem to have optimal sizes for L2 caches, like how the P2 with 128k L2 caches was optimal size on the .25 micron scale. The performance of a 512k off-die cache was better overall, hence P2's with the tiny 128k on-die L2 cache became "mere" Celerons. Then 256k was the optimal size for L2 caches on the .18 micron scale. It appears that 512k will be the optimal size for L2 caches on the .13 micron scale. It will be interesting to see if the optimal size for L2 caches on the .09 micron scale will be 1024k... err, 1mb.

Gone are the day of the $1000 consumer CPU, where we paid for "huge" off-die L2 caches. We could be using 2GHz Pentium 4's with 4mb off-die L2 caches running at 400MHz, but what is the use? You wouldn't want to pay the price difference when the tiny 256k L2 cache built on-die is much less pricey.
 
I assume you're talking about optimal cost/performance here. It's always true that higher cache sizes will produce a better performing processor - however, cache memory is very expensive in terms of transistor count and also makes the CPU die sizes bigger and puts out more heat. Therefore putting 512k L2 cache on a 0.18 micron processor, while certainly possible, will mean that with a larger die size, fewer chips can be produced per wafer, a greater proportion of the chips will fail. These two effects combine to greatly increase the cost.

Off cache dies have various advantages/disadvantages. You can set them to run asynchronously - with the 0.25 micron P3s for example, the L2 cache ran at half the clock speed. Having the cache fail does not stop the CPU die from working. A different cache can be used instead. Therefore the failure rates of cache and core will be added together, rather than multiplied.

However, running cache at lower than processor speed significantly reduces it's advantage. No one would want a 400 MHz cache since that's barely faster than main memory.
 
<<No one would want a 400 MHz cache since that's barely faster than main memory. >>

On the contrary, we are talking a SYNCHRONOUS 400MHz. The 400fsb used on the Pentium 4 is really a 100fsb across four separate pathways. The true 400MHz L2 cache would be four times faster than any single 100MHz pathway.
 
Back
Top