• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

P4 vs Athlon XP in terms of cache

AMDPwred

Diamond Member
Which one has more cache? Will we be seeing any true arcitexture upgrades with newer versions of chips this year or are we just getting more ghz?
 
Duron (Morgon core/180 nm) has 192 KB of on-die cache

Athlon XP (Palomino core/180 nm) has 384 KB of on-die cache

P4 (Willamette core/180 nm) has 256 KB of on-die cache (L1 maps to L2 making it redundant)

P4 (Northwood core/130 nm) has 512 KB of on-die cache (L1 maps to L2 making it redundant)

edit: Forgot to answer second part, no new cache hierarchies this year for current products. New P4-based (Willamette/180 nm) celerons will have 128 KB of on-die cache and new Clawhammers (AMD K8 64-bit desktop cpu) is purported to have 512 KB of on-die L2 cache with an unknown amount of L1 cache and it remains to be seen whether it is redundant like Intels or additive like AMDs.
 
Not quite that simple.

The AMD Athlon has 128KB of L1 cache, and 256KB of L2 cache, in an exclusive format: L1 and L2 contents are not duplicated. Data path is only 64bits wide. While people like to say that the Athlon has 384KB of cache, this is not quite true as there are different latencies between the 2 cache levels.

The same goes for the Duron, with 128KB of L1 and 64KB of L2.

The Pentium 4 features an 8KB L1 Data cache, and 256KB of L2 for the Willamette or 512KB of L2 for the Northwood. Data path is 256 bits wide. Where the main difference resides is the L1 instruction cache. Intel have completely changed the way this cache works....it has become an execution trace cache where effectively the L1 cache is placed AFTER the execution units. This ETC is speculated to be around 96KB. The way that it is position in relation to the execution units means that its performance increases as the processor frequency increases.
 


<< Duron (Morgon core/180 nm) has 192 KB of on-die cache

Athlon XP (Palomino core/180 nm) has 384 KB of on-die cache

P4 (Willamette core/180 nm) has 256 KB of on-die cache (L1 maps to L2 making it redundant)

P4 (Northwood core/130 nm) has 512 KB of on-die cache (L1 maps to L2 making it redundant)
>>

Exclusitivity vs. Inclusivity does not make the cache levels redundant or additive; the issues behind cache design are far more complex than that. Whether the cache heirarcy is inclusive or exclusive, you have to consider the cache hit-rates and latencies of each level seperately to analyze the cache hierarchy. The basic equation to determine access time (for a two-level cache) is the same for both philosophies: average memory access time = (L1 access time) + (L1 miss-rate) * (L2 access time) + (L2 global miss-rate) * (main memory access time).

The term in bold, the L2 global miss-rate, is the miss-rate after L1 references. The general rule of thumb is that the global miss-rate is unaffected by the presence or size of the L1 (ie, doubling the L1 size doesn't change the L2's miss-rate), as long as the L2 is at least 4 to 8 times the size of the L1. This is precisely due to exclusitivity vs. inclusitivity.

Take the P4 for example. It's L1 execution trace cache stores 12,000 micro-ops; a x86 instruction on average decodes to 1.5 micro-ops, and x86 instructions on average are about 4 bytes long, so the data from the trace cache occupies around 32KB in the L2 (rough estimate). It has a combined L1 size of around 40KB (8KB data L1, 2 cycle load-use latency, ~3.9% average miss-rate) and a 512KB L2 (5 cycle latency, ~.74% miss-rate). As it is inclusive (not exactly, truely inclusive caches are rare), it has effectively aroundn 472KB of L2 at its disposal. If the cache hierarchy were exclusive like the Athlon's, it would use the full 512KB...the increase in miss-rate is rougly proportional to the square-root of the ratio of the sizes, so the new miss-rate would decrease from .74% to .71%....hardly worth noting. Since maintaining cache exclusivity consumes bandwidth from the victim buffer, it is not worth it for the P4.

As for the Athlon, it has a 128KB L1 size split between data and instruction caches (3 cycle load-use latency, ~1.8% miss-rate) and 256KB L2 (11 cycle latency, ~1% miss-rate). If the L2 was not exclusive with the L1, only 128KB would be available for caching, and its miss-rate would increase from 1% to around 1.4%....that's significant enough to warrant the exclusive hierarchy.

To summarize, if the Athlon's L2 was changed to inclusive, its miss-rate would suffer by around 40%; if the P4's L2 was changed to exclusive, its miss-rate would be roughly 4% better.

Also, keep in mind that cache size is not the most important factor, latency and hit-rate are. Many factors that increase hit-rate (larger size, higher associativity) also penalize latency. There is also a happy medium for the cache block size that most successfully exploits temporal and spatial locality that must be balanced with bandwidth.
 
Back
Top