<< Duron (Morgon core/180 nm) has 192 KB of on-die cache
Athlon XP (Palomino core/180 nm) has 384 KB of on-die cache
P4 (Willamette core/180 nm) has 256 KB of on-die cache (L1 maps to L2 making it redundant)
P4 (Northwood core/130 nm) has 512 KB of on-die cache (L1 maps to L2 making it redundant) >>
Exclusitivity vs. Inclusivity does not make the cache levels redundant or additive; the issues behind cache design are far more complex than that. Whether the cache heirarcy is inclusive or exclusive, you have to consider the cache hit-rates and latencies of each level seperately to analyze the cache hierarchy. The basic equation to determine access time (for a two-level cache) is the same for both philosophies: average memory access time = (L1 access time) + (L1 miss-rate) * (L2 access time) + (L2 global miss-rate) * (main memory access time).
The term in bold, the L2 global miss-rate, is the miss-rate after L1 references. The general rule of thumb is that the global miss-rate is unaffected by the presence or size of the L1 (ie, doubling the L1 size doesn't change the L2's miss-rate), as long as the L2 is at least 4 to 8 times the size of the L1. This is precisely due to exclusitivity vs. inclusitivity.
Take the P4 for example. It's L1 execution trace cache stores 12,000 micro-ops; a x86 instruction on average decodes to 1.5 micro-ops, and x86 instructions on average are about 4 bytes long, so the data from the trace cache occupies around 32KB in the L2 (rough estimate). It has a combined L1 size of around 40KB (8KB data L1, 2 cycle load-use latency, ~3.9% average miss-rate) and a 512KB L2 (5 cycle latency, ~.74% miss-rate). As it is inclusive (not exactly, truely inclusive caches are rare), it has effectively aroundn 472KB of L2 at its disposal. If the cache hierarchy were exclusive like the Athlon's, it would use the full 512KB...the increase in miss-rate is rougly proportional to the square-root of the ratio of the sizes, so the new miss-rate would decrease from .74% to .71%....hardly worth noting. Since maintaining cache exclusivity consumes bandwidth from the victim buffer, it is not worth it for the P4.
As for the Athlon, it has a 128KB L1 size split between data and instruction caches (3 cycle load-use latency, ~1.8% miss-rate) and 256KB L2 (11 cycle latency, ~1% miss-rate). If the L2 was not exclusive with the L1, only 128KB would be available for caching, and its miss-rate would increase from 1% to around 1.4%....that's significant enough to warrant the exclusive hierarchy.
To summarize, if the Athlon's L2 was changed to inclusive, its miss-rate would suffer by around 40%; if the P4's L2 was changed to exclusive, its miss-rate would be roughly 4% better.
Also, keep in mind that cache size is not the most important factor, latency and hit-rate are. Many factors that increase hit-rate (larger size, higher associativity) also penalize latency. There is also a happy medium for the cache block size that most successfully exploits temporal and spatial locality that must be balanced with bandwidth.