Originally posted by: Lynx516
...Therefore Exclusive is ALWAYS better.
The answer isn't so cut-and-dry...as is often the case in computer architecture, the answer is cop-out "it depends."
The natural behavior of a multi-level cache is somewhat inbetween fully inclusive and exclusive; both require some effort to maintain. For full inclusitivity, it typically means ensuring when a cache line is evicted at a higher-level cache, it must be invalidated from the lower levels of cache if they exist. For full exclusitivity, the higher-levels of cache essentially act as a victim buffer for the lower levels of cache, generating a higher amount of traffic as glugglug described.
Exclusitivity is necessary if the L2 is less than 4 to 8 times the size of the L1 (just a general rule-of-thumb); otherwise the duplication of data begins to impact of the L2's local hit rate. If the L2 is around at least 8 times the size of the L1, then the extra effort to maintain exclusitivity may not be worth it given a very minimal increase in L2 hit rate.
The fact that in a fully inclusive hierarchy the L2 (or highest level of cache) contains all cache lines present at lower levels is actually a big advantage for snooping multiprocessors. Multiprocessors have to make sure that all data in caches are up-to-date (coherent)...if one processor writes to a value in its cache, all other copies of that value in other caches have to be invalidated. A snoopy multiprocessor system does this by broadcasting the event on the bus, on which other processors can "snoop" and invalidate data in their caches, if necessary.
For a fully inclusive hierarchy, the processor only needs to check the tags of the highest level of cache (each line in a cache has a tag, which is the upper portion of the line's full address so that the cache can identify it). A fully exclusive hierarchy would normally need to check the tags of all caches in the hierarchy. This is bad for the L1...a snoop would stall the L1 pipeline, which could stall the processor since the L1 is so tightly coupled to the processor pipeline. The solution is to duplicate the tags of the lower levels of cache with those of the highest level, so that the processor can snoop without stalling the L1. Depending on the cache hierarchy, the extra overhead of the duplicated tags can be significant.