I was reading this article and one thing about it really bothered me:
The calculation here is awful and leads to a bad conclusion. A bit of cache costs much more than the 6 transistors of the storage cell - there are many sources of overhead: column muxes, hierarchical bitline logic, sense amps (possibly), bitline precharge devices, write drivers, redundant rows/columns, decoders, tags, and ECC data. If you account for the actual transistor count of a cache, you'll get a significantly higher number. I think you'll end up with about the same number of transistors left over for the core as Conroe had.
Tags are huge: with 64 byte cache lines and a 6MB cache, that's 98304 lines. 16-way associative => 6144 sets. That means you can determine ~12 bits of the address from which set a block ends up in. A 64-byte line means you don't need to find 8 of the bits. If you have a 32-bit physical address space, you need 12 more bits of tags per set (16 bites if you have a 36-bit physical address space). For multiprocessor support, you need to know what state the cache line is in (I believe Intel's chips use 4 states - modified, owned, shared, invalid): 2 more bits. This alone gives you another 98304 * 14 = 1,376,256 bits, times 6 transistors per bit = 8.2M transistors.
ECC data is worse: if the ECC is done 64 bits at a time, single-error-correct, double-error-detect requires 8 bits of overhead. So, 6 MB = 6*1024*1024*8 bits => 6,291,456 bits of overhead => 37,748,736 transistors.
That alone is over 45M transistors that Anand missed. The other overheads are harder to estimate precisely, but they add up.
If you assume 80% efficiency with 288M in bitcells, that's 360M transistors for the cache vs 240M for Conroe, you get 50M for Penryn and 290M-240M = 50M for Conroe. I think 80% actual is way above normal - one of my friends in academia is currently taping out an SRAM with 60% efficiency; 70% is generally considered good.
This version of Penryn is dual-core, and the first quad-core Penryn chips will simply be two of these on a single package, although later on we may see a single-die solution. At 410 million transistors, we expect a dual-core Penryn to have a 6MB shared L2 cache (up from 4MB in Conroe). The logic part of the Penryn core will be mostly evolutionary from Conroe, but do expect additional functionality and performance from more than just a larger cache.
If we assume that 288M transistors (6T SRAM) will be used by the 6MB cache, that leaves 122M transistors for L1 cache and the rest of the core. Applying the same calculation to Conroe gives us 99M transistors left over, meaning that there are roughly 23% more core-logic, control and L1 transistors being used in Penryn than in Conroe.
The calculation here is awful and leads to a bad conclusion. A bit of cache costs much more than the 6 transistors of the storage cell - there are many sources of overhead: column muxes, hierarchical bitline logic, sense amps (possibly), bitline precharge devices, write drivers, redundant rows/columns, decoders, tags, and ECC data. If you account for the actual transistor count of a cache, you'll get a significantly higher number. I think you'll end up with about the same number of transistors left over for the core as Conroe had.
Tags are huge: with 64 byte cache lines and a 6MB cache, that's 98304 lines. 16-way associative => 6144 sets. That means you can determine ~12 bits of the address from which set a block ends up in. A 64-byte line means you don't need to find 8 of the bits. If you have a 32-bit physical address space, you need 12 more bits of tags per set (16 bites if you have a 36-bit physical address space). For multiprocessor support, you need to know what state the cache line is in (I believe Intel's chips use 4 states - modified, owned, shared, invalid): 2 more bits. This alone gives you another 98304 * 14 = 1,376,256 bits, times 6 transistors per bit = 8.2M transistors.
ECC data is worse: if the ECC is done 64 bits at a time, single-error-correct, double-error-detect requires 8 bits of overhead. So, 6 MB = 6*1024*1024*8 bits => 6,291,456 bits of overhead => 37,748,736 transistors.
That alone is over 45M transistors that Anand missed. The other overheads are harder to estimate precisely, but they add up.
If you assume 80% efficiency with 288M in bitcells, that's 360M transistors for the cache vs 240M for Conroe, you get 50M for Penryn and 290M-240M = 50M for Conroe. I think 80% actual is way above normal - one of my friends in academia is currently taping out an SRAM with 60% efficiency; 70% is generally considered good.