kernelc
Member
I don't think Bulldozer's L2 is bad at all at least compared to their own CPUs... from Anandtech's review: http://forums.anandtech.com/newreply.php?do=newreply&noquote=1&p=33513489
L1/L2 latency-
FX-8150: 4/21 cycles
Phenom II X6: 3/14 cycles
L1 cache latency increased due to clock speed reasons. The culprit for the higher L2 latency is twofold, one is that its much larger at 1MB, and the second is that its a shared cache while in Phenom II its a dedicated one. Despite that, its delivering more bandwidth than the Phenom II, and I think that's quite respectable.
Hi,
at the module level, L2 cache bandwidth should be quite respectable. However, the problems are:
- it seems that a single integer core (ie: a single thread) can use only half of L2 bandwidth;
- as L1 is a write-through cache, once the WCC is full L1 write speed will be ganged to L2 's one
While PhenomII also had a relatively slow L2, it's L1 was a write-back design: this means that, in many workloads, the large (64 KB) L1 cache could completely mask L2 speed/latency. This is not always the case with Bulldozer: if WCC goes full, L1 speed become ganged to L2 speed.
I wrote a Bulldozer analysis some months ago, describing this problem in detail: http://www.ilsistemista.net/index.p...n-whats-wrong-with-amd-bulldozer.html?start=4
Shared L2 cache in Core Duo increased latency to 14 cycles, up from 10 cycles.
I think the problem is again the module concept. You can see even Sandy Bridge's E's direct adding of 2 more cores and scaling interconnect and memory bandwidth offered diminishing returns, now add having less performance/clock, and the module concept delivering less gains than adding cores.
-6 K10 cores to 8 Bulldozer cores, in theory is 33% increase.
-But in reality that ends up being less than that. For 50% more cores, 3960X ends up being mostly low-40% faster
-Then you add that modules don't deliver as much cores
-And in rest of the applications, it doesn't benefit from having more cores and there's less performance/clock
Obviuously, increasing core count after a certain point will give back diminished results, especially on desktop space. After all, while SB-E is a 8 core design, Intel don't have a single desktop SKU with all 8 cores enabled.
Relying on further clock increases won't work with Steamroller, as 28nm might end up somewhat less performing than 32nm. 28nm doesn't just forgo SOI(which is only responsible for few % but still), but may be a lower power process. Fortunately, 28nm should improve leakage characteristics, as that's the benefit of a slower transistor.
I'm very curious to see GF's 28nm process. Maybe they surprise us, maybe not! :hmm:
