Question regarding cache (AMD)

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
AMD finally seems to have a good cache structure all around with Zen, something they have not had in quite some time. Bulldozer was a mess all over, whereas Phenom had good L1 and L2 caches, but L3 was never overly great.

So my question is, what was with the Phenom II's 48-way L3 cache, and BD's 64 (!) way L3? I've never seen anywhere near that many sets before or since. Did that contribute to their horrid latency? Was there some reasoning behind doing this?

Unrelated but fun trivia since I was reading the CPU upgrade history thread, regarding Duron. Duron had more L1 cache than L2 cache. Talk about odd!

EDIT, I just looked up the first K10 with the 2MB cache, it was 32-way. That seems closer to normal, except it seems like a lot of ways for such a small cache.
 
Last edited:

VirtualLarry

No Lifer
Aug 25, 2001
56,229
9,990
126
I'm not a cache expert, but I think that the "ways", basically is the total number of cache lines / segments, that can be stored in the cache, with the SAME valued address bits. So, it's kind of like a function of how many address bits that they want to use for the cache tags, etc., as well. Less address bits, means more aliasing, means, need more "ways" to allow the cache to be more effective.

Newer CPUs, seem to contain more address bits, so therefore need less "ways".

Then, there's the whole "Machine Intelligence" in Ryzen, which is used for Branch Prediction, I know that much. It might be used somehow for cache support as well, I don't know. (Maybe someone that is more expert with Zen architecture will chime in here.)
 

Nothingness

Platinum Member
Jul 3, 2013
2,371
713
136
The ways are accessed in parallel when looking for an address in the cache (tag lookup). This means that the more ways you have the more power you burn (though that can be reduced with way prediction). But having more ways is a good way (pun intended) to reduce conflicts (and also aliasing for virtually indexed caches), so you need to make a trade off between cache efficiency and power.

The Wikipedia entry is a good read: https://en.wikipedia.org/wiki/CPU_cache
 

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
The ways are accessed in parallel when looking for an address in the cache (tag lookup). This means that the more ways you have the more power you burn (though that can be reduced with way prediction). But having more ways is a good way (pun intended) to reduce conflicts (and also aliasing for virtually indexed caches), so you need to make a trade off between cache efficiency and power.

The Wikipedia entry is a good read: https://en.wikipedia.org/wiki/CPU_cache

OK, that makes sense. So from the sounds of it the number of ways shouldn't effect latency, but power? I'm still curious as to why they went with so many ways, while Intel never did, and with Zen they are back down to 16 for L3. Maybe a learning process? Or maybe it was deemed better for those architectures?

I still don't understand why Phenom/BD had such horrid L3 latency, but I guess no one really knows.
 

Nothingness

Platinum Member
Jul 3, 2013
2,371
713
136
Latency will be slightly increased by the number of ways (more wires, and more comparators to select from to see if there's a hit), but by how much, I don't know enough to say :)

For the choice of the number of ways for L3 I'm afraid I don't know.
 

BigDaveX

Senior member
Jun 12, 2014
440
216
116
I still don't understand why Phenom/BD had such horrid L3 latency, but I guess no one really knows.

Historically, AMD's cache architectures have generally been slower than Intel's. Athlon and Athlon XP both had strong L1 caches, but their L2 caches were pretty weaksauce. Athlon 64's, while faster, was still slower than those of its Intel rivals, but it was cancelled out by the stupidly efficient on-board memory controller.

Phenom suffered because its last-level cache was L3, not particularly big, and clocked a good deal slower than the cores; Core 2's LLC was L2, 2x-3x the size (not counting the split L2 cache on the Core 2 Quad), and clocked at the same speed as the core. Phenom II was actually a lot more comparable with the first-gen i7s in terms of cache structure, but the i7 had too many other advantages.

As for Bulldozer... well, what didn't AMD screw up there?
 

rvborgh

Member
Apr 16, 2014
195
94
101
i heard that the L3 in Phenom was more optimized for servers. If you check out the A8-3870K you can see that getting rid of the L3 didn't affect much. i also read a while ago that the L3 determined the upper bounds of write performance for Phenom. Someone correct me if i am wrong.