Question Zen local/ far (CCX) L3 Cache miss lookup policy question.

AshlayW

Member
Sep 6, 2018
32
97
91
www.eridonia-archives.com
Hello.

I just have a quick question to ask to any people knowledgeable on this subject. I would ideally like to ask Dr. Ian Cutress himself too, but I am sure he is very busy.

I have made a picture that summarises my question quite effectively, but I will type it here.

Zen architecture from gen1 to gen3 now, is made up of L3 cache domains. These domains, being victim caches are essentially local to their attached cores - a victim cache for evicted data from those cores' L2 caches. My question starts with:

When a core misses its Local L3 cache, in the CCX it is within, what is the policy / the way the processor circuit handles the next stage of looking up the requested data?

The L3 cache domains must be coherent to a degree to allow CCX-CCX communication and data transfer, presumably by the L3 cache slices local to each core. This data must flow over the Infinity Fabric. But the question is:

When the local L3 cache is missed, does the core look up the data in the other CCX's L3 caches (Far L3), and if so, how is that handled internally:

Cache Coherency on Zen2 and 3.jpg

I have 3 concepts here that I thought would be the way the circuit handles it. Each has a drawback, but I am inclined to believe Figure B. is how the circuit handles this lookup. It would hurt processor performance significantly to have cores potentially trying to seek the same cache line from DRAM via the off-die PHY, increasing power consumption and creating unnecessary bandwidth traffic. Figure C, is thus very unlikely but I included it as an example.

Figure B would create a lot of fabric traffic but hide latency to the memory by performing lookups in parallel and discarding non hits. After of which, the DRAM would be accessed.

Figure A would reduce fabric traffic in parallel but add a lot of latency to the cache hierarchy, hurting memory performance. (Could explain Ryzen's higher DRAM latency, but do we measure actual differences in latency between processor SKus with more or less CCXs? I.e, 3300X has just one domain, whereas 3900X has 4; thus, 3900X using Figure A would have at least 4x 12 clock cycles penalty before memory - so I think this is unlikely.

Additional questions;

The way the processor cores fetch data from memory after missing L3, perhaps there is a collection of SRAM tag (LUT) within the fabric circuit that keep track on which core pulled which data into L2 (and was then evicted into that CCX's L3) so that the lookup wouldn't actually require any cache-walking adding tons of latency. This would broadly be similar to Figure B, but with the core probing the LUT tags for a hit of that specific line and then requesting it after it is confirmed (even if it's slow to access the far L3 domains, it is still using less power and memory bandwidth to DRAM to pull it form there, rather than memory).

If anyone knows any further details on how cores handle accessing data from other CCXs, I would be very much appreciative!

Thanks,
Ash
 
  • Like
Reactions: Tlh97 and Schmide

naukkis

Senior member
Jun 5, 2002
706
578
136
After local L3 miss request is made to memory controller, so it's figure C. There's no paths between CCXs. Memory controller either has dictionary for exclusive-state lines or polls other CCX:s together with memory access but that coherency check can be made simultaneously with memory access.
 
  • Like
Reactions: Tlh97 and moinmoin