I am not sure I understood properly but I try to share my point of view on this. I am happy to be corrected.
When a thread is executing, in case data/instruction is not found on L2, the L3 is probed and if hit the cache lines are evicted and loaded with one from L3. If miss then go to Main memory.
If multiple CCXs are present, when there is an L3 miss the other L3s are probed (if there is a entry in the coherency directory otherwise go to main memory)
I understood there are proactive coherency probes, where directory entries are maintained across the L3s which are used to invalidate data on other CCXs when the data has been modified by another CCX. Additionaly the L3 directory contains entries to the private L2 caches
Probing across L3s is across the fabric hence some additional latency. But it does not matter how many L3s (at least to some extent), so it is a bit scalable across a lot of CCXs.
The directory entries across L3 only refers to regions not specific cache lines. So this way they can deal with big sizes. Probing across the L2 involves cache lines. But is way faster as well.
I don't know if inclusive cache or others would be more suited in this case.
View attachment 26080
View attachment 26081
All of this is already on Zen2. So for Zen3
- For the single core on the 8 core CCX, the bigger L3 means it has a lot of space to swap with until eviction to main memory finally happens. Additionally due to memory prefetching, at L3, chances of hits when data is not found on L2 are higher with bigger L3 ( and possibily avoiding engaging the probes across to other L3s which incurs penalty ).
- For a very high core count part, the traffic on the Infinity fabric for highly threaded loads is probably going to be much less than with a 4 core CCX, simply because there are lesser CCXs to talk to.
- On the other hand, the 8 Core CCX would be more complex due to a lot of routing between the cores/L2 to the L3 but then counter balanced by the presence of a single fabric connection instead of two previously for two 4 core CCX. Lesser SerDes blocks.
- Request to the IMC are also streamlined, because for an 8 core part for example, the IMC (coherency slave) is handling only one master. This is very important because DRAM has big recovery times and this could degrade memory performance.
I don’t see why the number of serdes blocks would go down. It is probably connecting to the same IO die and it needs the same bandwidth; still needs to feed 8 cores. We probably don’t get a new IO die until Zen 4 with DDR5 and pci-express 5. Connecting to the same IO die should mean exactly the same number of serdes. I don’t know how the current links are arranged. Previously they were 32-bit links (IFOP) between chips for Zen 1. They were at roughly half the clock compared to the 16-bit external links (IFIS). I was assuming that each CCD still had a single 32-bit link to the IO die. There shouldn’t be any serdes between the 2 on die CCX and the fabric link on the die. That seems to be what you are implying here? That would be a waste. On chip links would be at 256-bit wide and would only be packetized (256 bit -> 8 x 32 bit) for going over the external serdes link. It wouldn’t make any sense to packetize it on chip, send it some where still on chip, depacketize it, switch it at at least 256-bit wide, and then repacketize it for sending to the IO die.
Increasing FP performance probably means that they need to improve cache bandwidth. L2 and L3 are 32 bytes per cycle. L1 can do 2 x 32-byte load and 1 32-byte store. L2 and L3 bandwidth is actually only enough for one operand of a 256-bit AVX2 unit. Zen 2 has 2 FMA units that each can consume 3 32-byte operands per clock for FMA operations. This is why GPUs with massive bandwidth and data paths are much better at some algorithms. They actually have the bandwidth to feed their compute resources more continuously. So, how are they going to get the extra bandwidth? Doubling the paths to 64-bytes kind of seems unlikely. Perhaps the L1 will be able to load more than 2 x 32-bytes per clock? That may improve performance significantly without even increasing the number of AVX units. I was thinking that they would go up to 3 AVX256 units, but it may be all cache improvements. It would be great if they managed to reduce latency in L1/L2 to make up for the slightly higher L3 latency.