Isn't that a feature? The problem is L3 size and mesh performance overall, the fact that compute die is full SoC with I/O etc. I think if Intel can add respectable sized L3, things would get better immediately. Taking 50ns hit on checking 100MB L3 cache is beyond stupid.
IF they produce compute chiplets with Cores + L3, have mem and IO on separate die AND still share L3 i'd be quite happy for plenty of workloads we have.
I see, so bigger L3 would make it more worth while to take the hit (to check), because there would more likely be a cache hit rather than miss.
If MLID leak is close to correct, and GNR has 3 large die with ~44 cores each, I don't think there is going to be room for a lot more L3. Maybe some more, but nothing like a gigabyte.
Remember on AMD, once you go beyond 8 core sized workload, you communicate through memory.
We had a peak at future AMD designs, and it seems that's what AMD is going with (going forward). Turbo charged with MALL (Memory Attached Last Level Cache). So seemingly sending the request to memory, but having a cache there to serve some requests from cache.
So, it seems (for now), AMD has given up on any direct sharing of L3. I wonder if Venice changes it in any way, or if AMD just strengthens the MALL.
It may be possible to have some cache coherency algorithms with homogeneous chiplets, but likely impossible with heterogeneous package, sharing CPU, GPU and maybe some AI or other dedicated chiplets.