If each chiplet has a local 32MB L3 cache, is L4 cache (on I/O die) smaller than 256MB (128MM or 64MB) even beneficial or needed? Does it improve memory lantecy by how much? Does it reduce cache misses by a significant amount?
The primary reason you want a cache there is that to maintain cache coherency, you absolutely want the IO die to know exactly where each line in cache loaded from the attached memory channels currently resides. If the IO die doesn't know that, on every memory access the system has to query every cache in the system. This is how cache coherency worked in the era of dual cores, but that starts being painful when you move up to quads and at >8 cores, you can just forget about it.
So you attach a directory structure that contains occupancy information for every line in the system. The thing is, that structure is basically identical to cache tags. Querying it costs the same amount of latency as querying a cache for occupancy. So when you have that, you might as well expand it a bit, and add actual storage, turning it into a real L4 cache.
The big question in my mind is not if there is a cache. It's not even whether the all the lines all the caches of the chiplets must also allocate a tag from the L4, I would be extremely surprised if they didn't. The question is whether the all the data in the lower levels is required to be fully replicated in the L4. That would be possible with eDRAM.
Note that I fully expect the L4 to be memory-sided, and split per controller. That is, each memory controller (in past AMD designs, each controller has two channels, so 4 controllers per EPYC IO die) actually has it's own separate L4 cache and only lines from that memory controller can reside there. This substantially reduces the latency hit of looking at the L4 tags.
If not actually split per controller, then at least split into two halves, for top/north memory controllers and down/south controllers.
If each chiplet would contain just two 4-core CCXs, why didn't AMD use four 146 mm² chiplets with four 4-core CCXs instead what they have done now (eight 73 mm² chiplets)? Do chiplets have to be 8-core CCXs for current design for Rome to make any sense? AMD didn't choose 8-core chiplets just because of yields or binning for higher clocks / better power efficiency. They must have had some architectural reasons behind it, right? 7 nm Vega is a relatively big chip so 146 mm² chiplet would have been ok.
Yes. All else being equal, if your product fits into a smaller die, smaller dies are better all the way down to ~20mm^2 or so. There are always some errors that lead to chips that are unharvestable even at lower core counts (like shorts between power and ground), so the smaller your die, the less you lose. Also, smaller dies lead to less loss caused by the edges of wafer. I think the idea of putting two CCX on a single chiplet is not something anyone ever proposed -- it's just cost without benefit once all communication needs to go through an external chip anyway.
Once they decided to go with chiplets, design tradeoff they optimized was the size of CCX vs size of die. Larger CCX are better because single-threaded performance still matters, and the bigger the CCX the more L3 cache a single thread can make use of. Denser future fabrication processes will almost certainly lead to even smaller chiplets.