I expect the number of cores per CCX will improve. If the 96 core rumour is correct, then I wouldn't be surprised if Zen 4 CCDs have 12 cores and 48MB L3.
Main thing is I'm betting that when AMD refactored the CCX, that they built in some flexibility. Why refactor in a way that only allows eight cores if you're likely to need to refactor again for more? Just do it right the first time. Looking at the die shot of Zen 3 going around kind of supports this. Cores are no longer "mirrored" relative to their closest neighbours, and the interfaces between L3 blocks running east-to-west are no longer there. The area between east and west L3 regions looks like a giant crossbar, and perhaps this can scale arbitrarily albeit at the expense of latency.
I really doubt that they will go to a larger CCX in the next generation. I think it will stay at 8 cores for a while. Zen 3 is the new architecture; Zen 4 should mostly be improvements on Zen 3. Most past history indicates scaling becomes an issue for a monolithic cache for more than 8 cores. The L3 cache may stay 32MB for a while also, but they may add L4. They could do things with stacking to provide larger L3 cache, but it needs to be kept in mind that larger cache pretty much always means slower due to the physical laws of the universe. They may move to a 16 core die with 2 CCX per die though. That would not be that large at 5 nm.
Also, I expect "Infinity Cache" is going to be used in more products. I kind of hope that Milan gets the 128 MB infinity cache in the IO die, but we may not get that until Genoa. Zen 3 is the updated core and Ryzen based on Zen 3 uses the exact same IO die as Zen 2. We probably have to wait for Zen 4 for the massive IO updates (DDR5, PCI-E 5, maybe infinity cache). If they use stacked die in Genoa, then the core count could be massive. Some TSMC tech may allow for stacking multiple cpu die, so they could have a 32 core with 1 layer, 64 with 2, 96 with 3, 128 with 4, etc. TSMC has apparently demonstrated 12 high stacked die without using micro-solder bumps, which makes the whole stack very thin. It has much better thermal performance than tech that uses bumps; no space between die and the whole stack is very thin allowing better heat transfer. The power consumption would be an issue, so the clock speed would need to be reduced with each layer. We already have that with rome anyway, with lower core count devices clocking higher due to larger thermal headroom. We also have some weird products like the Epyc 7F52 16-core (1 active core per CCX; a lot of thermal headroom).
Going to pci-express 5.0 like speeds for the IFOP (on package serdes) is probably too much power so Zen 4 Epyc will probably be stacked with the IO in some manner, even if it is full interposer. Some TSMC tech embedds a piece of silicon in the package under other chips (similar to intel EMIB). They call that one local silicon interconnect (LSI). That should be much cheaper than using a giant interposer under everything. For something like HBM, it would only be partially under the HBM and the GPU or other die. That may be how they do the desktop parts cheaply if they only make one cpu die that is designed for stacking. They would just have a tiny embedded chip with the cpu die and io die overlapping it. I guess they may also be able to make a die with both types of connections. Zen 1 had pads for 4 IFOP links that were not used at all in the single die parts. Chip stacking will allow HBM type connections between chiplets; 1024-bit wide low clocked interfaces rather than 32-bit serdes at ridiculously high clock speeds. They may widen the internal interfaces in Zen 4 to take advantage of the wider pathways. I am also wondering if they are going to find a way to leverage their GPU technology. Even a single, small GPU chiplet could be very powerful;they just need a good way to access it at low latency.
See this article (already posted several times) for an overview of TSMC chip stacking tech:
www.anandtech.com