Games make too frequent a trip to the memory for inclusive/exclusive cache to matter and ring bus is slower than crossbars within a CCX, so it boils down to main memory latency as to why Zen 2 is slower in games.
I think overall picture is more complicated than simple "memory" latency. Even if game working sets are way too large to fit into L2 or L3 and hit memory a lot, there are still substantial ways fast inclusive cache hierarchy can help:
1) All those worker threads need to synchronize, and locks take a lot of snooping and core to core communication to take cache line ownership and so on even when it is done correctly. When it is done wrong even a tiny bit, things like false cache line sharing can hurt big time. Fast inclusive L3 is a boon in these situations, as L3 contains cache lines from all cores, so even worst case ping-pong is fast.
Obviuosly on AMD these worse cases still need to hit inter CCX and inter CCD to see if some other core L2 tags contain the line in question. That 4+0 core from AMD is gonna be interesting for sure, intra-CCX only is good.
2) There is obvious producer->consumer relationship in each game frame. Game threads prepare chunks of data, that eventually ends up consumed (transformed etc) by GPU drivers and sent to GPU to render. Inclusive L3 do help a lot in these scenarios, as DirectX / GPU threads will find at least part of the data in L3. The hit rates can vary from 0 in case of very large working set that simply evicts older data, to 100% where some I/O thread loads geometry/textures to memory and it is then immediately picked up from L3 by GPU ( of course it is more complicated due to need to DMA etc).
Eviction L3 like ZEN or Skylake-X do help as well, but in case of ZEN there are CCX limited, so there is penalty if that "GPU" thread is running on different CCX/CCD and suddently you are at the mercy of scheduling ( read minimum FPS will get hurt ).
3) Games can benefit big time from prefetches to L3, avoiding memory misses is still very important even if memory is relatively nearer on Intel. Skylake-X supposedly can have LLC prefetching, but no idea how much it helps.
4) Skylake-X mesh has additional L3 gotchas that are rarely discussed. While everyone here talks about "ring" being limited by number of stops, bandwidth is obviously limited etc by ring. Mesh also has nasty limitations. In current implementation L3 write bandwidth seems to be incredibly anemic, barely ~100GB/s for 18C and that is ~the speed of quad DDR4 3200. Reads are faster, but it is harder for those gaming loads to make use of L3, if write bandwidth is so low, might as well go to memory.
So in the end, once Skylake-X thread is out of L2, it needs to go to memory, once it needs locking - it needs to pay "snoop every other L2", mesh latencies price each time, once another thread needs something, even if it was just calculated and evicted", its highly likely it needs to go memory.