Discussion [Video]Ryzen 7 3800X 5GHz vs. Core i9 9900K 5 GHz

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

tamz_msc

Diamond Member
Jan 5, 2017
3,770
3,590
136

Skylake wins in everything gaming except WoW 1% and 0.1% lows. This proves that the bottleneck in Zen 2 isn't frequency, but rather the memory subsystem. Hopefully Zen 3 addresses this, otherwise Intel would have nothing to worry about.
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,673
3,794
136
Aren't the mesh and new cache structure semi-dependent on each other to accomplish the goal of more uniform access time across a many-core monolithic die? (not a rhetorical question, [later edit] it was my understanding that the new cache structure was dictated by the new mesh arrangement, except maybe for the much larger L2)

A quick recap: in terms of intercore latency SKL-X looked better than Zen1 at launch, but not better than ring bus Broadwell-E, Haswell-E and especially consumer Kaby Lake.

View attachment 20685

We're also lacking some more recent data, back when SKL-X launched there was some talk around here that UEFI updates helped improved the situation up to a point where it became significantly better in (gaming) benchmarks, possibly shifting the choke point towards L3 cache. The HEDT platform seems kinda forgotten by both Intel and reviewers, so I guess we'll have to wait a lot more before getting decent answers. (I have seen no Cascade Lake X latency measurements for example)

The best info we can still find is probably memory latency:

With Broadwell-E with similar memory to give more insight:


I think with the increasing core count the switch to the mesh was necessary. As to whether that dictated the change in cache, I'm not certain. I think Intel wanted a larger L2 but since they don't have the density (or for other reasons) their L3 is nowhere near AMD's. That basically made the switch to a non-inclusive L3 essential.

I think Intel's lead in gaming is largely due to the inclusive L3 and the ring bus. The ring bus uses less power than the mesh or IF, allowing more power to the cores. The L3 can still be inclusive because of the small L2 caches. That will eventually change as Intel adds more cores (making the ring bus ineffective) and increasing the L2 cache. Isn't it rumored the L2 on newer Intel CPU's is 1.25-1.5MB?
 

tamz_msc

Diamond Member
Jan 5, 2017
3,770
3,590
136
I think Intel's lead in gaming is largely due to the inclusive L3 and the ring bus.
Games make too frequent a trip to the memory for inclusive/exclusive cache to matter and ring bus is slower than crossbars within a CCX, so it boils down to main memory latency as to why Zen 2 is slower in games. The upcoming 3300X with a 4+0 configuration would be an interesting comparison against say a Comet Lake i3 or a 7700K - AMD's best quad core against Intel's best quad core, and it remains to be seen whether the best possible configuration of Zen 2 in terms of latency can beat a similarly clocked Skylake in games.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,025
136
Anyone here decide on getting a 10900k or 10700k yet?
Games make too frequent a trip to the memory for inclusive/exclusive cache to matter and ring bus is slower than crossbars within a CCX, so it boils down to main memory latency as to why Zen 2 is slower in games. The upcoming 3300X with a 4+0 configuration would be an interesting comparison against say a Comet Lake i3 or a 7700K - AMD's best quad core against Intel's best quad core, and it remains to be seen whether the best possible configuration of Zen 2 in terms of latency can beat a similarly clocked Skylake in games.

I will definitely be curious in the latencies. I have an AIDA64 screenshot I will share tomorrow of my 1950X at 4.3Ghz and a 4+0 config.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Games make too frequent a trip to the memory for inclusive/exclusive cache to matter and ring bus is slower than crossbars within a CCX, so it boils down to main memory latency as to why Zen 2 is slower in games.

I think overall picture is more complicated than simple "memory" latency. Even if game working sets are way too large to fit into L2 or L3 and hit memory a lot, there are still substantial ways fast inclusive cache hierarchy can help:

1) All those worker threads need to synchronize, and locks take a lot of snooping and core to core communication to take cache line ownership and so on even when it is done correctly. When it is done wrong even a tiny bit, things like false cache line sharing can hurt big time. Fast inclusive L3 is a boon in these situations, as L3 contains cache lines from all cores, so even worst case ping-pong is fast.
Obviuosly on AMD these worse cases still need to hit inter CCX and inter CCD to see if some other core L2 tags contain the line in question. That 4+0 core from AMD is gonna be interesting for sure, intra-CCX only is good.

2) There is obvious producer->consumer relationship in each game frame. Game threads prepare chunks of data, that eventually ends up consumed (transformed etc) by GPU drivers and sent to GPU to render. Inclusive L3 do help a lot in these scenarios, as DirectX / GPU threads will find at least part of the data in L3. The hit rates can vary from 0 in case of very large working set that simply evicts older data, to 100% where some I/O thread loads geometry/textures to memory and it is then immediately picked up from L3 by GPU ( of course it is more complicated due to need to DMA etc).
Eviction L3 like ZEN or Skylake-X do help as well, but in case of ZEN there are CCX limited, so there is penalty if that "GPU" thread is running on different CCX/CCD and suddently you are at the mercy of scheduling ( read minimum FPS will get hurt ).

3) Games can benefit big time from prefetches to L3, avoiding memory misses is still very important even if memory is relatively nearer on Intel. Skylake-X supposedly can have LLC prefetching, but no idea how much it helps.

4) Skylake-X mesh has additional L3 gotchas that are rarely discussed. While everyone here talks about "ring" being limited by number of stops, bandwidth is obviously limited etc by ring. Mesh also has nasty limitations. In current implementation L3 write bandwidth seems to be incredibly anemic, barely ~100GB/s for 18C and that is ~the speed of quad DDR4 3200. Reads are faster, but it is harder for those gaming loads to make use of L3, if write bandwidth is so low, might as well go to memory.

So in the end, once Skylake-X thread is out of L2, it needs to go to memory, once it needs locking - it needs to pay "snoop every other L2", mesh latencies price each time, once another thread needs something, even if it was just calculated and evicted", its highly likely it needs to go memory.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
When overclocked and when running modern optimized games it does perform well, but we need to keep in mind the following:
  • it has obvious weak spots with a number of other games even against much lower clocked CPUs from the Skylake army
  • when overclocked we need to compare against overclocked Skylake, in which case it won't even have clock parity anymore

I agree with both of these points, but as mentioned before, it highly depends on the type of game. As for the latter, the Gamersnexus review I cited had it competing against a 9900KS at 5.2ghz and a 9700K at 5.1ghz.

I cannot stress this enough: my intervention in firs page was accurately pointed towards the claim that the mesh interconnect was not weaker than the ring bus in games. All one needs to do to disprove that is to find a meaningful category of games where the mesh fails to deliver. Everything else is just further discussion on the topic. (which may actually be very interesting as long as we keep it somewhat related to thread topic)

It would be very difficult to find the answer to this question without profiling tools or something. Intel designed the mesh interconnect for high core count CPUs, and claimed the ring bus was more effective with 10 core and less CPUs. Therefore, in games that don't utilize that many cores it's conceivable that the mesh interconnect would be a performance liability. The worst performing title from that review was Total Warhammer II and that game uses no more than 4 cores.

I'd say it's likely a combination of the mesh interconnect and the cache hierarchy. Both can likely be nullified through programming techniques though I would wager, because there are certain games that use no more than 4 cores and still perform very well on Skylake-X in relation to the mainstream parts.
 
  • Like
Reactions: coercitiv