Ryzen: Strictly technical

Discussion in 'CPUs and Overclocking' started by The Stilt, Mar 2, 2017.

  1. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    Nice stats, though is the above meant to be read as sarcasm? Since CPU waiting for IO is always part of the overall CPU load, the inter-CCX bottleneck is just one place where IO can lead to the core waiting for data. Under Linux the CPU load spent for waiting for IO is commonly available as separate stat.
     
  2. CHADBOGA

    CHADBOGA Golden Member

    Joined:
    Mar 31, 2009
    Messages:
    1,652
    Likes Received:
    230
    I have done my best to bring the Internet Strongman to Anandtech forums.

    I stated at least twice on RWT forums that these forums are a proving ground for which he can test his insights and he just ignored me.

    I sent him two tweets from different accounts on Twitter, saying he should come to these forums and he blocked me on both occasions.
     
  3. Timur Born

    Timur Born Member

    Joined:
    Feb 14, 2016
    Messages:
    81
    Likes Received:
    58
    I took a much closer look at those "freezes" again that happen for a few seconds under certain workloads, usually CPU stress tests, but I see it happen under real practical workload. I use to call them stalls, but they are the same thing that other people call freezes for what we talk about. Even if only a single logical core is (over)loaded with certain workload then both the GUI/graphic output and keyboard/mouse input are suspended anywhere between a short blip and several seconds (I measured up to around 4). Additionally some but not all processing seems to stop during that time, which I reported wrongly before when I claimed that the whole system stops.

    First of all, both the suspension of graphic output and the partly ongoing of background processing can be measured! It's important to note that time-counters seem to roll on regardless of any stalls, which in turn allows software to keep measuring average CPU load, CPU cycles (+delta) and context switches (+delta) and frames per second. The CPU cycles Delta is especially interesting, because it tells us whether a program interrupts processing during a stall or not.

    What Delta means is that when you measure over the span of a second then the Delta is the amount of cycles that happened during that second. If a program interrupted its processing during a stall then the CPU cycles Delta decreases on the very next tick right after a stall. If a program kept processing uninterrupted then the Delta increases on the very next tick right after the stall.

    Programs that interrupt their processing include: WinRar, 7-Zip, Foobar2000, Firefox (Youtube HTML video output), Furmark
    Programs that do not interrupt their processing include: Ableton Live, MediaPlayerClassic, HWinfo

    Both WinRar's and 7-Zip's benchmark throughput drop considerably during stalls. WinRar seems to especially dislike my Reaper based workload, or rather the other way around, as WinRar interrupts Reaper even more.

    Audio and Video are two very special cases that justify some extra explanation.

    Audio:

    Audio drivers do not stall, regardless of the audio buffer size being used. I ran a RME Babyface USB ASIO drivers (isochronous USB transfer) at less than 2 ms buffer size without interruption whatsoever while my mouse kept stalling over the very same USB port + hub. If the application keeps processing data (Ableton Live) you can run input audio from the USB audio interface to the application and then back to the audio interface completely uninterrupted while the rest of your system is nearly unusable.

    If the application does interrupt its processing during a stall then the size of the application's own audio buffer decides over whether your audio stream gets interrupted or not. For example, if you set Foobar's own audio buffer to a size larger than the stalls (bigger than 4 seconds is good) then you get no audio interruptions. And if stalls are shorter than what Firefox buffers for Youtube playback then you get no interruption. This is because with larger buffers the audio data has already been processed before the stall happens and it seems that the program parts that just shovel the data to the audio driver do not get interrupted during a stall.

    Video:

    Video playback and graphic output always get interrupted, which in turn results in GPU load and Video Engine load dropping considerably, including the GPU frequency and temperature dropping. Since timers keep rolling the video/graphic output will jump forward to match the new time-frame (a 5 second stall means that the video jumps 5 seconds forward). For Youtube videos in Firefox this is a true jump, for videos in MediaPlayerClassic this is a fast forward that shortly increases the frame-rate (while maintaining the refresh-rate, because disabling VSync doesn't seem to work properly). Both Firefox Youtube and Furmark also see their average CPU load drop because of the stalls that interrupt their processing, even though they do maintain their time-lines in form of a straight jump.

    Interestingly HWInfo's graph display works similarly to how Firefox Youtube playback and Furmark works. The graph will do a full jump to the new time-frame instantly after a stall instead of drawing the in-between measurements. If you mouse-over the graph you can see several seconds being absent corresponding to the stall time.
     
    #1728 Timur Born, Aug 23, 2017
    Last edited: Aug 23, 2017
    WiseUp216 and krumme like this.
  4. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    At Hot Chips AMD gave more information about IF links between dies. Every Zeppelin die has 4 IF links, of which even in Epyc only 3 are used based on positioning on the package to keep trace lengths short.
    [​IMG]

    They also stated that the cost of the MCM approach is 59% the cost of a hypothetical monolithic Epyc chip, including a 10% area overhead for the MCM dies.
    [​IMG]

    Initial reporting: https://www.servethehome.com/amd-epyc-infinity-fabric-update-mcm-cost-savings/
    More and better slides: http://www.tomshardware.com/news/amd-threadripper-epyc-mcm-cost,35306.html
    Though I'd appreciate if anybody shares a more complete set of the slides if available.
     
  5. deadhand

    deadhand Junior Member

    Joined:
    Mar 4, 2017
    Messages:
    21
    Likes Received:
    82
    Hi everyone,

    Just to update on a post I made earlier this year:
    https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-9#post-38776310

    I took a crack at solving this issue and the problem was indeed due to false sharing. CS: Go uses a very similar lightmap baker to the SDK 2013 branch of Valve's Source engine, so it's evident that the issue is present there as well.

    Here are the results of the false sharing removed vs. original:

    (Note: CPU is a Ryzen Threadripper - affinity masks are set to 8 threads, 16 threads, and 32 threads respectively to simulate 1 CCX, 2 CCX, and 4 CCX processors.)
    The negative scaling I experienced with the dual Xeon tests (machine I previously used) is also eliminated (not shown below).

    Lower is better!

    [​IMG]

    Here is the pull request with the fix: (Literally two lines of code)
    https://github.com/ValveSoftware/source-sdk-2013/pull/436

    It seems that AMD CPUs, particularly AMD FX CPUs but also Ryzen are much more susceptible to the effects of false sharing than Intel CPUs.

    Also, I'd like to point out that 'vrad' has been used to show off the benefits of multi-core processors in the past and as a benchmark:

    https://www.anandtech.com/show/2489/11

    It's very likely that this issue was present, even back then.

    EDIT: Also, the comparatively poor scaling from 16 to 32 threads is likely due to SMT and Amdahl's law (or, rather, that the rest of the code that used to take a comparatively small portion of the time now takes a relatively long time and seems to also have some scaling issues.)

    The speedup in the fixed section of the code is likely a fair bit higher than what's shown here.
     
    #1730 deadhand, Oct 3, 2017
    Last edited: Oct 3, 2017
    Burpo, lightmanek, Despoiler and 12 others like this.
  6. tamz_msc

    tamz_msc Golden Member

    Joined:
    Jan 5, 2017
    Messages:
    1,532
    Likes Received:
    1,402
    Very nice work! One question though - was this done with NUMA disabled or enabled? Are there any performance difference, if any, between the two. Interested to know since last time you commented that this is a memory bound situation.
     
    Drazick likes this.
  7. CatMerc

    CatMerc Senior member

    Joined:
    Jul 16, 2016
    Messages:
    682
    Likes Received:
    652
    @The Stilt I have a question. Let's say I buy an AM4 motherboard and a Summit Ridge CPU now, and then later replace that Summit Ridge with a Pinnacle Ridge.
    Does the chipset affect memory latency at all? I'd be completely satisfied with X370 I/O wise, but if waiting for X470 would mean getting lower memory latency I'd do it.

    Basically I'm asking if pairing Pinnacle Ridge with X470 would give performance advantages over X370.
     
    Drazick likes this.
  8. LightningZ71

    LightningZ71 Member

    Joined:
    Mar 10, 2017
    Messages:
    84
    Likes Received:
    45
    From what little seems to be leaking onto the internet, it seems that the biggest change between X370 and X470 will be the upgrade of the chipset<=>Processor link to PCI-E 4 from the existing version 3. This will likely result in a bit less bandwidth contention between all of the chipset connected devices for processor and DMA transactions. I suspect that there may be some other modernization of the USB 3.X vX interfaces too, but nothing drastic. Memory latency between the processor and the RAM will be almost entirely based on the supported speeds and the DDR controller on the chipset, which may improve due to minor tweaks during the generation change. That wouldn't be any different between chipset generations. The only other possible difference would be any modifications that the mobo makers may make to the board layouts themselves to assist in running the DRAM faster, but, I feel that there is very little that can be done on that front.
     
    Drazick likes this.
  9. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,318
    Likes Received:
    1,756
    No.
    The chipset just provides additional IO to the CPU, similar to e.g. PCI-E to USB3 peripheral controller would.
    The chipset isn't required for the CPU to function and on Crosshair VI Hero you can actually completely disable the external PCH.
     
    Burpo, Drazick, coercitiv and 6 others like this.
  10. IRobot23

    IRobot23 Member

    Joined:
    Jul 3, 2017
    Messages:
    172
    Likes Received:
    28
    Could AMD improve DDR4 latency with pinnacle?
     
    CatMerc likes this.
  11. PhonakV30

    PhonakV30 Senior member

    Joined:
    Oct 26, 2009
    Messages:
    765
    Likes Received:
    236
    I think depend on Infinity Fabric?
     
  12. deadhand

    deadhand Junior Member

    Joined:
    Mar 4, 2017
    Messages:
    21
    Likes Received:
    82
    The problem was essentially what I described here: (except it's technically not really 'false sharing' in this case, though the effects are identical)
    https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-23#post-38790480

    I initially thought it might have been bandwidth related but after reducing the scene size to something extremely small (less than a meg) and increasing sample counts to prevent sub 1-second execution time on my test scene (in this application scene size controls sample counts, but that sample count can be increased by a user defined factor), I came to the conclusion that there was no way any bandwidth issues were related, as the throughput was not improving at all and the thread scaling wasn't either (in terms of light sampling - the number of samples per second were essentially identical regardless of scene size when offset by the sample count scaling).

    You can think of false sharing as being a 'serialized portion' of code in Amdahl's law, where it gets worse with more threads. When threads are false sharing between CCX's (or between sockets), the latency in the cache update is worse (than within a CCX) and thus threads waiting for the cache line to go back into a shared state end up waiting longer than if they were within a single CCX.

    The most interesting thing about this to me is that the effects of false sharing seem much less significant on something like an i7-7700k than AMD CPUs, even with a hypothetical single CCX Ryzen quad core. The speedup of my fix on this program was relatively minor on an i7-7700k (~38% vs. the hypothetical single CCX Ryzen @ 56%).

    This was of course a really extreme case, but it might be responsible for some thread scaling issues in some other applications, and it's quite a fixable problem (if it's found... that's the hard part)
     
    #1737 deadhand, Oct 7, 2017
    Last edited: Oct 7, 2017
    Schmide and CatMerc like this.
  13. CatMerc

    CatMerc Senior member

    Joined:
    Jul 16, 2016
    Messages:
    682
    Likes Received:
    652
    Wouldn't the lower L3 latency on Skylake account for having a lesser penalty from false sharing?
     
    Drazick likes this.
  14. tamz_msc

    tamz_msc Golden Member

    Joined:
    Jan 5, 2017
    Messages:
    1,532
    Likes Received:
    1,402
    L3 being inclusive might also be a factor in favor of Skylake.
     
    Drazick likes this.
  15. deadhand

    deadhand Junior Member

    Joined:
    Mar 4, 2017
    Messages:
    21
    Likes Received:
    82
    Since Ryzen's L3 stores shadow tags to data in the L2 caches of the cores within a CCX, it might be argued that it sort of behaves a little bit like an inclusive cache. (If there's an L3 miss it checks the shadow tags to see if the data might be in another core's L2 cache before (accessing memory?), from what I understand)

    I think so, yes, but the inter-CCX issue seems to make it much worse.
     
    Drazick likes this.
  16. IRobot23

    IRobot23 Member

    Joined:
    Jul 3, 2017
    Messages:
    172
    Likes Received:
    28
    Does anyone knows how will 12nm effect Ryzen pinnacle ridge?
     
    kostarum likes this.
  17. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,838
    Likes Received:
    838
    No clue. It depends on whether it will be a straight die shrink, or whether Pinnacle Ridge will have any fundamental design changes vs Summit Ridge.
     
  18. maddie

    maddie Golden Member

    Joined:
    Jul 18, 2010
    Messages:
    1,697
    Likes Received:
    349
    AMD using 12nm has the opportunity to either lower power usage, increase clocks or a combination of the two. As the present ecosystem of motherboards, coolers, etc are configured for present wattage CPUs, namely 65W & 95W, I believe that they will go for a clock increase as they are already power efficient but need to close the performance ST gap with Intel on the client side.

    Realize that 12nm should use less power if clocks are the same as present, so they can also have lower power server products in non highly stressed server farms, using the same die clocked as at present.

    All of this is separate to any improvements in the die layout to improve IPC.

    AFAIK, density is minimally improved.
     
  19. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,838
    Likes Received:
    838
    If I recall correctly, AMD won't update their Epyc lineup on 12nm anyway (or if they do, it comes later than Pinnacle Ridge), so I'm not sure that they're going to make many moves aimed at server farms.
     
    lightmanek likes this.
  20. IRobot23

    IRobot23 Member

    Joined:
    Jul 3, 2017
    Messages:
    172
    Likes Received:
    28
    Thanks for reply.
    Since DF speeds is biggest problem to with ryzen memory latency. Some say that DF speed should be at 1:1 (DDR4 MT/s), other say that locked around 2GHz should be enough. Since I do not know much about ryzen DF except speed and bandwidth and basically it is new NB.

    Would certainly like to see Ryzen core clock = DF (1200MHz with 3200MHz DDR4) vs i7 8700K 1200MHz + NB 1200MHz with DDR4 3200. Anyone did that kind of test? To see how well does ryzen CCX technology at lower clocks?

    Could AMD simply improve DF bandwidth? More B/Cycle?
     
  21. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    Indeed, Epyc/Threadripper will be updated with Zen 2 again. So for 12LP/Zen+/Pinnacle Ridge (likely also Raven Ridge, but the silence there is deafening) they will remove the now superfluous parts of the uncore, like the multi-die multi-sockets interfaces etc. I'd expect some low hanging fruits fixes on the cores and IMC as well, but there the actual design improvements should come with Zen 2.
     
    Drazick likes this.
  22. raghu78

    raghu78 Diamond Member

    Joined:
    Aug 23, 2012
    Messages:
    3,684
    Likes Received:
    862
    AMD could definitely improve DF speeds but they have to balance the power increase with performance gains. I think there is a chance that Pinnacle Ridge might support DDR4 4000+ speeds and thus have higher DF speeds. I would like to see a fixed DF speed of 2-2.4 Ghz on Pinnacle Ridge if its possible. With 7nm Zen 2 AMD could push the DF speeds in the 3 Ghz range. AMD has quite a few levers to work with to improve performance.
     
    Drazick likes this.
  23. IRobot23

    IRobot23 Member

    Joined:
    Jul 3, 2017
    Messages:
    172
    Likes Received:
    28
    Well, DF has good bandwidth for those clocks already. Since its also meant for server market I assume that they are going to stick for low power. For desktop they could optimized higher clocks.
    1600MHz+ would be "killer".
     
  24. raghu78

    raghu78 Diamond Member

    Joined:
    Aug 23, 2012
    Messages:
    3,684
    Likes Received:
    862
    https://semiaccurate.com/2017/05/17/amds-details-epyc-server-ambitions/

    EPYC will remain on 14LPP . Ryzen and most likely ThreadRipper will get a update on 12LP as it really needs higher clock frequency to compete with Coffeelake. Since Pinnacle Ridge dies are desktop only AMD could tweak fabric speeds specifically for reducing memory latency and improving gaming performance.
     
    Drazick likes this.
  25. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,838
    Likes Received:
    838
    Improving DF speeds will also improve interCCX communication (thread jumping) and reduce the effects of "false sharing". Which would be nice for those titles that inexplicably run like crap on Ryzen.
     
    Drazick likes this.