Ryzen: Strictly technical

Discussion in 'CPUs and Overclocking' started by The Stilt, Mar 2, 2017.

  1. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    419
    Likes Received:
    288
    See this post.
     
  2. lolfail9001

    lolfail9001 Senior member

    Joined:
    Sep 9, 2016
    Messages:
    839
    Likes Received:
    292
    So, basically it still looks SMT aware to me, just that disabling core parking increases amount of jumps between cores significantly.
     
  3. innociv

    innociv Member

    Joined:
    Jun 7, 2011
    Messages:
    52
    Likes Received:
    17
    Anyway, re: how to multithread games

    I'm pretty sure the optimal thing, since there clearly seems to be DX12 and Vulkan issues where they rely on the shared cache and interconnectivity, is to have those running on the other CCX and "only" using those 8 threads for the actual graphics API calls. Also to only have graphics drivers running on one CCX, and somehow on that same one despite that being a separate application.
    This might seem suboptimal, but the other CCX wouldn't sit there doing nothing. You still have your main loop running, i/o handlers, sound, and so on running on that CCX. You can be preparing the next draw calls to be sent to the other CCX
    That, then, would be your only-cross CCX, going from your draw call prep to the actual draw calls. Which should be very little since you should run your physics and transforms on the "multithread" CCX

    But even then... I'm not totally sure.
    Like you want, it'd seem, to have your heavy single threaded tasks on one CCX without using SMT. So you don't want to load more on there.
    Then on the other CCX, you want to run your tasks that are cheap to parallel-ize but that suffer from cross-CCX issues. Physics, object transforms, DX12/Vulkan... Pretty much the only single thread there being the main graphics API thread, your render loop, and your main dispatchers for those heavily parallel tasks if needed.
    But... it's things going on on your main threads that trigger your object transforms happening.
    So on top of that, you need some super tight optimized way to take your inputs and game systems on one side calling for transformations to happen on your actual render side.

    In my mind, that's probably the roughly optimal set up.
    And this is similar to how many games are set up, as far as I'm aware, with 3-8 main threads but which can spawn a dozen or two, or three, more for their highly parallel tasks which get stacked on top the more heavy single threaded tasks.

    I think what a lot of people miss is that with their traditional view, that seems intuitive, on how multithreading works is you have say a main thread, render thread, sound, i/o. They think these are separate systems that you can split up easy.
    But... that's not really the case. There's a performance penalty to splitting those up. Clock for clock (or watt for watt maybe more apt), things run worse split up like this and they tend to rely on L3 to make that work a bit better but not still quite as good.
    But then add to that, people tend to think physics is another thing that might be put on one thread.
    But no, physics on the other hand is often something that will run more efficiently when split up into many threads and using SMT. Similar to how in Cinebench, the 1800X gets a 162 single threaded score. Yet, the multicore is 1624. This is 10 times higher, not 8 times higher as a layman might expect.

    So this is why I think the optimal layout is that you keep your main thread and other things that need to be kept "close" to it on one CCX, and you put your things that benefit from being parallel on the other.
    Oh, and the other thing is that you can't benefit from SMT as much if you have something uneven blocking it, but on the other CCX with almost none of that going on you're free to write everything in a way that can benefit from SMT.

    This is what I was going over in my post above yours..

    That no, this can't be it because it'd be 98ns to write to memory from the pinging core.
    But when it comes to reading.. so say it instantly starts to read, well wouldn't it try to read from memory simultaneously to looking in its L3 cache? It should catch it a moment after it was just written.

    If Ryzen had a lattency of 142 every time it tried to read from memory, that would mean it's checking L3 cache then memory first each time and it's truly 98-42ns, since you'd be factoring in the seeking from L3 cache before it actually. Which I don't believe it's that fast, I believe they're attempted simutaniously...

    I'm explaining this very poorly..
    Basically if the MMU checks L3 before even trying to read memory, that'd mean that the real memory latency is 98-42ns.
    So the write speed should have been 56ns. And the read speed on the other core sould have been 98ns instead of 142ns. You're missing 44ns.
    None of that can be true as far as I see, but I think I'm still wording this poorly.
     
  4. Kromaatikse

    Kromaatikse Member

    Joined:
    Mar 4, 2017
    Messages:
    71
    Likes Received:
    151
    My point is that *core parking* is SMT aware, but the core-parking algorithm is *not* part of the scheduler.
     
    Ajay and looncraz like this.
  5. Kromaatikse

    Kromaatikse Member

    Joined:
    Mar 4, 2017
    Messages:
    71
    Likes Received:
    151
    I don't think the 142ns figure is from a single cross-CCX access. I think it might be a more complex operation, such as a semaphore, which requires multiple accesses. That's why we need the code performing this test, so we know what it's actually measuring.
     
    lightmanek, Ajay, CatMerc and 4 others like this.
  6. innociv

    innociv Member

    Joined:
    Jun 7, 2011
    Messages:
    52
    Likes Received:
    17
    Agreed.

    But in addition, whatever does account for that latency, I don't think it's purely to blame for some games performing terribly (especially DX12/Vulkan).
    I have a feeling there's something besides just the cross-CCX bandwidth that's getting overloaded when something tries to cross them too much.
    DX12 and Vulkan (two cases where we see Ryzen performance seriously drop) likely rely on the L3 cache, sure, but I'd think the 22Gb/s is enough for their purposes if it were only down to that.
     
  7. lolfail9001

    lolfail9001 Senior member

    Joined:
    Sep 9, 2016
    Messages:
    839
    Likes Received:
    292
    Fair enough.

    And my point is that i see no convincing evidence scheduler is not SMT/CMT aware in your screens, because well, there is no conflict between scheduling 2 threads on same physical core to be seen with a single threaded load.
     
  8. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    640
    Likes Received:
    1,329
    There is no 1400X :p

    Ryzen 5 1400 : 4/8 @ 3.2~3.4: $169
    Ryzen 5 1500X : 4/8 @ 3.5~3.7: $189

    There's no gap to place a 1400X..
     
    Drazick likes this.
  9. imported_jjj

    imported_jjj Senior member

    Joined:
    Feb 14, 2009
    Messages:
    447
    Likes Received:
    278
    There is as the 1400 has 50MHz XFR while the 1500X has 200MHz so a 450MHz gap.
     
  10. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    640
    Likes Received:
    1,329
    I don't have enough access to answer that question - I'm just reversing what I can find in the BIOS ROMs.
     
    Drazick likes this.
  11. William Gaatjes

    Joined:
    May 11, 2008
    Messages:
    14,791
    Likes Received:
    56
    Yeah, i found that out as well while reading the anandtech site.
    My previous information came from wccftech.

    What does the X stand for ?
    The 1400 seems to have XFR as well.
    It just seems artificially limited.
     
  12. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    640
    Likes Received:
    1,329
    I love the logic, but I think the memory latency figures for Ryzen are (at lest partly) inclusive of the CCX and DF latencies. Otherwise increasing core frequency shouldn't reduce it much (if at all) - but I saw 10ns drop going from a core clock of 3GHz to 3.8Ghz and 9GB/s more bandwidth - without touching memory settings.

    How would a program work to isolate just the IMC to RAM latency? This is something I've never explored (well, creating benchmarks at all, actually, is something I've never had cause to do outside of critical program code).

    Just benchmarking memcpy() performance and the average time for accesses to return (using the TSC for timing) is all that comes to mind. Calculate how long it takes to access some memory address you haven't dirtied from a single core.

    How Intel systems can show latencies of 19ns is beyond me. There's some black magic going on there... that's lower than the time it takes to get data into a core from memory.
     
    Drazick and powerrush like this.
  13. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    419
    Likes Received:
    288
    I think it all leads back to how AIDA64 benchmarks the cache.
     
    lightmanek and Dresdenboy like this.
  14. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    640
    Likes Received:
    1,329
    The accurate way to test it would be holding a spin-lock in one thread to protect a structure that simply stores the TSC value, setting that TSC value and releasing the spinlock from another thread. This release would need to happen at an interval MUCH greater than the largest possible latency, so something like 1ms should be fine. Real time thread priorities must be used.

    The reader thread would keep an array of (rdtsc - ping->tsc) results. The rdtsc instruction takes about 60 cycles, the subtraction would just about be free, and the results would need to be stored on the stack - which sets some requirements for the operation (I like requirements, keeps the code simple :p).

    Using semaphores or sockets would be a sure way to end up including other variables into the results.
     
    Drazick likes this.
  15. innociv

    innociv Member

    Joined:
    Jun 7, 2011
    Messages:
    52
    Likes Received:
    17
    XFR was AMD's biggest mistake.

    They should have just marked them as 3.45Ghz and 3.9Ghz "precision boost". Instead they make them look lower clocked and spread this lie that they overclocked them as good as your cooling allowed. :(

    "X" just stands for "it's better xD"
     
  16. Elixer

    Elixer Diamond Member

    Joined:
    May 7, 2002
    Messages:
    8,777
    Likes Received:
    189
    Isn't the Infinity fabric based off of Seamicro's Freedom fabric that ties the CCX units together?
    That interconnect (Seamicro's Freedom fabric) was supposed to have over a terabit/sec bandwidth, and < 6 us latency per node.
    Obviously, something is getting in the way here to slow down the transfer rate, assuming pcper's graph is correct, but, again, we need the darn source to see what they are actually doing.
    In fact, I would be curious to see that same code run on linux 4.11+ kernel, and see what it shows there as well.
     
    lightmanek and powerrush like this.
  17. DisEnchantment

    Joined:
    Mar 3, 2017
    Messages:
    65
    Likes Received:
    42
    lightmanek and looncraz like this.
  18. powerrush

    powerrush Junior Member

    Joined:
    Aug 18, 2016
    Messages:
    20
    Likes Received:
    4
    All the problem with Ryzen is that "coherent" data fabric. It runs at half the ram speed.
     
  19. imported_jjj

    imported_jjj Senior member

    Joined:
    Feb 14, 2009
    Messages:
    447
    Likes Received:
    278
    The GMI links are a key part of this thing too and 4 dies playing well together is important.
    So any chance all data goes through some GMI related "filter" so they can get similar latency on die and in package?
     
  20. imported_jjj

    imported_jjj Senior member

    Joined:
    Feb 14, 2009
    Messages:
    447
    Likes Received:
    278
    Why do folks complain about this, it's pretty much the default way to go about it as it scales with memory BW needs. Running it faster to reduce latency can be an upside but is less efficient.
     
    lightmanek and Dresdenboy like this.
  21. coffeemonster

    coffeemonster Member

    Joined:
    Apr 18, 2015
    Messages:
    135
    Likes Received:
    35
    I really think they should have made a 1400X with the 1500X clocks(4/8 @ 3.5~3.7: $189)
    and had the 1500X be the true scaled down 1800X gamer quad @3.6~4.0: $209 or so
     
  22. CrazyElf

    CrazyElf Member

    Joined:
    May 28, 2013
    Messages:
    79
    Likes Received:
    9
    Has anyone tried ECC on Ryzen? Asrock's boards should support X370. No idea about the other vendors.


    AIDA64 wasn't giving correct numbers for L2 and L3, I know, but they said that L1 and RAM were ok.

    From what I understood, it was to be incorporated into the next version of AIDA64? Do you have any more information?


    True enough. Right now we are working with fixed memory multipliers and locked timings. No idea what Ryzen's memory controller is capable of once fully unlocked.

    Hope it's good as it could really make a difference.



    Not sure either.

    An interesting question - what is the weighting?

    Some of the time, the memory won't even need to be accesses because the data will be in one of the cores' L3 of the other CCX (Or L2 even? Due to L3 being a victim cache?).


    I guess the answer might be:

    Weighted average inter-CCX latency = latency within CCX + (% of time data is in L3 of other CCX x amount of time to access other CCX + % of time you need to get data from DRAM x time it takes to access DRAM) + probability of cache miss x average time of cache miss


    I'm thinking that:
    1. Overclocking core might reduce latency within CCX (since they're all tied together, CPU register, L1, L2, and L3)
    2. Overclocking RAM might reduce latency in all areas where there is 32B/cycle

    [​IMG]

    So that would mean all the caches and the data fabric.


    Then the average latency between the cores would be (and this is assuming you are getting data out of a core):
    1. 3/7 times (there are 3 other cores in the CCX and 7 other on die), you will have the data within your own CCX
    2. 4/7 times you need to go to the other CCX, and in some cases out of that 4/7, find the information in the L3 cache of one of the cores in the other CCX
    3. 4/7 times you need to go outside your CCX, but in other cases of the 4/7, have to go to DRAM because the data is not in the other CCX's cores' L3 cache at all.
    4. I guess then there is the cache misses within the CCX as well

    So overall average latency between cores is, including both within and outside of the CCX:

    3/7 x average latency within CCX + 4/7 x probability of going to L3 in other CCX x time to access other CCX L3 cache + 4/7 x probability of not finding data in other CCX x time it takes to go to DRAM + probability of cache miss inside CCX x time penalty for cache miss


    I wonder what this is going to look like for Naples. How would they get 8 CCXs to talk to each other? I don't think that mesh is practical for 8 CCXs, so a bi-directional ring might be necessary. We will know by next quarter.
     
    #997 CrazyElf, Mar 18, 2017 at 7:10 PM
    Last edited: Mar 18, 2017 at 7:28 PM
    lightmanek and Dresdenboy like this.
  23. CataclysmZA

    CataclysmZA Junior Member

    Joined:
    Mar 15, 2017
    Messages:
    6
    Likes Received:
    7
    When a request or a ping is made to L3 in a different CCX, both L3 cache in that second CCX and the unified IMC are pinged/strobed/checked at the same time.

    If the inter-CCX latency of 140ns average is correct, then it can only be that the two latency values are being reported together, regardless of whether or not any data is retrieved in the next clock cycle.
     
  24. Elixer

    Elixer Diamond Member

    Joined:
    May 7, 2002
    Messages:
    8,777
    Likes Received:
    189
    lightmanek and Dresdenboy like this.
  25. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    640
    Likes Received:
    1,329

    Exactly this. XFR would have been great if it also allowed one thread to hit something like 4.2GHz - a clock rate no one can achieve with ambient cooling and tolerable voltages.

    Now, if XFR was something we could manually adjust as part of overclocking... that'd be insanely awesome. I know my 1700X can do 4.1GHz on two cores, but it can't handle 3.9GHz on all cores at the same voltage.

    3.8 all-core and 4.1GHz dual-core turbo would makes that thing an absolute beast. Instead I'm stuck with 3.8GHz fixed - missing out on a good 8% of single threaded performance potential.
     
    Drazick likes this.