Ryzen: Strictly technical

Discussion in 'CPUs and Overclocking' started by The Stilt, Mar 2, 2017.

  1. theevilsharpie

    theevilsharpie Platinum Member

    Joined:
    Nov 2, 2009
    Messages:
    2,323
    Likes Received:
    13
    The Windows scheduler is already NUMA-aware and can tell the difference between logical and physical cores. It has everything needed to schedule work on Ryzen efficiently (except for perhaps some power control stuff), and just needs to be aware of the processor's core and cache topology.
     
  2. Ajay

    Ajay Platinum Member

    Joined:
    Jan 8, 2001
    Messages:
    2,863
    Likes Received:
    75
    We can't know that for sure. We don't have the algorithms for recent versions of the window's scheduler. I would be surprised if there weren't server applications out there that rely on scheduler profiles to optimize performance - same thing for embedded windows applications. There are, no doubt, quirks to be avoided as well (something game devs might know of). This OS has been evolving for over 20 years - who really knows what's in there (aside from MS)?

    So those recommendations work great as guideposts developing a change to the functional specification, but the devil is in the implementation.
     
  3. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    424
    Likes Received:
    290
    Getting to look at the algorithms might not be possible, but Windows scheduling is pretty well-documented.
     
    Kromaatikse likes this.
  4. Ajay

    Ajay Platinum Member

    Joined:
    Jan 8, 2001
    Messages:
    2,863
    Likes Received:
    75
    Thanks, that a slightly newer descriptor of the one I have bookmarked. Kernel symbols are available for Win10/7 etc., so one could install them and debug into the kernel and take a look at the scheduler code.

    Anywho, that link is just a description of the windows scheduling API, with some operational details (but very few!). It's not enough to have much of a clue how the scheduler actually behaves in realtime - both normally, edge cases, and any hazards (if there are any). This isn't s trivial snippet of code - here is the current Linux version...
    .h: https://github.com/torvalds/linux/blob/master/include/linux/sched.h
    .c: https://github.com/torvalds/linux/blob/master/kernel/sched/fair.c
     
  5. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    424
    Likes Received:
    290
    If only M$FT were generous enough to put their Windows code on github.
     
    Ajay likes this.
  6. iBoMbY

    iBoMbY Member

    Joined:
    Nov 23, 2016
    Messages:
    89
    Likes Received:
    43
    Yes, it is NUMA aware, but Ryzen is not reported as two NUMA nodes to the system.
     
  7. Kromaatikse

    Kromaatikse Member

    Joined:
    Mar 4, 2017
    Messages:
    74
    Likes Received:
    156
    I found this:
    Thread Ideal Processor
    When you specify a thread ideal processor, the scheduler runs the thread on the specified processor when possible. Use the SetThreadIdealProcessor function to specify a preferred processor for a thread. This does not guarantee that the ideal processor will be chosen but provides a useful hint to the scheduler.​

    So, a game which is aware of Ryzen's special topology can influence Windows' scheduling behaviour without relying on an affinity mask. This is fortunate since the latter appears to be broken. The problem is that each and every game dev needs to think about and set this correctly.

    I remain convinced that the scheduler itself is completely oblivious to SMT, NUMA, etc. Any illusion otherwise is given by the core-parking algorithm (which is at least SMT aware), and by affinity optimisations applied internally or externally to a given process. The documentation linked above talks a lot about applications needing to take responsibility for optimising their own affinity settings for topological considerations.
     
    lightmanek, Ajay and Dresdenboy like this.
  8. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    424
    Likes Received:
    290
    Thankfully Codemasters correctly identifies the topology, it is only when syncing the configuration files through Steam when migrating a previous installation of F1 2016 that it fails to update its detection of the new configuration.

    If this is the situation, then who knows what other developers might be doing - the Ghost Recon Wildlands example(typical of Ubisoft) doesn't inspire much confidence.
     
  9. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,687
    Likes Received:
    455
    Didn't Robert Hallock/AMD state that F1 detected 16 cores?
     
    looncraz likes this.
  10. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    424
    Likes Received:
    290
    Ah, I seem corrected then. The 16-core detection is in addition to the syncing issue. GRW on the other hand is the one that blatantly shows 16 cores in the results of its built-in benchmark.
     
    Dresdenboy likes this.
  11. imported_jjj

    imported_jjj Senior member

    Joined:
    Feb 14, 2009
    Messages:
    491
    Likes Received:
    309
    Minkoff and Dresdenboy like this.
  12. innociv

    innociv Member

    Joined:
    Jun 7, 2011
    Messages:
    52
    Likes Received:
    17
    Even if F1 and other games using the engine correctly identified 8 cores with SMT, that doesn't make it "aware of its topology"

    Only if an application identifies that it has 2 L3 caches per set of 4 cores and significant latency to cross them is it actually aware of the topology.

    Hm.
    So they can use the CPUID to disable Windows 7 updates on Ryzen CPUs, but they can't use it to change the scheduling pattern. Got it. :)

    It would, wouldn't it?
    Isn't there more latency from one side of the L3 cache to the other, so it's best to cluster threads for a process together regardless of it being a "true 8 core" or a "2x4 core"?

    I think many people did assume the scheduler was already aware of the latency of cross communication from one core to the next as it seems like such a straight forward optimization to do.

    Jaguar is based on Bobcat which is based on K10.
    Jaguar has more in common with K8 and Ryzen as well than Bulldozer.

    Ryzen is sort of Phenom IV... if Phenom III ever existed. But you could think of Bobcat and Jaguar as Phenom III.
     
    #837 innociv, Mar 16, 2017
    Last edited: Mar 16, 2017
  13. OrangeKhrush

    OrangeKhrush Senior member

    Joined:
    Feb 11, 2017
    Messages:
    210
    Likes Received:
    326
    Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real.

    I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.
     
    lightmanek, Madpacket and dnavas like this.
  14. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    424
    Likes Received:
    290
    I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

    How does a silicon revision solve what is fundamentally an interconnectivity issue?
     
  15. lolfail9001

    lolfail9001 Senior member

    Joined:
    Sep 9, 2016
    Messages:
    858
    Likes Received:
    297
    Increase data fabric clock?
     
  16. piesquared

    piesquared Golden Member

    Joined:
    Oct 16, 2006
    Messages:
    1,325
    Likes Received:
    176
    Speed up the DF? We've already seen how much increasing the BCLK increases performance. Although I'm very happy with the performance of my Ryzen 1700 already.
     
  17. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    424
    Likes Received:
    290
    There is a point beyond which BCLK increases affect PCI-E bandwidth. That threshold is pretty low for PCI-E 3.0 to 2.0 operation. Unless the DF is made independent of the IMC, latency issues will remain.
     
  18. Kromaatikse

    Kromaatikse Member

    Joined:
    Mar 4, 2017
    Messages:
    74
    Likes Received:
    156
    Well, this is of course the danger Microsoft inherently runs by punting topology awareness to applications. In general, application developers aren't as aware of these details as system devs, and certainly the spread of knowledge is more uneven.
     
    deadhand and Ajay like this.
  19. dnavas

    dnavas Member

    Joined:
    Feb 25, 2017
    Messages:
    37
    Likes Received:
    11
    I'm sitting next to my computer now which is doing work that my poor 860 couldn't have managed, so it's hard to be too disappointed. But.... Drat. I'm still seeing 6 hour encodes, so if they can ship a quad-channel, 16 core 3.6-4.0Ghz 200W TDP monster (32 pcie lanes?) on an ATX-sized board (hah!), AMD is going to get a lot of money from my wallet. Probably e-atx, huh? :sigh:
     
  20. dnavas

    dnavas Member

    Joined:
    Feb 25, 2017
    Messages:
    37
    Likes Received:
    11
    Wasn't there a mention of being able to run the fabric clock at 1:1 to memory clock? Could be that there was a problem in current revision that didn't allow clocks that high?
     
    looncraz likes this.
  21. OrangeKhrush

    OrangeKhrush Senior member

    Joined:
    Feb 11, 2017
    Messages:
    210
    Likes Received:
    326
    That is the answer likely, they left things available to play with
     
  22. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    424
    Likes Received:
    290
    Typically games just use the CPUID for detection and store it in an xml. Are there applications that actually detect and store cache hierarchy in a separate file?
     
  23. lolfail9001

    lolfail9001 Senior member

    Joined:
    Sep 9, 2016
    Messages:
    858
    Likes Received:
    297
    I believe there was mentioned a presently disabled ability to run DF at 2x the mem bus clock (so basically to run it at the tick rate of memory). Should AMD be able use it with some new silicon revision, the DF issues for all intents and purposes may be resolved.
     
    shing3232, CatMerc and OrangeKhrush like this.
  24. formulav8

    formulav8 Diamond Member

    Joined:
    Sep 18, 2000
    Messages:
    6,338
    Likes Received:
    157
    I wouldn't be surprised if a future bios allows you to change the divider's ratio.
     
    looncraz likes this.
  25. OrangeKhrush

    OrangeKhrush Senior member

    Joined:
    Feb 11, 2017
    Messages:
    210
    Likes Received:
    326
    Get me a 1500 with newer revision silicon and a beer