Ryzen: Strictly technical

Discussion in 'CPUs and Overclocking' started by The Stilt, Mar 2, 2017.

  1. knowndragon

    knowndragon Junior Member

    Joined:
    Apr 3, 2017
    Messages:
    17
    Likes Received:
    4
    Lots of info to read here. I appreciate the OP for taking the time. This is being seen and linked on other forums I belong to.

    So rule of thumb, as I have not read all the posts in this thread yet. I will if you can't get pass the original frequency of the xfr, The best thing to do is leave it alone at stock? I am going to try and see about base over clock with a multi mix if this is possible.
     
  2. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,633
    Thanks, I refreshed the cached versions in case they get used again. I was originally planning to use PHP charts, which are compute-heavy, so I was going to cache the pages full-time in memory. Not sure how the empty cached versions came up at all, though... pretty strange.
     
    Drazick likes this.
  3. Chl Pixo

    Chl Pixo Junior Member

    Joined:
    Mar 9, 2017
    Messages:
    11
    Likes Received:
    2
    @looncraz good info.
    Now only if the virtualization was not so lacking it would be great.
    Still seeing big perf drop on passed GPU.

    I am curious if the new AGESA resolves the problem with IOMMU and AMD cards on the chipset slot.
    Currently no linux will boot at least on ASUS prime x370-pro i there is GCN based card in chipset slot.
    Tested this with RX 460, R9 290 & R7 260X.
    Old HD 6450 work fine and according to what I read on net Nvidia card works too.
     
  4. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,730
    Likes Received:
    554
    Someone noticed that the 1T SB/XV charts are swapped, Edit: but they seem OK to me.
     
    #1479 Dresdenboy, Apr 10, 2017
    Last edited: Apr 10, 2017
    Drazick likes this.
  5. mattiasnyc

    mattiasnyc Member

    Joined:
    Mar 30, 2017
    Messages:
    122
    Likes Received:
    66
    I don't think it's off-topic, it relates to strictly technical issues.... And I'm the same Mattias as on Gearslutz! :)

    I don't think the CPU bound x16 is slowing down, I just think it's an odd coincidence that the boards with legacy PCI have chosen to not offer x8/x8 for the CPU PCIe 3.0 x16.

    Either way, as soon as I get confirmation that the adapter outputs the correct voltage required for my Lynx card I'll probably settle for that as opposed to legacy PCI on board. That'll allow me to do x8/x8 for video work in the future.
     
  6. dnavas

    dnavas Member

    Joined:
    Feb 25, 2017
    Messages:
    131
    Likes Received:
    46
    Yes, I see all of the charts swapped -- SandyBridge on top. :shrug: I'm more interested in all the remaining pages. I hope your time frees up.

    I've been using Ryzen for video editing, and I've noticed two things.
    First, the lack of quicksync really hampers decode acceleration, to the point where a 7700k may well be a better bet for some users. Single-threaded UHD avc decode in my NLE can't complete at 60p, so unless the video is packaged with multiple slices (two might work at ~4.1G, but I'm only stable at 3.9), I'm not real-time in my software. That's a problem. Fortunately I have quad-slice cam output, so I'm fine for now, and can only hope that the software eventually makes use of nvdec (or some upgraded amd equivalent).
    Worse (and second), if the decode threads, which are long-pole items, wind up having their time stolen by SMT'd threads, I'm in for a bad time. There are very strange performance pits. I can put a few cams-worth of video in a loop, and one time through the loop everything is good, and another time through it, we're hiccoughing like crazy. This usually happens when the processor is nearly fully utilized (over 80%). I haven't re-run that test after I upgraded my BIOS (to F5g on a Gaming 5, which is supposed to have 1.0.0.4), so maybe things have improved, but it feels like some kind of scheduling problem. If the threads are scheduled on top of each other (err, same core), they seem to stay that way until the next go-around. :shrug: [This is Win7, btw]

    Hopefully those observations are useful in some way. Obviously more investigation would be required. A longer, video-editing-focused review which I would sum up with "not all things parallelizable are parallelized" is here https://www.pugetsystems.com/labs/a...2017-AMD-Ryzen-7-1700X-1800X-Performance-909/. As someone who wants to know why their benchmarks look a certain way, this kind of article will likely grate, but the kind of software behavior it demonstrates is going to be a problem for some HEDT targets. That said, I'd be lying if I claimed I wasn't interested in a 16 core anyway. :)

    Thanks for the time!
     
    #1481 dnavas, Apr 10, 2017
    Last edited: Apr 10, 2017
    french toast likes this.
  7. Kromaatikse

    Kromaatikse Member

    Joined:
    Mar 4, 2017
    Messages:
    83
    Likes Received:
    169
    This is something I see quite specifically in Gentoo Linux, which builds all of its packages from source on the end system. During this build process, there are several distinct phases which occur entirely sequentially:
    • Source archives are verified, unpacked and patched. This is basically a serial operation, although decompressors effectively parallelise with the unarchivers and patchers they directly feed. In any case it only takes a long time for very big packages.
    • The build tree is configured. More often than not, this involves a GNU Autotools script, which is notoriously slow and pedantic - and also completely serial.
    • The source code is compiled and linked. This is theoretically the meat of the business, and is usually properly parallelised on large packages, as you'd expect of a multi-file compiler workload. There may be a few bottlenecks in the dependency chain, but that's it.
    • The build products, documentation, etc. are installed. This is mostly a disk-limited operation, but with one or two notable exceptions: in particular Glibc inexplicably delays building locale descriptors to this stage and does not parallelise this 100+ step (by default) process.
     
  8. i-know-not

    i-know-not Junior Member

    Joined:
    Mar 2, 2017
    Messages:
    8
    Likes Received:
    12
  9. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,730
    Likes Received:
    554
    It's not my review. The charts are meant to show performance relative to XV (top) and SB (bottom). Thus they have to include SB (top) and XV (bottom).

    Your observation reminds me of a question I asked in the past regarding performance measurement of actual video editing, not just the (overnight) rendering. This would include the storage subsystem, mem substytem, and the CPU of course.

    SMT related things (e.g. BG threads reducing performance of FG threads with user interaction) could be improved on by setting affinity, using process lasso, etc. But a smarter scheduler would help, too, of course.

    What kind of NVMs, SSDs, HDDs do you use - and how much RAM?
     
    Drazick likes this.
  10. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,633
    I have no idea how the charts could ever be swapped, they're hard-coded in place inside their cells and have never been placed in the wrong cell. What browser are you using? And, are you sure they are swapped? The results relative to Excavator will contain the Sandy Bridge results, whereas the results relative to Sandy Bridge will contain the Excavator results.

    What program do you use for video editing? Proprietary solutions for generic problems always irks me. QuickSync isn't anything special, it's just GPU compute.
     
    Drazick likes this.
  11. DeeJayBump

    DeeJayBump Member

    Joined:
    Oct 9, 2008
    Messages:
    52
    Likes Received:
    61
    Thanks for all the hard work you've provided us with all of this Ryzen testing, first of all.

    As for reversed charts, using Pale Moon, the charts [in both Single Thread + Multi-Thread sections] are reversed for me as well. Appears that the charts themselves are misnamed [Excavator-named charts lack Excavator results, SB-named charts lack SB results] which appears to be the issue.
     
  12. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,633
    "Relative to Excavator" charts will not contain Excavator results - they would all just be 100% ;-)

    Relative charts generally exclude that to which they are relative as that would just be the 100% marker.
     
    Drazick and Dresdenboy like this.
  13. dnavas

    dnavas Member

    Joined:
    Feb 25, 2017
    Messages:
    131
    Likes Received:
    46
    Yes, I know. I reopened the page and all seems reasonable. Perhaps I misread the first time through? I dunno -- been a long couple of days....

    Edius comes in two different versions, and the workgroup version (which I have) indicates that it supports multi-CPU systems. I don't know the extent to which it is getting in its own way. I should look into Lasso.

    Well, the OS is on an SSD. I'll probably graduate it to nvme, but not in any hurry.
    Most folks that I know have local raid arrays, but I prefer to edit in the location of the final resting place for my bits, so I have a rather unusual setup where the NAS sits next to my PC. I've got 8 spinning 4TB drives in raid-6 fronted by two 256GB SSDs in raid-0 in read-only (SATA, because qnap doesn't support pcie-based nvme). Now that I've started dealing with 4k, I'm considering a dedicated 10GbE connection, although I don't normally do multicam, and 140mbps is a pretty simple thing for straight gigabit. I'm more concerned about the pre-rendered stuff being able to be streamed adequately. The NAS has 8GB, my computer has 16GB. Edius itself doesn't really require a lot of ram.

    Well, I was sure this afternoon, but looking at it again, they seem as expected. Perhaps I was confused. The current labels (relative to ...) make things look fine to me this evening.

    If decode was a solved problem, editing long-gop 4:2:2 4k video wouldn't be such a difficult task. It is, though, because QS, nvdec, etc. don't support 4:2:2. Generally, you talk to CPU people and they say "but, that's just gpu" and you talk to gpu people and they say "yeah, but who watches 4:2:2 video?" So you have nvdec supporting 8k hevc formats for the broad consuming public -- all 2 of them, but not the 4:2:2 format that's required for delivered video in various places around the world. You have Intel with qsv in their consumer chips, but not the CPUs which would otherwise be more useful in editing. Because it's just gpu. And the hardware is only there because it's useful for driving down the "watts-while-watching-bluray" numbers. And who uses an 8-core processor to watch blu-rays. :cry:

    The thing is, that hardware is really useful. In Edius it's easily worth a couple of cores. I don't have numbers for the 7700k, but the higher the decode resolution that gets supported, the greater the number of lower resolution simultaneous decodes that can run. It's why Vegas is making such a big deal of their support of it. Meanwhile we're staring at the 8k freight train and looking backwards in time towards the use of proxies. Unpleasant :(

    But, off-topic.
    Given the current immaturity of the platform, against the shifting sands of bios updates and game patches, to attempt what you've attempted is a thing worthy of note. I do appreciate it. I think it'll be really important in a few months when Zen heads up against Skylake-X. It'll likely be processor count vs frequency, and understanding the shortcomings (and not) of the former is going to go a long way to having a good discussion about the merits of the platforms. Thanks muchly.
     
  14. DeeJayBump

    DeeJayBump Member

    Joined:
    Oct 9, 2008
    Messages:
    52
    Likes Received:
    61
    Got it. Brain cramp on my part, sorry.
     
    looncraz and Dresdenboy like this.
  15. Paratus

    Paratus Lifer

    Joined:
    Jun 4, 2004
    Messages:
    10,259
    Likes Received:
    1,935
    Ars has a pretty interesting article on performance improvements from patches to games, window and the processor microcode. They take a pretty deep dive into what's going on.

    https://arstechnica.com/information...ryzen-showing-just-what-can-and-cant-be-done/

    It basically shows what I kind of already expected. Several areas where it appears to significantly underperform against broadwell are mostly due to the lack of Ryzen specific optimizations
     
    looncraz, malitze and Dresdenboy like this.
  16. Dygaza

    Dygaza Member

    Joined:
    Oct 16, 2015
    Messages:
    171
    Likes Received:
    34
  17. CatMerc

    CatMerc Senior member

    Joined:
    Jul 16, 2016
    Messages:
    682
    Likes Received:
    652
    Heh, I see they took my request. I emailed them about doing these tests :p

    At 1066MHz clock, a message takes 100ns to cross the CCX barrier. At 1600MHz, a message takes 71ns to cross the CCX barrier. Looks like around linear scaling.
    4000MHz RAM should reduce this to 55ns, so a total of 95ns ish. Close to Intel's 80ns.

    If they fixed the DF clock at 4GHz, it would be 70ns ish worst case, or just 27.5ns added from the data fabric.
     
    #1492 CatMerc, Apr 11, 2017
    Last edited: Apr 11, 2017
    Dresdenboy, french toast and TerionX6 like this.
  18. imported_jjj

    imported_jjj Senior member

    Joined:
    Feb 14, 2009
    Messages:
    660
    Likes Received:
    430
    french toast and T1beriu like this.
  19. TerionX6

    TerionX6 Junior Member

    Joined:
    Jun 29, 2015
    Messages:
    14
    Likes Received:
    20
    Ryzen CCX latencies
    2133 > 2400
    1/8th increase in DF/mem clock gives ~1/12.5th decrease in latency
    scaling of .96

    2400 > 2933
    1/4.5th increase in DF/mem clock gives ~1/8.19th decrease in latency
    scaling of .918

    2933 > 3200
    1/11th increase in DF/mem clock gives ~1/25.8th decrease in latency
    scaling of .95

    Following these figures I no longer believe a 2Ghz DF clock, 4Ghz RAM speed would lower latency so much. My calculations show ~95ns at best., which is within 15% of Intel's monolithic approach. Still impressive!

    Ciao,
    Terion
     
    T1beriu likes this.
  20. CatMerc

    CatMerc Senior member

    Joined:
    Jul 16, 2016
    Messages:
    682
    Likes Received:
    652
    I'm glad someone finally tested core to core latency for Skylake quad cores. It appears the intra CCX latency is identical to a quad core Skylake at 40ns. Inter-CCX latency however jumps up even beyond the 80ns mark of Broadwell-E. I imagine with 4000MT/s RAM, the latency difference against Broadwell-E won't be significant enough to have a real effect, while in best case it will still be better.

    The question is how Skylake-X fares. Will it maintain the 40ns latency to all cores? Because if so, then that fabric will have some work to do lol
     
  21. TerionX6

    TerionX6 Junior Member

    Joined:
    Jun 29, 2015
    Messages:
    14
    Likes Received:
    20
    Rather I am curious of the latency differences between Naples and Intel's top end server SKUs. It's said that Intel's ring implementation has more and more latency for more and more cores. While Naples will have to deal with not just Inter-CCX comms, but as well communication delays between the 8 core clusters, Intel's current designs have to deal with ever larger ring delays. If we could get our hands on latency tests of those fancy 28-core Xeons...

    With that said I read someone mention they expect a mesh based KNL-like topography for future Xeons. I can't imagine this would be available on skylake or any Intel 14nm design.
     
  22. hondaman

    hondaman Senior member

    Joined:
    Oct 9, 1999
    Messages:
    210
    Likes Received:
    0
    I have an Asrock Taichi with v2.0 bios. I have an rx460 in the PCI-E 1 slot (nearest the cpu) and a NV 1070 in the "middle" PCI-E slot. Running Ubuntu 17.04 beta. I've been trying and failed to do pci-e pass through.
     
  23. SpecChum

    SpecChum Member

    Joined:
    Aug 16, 2007
    Messages:
    31
    Likes Received:
    8
    As you know, my 1700 couldn't (well, not consistently) hit 3200 on my gskill 3200c14 memory so I decided to buy another and swap it out last night.

    Result?

    3200c14 ram first time every time. Nothing has changed but the CPU.

    Obviously not conclusive by any means, but food for thought.
     
    tamz_msc, krumme and looncraz like this.
  24. Timur Born

    Timur Born Member

    Joined:
    Feb 14, 2016
    Messages:
    81
    Likes Received:
    58
    Ambient 21°C, Radiator 21.5°C, Sense Skew enabled (Auto/defaults), "Power Safer" W10 profile

    Idle:

    [​IMG]

    Idle with WmiPrvSE.EXE background load:

    [​IMG]

    Different CPU load profiles (power vs. temperature), x-axis not aligned:

    Power
    [​IMG]
    Temperature
    [​IMG]

    Sorry for the typo, I meant Core 15 "odd". Cores 1-16 (15 = odd) in Statuscore correspond to cores 0-15 (15 = even) in Task-Manager. I meant core 15 in Statuscore (core 14 in TM). Statuscores uses different CPU instruction sets for even and odd core stress tests!

    And to make things a bit more complicated: :p

    [​IMG]
     
    lightmanek likes this.
  25. Timur Born

    Timur Born Member

    Joined:
    Feb 14, 2016
    Messages:
    81
    Likes Received:
    58
    Using Statuscore's stresstest on different CPU cores:

    [​IMG]
    [​IMG]