Ryzen: Strictly technical

Discussion in 'CPUs and Overclocking' started by The Stilt, Mar 2, 2017.

  1. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    The big problem about this whole hardware bug is that to this day nobody was able to boil down the cause to a minimum example which necessarily means nobody knows how to completely avoid it. Without any way to guarantee the hardware bug is avoided no software at any level, be it kernel, OS or programs can be expected to even try avoiding it. That most existing owners didn't encounter it so far doesn't mean this will be true with all future software as well, so potentially everybody can be affected at some point.

    Of course you can pretend it will never affect you and I hope this works out well for you. But AMD at the very least needs to offer a software workaround that guarantees that the hardware bug is never triggered even in all the cases that currently make bug easy to trigger. If even that is not possible only an offer to exchange older chips is prudent.
     
    CatMerc, wildhorse2k and Space Tyrant like this.
  2. IEC

    IEC Lifer

    Joined:
    Jun 10, 2004
    Messages:
    12,828
    Likes Received:
    1,945
    Neither I or any of the people I have built Ryzen systems for (I think I'm up to 8 rigs total now?) have run into this issue. Granted, anything I'm compiling usually takes minutes, not days.
     
    Drazick likes this.
  3. mattiasnyc

    mattiasnyc Member

    Joined:
    Mar 30, 2017
    Messages:
    122
    Likes Received:
    66
    That's probably true for a very large amount of CPUs unless they all were completely flawless, which I seriously doubt. In other words, people worry about this for good reason if they run software that taxes the CPU very heavily under very specific workloads, and outside of that it's really a tempest in a teapot. If not all CPUs are flawless then every single time you buy a CPU and are unaware of a flaw in it you're subject to exactly the same situation, you just don't know about it, so you don't worry about it.

    That sound moderately sincere and more passive aggressive than anything. I don't "pretend" it won't ever affect me, I'm simply saying that the probability is very very high that it will never affect most users. Why? Because the CPU has been out and about for months now and the vast majority of users haven't run into it. Not even reviewers did for a long time.

    The sky isn't falling. It's just needless drama from all of those who most likely won't be affected.
     
  4. Space Tyrant

    Space Tyrant Member

    Joined:
    Feb 14, 2017
    Messages:
    53
    Likes Received:
    29
    Yes, clearly it is uncommon to encounter this problem 'naturally'. And it's also difficult to put it into 'proper' perspective when it's reported by only a (small) self-selected subgroup. But it does seem to be something that occurs rarely under 'natural' conditions.

    However, I ran into it -- without specifically looking for it -- a couple of days ago. I was compiling the latest stable kernel to experiment with. Along the way, I decided to benchmark SMT to see the boost it would provide in the task of compiling C code. First, I noticed erratic timing results, then found some binaries were missing...

    I've never encountered it compiling my projects, the largest portion of which is about 18,000 lines of code and takes about 5 seconds to build.

    But when compiling Linux (~ 11 minutes on my system) I would hit it in about half of the attempts. That dropped to ~1 in 50 after I disabled ASLR, however.

    As for the SMT timing difference -- I got a ~27% speedup. Nice!

    Edit: number typo
     
    #1704 Space Tyrant, Aug 5, 2017
    Last edited: Aug 5, 2017
  5. BradC

    BradC Junior Member

    Joined:
    Apr 24, 2017
    Messages:
    18
    Likes Received:
    12
    What about long running processes? I've had normal processes die with less than 6 weeks of uptime. Like anything lots of load brings it on faster, but a normal system running a couple of VMs and some native daemons like CUPS(which I ordinarily reboot maybe once a year for a kernel upgrade) should not have processes dying randomly. It's predecessor FX-8350 (which is still in active service until I can get the Ryzen stable) will happily tool along without random process deaths for as long as I care to run it. So yes, I can bring it on rapidly by running a stress test, but long term processes die too, just far less frequently and much harder to reproduce at will.

    Not everybody is trying to use these things as a desktop where the odd dead process isn't an issue or reboots are frequent enough not to leave processes running long enough to get affected (although my Debian based iMac desktop only gets about 2 reboots a year anyway).
     
    Space Tyrant likes this.
  6. scannall

    scannall Senior member

    Joined:
    Jan 1, 2012
    Messages:
    862
    Likes Received:
    376
    Is it that specific compiler? I'm wondering if there would be a difference using LLVM and Clang for instance.
     
  7. Space Tyrant

    Space Tyrant Member

    Joined:
    Feb 14, 2017
    Messages:
    53
    Likes Received:
    29
    Yes, I'm running GCC... but the problem evidently isn't there.

    Clang does demonstrate the problem. It isn't GCC, Clang, or Linux specific. It's been demonstrated under Windows when running the same loads under WSL as well as a couple of BSD variants -- which are derivatives of the original AT&T Unix code with no 'genetic' relationship to Linux.

    Edit: related links here: https://www.reddit.com/r/Amd/comments/6rtqj0/information_i_could_find_on_these_segfault_issues/
     
    #1707 Space Tyrant, Aug 6, 2017
    Last edited: Aug 6, 2017
    moinmoin likes this.
  8. ashetos

    ashetos Senior member

    Joined:
    Jul 23, 2013
    Messages:
    222
    Likes Received:
    8
    By briefly looking at amd community forum, phoronix and reddit it seems that the consensus right now is:
    - Phoronix custom test is flawed, it produces segmentation faults on other processors from intel and AMD due to some buggy PHP component
    - kill-ryzen.sh is the test to use
    - There is no confirmed segmentation fault on EPYC chips yet (B2 stepping)
    - Some newer ryzen chips from people who RMA'ed the faulty ones don't produce the segmentation fault (maybe new microcode? new stepping?)
     
  9. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    The poster stated no new microcode (the one in AGESA 1.0.0.6a) and no new stepping (still B1). The only difference appears to be the code on the CPU being 1725 SUS instead 1716 PGT where the number may indicate the manufacturing date, with the first two digits is the year and the latter two the week, so mid June instead mid April. That poster is also the only source so far afaik.
     
  10. BradC

    BradC Junior Member

    Joined:
    Apr 24, 2017
    Messages:
    18
    Likes Received:
    12
    That is brilliant news frankly. Now all we need is some significant testing on Threadripper and we'll be able to get an idea if it is an early stepping issue (B1 vs B2) or something specific in the Ryzen implementation. Given some of the guys on the AMD forum have had 3 different sets of results with 3 different chips (same stepping, same microcode, different manuf date) it does lean toward variations in the processor, but ultimately only AMD knows and they aren't telling.
     
  11. piesquared

    piesquared Golden Member

    Joined:
    Oct 16, 2006
    Messages:
    1,418
    Likes Received:
    249

    You must have clicked on a different link? This is what that reddit post says:

    So this particular test by Phoronix is invalidated.
     
    #1712 piesquared, Aug 6, 2017
    Last edited: Aug 6, 2017
  12. BradC

    BradC Junior Member

    Joined:
    Apr 24, 2017
    Messages:
    18
    Likes Received:
    12
    Were you expecting me to say it's terrible news that the test is invalid and Epyc isn't affected? I think it's great that Epyc does not suffer from this issue (or at least the samples that were tested).

    Epyc is stepping B2, Ryzen (and reportedly Threadripper) is/are B1. This just helps with additional datapoints to narrow things down. If Threadripper is unaffected then it's even better for AMD and users.

    They'll get Ryzen sorted eventually. If the whole architecture was showing the issue then it'd be catastrophic.
     
    moinmoin and lightmanek like this.
  13. teejee

    teejee Member

    Joined:
    Jul 4, 2013
    Messages:
    193
    Likes Received:
    14
    tamz_msc and Space Tyrant like this.
  14. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    So Phoronix gets a response from AMD after this issue has been running for this long? Not sure what to feel about that timing.

    Anyway AMD employee amdmatt posted following in the long running AMD community thread:

    "We have been working closely with a small but important subset of Linux users that have experienced segment faults when running heavy or looping compilations on their Ryzen CPU-based systems. The results of our testing and analysis indicate that segment faults can be caused by memory allocation, system configurations, thermal environments, and system marginality when running looping or heavy Linux compile workloads. The marginality is stimulated with very heavy workloads and when the system environment is not ideal. AMD is working with individual users to diagnose the issues.

    We are confident that we can help each of you identify the source of the marginality and eliminate the segment faults. We encourage all of our Linux users who are experiencing segment faults under compile workloads to continue working with AMD Customer Care. We are committed to solving this issue for all of you."
     
  15. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,579
    Likes Received:
    514
    so its probably a timing/binnig/power consumption triggered issue on a timing critical circuit or something. If thats what it is poor old internet strong man, now he will have nothing to do.......
     
    CHADBOGA and .vodka like this.
  16. .vodka

    .vodka Senior member

    Joined:
    Dec 5, 2014
    Messages:
    953
    Likes Received:
    899
    Don't underestimate the strongman. He'll always find something new to fuel his crusade against the evil in this world.

    -------------------------------------

    The interesting thing here is that Threadripper is using stepping B1 chips, AM4 Ryzen is also using B1 chips. TR doesn't have the problem, some AM4 Ryzens do.

    Bad batch and that's it?
     
    Drazick and CatMerc like this.
  17. Veradun

    Veradun Senior member

    Joined:
    Jul 29, 2016
    Messages:
    208
    Likes Received:
    192
    Or maybe... uhm... "hyperbinning" ?
     
  18. plopke

    plopke Member

    Joined:
    Jan 26, 2010
    Messages:
    171
    Likes Received:
    44
    Is that a good thing or a bad thing ?
     
  19. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,579
    Likes Received:
    514
    It's just a thing, broken ones probably can't be fixed while maintaining its current bin. Going forward chip quality improves and more specific tests can be carried out.

    With a new stepping the curcuits in question could be worked on to remove the race condition etc.
     
    coercitiv likes this.
  20. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    Considering AMD now on repeated RMA's apparently manually recreates the customers' systems to find and send chips that don't showcase the issue I'd call it a bad thing. But it's certainly nice of them they actually do the effort.
     
  21. ub4ty

    ub4ty Senior member

    Joined:
    Jun 21, 2017
    Messages:
    249
    Likes Received:
    274
    The first thing I'm doing once I get threadripper is testing to see if what they stated is actually true... If it isn't, I am going to be sure to make sure it's known its more than likely going right back to where it came from along with any supporting hardware.


    The timing is clear. Once it got prominent and more publicly facing visibility, they decided to somewhat spill the beans. As long as it was restricted to a forum post where people did most of the hard work to uncover and detail the bug, they didn't care to comment publicly.

    The poor old internet strong man is the only reason this bug likely got discovered and addressed. Funny that you forget about that after all the hard work is done and the issue is being addressed w/o any effort of your own beyond making comments like this about a group of diligent people who dedicated time to get a fix in and the problem addressed. More clearly the problem seems to be with OPCache whose circuitry/logic/timings would be several tested in a scenario of multi-threaded code compilation. I doubt internet weakmen peanut gallery commenters have the understanding or even the common concern beyond themselves to have uncovered such an issue. So, I thank the strong man for doing the unpaid work for general society that weak man couldn't bring themselves to do. Time and time again, it's clear and proven that it's because of people that actually care that complex systems don't all together fall apart when pushed to their spec.
     
    CHADBOGA and Space Tyrant like this.
  22. moinmoin

    moinmoin Member

    Joined:
    Jun 1, 2017
    Messages:
    199
    Likes Received:
    108
    Phoronix is still plenty niche, targeting the supposedly only audience affected by this issue. Rather the fact that they went public without any fix, workaround or further promises thereof tells me they came to the conclusion that it's not something they can fix in microcode or software but is rather a silicon/binning issue of early chips.
     
  23. .vodka

    .vodka Senior member

    Joined:
    Dec 5, 2014
    Messages:
    953
    Likes Received:
    899
    Nah, it wasn't poor old juan who single handedly made all of this happen. He sure ranted pages and pages on this, as he does on everything AMD. I wonder why isn't he here to spread his word. I also see many of the resident trolls have moved over to other forums like [H], something I welcome because it makes for a much better experience here since we aren't pulling our hair out trying to have an adult conversation. I'd better hope I didn't summon him or them again back here...

    There are actual people out there that aren't in a crusade against AMD who got hit by this problem and spoke up, more and more every day... search around a bit, it's all over major forums and specific communities.

    If AMD is accepting RMAs for this issue to the extent of replicating customer systems to find a proper CPU to send, then it's a weird bug to track down. Again, TR and AM4 Ryzen both use B1 stepping chips, TR doesn't have the problem, some AM4 Ryzens do. What the hell?

    I'm again putting my money on a bad batch, something somewhere apart from the chip itself got borked somewhere and compilation triggers the issue. "Performance marginality" sounds like that.
     
    #1724 .vodka, Aug 9, 2017
    Last edited: Aug 9, 2017
    Space Tyrant, Drazick and IEC like this.
  24. Timur Born

    Timur Born Member

    Joined:
    Feb 14, 2016
    Messages:
    81
    Likes Received:
    58
    Surprise, surprise. Turns out that CCX bottlenecks can be objectively measured as increased CPU load!

    The following tests ran 8 (main) threads of DAWBench (RCX-EXT, 7 tracks x 8 FX plugins) in Reaper on 4 cores. Left side is 4 cores of CCX1, right side is 2 cores of CCX0 + 2 cores of CCX1. SMT was disabled.

    What can be seen here is that for running this particular test load on a single CCX (left) then memory bandwidth, memory latency and Infinity Fabric frequency or latency all don't matter. As such I conclude that the increased CPU load is based on inter CCX communication. This also demonstrates that the additional CPU load drops as Infinity Fabric frequency/latency improves, but at one point it doesn't improve anymore.

    This also demonstrates that the Infinity Fabric bottlenecks memory latency up to a certain point. Latencies should all have been roughly the same in all measurements, with lower frequency being offset by tighter timings. Granted, I did used the same (tight) secondary and tertiary timings and only cared to change the primary ones, but you get the picture.

    (2133-C15 returned about the same results as 2133-C9, so only one is posted here)

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]
     
    coercitiv, MajinCry, tamz_msc and 8 others like this.