Ryzen: Strictly technical

moinmoin · Aug 5, 2017

mattiasnyc said:
I disagree. Most owners probably never encounter this issue, as far as I can tell. If this was 'bad news' for "all the owners" that haven't gotten an RMA then surely they'd all already be suffering from issues. That doesn't seem to be the case at all. Actually I don't recall a single review that disclosed this particular problem. Not one. Surely this tells us that this issue is very much a "corner case", or whatever one calls it....

The big problem about this whole hardware bug is that to this day nobody was able to boil down the cause to a minimum example which necessarily means nobody knows how to completely avoid it. Without any way to guarantee the hardware bug is avoided no software at any level, be it kernel, OS or programs can be expected to even try avoiding it. That most existing owners didn't encounter it so far doesn't mean this will be true with all future software as well, so potentially everybody can be affected at some point.

Of course you can pretend it will never affect you and I hope this works out well for you. But AMD at the very least needs to offer a software workaround that guarantees that the hardware bug is never triggered even in all the cases that currently make bug easy to trigger. If even that is not possible only an offer to exchange older chips is prudent.

IEC · Aug 5, 2017

Neither I or any of the people I have built Ryzen systems for (I think I'm up to 8 rigs total now?) have run into this issue. Granted, anything I'm compiling usually takes minutes, not days.

mattiasnyc · Aug 5, 2017

moinmoin said:
The big problem about this whole hardware bug is that to this day nobody was able to boil down the cause to a minimum example which necessarily means nobody knows how to completely avoid it. Without any way to guarantee the hardware bug is avoided no software at any level, be it kernel, OS or programs can be expected to even try avoiding it. That most existing owners didn't encounter it so far doesn't mean this will be true with all future software as well, so potentially everybody can be affected at some point.

That's probably true for a very large amount of CPUs unless they all were completely flawless, which I seriously doubt. In other words, people worry about this for good reason if they run software that taxes the CPU very heavily under very specific workloads, and outside of that it's really a tempest in a teapot. If not all CPUs are flawless then every single time you buy a CPU and are unaware of a flaw in it you're subject to exactly the same situation, you just don't know about it, so you don't worry about it.

moinmoin said:
Of course you can pretend it will never affect you and I hope this works out well for you.

That sound moderately sincere and more passive aggressive than anything. I don't "pretend" it won't ever affect me, I'm simply saying that the probability is very very high that it will never affect most users. Why? Because the CPU has been out and about for months now and the vast majority of users haven't run into it. Not even reviewers did for a long time.

The sky isn't falling. It's just needless drama from all of those who most likely won't be affected.

Space Tyrant · Aug 5, 2017

IEC said:
Neither I or any of the people I have built Ryzen systems for (I think I'm up to 8 rigs total now?) have run into this issue. Granted, anything I'm compiling usually takes minutes, not days.

Yes, clearly it is uncommon to encounter this problem 'naturally'. And it's also difficult to put it into 'proper' perspective when it's reported by only a (small) self-selected subgroup. But it does seem to be something that occurs rarely under 'natural' conditions.

However, I ran into it -- without specifically looking for it -- a couple of days ago. I was compiling the latest stable kernel to experiment with. Along the way, I decided to benchmark SMT to see the boost it would provide in the task of compiling C code. First, I noticed erratic timing results, then found some binaries were missing...

I've never encountered it compiling my projects, the largest portion of which is about 18,000 lines of code and takes about 5 seconds to build.

But when compiling Linux (~ 11 minutes on my system) I would hit it in about half of the attempts. That dropped to ~1 in 50 after I disabled ASLR, however.

As for the SMT timing difference -- I got a ~27% speedup. Nice!

Edit: number typo

BradC · Aug 6, 2017

IEC said:
Neither I or any of the people I have built Ryzen systems for (I think I'm up to 8 rigs total now?) have run into this issue. Granted, anything I'm compiling usually takes minutes, not days.

What about long running processes? I've had normal processes die with less than 6 weeks of uptime. Like anything lots of load brings it on faster, but a normal system running a couple of VMs and some native daemons like CUPS(which I ordinarily reboot maybe once a year for a kernel upgrade) should not have processes dying randomly. It's predecessor FX-8350 (which is still in active service until I can get the Ryzen stable) will happily tool along without random process deaths for as long as I care to run it. So yes, I can bring it on rapidly by running a stress test, but long term processes die too, just far less frequently and much harder to reproduce at will.

Not everybody is trying to use these things as a desktop where the odd dead process isn't an issue or reboots are frequent enough not to leave processes running long enough to get affected (although my Debian based iMac desktop only gets about 2 reboots a year anyway).

scannall · Aug 6, 2017

Space Tyrant said:
Yes, clearly it is uncommon to encounter this problem 'naturally'. And it's also difficult to put it into 'proper' perspective when it's reported by only a (small) self-selected subgroup. But it does seem to be something that occurs rarely under 'natural' conditions.

However, I ran into it -- without specifically looking for it -- a couple of days ago. I was compiling the latest stable kernel to experiment with. Along the way, I decided to benchmark SMT to see the boost it would provide in the task of compiling C code. First, I noticed erratic timing results, then found some binaries were missing...

I've never encountered it compiling my projects, the largest portion of which is about 18,000 lines of code and takes about 5 seconds to build.

But when compiling Linux (~ 11 minutes on my system) I would hit it in about half of the attempts. That dropped to ~1 in 50 after I disabled ASLR, however.

As for the SMT timing difference -- I got a ~27% speedup. Nice!

Edit: number typo

Is it that specific compiler? I'm wondering if there would be a difference using LLVM and Clang for instance.

Space Tyrant · Aug 6, 2017

scannall said:
Is it that specific compiler? I'm wondering if there would be a difference using LLVM and Clang for instance.

Yes, I'm running GCC... but the problem evidently isn't there.

Clang does demonstrate the problem. It isn't GCC, Clang, or Linux specific. It's been demonstrated under Windows when running the same loads under WSL as well as a couple of BSD variants -- which are derivatives of the original AT&T Unix code with no 'genetic' relationship to Linux.

Edit: related links here: https://www.reddit.com/r/Amd/comments/6rtqj0/information_i_could_find_on_these_segfault_issues/

ashetos · Aug 6, 2017

By briefly looking at amd community forum, phoronix and reddit it seems that the consensus right now is:
- Phoronix custom test is flawed, it produces segmentation faults on other processors from intel and AMD due to some buggy PHP component
- kill-ryzen.sh is the test to use
- There is no confirmed segmentation fault on EPYC chips yet (B2 stepping)
- Some newer ryzen chips from people who RMA'ed the faulty ones don't produce the segmentation fault (maybe new microcode? new stepping?)

moinmoin · Aug 6, 2017

ashetos said:
- Some newer ryzen chips from people who RMA'ed the faulty ones don't produce the segmentation fault (maybe new microcode? new stepping?)

The poster stated no new microcode (the one in AGESA 1.0.0.6a) and no new stepping (still B1). The only difference appears to be the code on the CPU being 1725 SUS instead 1716 PGT where the number may indicate the manufacturing date, with the first two digits is the year and the latter two the week, so mid June instead mid April. That poster is also the only source so far afaik.

DrMrLordX · Aug 6, 2017

https://www.reddit.com/r/Amd/comments/6runcc/reported_epyc_segfault_might_not_be_true/

Hmmmmmm!

BradC · Aug 6, 2017

DrMrLordX said:
https://www.reddit.com/r/Amd/comments/6runcc/reported_epyc_segfault_might_not_be_true/

Hmmmmmm!

That is brilliant news frankly. Now all we need is some significant testing on Threadripper and we'll be able to get an idea if it is an early stepping issue (B1 vs B2) or something specific in the Ryzen implementation. Given some of the guys on the AMD forum have had 3 different sets of results with 3 different chips (same stepping, same microcode, different manuf date) it does lean toward variations in the processor, but ultimately only AMD knows and they aren't telling.

piesquared · Aug 6, 2017

BradC said:
That is brilliant news frankly. Now all we need is some significant testing on Threadripper and we'll be able to get an idea if it is an early stepping issue (B1 vs B2) or something specific in the Ryzen implementation. Given some of the guys on the AMD forum have had 3 different sets of results with 3 different chips (same stepping, same microcode, different manuf date) it does lean toward variations in the processor, but ultimately only AMD knows and they aren't telling.

You must have clicked on a different link? This is what that reddit post says:

Posters on AMD Support page are stating that the EPYC segfaults from the Phoronix test suite are not a problem with the CPU. But rather it is a know segfault that occurs with PHP and conftest. Phoronix needs to be run without PHP.

So this particular test by Phoronix is invalidated.

BradC · Aug 6, 2017

piesquared said:
So this particular test by Phoronix is invalidated.

Were you expecting me to say it's terrible news that the test is invalid and Epyc isn't affected? I think it's great that Epyc does not suffer from this issue (or at least the samples that were tested).

Epyc is stepping B2, Ryzen (and reportedly Threadripper) is/are B1. This just helps with additional datapoints to narrow things down. If Threadripper is unaffected then it's even better for AMD and users.

They'll get Ryzen sorted eventually. If the whole architecture was showing the issue then it'd be catastrophic.

teejee · Aug 7, 2017

Response from AMD below. EPYC and TR not affected by the issue. But seems like early Ryzen needs to be RMA'd if you have this issue. I e no micro code fix possible as I interpret AMD's response.

https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

Sent from my LG-D855 using Tapatalk

moinmoin · Aug 7, 2017

So Phoronix gets a response from AMD after this issue has been running for this long? Not sure what to feel about that timing.

Anyway AMD employee amdmatt posted following in the long running AMD community thread:

"We have been working closely with a small but important subset of Linux users that have experienced segment faults when running heavy or looping compilations on their Ryzen CPU-based systems. The results of our testing and analysis indicate that segment faults can be caused by memory allocation, system configurations, thermal environments, and system marginality when running looping or heavy Linux compile workloads. The marginality is stimulated with very heavy workloads and when the system environment is not ideal. AMD is working with individual users to diagnose the issues.

We are confident that we can help each of you identify the source of the marginality and eliminate the segment faults. We encourage all of our Linux users who are experiencing segment faults under compile workloads to continue working with AMD Customer Care. We are committed to solving this issue for all of you."

itsmydamnation · Aug 7, 2017

so its probably a timing/binnig/power consumption triggered issue on a timing critical circuit or something. If thats what it is poor old internet strong man, now he will have nothing to do.......

.vodka · Aug 7, 2017

itsmydamnation said:
so its probably a timing/binnig/power consumption triggered issue on a timing critical circuit or something. If thats what it is poor old internet strong man, now he will have nothing to do.......

Don't underestimate the strongman. He'll always find something new to fuel his crusade against the evil in this world.

-------------------------------------

The interesting thing here is that Threadripper is using stepping B1 chips, AM4 Ryzen is also using B1 chips. TR doesn't have the problem, some AM4 Ryzens do.

Bad batch and that's it?

Veradun · Aug 8, 2017

.vodka said:
The interesting thing here is that Threadripper is using stepping B1 chips, AM4 Ryzen is also using B1 chips. TR doesn't have the problem, some AM4 Ryzens do.

Bad batch and that's it?

Or maybe... uhm... "hyperbinning" ?

plopke · Aug 8, 2017

itsmydamnation said:
so its probably a timing/binnig/power consumption triggered issue on a timing critical circuit or something. If thats what it is poor old internet strong man, now he will have nothing to do.......

Is that a good thing or a bad thing ?

itsmydamnation · Aug 9, 2017

plopke said:
Is that a good thing or a bad thing ?

It's just a thing, broken ones probably can't be fixed while maintaining its current bin. Going forward chip quality improves and more specific tests can be carried out.

With a new stepping the curcuits in question could be worked on to remove the race condition etc.

moinmoin · Aug 9, 2017

plopke said:
Is that a good thing or a bad thing ?

Considering AMD now on repeated RMA's apparently manually recreates the customers' systems to find and send chips that don't showcase the issue I'd call it a bad thing. But it's certainly nice of them they actually do the effort.

ub4ty · Aug 9, 2017

BradC said:
That is brilliant news frankly. Now all we need is some significant testing on Threadripper and we'll be able to get an idea if it is an early stepping issue (B1 vs B2) or something specific in the Ryzen implementation. Given some of the guys on the AMD forum have had 3 different sets of results with 3 different chips (same stepping, same microcode, different manuf date) it does lean toward variations in the processor, but ultimately only AMD knows and they aren't telling.

The first thing I'm doing once I get threadripper is testing to see if what they stated is actually true... If it isn't, I am going to be sure to make sure it's known its more than likely going right back to where it came from along with any supporting hardware.

moinmoin said:
So Phoronix gets a response from AMD after this issue has been running for this long? Not sure what to feel about that timing.

Anyway AMD employee amdmatt posted following in the long running AMD community thread:

"We have been working closely with a small but important subset of Linux users that have experienced segment faults when running heavy or looping compilations on their Ryzen CPU-based systems. The results of our testing and analysis indicate that segment faults can be caused by memory allocation, system configurations, thermal environments, and system marginality when running looping or heavy Linux compile workloads. The marginality is stimulated with very heavy workloads and when the system environment is not ideal. AMD is working with individual users to diagnose the issues.

We are confident that we can help each of you identify the source of the marginality and eliminate the segment faults. We encourage all of our Linux users who are experiencing segment faults under compile workloads to continue working with AMD Customer Care. We are committed to solving this issue for all of you."

The timing is clear. Once it got prominent and more publicly facing visibility, they decided to somewhat spill the beans. As long as it was restricted to a forum post where people did most of the hard work to uncover and detail the bug, they didn't care to comment publicly.

itsmydamnation said:
so its probably a timing/binnig/power consumption triggered issue on a timing critical circuit or something. If thats what it is poor old internet strong man, now he will have nothing to do.......

The poor old internet strong man is the only reason this bug likely got discovered and addressed. Funny that you forget about that after all the hard work is done and the issue is being addressed w/o any effort of your own beyond making comments like this about a group of diligent people who dedicated time to get a fix in and the problem addressed. More clearly the problem seems to be with OPCache whose circuitry/logic/timings would be several tested in a scenario of multi-threaded code compilation. I doubt internet weakmen peanut gallery commenters have the understanding or even the common concern beyond themselves to have uncovered such an issue. So, I thank the strong man for doing the unpaid work for general society that weak man couldn't bring themselves to do. Time and time again, it's clear and proven that it's because of people that actually care that complex systems don't all together fall apart when pushed to their spec.

moinmoin · Aug 9, 2017

ub4ty said:
The timing is clear. Once it got prominent and more publicly facing visibility, they decided to somewhat spill the beans. As long as it was restricted to a forum post where people did most of the hard work to uncover and detail the bug, they didn't care to comment publicly.

Phoronix is still plenty niche, targeting the supposedly only audience affected by this issue. Rather the fact that they went public without any fix, workaround or further promises thereof tells me they came to the conclusion that it's not something they can fix in microcode or software but is rather a silicon/binning issue of early chips.

.vodka · Aug 9, 2017

ub4ty said:
The poor old internet strong man is the only reason this bug likely got discovered and addressed.

Nah, it wasn't poor old juan who single handedly made all of this happen. He sure ranted pages and pages on this, as he does on everything AMD. I wonder why isn't he here to spread his word. I also see many of the resident trolls have moved over to other forums like [H], something I welcome because it makes for a much better experience here since we aren't pulling our hair out trying to have an adult conversation. I'd better hope I didn't summon him or them again back here...

There are actual people out there that aren't in a crusade against AMD who got hit by this problem and spoke up, more and more every day... search around a bit, it's all over major forums and specific communities.

If AMD is accepting RMAs for this issue to the extent of replicating customer systems to find a proper CPU to send, then it's a weird bug to track down. Again, TR and AM4 Ryzen both use B1 stepping chips, TR doesn't have the problem, some AM4 Ryzens do. What the hell?

I'm again putting my money on a bad batch, something somewhere apart from the chip itself got borked somewhere and compilation triggers the issue. "Performance marginality" sounds like that.

Timur Born · Aug 21, 2017

Surprise, surprise. Turns out that CCX bottlenecks can be objectively measured as increased CPU load!

The following tests ran 8 (main) threads of DAWBench (RCX-EXT, 7 tracks x 8 FX plugins) in Reaper on 4 cores. Left side is 4 cores of CCX1, right side is 2 cores of CCX0 + 2 cores of CCX1. SMT was disabled.

What can be seen here is that for running this particular test load on a single CCX (left) then memory bandwidth, memory latency and Infinity Fabric frequency or latency all don't matter. As such I conclude that the increased CPU load is based on inter CCX communication. This also demonstrates that the additional CPU load drops as Infinity Fabric frequency/latency improves, but at one point it doesn't improve anymore.

This also demonstrates that the Infinity Fabric bottlenecks memory latency up to a certain point. Latencies should all have been roughly the same in all measurements, with lower frequency being offset by tighter timings. Granted, I did used the same (tight) secondary and tertiary timings and only cared to change the primary ones, but you get the picture.

(2133-C15 returned about the same results as 2133-C9, so only one is posted here)

Ryzen: Strictly technical

Diamond Member

Elite Member

Senior member

Member

Junior Member

Golden Member

Member

Senior member

Diamond Member

Lifer

Junior Member

Golden Member

Junior Member

Senior member

Diamond Member

Diamond Member

Golden Member

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Senior member