Ryzen: Strictly technical

Page 68 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Sorry, by "easy to reproduce" I should have made clear that I hit it about 10 times a day (at least) on a staging system that I'm really like to put into production. AGESA upgrade made zero difference and I shouldn't have to fart around with different version compilers or options that my UEFI doesn't have in any case. This shouldn't happen on plain old x86 code that works on every other CPU reliably. My point was there appears to be no outward community interaction on the part of AMD and a boat load of people putting relatively reliable test cases together because the bug *is* so damn easy to hit.

Sure it's not a 10 line test case that crashes every time (yet), but AMD would have a bucket load more knowledge and instrumentation at their disposal. I would assume they'd be looking into it, it's just odd to get zero feedback at all.

I certainly could not recommend anyone put a Zen based system into a production environment until they get it sorted. I've seen other processes die the same way gcc does, it's just a lot more "random" and consequently harder to reproduce. That sort of unpredictability does not instil confidence. I'm sure they'll get it sorted, but a quick "hey we are actually really looking into this" would be helpful. So far I haven't really seen that level of engagement.
Oh absolutely, it's a very serious hardware bug that is commonly and easily encountered with the Unix userland (not only under Linux but also BSD and in WSL under Windows which uses the Windows kernel, so it's confirmed to be an issue in the hardware, not any specific software). My point was that we still don't know exactly what triggers the issue, so nobody got a workaround that's always working either. As a result it's a purely random set of guesses what exactly exacerbates and what supposedly solves the issue. AMD absolutely must completely fix this issue before they ramp up Epyc shipment (which is supposed to happen later this year).

Another educated guess FWIW by the maintainer of DragonFly BSD:
"Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

The bug is completely unrelated to overclocking. It is deterministically reproducable.

I sent a full test case off to AMD in April.

I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update.
"
Source
 
  • Like
Reactions: amd6502

Schmide

Diamond Member
Mar 7, 2002
5,581
712
126
AMD is a power of 2 designer. IMO they will always conform to it. They seem to subdivide everything around it. Their 4 ccx seems to run on a 8 path (6 core to core) (1 memory) (1 other) all bi-directional. If you increase the core count you run into pathway spaghetti. For each power of 2 you have a binomial coefficient number of pathways. C(2^x, 2)

2=1
4=6
8=28
16=120
32=496

I think AMDs next iteration will have the same ccx count but beef up the L3 and infinity fabric communication. Probably lay 4 ccxs on a die with much improved local communication.
 
  • Like
Reactions: french toast

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
Having more than two CCX units per die breaks the interconnect arrangement between individual Zen dies in an MCM and in 2P in epyc. That would require a wholesale tear up of their interconnect fabric. For platform stability purposes, Zen must keep sockets aND external interfaces the same. This means that improvements are strictly kept in the die. With that in mind, this is what we're left with:

Larger L3
Improved schema for L3
Larger L2
Lower latency L2
More cores in the CCX
Improved DDR4 controller

I don't see them re-floor planning individual cores at this point. Maybe some minor errata tweaks and a touch up on critical paths, but that's it. Now, they can refloorplan the whole uncore as they will have more area to work with in their packages with a smaller feature size. This means that there is room to resize things, like the L3. They can reshape the CCXs into a 3x2 grid, making intercore communications less wiring ugly while still expanding to 6 cores in each ccx. They can even do an 8 core ccx, but I don't think that they would be able to expand the L3 sufficiently to keep 1mb per core in the whole die.
Do you mean a shared L3 cache between 2 cores?

I was going to ask "why is this forbidden?" and then read your post. 4 Double cores/CCX with similar interconnect complexity, or 3 double cores/CCX with reduced intra-die complexity.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
4 double cores introduces a lot of extra latency as each pair gets a three port switch put between it and the rest of the die. Perhaps think of doing paired CCXS where each ccx is four cores (or three with an unused slot), with each ccx connected to an 8 port switch via the two external ports with two outbound memory ports and two inter ccx ports. Having the second ccx port can seem redundant, but it can be used as a direct link to the other ccx as opposed to the existing link that seems to involve some other decisions and latency. It's just an idea to keep the ccx as similar to the current design while making more improvements in throughput.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Oh absolutely, it's a very serious hardware bug that is commonly and easily encountered with the Unix userland (not only under Linux but also BSD and in WSL under Windows which uses the Windows kernel, so it's confirmed to be an issue in the hardware, not any specific software). My point was that we still don't know exactly what triggers the issue, so nobody got a workaround that's always working either. As a result it's a purely random set of guesses what exactly exacerbates and what supposedly solves the issue. AMD absolutely must completely fix this issue before they ramp up Epyc shipment (which is supposed to happen later this year).

Another educated guess FWIW by the maintainer of DragonFly BSD:
"Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

The bug is completely unrelated to overclocking. It is deterministically reproducable.

I sent a full test case off to AMD in April.

I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update.
"
Source

Serious question : If I purchase this processor for a workstation and I encounter such a bug at ever juncture of my work tasks and thus it is useless, is this grounds for a full refund from various e-retailers? I can't imagine it not being the case and wonder if those highlighting this as an issue have sent their processor and parts back for a refund.

I am likely going to invest in the platform but will obviously not hesitate to issue a return if I face a bug in their hardware that prevents me from doing work. What should I expect? Should I call various e-retailers and inquire about how they'd handle his?
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Option 1 also complicates stuff, though.

The cores in one CCX are according to the block diagrams connected similarly to dies in Naples - each core has a link with each other. You can do that with 3 links per node if you have 4 cores in CCX, but if there were six, there would need to be 5 links in each of those cores - 15 paths in the CCX crossing between the cores, an awful lot of wiring. With 4 cores, you only have 6 paths and just two cross each other.
Can you link the block diagram you mean? That's definitely new to me.

Edit: nvm, thought you meant across CCX. Yeah, wiring complexity will increase somewhere and that's unavoidable, but in order to maintain high performance and compactness, I think the compromise should be made on the CCX level.
sDxLaWT.png
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Serious question : If I purchase this processor for a workstation and I encounter such a bug at ever juncture of my work tasks and thus it is useless, is this grounds for a full refund from various e-retailers? I can't imagine it not being the case and wonder if those highlighting this as an issue have sent their processor and parts back for a refund.

I am likely going to invest in the platform but will obviously not hesitate to issue a return if I face a bug in their hardware that prevents me from doing work. What should I expect? Should I call various e-retailers and inquire about how they'd handle his?
Once AMD admits to the problem it should be ground for a full refund (which is exactly why they won't admit it, instead keeping mum and everything fuzzy, hopefully fixing it soon). I'd wait until this is resolved unless you have use cases not affected by the bug.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Once AMD admits to the problem it should be ground for a full refund (which is exactly why they won't admit it, instead keeping mum and everything fuzzy, hopefully fixing it soon). I'd wait until this is resolved unless you have use cases not affected by the bug.
That's assuming it's not fixable with a new microcode, which would be quite unusual.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
That's assuming it's not fixable with a new microcode, which would be quite unusual.
Everybody assumes it's fixable, it's just the long waiting time with no feedback of any kind that makes people naturally start doubt that.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Everybody assumes it's fixable, it's just the long waiting time with no feedback of any kind that makes people naturally start doubt that.
Obviously if it's not fixable this is VERY serious. Otherwise, this is just classic AMD miscommunication.
I wouldn't be surprised if most of the people working on a fix are even aware what people outside the company are thinking.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I'd wait until this is resolved unless you have use cases not affected by the bug.

Buying without a firm confirmation that the bug is already resolved would be madness. The very random nature of crashes and the fact that they rise SIGSEGV on what looks like normal postmortem RIP address means things went very wrong. Could be pretty much anything.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Obviously if it's not fixable this is VERY serious. Otherwise, this is just classic AMD miscommunication.
I wouldn't be surprised if most of the people working on a fix are even aware what people outside the company are thinking.
Unless AMD intends to shelve Epyc again after all the fanfares they must resolve the bug anyway. In that regard I don't think there's anything to worry about, it's just a matter of time.

All indications point to the bug being in the microcode. Unlike with say the Windows Scheduler there is neither a way to put blame on others nor to publicly pretend all is fine the way it is. There is also no way to completely mitigate the problem. It's completely up to AMD itself to fix it, and I consider the noncommunication better than miscommunication. Any communication very likely would involve nothing more than putting off details a la "we are aware of the issue and have nothing to announce at this point" anyway.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Obviously if it's not fixable this is VERY serious. Otherwise, this is just classic AMD miscommunication.
I wouldn't be surprised if most of the people working on a fix are even aware what people outside the company are thinking.
Trust me, they're very aware and a bug like this has high availability within a company.
Given its severity, it is just simply understood and communicated that no one is supposed to talk about it outside of official channels and spokes people. In such cases, if this is violated, it is usually grounds for termination of your job. As far as the persistence of this bug w/o addressment, I can imagine it's because something is fundamentally broken and requires a hellavuh workaround. If you're an engineer in tech, you know exactly how these kinds of things works and how little sleep you get trying to resolve the problem. On one hand it sucks.. on the other its quite exciting to :
  1. Find out the root cause of the bug
  2. Find an elegant solution
That being said, I have enough complexity already in my DEV efforts. I can't add on the possibility of hardware doing something fundamentally flawed causing crashes .. Maybe in a test rig which I have setup. However, most certainly not in the upper reaches of their offerings.
 
  • Like
Reactions: CatMerc

ub4ty

Senior member
Jun 21, 2017
749
898
96
Buying without a firm confirmation that the bug is already resolved would be madness. The very random nature of crashes and the fact that they rise SIGSEGV on what looks like normal postmortem RIP address means things went very wrong. Could be pretty much anything.
I am bringing a test rig in-house ATM.
However, indeed... This is a show-stopper for progressing beyond a test-rig.
Maybe August brings resolution.
 
  • Like
Reactions: CatMerc

Jan Olšan

Senior member
Jan 12, 2017
273
276
136
Well it doesn't hang in stressful computations, it supposedly does something buggy when inside microcode. So it is likely not a bug on the grade of the TLB errata or the FDIV. I'd expect it to fixable with ucode if it happens in ucode.

But on the other hand, it has to do with SMT, and that is a hard thing. Intel also had to disable TSX on their chips...
 

.vodka

Golden Member
Dec 5, 2014
1,203
1,537
136
Unless AMD intends to shelve Epyc again after all the fanfares they must resolve the bug anyway. In that regard I don't think there's anything to worry about, it's just a matter of time.

Well, Epyc and TR are using Zeppelin stepping B2...

Has anyone with access to Epyc tested for this particular bug?
 
  • Like
Reactions: Drazick

tamz_msc

Diamond Member
Jan 5, 2017
3,719
3,554
136
Ok so the problem has reared its head beyond compiling GCC loops from a ramdisk.
http://phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run

Using PTS_CONCURRENT_TEST_RUNS=4 TOTAL_LOOP_TIME=60 phoronix-test-suite stress-run build-linux-kernel build-php build-apache pgbench apache redis will have the Phoronix Test Suite continually running four different benchmarks simultaneously for a period of 60 minutes. As soon as one test finishes, another is fired up. The stress-run algorithm randomly picks the tests of your set to run, but does look at the test profile to ensure if the tests stress multiple subsystems, it tries to ensure stress on all subsystems are always being stressed.
While with the ryzen-fail demo program when disabling SMT it could take up to a half hour to get a fail reported, with the Phoronix Test Suite stress-run for the Ryzen 7 1800X with eight cores and no SMT, I managed to get the first segmentation fault after the system was booted up for just 229 seconds... And the segmentation faults would continue every few minutes in this configuration under the immense workloads.
This is just getting harder and harder to ignore. Needs more testing on EPYC platforms, TR too.
 
  • Like
Reactions: Drazick

tamz_msc

Diamond Member
Jan 5, 2017
3,719
3,554
136
has it been reproducted on other os than linux?
It has been reproduced with WSL(Windows Subsystem for Linux).

It gets worse, the PTS segfaults with Epyc as well:
https://www.reddit.com/r/Amd/comments/6rmq6q/epyc_7551_mining_performance/dl6fcar/

This, coupled with the fact that one of AMD's engineer's reddit post about looking into the kill_ryzen.sh script got subsequently removed, means that this would soon blow out of proportion and it will be a huge problem for Epyc adoption in the datacenter. They basically won't touch Epyc unless this gets resolved immediately.

AMD needs to double down on finding the root cause and fixing it or they'll be left with a black face.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
In the AMD forum people are now reporting successful runs with recent new CPUs (after RMA).
Which wouldn't really be good news for all the owners already out there. Still hope the issue is fixable (or at least avoidable, come on!) through a microcode update or some such.
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
Which wouldn't really be good news for all the owners already out there. Still hope the issue is fixable (or at least avoidable, come on!) through a microcode update or some such.

I disagree. Most owners probably never encounter this issue, as far as I can tell. If this was 'bad news' for "all the owners" that haven't gotten an RMA then surely they'd all already be suffering from issues. That doesn't seem to be the case at all. Actually I don't recall a single review that disclosed this particular problem. Not one. Surely this tells us that this issue is very much a "corner case", or whatever one calls it....
 
Status
Not open for further replies.