- Jun 1, 2017
Oh absolutely, it's a very serious hardware bug that is commonly and easily encountered with the Unix userland (not only under Linux but also BSD and in WSL under Windows which uses the Windows kernel, so it's confirmed to be an issue in the hardware, not any specific software). My point was that we still don't know exactly what triggers the issue, so nobody got a workaround that's always working either. As a result it's a purely random set of guesses what exactly exacerbates and what supposedly solves the issue. AMD absolutely must completely fix this issue before they ramp up Epyc shipment (which is supposed to happen later this year).Sorry, by "easy to reproduce" I should have made clear that I hit it about 10 times a day (at least) on a staging system that I'm really like to put into production. AGESA upgrade made zero difference and I shouldn't have to fart around with different version compilers or options that my UEFI doesn't have in any case. This shouldn't happen on plain old x86 code that works on every other CPU reliably. My point was there appears to be no outward community interaction on the part of AMD and a boat load of people putting relatively reliable test cases together because the bug *is* so damn easy to hit.
Sure it's not a 10 line test case that crashes every time (yet), but AMD would have a bucket load more knowledge and instrumentation at their disposal. I would assume they'd be looking into it, it's just odd to get zero feedback at all.
I certainly could not recommend anyone put a Zen based system into a production environment until they get it sorted. I've seen other processes die the same way gcc does, it's just a lot more "random" and consequently harder to reproduce. That sort of unpredictability does not instil confidence. I'm sure they'll get it sorted, but a quick "hey we are actually really looking into this" would be helpful. So far I haven't really seen that level of engagement.
Another educated guess FWIW by the maintainer of DragonFly BSD:
"Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.
The bug is completely unrelated to overclocking. It is deterministically reproducable.
I sent a full test case off to AMD in April.
I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update."