AMD Zen: A properly functioning microarchitecture?

eton975 · Oct 6, 2018

I'm sure many of us here have vague memories of chips that had real-world, unfixable bugs and problems. Problems that would not go away with a better motherboard, BIOS updates, OS reinstalls - without rewritten software. Problems that others might even blame on a bad overclock until the OP reverted to stock settings and would persist.

Pentiums with incorrect FDIV lookup tables and buslocks from F00F... opcodes, Cyrixes that had weird issues with anything assuming Pentum-level instructions, K6s that had stability issues with too much RAM, Athlons that would go up in smoke for lack of a heatspreader or a fast thermal diode.

But most of those examples are in the distant past, likely buried in landfill. PCs, their components and their marketing have changed over the years - expansion slot, form factor and firmware standards have changed over the years to be easier to work with and harder to mess up. You can't just plug in the motherboard power backwards and see it go up in smoke anymore unless you're pushing down with the force of a steamroller. Can't insert your CPU backwards and fry it these days like you could your Socket 3 486!

Mid-end motherboards today come in glossy, made-to-fit boxes, have sleek plastic shrouds, a jet-black silkscreen and fancy lettering; they are often fashionable, 'refined' products, are they not, far cries from the crude green and brown boards of the 80s and 90s or the more fisher-price boards around the new millennium (even modern ones sans UEFI, IMO). Surely the janky past of glitchiness is behind us, with our modern equipment offering a smooth, trouble-free computing experience?

Now (well, really around a year ago), I hear that a large amount of early Ryzens have a manufacturing defect - ringing, out-of-spec inductance somewhere, grounding issue, insufficient capacitance in power layers/ripple smoothing, design features being packed too close, mask defect blowing open part of a line? - that causes segfaults compiling GCC, and not every time either. Even after microcode and UEFI updates. RMA required. Huge numbers of chips affected. Virtual 8086 mode semi-broken (apparently fixed with AGESA 1006?), causing issues with some VMs or running 16-bit software on 32-bit Windows. Reports of unusual crashes in Maya? Handbrake? Broken C6 sleep mode causing lockups when idle in Linux and BSD?

And if these obvious symptoms are cropping up, how many silent issues might be occurring in the background that don't trigger a full-blown segfault on Ryzen? (Not to say that errata on Intel are never an issue)

To me, this feels a bit (a bit) like the ominous days of Super Socket 7 and its standard melange. What kinds of little issues could come up at the worst time to bite you in the back? As if termites infested some of the towers and towers of software written for PCs, but you don't know exactly which ones until you walk in and try to run them. Like a piece of the digital landscape has broken off, but you're blind and can only hope you don't fall down the hole it created.

(The things that really come to mind on the recent Intel side are issues with atomicity on Skylake HT (test program was tight assembly loops?), Meltdown/Spectre of course and x87 having questionable precision? Also curious about chipset/USB stability on the last 10 years worth of the AMD lines? Have heard of some issues regarding USB latency with SB7/9xx on the AMD side... Serato appears to be an example of software affected.)

Perhaps it's just paranoia on my part, but I think I make somewhat of a point that these issues should have been ironed out. These are not all just hypothetical issues listed on a datasheet. Sometimes I wonder if deterministic faults on 2 different PCs with similar architecture could cause real issues for DC projects like BOINC that depend on cross-system validation? Computers that don't compute the way we expect them to...

Zen uarch family erratum sheet

Anecdotes from people here with Ryzen systems would be interesting to hear.

William Gaatjes · Oct 6, 2018

I am not sure what the issue was with the segment fault, but it seems fixed with a GCC patch if i am not mistaken.
I use ARM GCC compilers but i never had issues. But that may be becuase it is not related.

What i have read about ancient dos virtual machines crashing...
Ryzen has very good virtualization features in hardware. When you choose to use archiac virtual machine software that tries to do everything in software....
Maybe it is time upgrade that virtual machine to one that properly supports hardwre virtualization techniques.

To be honest, the only experience i have with virtual machines is VM ware.
I use VMware, never had any issues while running at least windows xp with my ryzen 2600.
It can be that it is fixed for the 2000 series or VMware leaves the handling over to ryzen.
This article digs deeper into it :
http://www.os2museum.com/wp/vme-broken-on-amd-ryzen/

After analyzing the problem, it’s now clear what’s happening. As incredible as it is, Ryzen has buggy VME implementation; specifically, the INT instruction is known to misbehave in V86 mode with VME enabled when the given vector is redirected (i.e. it should use standard real-mode IVT and execute in V86 mode without faulting). The INT instruction simply doesn’t go where it’s supposed to go which leads to more or less immediate crashes or hangs.

How did AMD miss it? Because only 32-bit OSes are affected, and only when running 16-bit real-mode code. Except with Windows XP and Server 2003 it’s much worse and these systems may not even boot.

To be clear, the problem is not at all specific to virtualization. It has been confirmed on a Ryzen 5 1500X running FreeDOS—which comes with the JemmEx memory manager, which enables VME by default. Until VME was disabled, any attempt to boot with JemmEx failed with invalid opcode exceptions. After disabling VME, FreeDOS worked normally.

That is not surprising because when the problematic INT instruction is executed inside a VM using AMD-V, it is almost always executed without any intervention from the hypervisor, which means the hypervisor has no opportunity to mess anything up.

Now, back to the XP trouble. Windows NT supports VME at least since NT 4.0 and enables it automatically. That is the case for NT 4.0, XP, Windows 7, etc. For the most part, it would only matter when running a 16-bit DOS or Windows application (such as EDIT.COM which comes with Windows).

Windows XP and Server 2003 (that is NT 5.1 and 5.2) is significantly more affected because it was the first Windows OS that shipped with a generic display driver using VBE (VESA BIOS Extensions), and the only Windows family which executed the BIOS code inside NTVDM (with VME on, if available). Starting with Vista, presumably due to increased focus on 64-bit OSes where V86 mode is entirely unavailable, the video BIOS is executed indirectly, likely using pure software emulation.

The upshot is that the problem is visible in Windows versions at least from NT 4.0 and up, but XP and Server 2003 may entirely fail to boot, either hanging or crashing just before bringing up the desktop. Other operating systems which use VME are affected as well (OS/2, DOS with certain memory managers).

The workaround is simple—if possible, mask out the VME CPUID bit (bit 1 in register EDX of leaf 1), which is something hypervisors typically allow. Windows does not require VME and without VME, XP can be booted normally on Ryzen CPUs, at least in a VM.

scannall · Oct 6, 2018

I'm not sure what your point is really. Every CPU has errata. Every single one. It's usually worked around in either BIOS or software. And once in a while they have to replace a bunch of chips. Something both Intel and AMD have had to do.

Ryzen is obviously a well functioning architecture. Millions out there prove that, after the early ones were replaced. I suppose one could argue that Intel should recall and replace every Meltdown susceptible chip out in the wild. That won't happen of course, and it works as designed. But it was a very poor design decision that has cost large companies millions of dollars.

NTMBK · Oct 6, 2018

They've shipped millions of units. If there were lingering issues we'd have heard about them.

eton975 · Oct 6, 2018

William Gaatjes said:
I am not sure what the issue was with the segment fault, but it seems fixed with a GCC patch if i am not mistaken.
I use ARM GCC compilers but i never had issues. But that may be becuase it is not related.

Segfault was apparently a manufacturing issue of some kind. AMD reps have basically confirmed there is no microcode fix for that issue (though updated GCC may not hit the processor the right way to cause it), and it does not occur on older AMD or Intel systems. There is a RMA process setup for it.

William Gaatjes said:
What i have read about ancient dos virtual machines crashing...
Ryzen has very good virtualization features in hardware. When you choose to use archiac virtual machine software that tries to do everything in software....
Maybe it is time upgrade that virtual machine to one that properly supports hardwre virtualization techniques.

To be honest, the only experience i have with virtual machines is VM ware.
I use VMware, never had any issues while running at least windows xp with my ryzen 2600.
It can be that it is fixed for the 2000 series or VMware leaves the handling over to ryzen.
This article digs deeper into it :
http://www.os2museum.com/wp/vme-broken-on-amd-ryzen/

Read that article a while back - I have heard that IDT/IVT can cause serious issues if you're not paying close attention when designing BIOS and microcode, so interrupt vectors getting garbled during processing is not fun.

However unless I'm totally mistaken, all these VM softwares are using hardware-assisted virtualisation? Are they not relying on the CPU's hardware VME support..? Are you suggesting that legacy stuff like this should be trapped and emulated by the hypervisor in software (as a workaround or encouraged by the processor vendors?), while protected and long mode operation should be done "natively" through the hardware VM features? Not to bite, but if so I think you could say so more clearly.

Additionally, there are still regular complaints about crashes from people trying to install Windows 98 on VMware and Virtualbox forums, that curiously do not occur when they test on their Intel-based machines. I am not sure whether these have much to do with VME however - I have not read in very far, nor seen the exact recommendations support has made for masking CPUID flags.

eton975 · Oct 6, 2018

scannall said:
I'm not sure what your point is really. Every CPU has errata. Every single one. It's usually worked around in either BIOS or software. And once in a while they have to replace a bunch of chips. Something both Intel and AMD have had to do.

Ryzen is obviously a well functioning architecture. Millions out there prove that, after the early ones were replaced. I suppose one could argue that Intel should recall and replace every Meltdown susceptible chip out in the wild. That won't happen of course, and it works as designed. But it was a very poor design decision that has cost large companies millions of dollars.

Basically it is that while things may appear alright on the surface for those millions, could the bulk of the iceberg be underneath the waves? Especially with nondeterministic stuff like the cause of segfault - could BSOD this or that be a hardware issue rather than a software one? (But be attributed to 'typical OS/software bugs')

In terms of lingering issues - millions of the old Ryzen chips are floating around unreplaced. While not all of them seem affected, it would seem...? a substantial portion of them are.

As for hearing about them - lots of anecdotes here and there about an issue with X software running on a Ryzen system. Though that doesn't mean it's a Ryzen problem of course, and they are just anecdotes.

VirtualLarry · Oct 6, 2018

I've had a number of BSODs on my Windows 10 system with a Ryzen 5 1600 CPU. More than I would like, but then again, I'm overclocked, although I haven't always been. For the most part, though, it's stable.

Edit: It should be stated, that I did purchase this one "early", and it probably has "the bug".

William Gaatjes · Oct 6, 2018

VirtualLarry said:
I've had a number of BSODs on my Windows 10 system with a Ryzen 5 1600 CPU. More than I would like, but then again, I'm overclocked, although I haven't always been. For the most part, though, it's stable.

I only have the memory overclocked but with relaxed timings. And i never have any issues at all.
Runs fantastic.

Markfw · Oct 6, 2018

I have 5 TR systems (4 x 1950x and a 2990wx)and 4 Ryzen (1800x,1700x,2 x 2700x) systems. Not once have I had any issue, and they all run 24/7/365@100%load.

There is nothing wrong with these chips. This whole thread reeks of trolling.

maddie · Oct 6, 2018

eton975 said:
Basically it is that while things may appear alright on the surface for those millions, could the bulk of the iceberg be underneath the waves? Especially with nondeterministic stuff like the cause of segfault - could BSOD this or that be a hardware issue rather than a software one? (But be attributed to 'typical OS/software bugs')

In terms of lingering issues - millions of the old Ryzen chips are floating around unreplaced. While not all of them seem affected, it would seem...? a substantial portion of them are.

As for hearing about them - lots of anecdotes here and there about an issue with X software running on a Ryzen system. Though that doesn't mean it's a Ryzen problem of course, and they are just anecdotes.

If I was you, I'd do the only safe thing. Stay as far away from all computers as much as possible.

Abwx · Oct 6, 2018

maddie said:
If I was you, I'd do the only safe thing. Stay as far away from all computers as much as possible.

Not all, the one below does +-/x ops and even roots of any order, wich is quite a difficulty in usual CPUs, and as far as i m aware it has no bug and about no risk of BSD :

https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Boulier1.JPG/220px-Boulier1.JPG

chrisjames61 · Oct 6, 2018

VirtualLarry said:
I've had a number of BSODs on my Windows 10 system with a Ryzen 5 1600 CPU. More than I would like, but then again, I'm overclocked, although I haven't always been. For the most part, though, it's stable.

I would say that overclocking throws everything out the window. Pretty sure your overclock actually isn't 100% stable.

chrisjames61 · Oct 6, 2018

eton975 said:
I'm sure many of us here have vague memories of chips that had real-world, unfixable bugs and problems. Problems that would not go away with a better motherboard, BIOS updates, OS reinstalls - without rewritten software. Problems that others might even blame on a bad overclock until the OP reverted to stock settings and would persist.

Pentiums with incorrect FDIV lookup tables and buslocks from F00F... opcodes, Cyrixes that had weird issues with anything assuming Pentum-level instructions, K6s that had stability issues with too much RAM, Athlons that would go up in smoke for lack of a heatspreader or a fast thermal diode.

But most of those examples are in the distant past, likely buried in landfill. PCs, their components and their marketing have changed over the years - expansion slot, form factor and firmware standards have changed over the years to be easier to work with and harder to mess up. You can't just plug in the motherboard power backwards and see it go up in smoke anymore unless you're pushing down with the force of a steamroller. Can't insert your CPU backwards and fry it these days like you could your Socket 3 486!

Mid-end motherboards today come in glossy, made-to-fit boxes, have sleek plastic shrouds, a jet-black silkscreen and fancy lettering; they are often fashionable, 'refined' products, are they not, far cries from the crude green and brown boards of the 80s and 90s or the more fisher-price boards around the new millennium (even modern ones sans UEFI, IMO). Surely the janky past of glitchiness is behind us, with our modern equipment offering a smooth, trouble-free computing experience?

Now (well, really around a year ago), I hear that a large amount of early Ryzens have a manufacturing defect - ringing, out-of-spec inductance somewhere, grounding issue, insufficient capacitance in power layers/ripple smoothing, design features being packed too close, mask defect blowing open part of a line? - that causes segfaults compiling GCC, and not every time either. Even after microcode and UEFI updates. RMA required. Huge numbers of chips affected. Virtual 8086 mode semi-broken (apparently fixed with AGESA 1006?), causing issues with some VMs or running 16-bit software on 32-bit Windows. Reports of unusual crashes in Maya? Handbrake? Broken C6 sleep mode causing lockups when idle in Linux and BSD?

And if these obvious symptoms are cropping up, how many silent issues might be occurring in the background that don't trigger a full-blown segfault on Ryzen? (Not to say that errata on Intel are never an issue)

To me, this feels a bit (a bit) like the ominous days of Super Socket 7 and its standard melange. What kinds of little issues could come up at the worst time to bite you in the back? As if termites infested some of the towers and towers of software written for PCs, but you don't know exactly which ones until you walk in and try to run them. Like a piece of the digital landscape has broken off, but you're blind and can only hope you don't fall down the hole it created.

(The things that really come to mind on the recent Intel side are issues with atomicity on Skylake HT (test program was tight assembly loops?), Meltdown/Spectre of course and x87 having questionable precision? Also curious about chipset/USB stability on the last 10 years worth of the AMD lines? Have heard of some issues regarding USB latency with SB7/9xx on the AMD side... Serato appears to be an example of software affected.)

Perhaps it's just paranoia on my part, but I think I make somewhat of a point that these issues should have been ironed out. These are not all just hypothetical issues listed on a datasheet. Sometimes I wonder if deterministic faults on 2 different PCs with similar architecture could cause real issues for DC projects like BOINC that depend on cross-system validation? Computers that don't compute the way we expect them to...

Zen uarch family erratum sheet

Anecdotes from people here with Ryzen systems would be interesting to hear.

It is your paranoia. I have two Ryzen systems. One CHVI Hero with a R5 1600 and and a Gaming Titanium with an R5 2400G and have never encountered any problems. In fact perusing forums I see scant to no evidence of what your post is implying.

Sable · Oct 6, 2018

Markfw said:
I have 5 TR systems (4 x 1950x and a 2990wx)and 4 Ryzen (1800x,1700x,2 x 2700x) systems. Not once have I had any issue, and they all run 24/7/365@100%load.

There is nothing wrong with these chips. This whole thread reeks of trolling.

I cannot capitalise FUD enough. Oh wait, I just did. (The leadin about ye olde pentiums having errata to give an unbiased feel was a nice touch though)

Markfw · Oct 6, 2018

Sable said:
I cannot capitalise FUD enough. Oh wait, I just did. (The leadin about ye olde pentiums having errata to give an unbiased feel was a nice touch though)

Are you agreeing with me that this thread is FUD ?

Just to be clear that you were not calling me out, since you quoted me.

IEC · Oct 6, 2018

Markfw said:
Are you agreeing with me that this thread is FUD ?

Just to be clear that you were not calling me out, since you quoted me.

He's agreeing with you. I haven't encountered any of the many published errata for CPUs since the Pentium III days. I doubt I'll encounter any of them in the future, either. Intel or AMD.

Markfw · Oct 6, 2018

IEC said:
He's agreeing with you. I haven't encountered any of the many published errata for CPUs since the Pentium III days. I doubt I'll encounter any of them in the future, either. Intel or AMD.

This whole thread is just ridiculous. All CPU's have errdata, hoe he can get "a properly functioning microarchitecture" out of that just seems like trolling.

chrisjames61 · Oct 6, 2018

The guy did a post and run so I am pretty sure it was a troll that backfired on him.

DrMrLordX · Oct 6, 2018

I have a day-one 1800x that has run everything I threw at it flawlessly while not overclocked.

Admittedly, I have not compiled the Linux kernel or GCC, which are the two things for which there were apparent errata for which AMD offered an RMA already. So no issue there. AMD's 2700x never had that problem.

eton975 · Oct 6, 2018

Sable said:
I cannot capitalise FUD enough. Oh wait, I just did. (The leadin about ye olde pentiums having errata to give an unbiased feel was a nice touch though)

Markfw said:
This whole thread is just ridiculous. All CPU's have errdata, hoe he can get "a properly functioning microarchitecture" out of that just seems like trolling.

Basically my point is that these errata appeared a little (maybe a lot) more serious and pervasive than normal to me. I remember the Sandy Bridge chipset recall, Atoms self-bricking and Phenom issues as the several 'big' examples of modern arches having serious hard-to-fix issues. If you patch over one issue but others keep coming up and crashing you, modifying results (in real-world workloads), I think it's pretty fair to say your µarch is booby-trapped (or in a much more finicky state than others) if you want to lean on it for correct, trouble-free computation. Isn't that one of the selling points of a computer? This is all assuming that the issues with Zen and Ryzen are coming up more than others though, and problems are not creeping up silently on other processors often.

Balance: Do I need to bash Intel harder? Meltdown/Spectre have absolutely been proven in the real world.

Markfw · Oct 6, 2018

eton975 said:
Basically my point is that these errata appeared a little (maybe a lot) more serious and pervasive than normal to me. I remember the Sandy Bridge chipset recall, Atoms self-bricking and Phenom issues as the several 'big' examples of modern arches having serious hard-to-fix issues. If you patch over one issue but others keep coming up and crashing you, modifying results (in real-world workloads), I think it's pretty fair to say your µarch is booby-trapped (or in a much more finicky state than others) if you want to lean on it for correct, trouble-free computation. Isn't that one of the selling points of a computer? This is all assuming that the issues with Zen and Ryzen are coming up more than others though, and problems are not creeping up silently on other processors often.

Balance: Do I need to bash Intel harder? Meltdown/Spectre have absolutely been proven in the real world.

Intel may have been proven with Meltdown/spectre, but you have proved nothing with Ryzen, other than you are trolling, and EVERY response has backed me up on this.

eton975 · Oct 6, 2018

No, I am not trolling. I am not sealioning. I might be specially pleading - I am not sure.

Segfault has been 'proven'. VME was proven and fixed - but there are still examples of people that can't get the old stuff to run on Win98 on Ryzen (with SVM enabled, other VMs work fine), but the exact same image mysteriously works fine(?) on Intel systems. In the case of C6 the lockups that users encounter seem to go away with a little python script to disable it or sometimes through disabling idle current in UEFI, and there is an acknowledged issue with MWAIT in the AMD datasheet (the instruction used to put the processor into lower-power states). For the other issues it's far more murky, but I can think of one example where something crops up on Ryzen and only Ryzen repeatedly.

VirtualLarry · Oct 6, 2018

I don't know about Ryzen CPUs, other than the earliest batch, which I probably have, having erratas that are user-visible, but what is going on with their APUs?

Their 2200G / 2400G won't even BOOT Windows 7. How the heck does that happen, for a CPU claiming x86 / x64 compatibility? Isn't Windows 7 64-bit the "Gold Standard" for "Legacy OSes"?

And I can't seem to keep my 2200G stable on Linux Mint 19 64-bit for more than a few days, even updating to the 4.19-RC4 kernel.

Edit: Not trying to FUD here in my post, just that it's well-known that the Ryzen CPUs will work with Windows 7, but the APUs will not, and my personal experiences trying to get Linux Mint 19 to work with my 2200G, on I think an Asus B350M-E Prime mobo.

eton975 · Oct 6, 2018

VirtualLarry said:
I don't know about Ryzen CPUs, other than the earliest batch, which I probably have, having erratas that are user-visible, but what is going on with their APUs?

Their 2200G / 2400G won't even BOOT Windows 7. How the heck does that happen, for a CPU claiming x86 / x64 compatibility? Isn't Windows 7 64-bit the "Gold Standard" for "Legacy OSes"?

And I can't seem to keep my 2200G stable on Linux Mint 19 64-bit for more than a few days, even updating to the 4.19-RC4 kernel.

There are serious ACPI issues, AFAIK, along with weird issues with the iGPU. I hear that sometimes even using a dGPU and disabling iGPU is not enough.

DaveSimmons · Oct 6, 2018

eton975 said:
hear that sometimes even using a dGPU and disabling iGPU is not enough.

Links? Evidence?

Simone: My best friend’s sister’s boyfriend’s brother’s girlfriend heard from this guy who knows this kid who’s going with a girl who saw Ferris pass-out at 31 Flavors last night. I guess it’s pretty serious.

You're throwing around a lot of hearsay and third-hand rumors.

Could AMD CPUs have unfixed flaws? Yes. Could intel CPUs have unfixed flaws? Yes.

Could intel CPUs have undiscovered speculative execution exploits? Yes. Could AMD? Yes.

If you have a fear-based preference for intel, then buy intel and be happy. The same for AMD.

Don't expect to convince us here based on literal FUD (Fear, Uncertainty and Doubt) rather than credible evidence.

AMD Zen: A properly functioning microarchitecture?

Senior member

Lifer

Golden Member

Lifer

Senior member

Senior member

No Lifer

Lifer

Moderator Emeritus, Elite Member

Diamond Member

Lifer

Senior member

Senior member

Golden Member

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Senior member

Lifer

Senior member

Moderator Emeritus, Elite Member

Senior member

No Lifer

Senior member

Elite Member