News 2990WX Threadripper Performance Regression FIXED (for certain workloads) on Windows*

IEC

Super Moderator
Super Moderator
Jun 10, 2004
13,444
129
126
#1
*Thread title was originally a copy-paste from video title. Editorial comment added in parentheses.

Per Level1Techs, it appears there is a Windows kernel bug that has led to the strange results like the TR 2990X losing to the TR 2950X in some tests such as Adobe Premiere, Indigo’s Renderer, Blender, 7zip, etc. It turns out, it's mostly not a memory bandwidth issue. It's a Windows scheduler bug that burns CPU cycles unproductively with how it handles >2 NUMA nodes (possibly due to a bandaid/fix for XCC Xeons). He proves it by comparing a 2990X and Epyc 7551 on Windows and on Linux and using coreprio to manipulate the performance.

Full article:
https://level1techs.com/article/unlocking-2990wx-less-numa-aware-apps

Video:

Conclusion:
"The rumors of a memory bandwidth problem, even with 32 cores (at least in these instances), has been greatly exaggerated."

Interpretation:
With server-like CPUs now easily available for consumers, Microsoft has some catching up if they want us to run Windows rather than Linux.

Update 1/14/2019:
AMD comments on Threadripper 2 Performance and Windows Schedule (AT article by Ian Cutress)
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
64
106
#2
0:40 "And its not AMD's fault".

An issue of this magnitude remains unfixed / undetected for 5 months+, whos fault exactly it is if not AMDs?
Its GROSS incompetence (if accurate).
 
Mar 13, 2006
10,089
48
126
#26
He proves it by comparing a 2990X and Epyc 7551 on Windows and on Linux and using coreprio to manipulate the performance.
Did he really "prove" it's a bug? Internet randos that think they know more than AMD or MS are a dime a dozen. There's even a guy om this very forum that has claimed he knows more about AMD CPUs than AMD themselves.
 

ericlp

Diamond Member
Dec 24, 2000
5,941
31
106
#30
Windows is a virus! Quickly get rid of it and install linux. LOL.

More than likely the reason to this taking so long to get addressed is most people that buy these chips aren't running windows to begin with and for good reason.
 

Markfw

CPU Moderator, VC&G Moderator, Elite Member
Super Moderator
May 16, 2002
16,942
342
136
#3
And how is Microsofts bug AMD's fault ? not bugging them enough ? I am sure they did.
 
Feb 23, 2017
400
230
96
#4
Much like how Intel was GROSSLY incompetent for not fixing/detecting Meltdown/Spectre for 5+ years?
 

The Stilt

Golden Member
Dec 5, 2015
1,709
64
106
#5
And how is Microsofts bug AMD's fault ? not bugging them enough ? I am sure they did.
So basically you are saying that Microsoft would be refusing to fix an obvious bug in their OS, pointed out by AMD?
Makes perfect sense, especially when AMD hasn't made any statements regarding to it.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
64
106
#6
Much like how Intel was GROSSLY incompetent for not fixing/detecting Meltdown/Spectre for 5+ years?
Yeah, noticing that your CPU is lacking half of the performance is equally hard as discovering design related hardware errata which has no symptoms.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#7
Process affinity? Duh.
This is amateur hour. What the hell are both AMD/Microsoft doing that this lasted so long? Was such a mystery? This is also why I run all of my dev boxes on linux. Windows is hot garbage for any serious compute tasks. Still tough, I wouldn't dare touch the 16+ core count Threadripper CPUs with that weird configuration whereby they have no direct I/O. I'd just buy a proper epyc system. I hope with the new chip architecture whereby there is a dedicated I/O chip that all of the core complexes hook into that they resolve this wonky foolishness.
 

Markfw

CPU Moderator, VC&G Moderator, Elite Member
Super Moderator
May 16, 2002
16,942
342
136
#8
Yeah, noticing that your CPU is lacking half of the performance is equally hard as discovering design related hardware errata which has no symptoms.
I noticed a problem on mine, and just went to linux on that box. Problem solved.

As for AMD pointing something out to MS, I am sure we are not privy to all communications between the 2.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#9
I noticed a problem on mine, and just went to linux on that box. Problem solved.

As for AMD pointing something out to MS, I am sure we are not privy to all communications between the 2.
Yeah, this is too basic for them to not have known. Any low level software dev could have discovered this in a day's time and performance regression bugs of this magnitude are typically given high priority and assigned to a tiger team of engineers internally before the product gets anywhere near shipment. They probably kept it under wraps because it would do nothing but harm sales knowing that Microsoft is likely looking into and addressing it.

https://developer.amd.com/amd-uprof/
 

IEC

Super Moderator
Super Moderator
Jun 10, 2004
13,444
129
126
#10
Without access to privileged information, you cannot assume one way or another what, if anything, each party knows or has reported to another.

It wasn't uncommon in my time in software that some seemingly simple fixes were put on the back burner for months or even years due to it being triaged as low impact/low # of affected users. For context, said software is considered "mission critical".

So rushing to judgment without knowing all the facts is a bit premature.
 

mattiasnyc

Senior member
Mar 30, 2017
266
79
76
#11
Where does it say that performance was "FIXED" on Windows? What am I missing?
 
Sep 9, 2017
52
7
41
#12

It's the same for Intel as well, the ecosystem is different and Microsoft needs to adapt to the new hardware.
This needs to be addressed sooner rather than later, even though a minority of people are buying those Flagship CPUs, they aren't exactly cheap and the core war is only getting more intense.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#13
Without access to privileged information, you cannot assume one way or another what, if anything, each party knows or has reported to another.

It wasn't uncommon in my time in software that some seemingly simple fixes were put on the back burner for months or even years due to it being triaged as low impact/low # of affected users. For context, said software is considered "mission critical".

So rushing to judgment without knowing all the facts is a bit premature.
You don't need access to privilege information to decipher this. If you've worked in the industry, you know exactly how this works.

AMD 100% knew about this. It would have been discovered in basic systems testing.
Once AMD found out about this, it should have taken a competent engineer less than a day with basic profiling tools to discover the root issue residing with Windows. From there comes a Sev 1 (show stopper) that goes all the way up the management chain to very high levels showing that a brand new high performance product is underperforming due to a bug. Microsoft was likely directly contacted through high level channels and made aware of it and both parties decided it would be better to keep this under wraps while they sort out a fix. The bug then gets marked as non-public and its managed internally between AMD/Microsoft. Show-stopper attribute is taken off and sev is widdled down for political reasons. This is what I've seen time and time again in my experience. The idea that Microsoft/AMD both didn't know about this at a high level is laughable given how straight forward and glaring of an issue it is and the common sense systems testing on Windows/Linux that of course would have discovered it before it even shipped.

As an engineer, I don't buy into any foolish PR masking that's done after the fact. There's non-publicly tracked bugs that would make people's heads spin at every tech company in existence. However, the idea that no one knows about them (at a high level) is laughable. Release management knows about all of this stuff. It's their job to de-escalate and mitigate it after the fact for max profit.

> Sev 1 (show stopper) comes up in Release management meeting
Performance is impacted? By how much? Does it boot? Take show stopper off. Sev 3 Performance bug. Get our contact at Microsoft on the line. I want a tiger team on this and mark it (non-public). This will not be shared on the public bug portal.

This practice is so commonly known that big customers often have a team whose dedicated job is to tease out the bugs that companies don't tell them about.

That being said.. Crappy windows rears its head again. I could imagine someone made the joke that no one with such compute demands uses windows anyway.
 
Last edited:
Apr 27, 2000
10,476
326
126
#14
That being said.. Crappy windows rears its head again. I could imagine someone made the joke that no one with such compute demands uses windows anyway.
I would imagine that 2990WX users probably did what @Markfw did by switching to Linux.
 

Markfw

CPU Moderator, VC&G Moderator, Elite Member
Super Moderator
May 16, 2002
16,942
342
136
#15
I would imagine that 2990WX users probably did what @Markfw did by switching to Linux.
Also, I run linux mint 19 on all of my boxes except 2 (12 on linux) since its way more efficient even on 8 threads. I only have one of those though, the rest at 16 or 32 thread, except the 2990wx@64.
 

StinkyPinky

Diamond Member
Jul 6, 2002
6,323
13
126
#16
Process affinity? Duh.
This is amateur hour. What the hell are both AMD/Microsoft doing that this lasted so long? Was such a mystery? This is also why I run all of my dev boxes on linux. Windows is hot garbage for any serious compute tasks. Still tough, I wouldn't dare touch the 16+ core count Threadripper CPUs with that weird configuration whereby they have no direct I/O. I'd just buy a proper epyc system. I hope with the new chip architecture whereby there is a dedicated I/O chip that all of the core complexes hook into that they resolve this wonky foolishness.
Perhaps the simplest explanation is the best one? Maybe it's a hard bug to fix. Could be deep in the core of the OS.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#17
Perhaps the simplest explanation is the best one? Maybe it's a hard bug to fix. Could be deep in the core of the OS.
I said nothing about the complexity of the bug.
I provided instead a simple explanation as to why AMD and Microsoft were clearly aware of it which is what was being debated... An intern in Q&A/Test could find and isolate this bug in system test over lunch. There's no question AMD/Microsoft knew about it.
 

Markfw

CPU Moderator, VC&G Moderator, Elite Member
Super Moderator
May 16, 2002
16,942
342
136
#18
Regardless of the complexity. Why is it that the most prevalent, and best funded OS has this bug and free linux does not ? Why does linux give me 30% more performance on the same hardware ?

Microsoft Windows is just a ripoff.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#19
Regardless of the complexity. Why is it that the most prevalent, and best funded OS has this bug and free linux does not ? Why does linux give me 30% more performance on the same hardware ?

Microsoft Windows is just a ripoff.
Numa config + full and proper support was never really needed or didn't have to be fully fleshed out on the consumer side of things so they likely just got caught with their pants down whereas Linux constantly churns through tons of enterprise hardware environments and is home in such an environment. Windows 10 is great for casual desktop but I've never seen any serious enterprise hardware run on it.
 

Mopetar

Diamond Member
Jan 31, 2011
4,270
199
126
#20
Yeah, noticing that your CPU is lacking half of the performance is equally hard as discovering design related hardware errata which has no symptoms.
I think it's a confluence of several factors. First, prior to AMD offering ThreadRipper, you wouldn't have people running into this bug on Windows. Until the system fails, no one even notices the problem. The other side of it is that even with people discovering it, the priority is probably quite low since there aren't a lot of affected users and the severity is much lower than fixing security critical bugs. Then it's a matter of fixing the problem in a way that doesn't break something else, which can be tricky if you don't understand the code all that well. Or perhaps it requires tearing out a pretty large chunk of code that's poorly designed in order to do things properly.

Regardless of the complexity. Why is it that the most prevalent, and best funded OS has this bug and free linux does not ? Why does linux give me 30% more performance on the same hardware ?
That's the wonder of open source. As soon as someone finds the bug, they can create a patch for it. The poor schlub at Microsoft probably has to dig through the bowels of ancient poorly documented code to track down the source of a problem that they don't understand terribly well and probably care about even less. With Linux you can examine and fix the code yourself, or at the very least figure out who introduced the bug and work with them to fix it if they're still active with the project.
 

Z15CAM

Golden Member
Nov 20, 2010
1,915
3
81
www.flickr.com
#21
I've browsing Level1Techs for several years - There pretty darn good - Better then top noted Tech Sites on the internet today. Eh-Hummm !

Not saying Curtress is late ;o(
 
Last edited:

BigDaveX

Senior member
Jun 12, 2014
314
21
101
#22
Numa config + full and proper support was never really needed or didn't have to be fully fleshed out on the consumer side of things so they likely just got caught with their pants down whereas Linux constantly churns through tons of enterprise hardware environments and is home in such an environment. Windows 10 is great for casual desktop but I've never seen any serious enterprise hardware run on it.
While NUMA support has never been a widely-needed feature on the desktop, it's not really like Microsoft could have been completely unaware it was a potential issue, since the lack of NUMA support in Windows back then almost single-handedly killed the Quad FX platform back in the day. And I'm pretty sure there were more than a few enthusiasts running Socket G34 Opterons around the turn of the decade as well.
 
Apr 16, 2014
156
0
101
#23
Its pretty important on my overclocked quad 61xx Opteron to have NUMA mode on, and SRAT table on when running single threaded games like Balanced Annihilation. Some software like Cinebench loves non NUMA though... (node interleaving). Running it in NUMA gives almost a 1000cb deficit (2300 vs 3229).


While NUMA support has never been a widely-needed feature on the desktop, it's not really like Microsoft could have been completely unaware it was a potential issue, since the lack of NUMA support in Windows back then almost single-handedly killed the Quad FX platform back in the day. And I'm pretty sure there were more than a few enthusiasts running Socket G34 Opterons around the turn of the decade as well.
 

kjboughton

Senior member
Dec 19, 2007
330
5
116
#24
But, but, BUT..... it’s nothing to do with NUMA!
You guys are funny.
 

ASK THE COMMUNITY