News 2990WX Threadripper Performance Regression FIXED (for certain workloads) on Windows*

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

IEC

Elite Member
Super Moderator
Jun 10, 2004
14,330
4,918
136
*Thread title was originally a copy-paste from video title. Editorial comment added in parentheses.

Per Level1Techs, it appears there is a Windows kernel bug that has led to the strange results like the TR 2990X losing to the TR 2950X in some tests such as Adobe Premiere, Indigo’s Renderer, Blender, 7zip, etc. It turns out, it's mostly not a memory bandwidth issue. It's a Windows scheduler bug that burns CPU cycles unproductively with how it handles >2 NUMA nodes (possibly due to a bandaid/fix for XCC Xeons). He proves it by comparing a 2990X and Epyc 7551 on Windows and on Linux and using coreprio to manipulate the performance.

Full article:
https://level1techs.com/article/unlocking-2990wx-less-numa-aware-apps

Video:

Conclusion:
"The rumors of a memory bandwidth problem, even with 32 cores (at least in these instances), has been greatly exaggerated."

Interpretation:
With server-like CPUs now easily available for consumers, Microsoft has some catching up if they want us to run Windows rather than Linux.

Update 1/14/2019:
AMD comments on Threadripper 2 Performance and Windows Schedule (AT article by Ian Cutress)
 
Last edited:

naukkis

Senior member
Jun 5, 2002
706
578
136
Windows scheduler is not senselessly repeatedly moving around the threads,it does so to prevent stress on the TIM by making the heat distribution more uniform,also it prevents single cores from degrading over time by running at full turbo all the time,also there is no performance penalty whatsoever for normal CPUs so...win-win.

I don't know if it was really necessarily sometimes but at least now it's totally worthless and stupid thing to do. Even with Intel-systems there's preferred high-binned turbo cores which boost higher than other cores so high-priority thread has to be rant with one single core all the time. And supposedly that isn't harming anything or whole preferred turbo-core scheme is faulty.
 
  • Like
Reactions: Encrypted11

ub4ty

Senior member
Jun 21, 2017
749
898
96
There's nothing wrong w/ AMD's processor 1st off.
The architectural concept is known as NUMA and an OS should support it intelligently at the scheduler level. The 16+ core count threadrippers have a unique configuration indeed but NUMA stands for Non-uniform Memory Access and that just falls under another NUMA config. If Windows isn't supporting this properly, it's the issue. Windows 10 no doubt supports the power standards that differentiate a hard plugged desktop vs a mobile battery powered device. Windows scheduler is just hot garbage (period). That being said, any software that can scale to 32 cores should have the proper plugs and settings to handle NUMA configurations and core parking. Overall, the least of all problems fall on AMD. Lastly, anyone buying these processors should have known this. I avoided buying them for these very reasons. I'm not comfortable with a whole die not having direct I/O access. You're obviously going to take a performance hit and for my compute needs that'd be too much. That being said, I could never imagine running these processors on Winblows.

https://stackoverflow.com/questions/28921328/why-does-windows-switch-processes-between-processors
There are solid and understood reasons for this behavior but I've even known Windows in the past to spaz out and do this way too often or have severe scheduling bugs.

Linux is also not foreign to this issue :
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
 

Shivansps

Diamond Member
Sep 11, 2013
3,855
1,518
136
There's nothing wrong w/ AMD's processor 1st off.
The architectural concept is known as NUMA and an OS should support it intelligently at the scheduler level. The 16+ core count threadrippers have a unique configuration indeed but NUMA stands for Non-uniform Memory Access and that just falls under another NUMA config. If Windows isn't supporting this properly, it's the issue. Windows 10 no doubt supports the power standards that differentiate a hard plugged desktop vs a mobile battery powered device. Windows scheduler is just hot garbage (period). That being said, any software that can scale to 32 cores should have the proper plugs and settings to handle NUMA configurations and core parking. Overall, the least of all problems fall on AMD. Lastly, anyone buying these processors should have known this. I avoided buying them for these very reasons. I'm not comfortable with a whole die not having direct I/O access. You're obviously going to take a performance hit and for my compute needs that'd be too much. That being said, I could never imagine running these processors on Winblows.

https://stackoverflow.com/questions/28921328/why-does-windows-switch-processes-between-processors
There are solid and understood reasons for this behavior but I've even known Windows in the past to spaz out and do this way too often or have severe scheduling bugs.

Linux is also not foreign to this issue :
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

The concept of NUMA nodes whiout local memory is new to the Windows Ecosystem, and i knew this was going to be an issue since day 1.
 
  • Like
Reactions: ozzy702

TheELF

Diamond Member
Dec 22, 2012
3,973
730
126
I don't know if it was really necessarily sometimes but at least now it's totally worthless and stupid thing to do. Even with Intel-systems there's preferred high-binned turbo cores which boost higher than other cores so high-priority thread has to be rant with one single core all the time. And supposedly that isn't harming anything or whole preferred turbo-core scheme is faulty.
https://www.nsaneforums.com/topic/294848-intel-turbo-boost-max-technology-30-v1001031-x64/
Yes exactly,it's called Intel Turbo Boost Max Technology 3.0 and intel has a tool that will diagnose your CPU to find which cores boost how high and tell the task scheduler to prefer those cores, it also allows you to select cores manually.
If you rely on task manager alone you will not get the best performance.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
There is nothing wrong with them but they are also not main stream CPUs,you can't expect them to run at top performance on the same OS that has to work on celerons and atoms.
Agreed.. This product was a write-off for me for several reasons pre-launch when I surmised that two dies wouldn't have direct I/O access. I'll be waiting to see if AMD is going to address this by having the new thread-ripper w/ the I/O chip Rome has. 16 core is the sensible limit for me up until that point.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,556
14,512
136
Agreed.. This product was a write-off for me for several reasons pre-launch when I surmised that two dies wouldn't have direct I/O access. I'll be waiting to see if AMD is going to address this by having the new thread-ripper w/ the I/O chip Rome has. 16 core is the sensible limit for me up until that point.
I am not worried about the IO, as while it slows it down a little, my 3.6 ghz on all cores (OC'ed) compared to the 7601@ 3 ghz, I still have more throughput at less than half the price.
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
Thread migration by the Windows scheduler is a known and regularly discussed issue, even here on this forum, e.g.:

Microsoft's .NET documentations states:
A process thread can migrate from processor to processor, with each migration reloading the processor cache. Specifying a processor for a thread can improve performance under heavy system loads by reducing the number of times the processor cache is reloaded.

The source of the problem appears to be that the Windows scheduler combines context switches with thread migration, something that was essentially a necessity for load balancing on chips with few hardware threads. The time slice available to a thread, a quantum, is (used to be?) is a multiple of a clock interval (which is dependent on the CPU, reportedly around 15ms, can be measured with ClockRes). As of XP/2003/Vista a short quantum (desktop usage) is 2 times that (so around 30ms) whereas a long quantum (server usage) 12 times (so around 180ms). So every time a thread's time slice is used up there is a high possibility of it being moved to another available hardware thread, trashing the current cache, needing to move the required data over and booting that hardware thread up if it was idle before. More documentation
If this was changed since and there's better documentation I and I'm sure many others would be glad to see it. The fact that discussion about the Windows scheduler and its thread migration are going on since when there is easy access to chips with more than 2 hardware threads makes me think not much has changed.

Microsoft's optimization on this was for software developer to make use of UMS (user-mode scheduling) or manually set processor affinity. Meanwhile 3rd parties try to fill the unoptimized void with the likes of the independent Process Lasso, Intel's Turbo Boost Max Technology 3.0, AMD's Dynamic Local Mode etc. pp.
 

naukkis

Senior member
Jun 5, 2002
706
578
136
Have anyone tried if this fix also boost normal desktop Ryzen performance? Dekstop-Ryzens with split L3 will also suffer if scheluder jumps pointlessly threads between CCXs.
 

Hitman928

Diamond Member
Apr 15, 2012
5,282
7,907
136
AMD comments on the problem.

https://www.anandtech.com/show/1385...eadripper-2-performance-and-windows-scheduler

TL;DR

Microsoft scheduler basically assumes you're running a VM host machine if you have multiple NUMA nodes. The way the scheduler handles this is by assigning a best NUMA node and any thread assigned to a best NUMA node gets priority and kicks off other threads on that node whether or not they are also on their best NUMA node.

So, if you are trying to use multiple nodes to run the same 1-2 heavy compute tasks, then they all get assigned to the same "best" NUMA node and are constantly kicking each other off the same node over and over again causing a severe performance regression. Microsoft fixed this for 2 NUMA nodes at some point due to intel workstation with 2 NUMA nodes becoming more popular, but anything over 2 doesn't have this fix. AMD has multiple support tickets in to fix the problem. Linux handles multiple NUMA nodes much more intelligently.
 
  • Like
Reactions: lightmanek

ub4ty

Senior member
Jun 21, 2017
749
898
96
AMD comments on the problem.

https://www.anandtech.com/show/1385...eadripper-2-performance-and-windows-scheduler

TL;DR

Microsoft scheduler basically assumes you're running a VM host machine if you have multiple NUMA nodes. The way the scheduler handles this is by assigning a best NUMA node and any thread assigned to a best NUMA node gets priority and kicks off other threads on that node whether or not they are also on their best NUMA node.

So, if you are trying to use multiple nodes to run the same 1-2 heavy compute tasks, then they all get assigned to the same "best" NUMA node and are constantly kicking each other off the same node over and over again causing a severe performance regression. Microsoft fixed this for 2 NUMA nodes at some point due to intel workstation with 2 NUMA nodes becoming more popular, but anything over 2 doesn't have this fix. AMD has multiple support tickets in to fix the problem. Linux handles multiple NUMA nodes much more intelligently.
Winblows strikes again. Obvious is obvious.
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Man I really get the feeling this is a stupid mask issue. Especially with Ian's exclude core 0 get better performance.

Somewhere the core provision routine sees one more core than their actually is.

AKA the mask for 32 spots is 32-1 = 31
 

BigDaveX

Senior member
Jun 12, 2014
440
216
116
In other words, Microsoft managed to fix Windows just enough to avoid a repeat of Quad FX, but left it sufficiently broken to tank performance with any more NUMA nodes than that. Why am I not surprised?