News 2990WX Threadripper Performance Regression FIXED (for certain workloads) on Windows*

IEC · Jan 3, 2019

*Thread title was originally a copy-paste from video title. Editorial comment added in parentheses.

Per Level1Techs, it appears there is a Windows kernel bug that has led to the strange results like the TR 2990X losing to the TR 2950X in some tests such as Adobe Premiere, Indigo’s Renderer, Blender, 7zip, etc. It turns out, it's mostly not a memory bandwidth issue. It's a Windows scheduler bug that burns CPU cycles unproductively with how it handles >2 NUMA nodes (possibly due to a bandaid/fix for XCC Xeons). He proves it by comparing a 2990X and Epyc 7551 on Windows and on Linux and using coreprio to manipulate the performance.

Full article:
https://level1techs.com/article/unlocking-2990wx-less-numa-aware-apps

Video:

Conclusion:
"The rumors of a memory bandwidth problem, even with 32 cores (at least in these instances), has been greatly exaggerated."

Interpretation:
With server-like CPUs now easily available for consumers, Microsoft has some catching up if they want us to run Windows rather than Linux.

Update 1/14/2019:
AMD comments on Threadripper 2 Performance and Windows Schedule (AT article by Ian Cutress)

naukkis · Jan 8, 2019

TheELF said:
Windows scheduler is not senselessly repeatedly moving around the threads,it does so to prevent stress on the TIM by making the heat distribution more uniform,also it prevents single cores from degrading over time by running at full turbo all the time,also there is no performance penalty whatsoever for normal CPUs so...win-win.

I don't know if it was really necessarily sometimes but at least now it's totally worthless and stupid thing to do. Even with Intel-systems there's preferred high-binned turbo cores which boost higher than other cores so high-priority thread has to be rant with one single core all the time. And supposedly that isn't harming anything or whole preferred turbo-core scheme is faulty.

ub4ty · Jan 8, 2019

There's nothing wrong w/ AMD's processor 1st off.
The architectural concept is known as NUMA and an OS should support it intelligently at the scheduler level. The 16+ core count threadrippers have a unique configuration indeed but NUMA stands for Non-uniform Memory Access and that just falls under another NUMA config. If Windows isn't supporting this properly, it's the issue. Windows 10 no doubt supports the power standards that differentiate a hard plugged desktop vs a mobile battery powered device. Windows scheduler is just hot garbage (period). That being said, any software that can scale to 32 cores should have the proper plugs and settings to handle NUMA configurations and core parking. Overall, the least of all problems fall on AMD. Lastly, anyone buying these processors should have known this. I avoided buying them for these very reasons. I'm not comfortable with a whole die not having direct I/O access. You're obviously going to take a performance hit and for my compute needs that'd be too much. That being said, I could never imagine running these processors on Winblows.

https://stackoverflow.com/questions/28921328/why-does-windows-switch-processes-between-processors
There are solid and understood reasons for this behavior but I've even known Windows in the past to spaz out and do this way too often or have severe scheduling bugs.

Linux is also not foreign to this issue :
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

Shivansps · Jan 8, 2019

ub4ty said:
There's nothing wrong w/ AMD's processor 1st off.
The architectural concept is known as NUMA and an OS should support it intelligently at the scheduler level. The 16+ core count threadrippers have a unique configuration indeed but NUMA stands for Non-uniform Memory Access and that just falls under another NUMA config. If Windows isn't supporting this properly, it's the issue. Windows 10 no doubt supports the power standards that differentiate a hard plugged desktop vs a mobile battery powered device. Windows scheduler is just hot garbage (period). That being said, any software that can scale to 32 cores should have the proper plugs and settings to handle NUMA configurations and core parking. Overall, the least of all problems fall on AMD. Lastly, anyone buying these processors should have known this. I avoided buying them for these very reasons. I'm not comfortable with a whole die not having direct I/O access. You're obviously going to take a performance hit and for my compute needs that'd be too much. That being said, I could never imagine running these processors on Winblows.

https://stackoverflow.com/questions/28921328/why-does-windows-switch-processes-between-processors
There are solid and understood reasons for this behavior but I've even known Windows in the past to spaz out and do this way too often or have severe scheduling bugs.

Linux is also not foreign to this issue :
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

The concept of NUMA nodes whiout local memory is new to the Windows Ecosystem, and i knew this was going to be an issue since day 1.

TheELF · Jan 8, 2019

naukkis said:
I don't know if it was really necessarily sometimes but at least now it's totally worthless and stupid thing to do. Even with Intel-systems there's preferred high-binned turbo cores which boost higher than other cores so high-priority thread has to be rant with one single core all the time. And supposedly that isn't harming anything or whole preferred turbo-core scheme is faulty.

https://www.nsaneforums.com/topic/294848-intel-turbo-boost-max-technology-30-v1001031-x64/
Yes exactly,it's called Intel Turbo Boost Max Technology 3.0 and intel has a tool that will diagnose your CPU to find which cores boost how high and tell the task scheduler to prefer those cores, it also allows you to select cores manually.
If you rely on task manager alone you will not get the best performance.

TheELF · Jan 8, 2019

ub4ty said:
There's nothing wrong w/ AMD's processor 1st off.

There is nothing wrong with them but they are also not main stream CPUs,you can't expect them to run at top performance on the same OS that has to work on celerons and atoms.

ub4ty · Jan 8, 2019

TheELF said:
There is nothing wrong with them but they are also not main stream CPUs,you can't expect them to run at top performance on the same OS that has to work on celerons and atoms.

Agreed.. This product was a write-off for me for several reasons pre-launch when I surmised that two dies wouldn't have direct I/O access. I'll be waiting to see if AMD is going to address this by having the new thread-ripper w/ the I/O chip Rome has. 16 core is the sensible limit for me up until that point.

Markfw · Jan 8, 2019

ub4ty said:
Agreed.. This product was a write-off for me for several reasons pre-launch when I surmised that two dies wouldn't have direct I/O access. I'll be waiting to see if AMD is going to address this by having the new thread-ripper w/ the I/O chip Rome has. 16 core is the sensible limit for me up until that point.

I am not worried about the IO, as while it slows it down a little, my 3.6 ghz on all cores (OC'ed) compared to the 7601@ 3 ghz, I still have more throughput at less than half the price.

moinmoin · Jan 9, 2019

Thread migration by the Windows scheduler is a known and regularly discussed issue, even here on this forum, e.g.:

Microsoft's .NET documentations states:
A process thread can migrate from processor to processor, with each migration reloading the processor cache. Specifying a processor for a thread can improve performance under heavy system loads by reducing the number of times the processor cache is reloaded.

The source of the problem appears to be that the Windows scheduler combines context switches with thread migration, something that was essentially a necessity for load balancing on chips with few hardware threads. The time slice available to a thread, a quantum, is (used to be?) is a multiple of a clock interval (which is dependent on the CPU, reportedly around 15ms, can be measured with ClockRes). As of XP/2003/Vista a short quantum (desktop usage) is 2 times that (so around 30ms) whereas a long quantum (server usage) 12 times (so around 180ms). So every time a thread's time slice is used up there is a high possibility of it being moved to another available hardware thread, trashing the current cache, needing to move the required data over and booting that hardware thread up if it was idle before. More documentation
If this was changed since and there's better documentation I and I'm sure many others would be glad to see it. The fact that discussion about the Windows scheduler and its thread migration are going on since when there is easy access to chips with more than 2 hardware threads makes me think not much has changed.

Microsoft's optimization on this was for software developer to make use of UMS (user-mode scheduling) or manually set processor affinity. Meanwhile 3rd parties try to fill the unoptimized void with the likes of the independent Process Lasso, Intel's Turbo Boost Max Technology 3.0, AMD's Dynamic Local Mode etc. pp.

Mockingbird · Jan 10, 2019

Post deleted due to Off-topic
Markfw
Anandtech Moderator

naukkis · Jan 14, 2019

Have anyone tried if this fix also boost normal desktop Ryzen performance? Dekstop-Ryzens with split L3 will also suffer if scheluder jumps pointlessly threads between CCXs.

Hitman928 · Jan 14, 2019

AMD comments on the problem.

https://www.anandtech.com/show/1385...eadripper-2-performance-and-windows-scheduler

TL;DR

Microsoft scheduler basically assumes you're running a VM host machine if you have multiple NUMA nodes. The way the scheduler handles this is by assigning a best NUMA node and any thread assigned to a best NUMA node gets priority and kicks off other threads on that node whether or not they are also on their best NUMA node.

So, if you are trying to use multiple nodes to run the same 1-2 heavy compute tasks, then they all get assigned to the same "best" NUMA node and are constantly kicking each other off the same node over and over again causing a severe performance regression. Microsoft fixed this for 2 NUMA nodes at some point due to intel workstation with 2 NUMA nodes becoming more popular, but anything over 2 doesn't have this fix. AMD has multiple support tickets in to fix the problem. Linux handles multiple NUMA nodes much more intelligently.

ub4ty · Jan 14, 2019

Hitman928 said:
AMD comments on the problem.

https://www.anandtech.com/show/1385...eadripper-2-performance-and-windows-scheduler

TL;DR

Microsoft scheduler basically assumes you're running a VM host machine if you have multiple NUMA nodes. The way the scheduler handles this is by assigning a best NUMA node and any thread assigned to a best NUMA node gets priority and kicks off other threads on that node whether or not they are also on their best NUMA node.

So, if you are trying to use multiple nodes to run the same 1-2 heavy compute tasks, then they all get assigned to the same "best" NUMA node and are constantly kicking each other off the same node over and over again causing a severe performance regression. Microsoft fixed this for 2 NUMA nodes at some point due to intel workstation with 2 NUMA nodes becoming more popular, but anything over 2 doesn't have this fix. AMD has multiple support tickets in to fix the problem. Linux handles multiple NUMA nodes much more intelligently.

Winblows strikes again. Obvious is obvious.

Schmide · Jan 14, 2019

Man I really get the feeling this is a stupid mask issue. Especially with Ian's exclude core 0 get better performance.

Somewhere the core provision routine sees one more core than their actually is.

AKA the mask for 32 spots is 32-1 = 31

moinmoin · Jan 15, 2019

Maybe the Windows scheduler is in some places counting from 0 and some other places from 1.

BigDaveX · Jan 15, 2019

In other words, Microsoft managed to fix Windows just enough to avoid a repeat of Quad FX, but left it sufficiently broken to tank performance with any more NUMA nodes than that. Why am I not surprised?

Search

News 2990WX Threadripper Performance Regression FIXED (for certain workloads) on Windows*

IEC

Elite Member

naukkis

Golden Member

ub4ty

Senior member

Shivansps

Diamond Member

TheELF

Diamond Member

TheELF

Diamond Member

ub4ty

Senior member

Markfw

Moderator Emeritus, Elite Member

moinmoin

Diamond Member

Mockingbird

Senior member

naukkis

Golden Member

Hitman928

Diamond Member

ub4ty

Senior member

Schmide

Diamond Member

moinmoin

Diamond Member

BigDaveX

Senior member

TRENDING THREADS