News 2990WX Threadripper Performance Regression FIXED (for certain workloads) on Windows*

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Jun 5, 2002
162
7
101
#51
Windows scheduler is not senselessly repeatedly moving around the threads,it does so to prevent stress on the TIM by making the heat distribution more uniform,also it prevents single cores from degrading over time by running at full turbo all the time,also there is no performance penalty whatsoever for normal CPUs so...win-win.
I don't know if it was really necessarily sometimes but at least now it's totally worthless and stupid thing to do. Even with Intel-systems there's preferred high-binned turbo cores which boost higher than other cores so high-priority thread has to be rant with one single core all the time. And supposedly that isn't harming anything or whole preferred turbo-core scheme is faulty.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#52
There's nothing wrong w/ AMD's processor 1st off.
The architectural concept is known as NUMA and an OS should support it intelligently at the scheduler level. The 16+ core count threadrippers have a unique configuration indeed but NUMA stands for Non-uniform Memory Access and that just falls under another NUMA config. If Windows isn't supporting this properly, it's the issue. Windows 10 no doubt supports the power standards that differentiate a hard plugged desktop vs a mobile battery powered device. Windows scheduler is just hot garbage (period). That being said, any software that can scale to 32 cores should have the proper plugs and settings to handle NUMA configurations and core parking. Overall, the least of all problems fall on AMD. Lastly, anyone buying these processors should have known this. I avoided buying them for these very reasons. I'm not comfortable with a whole die not having direct I/O access. You're obviously going to take a performance hit and for my compute needs that'd be too much. That being said, I could never imagine running these processors on Winblows.

https://stackoverflow.com/questions/28921328/why-does-windows-switch-processes-between-processors
There are solid and understood reasons for this behavior but I've even known Windows in the past to spaz out and do this way too often or have severe scheduling bugs.

Linux is also not foreign to this issue :
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
 

Shivansps

Platinum Member
Sep 11, 2013
2,340
132
126
#53
There's nothing wrong w/ AMD's processor 1st off.
The architectural concept is known as NUMA and an OS should support it intelligently at the scheduler level. The 16+ core count threadrippers have a unique configuration indeed but NUMA stands for Non-uniform Memory Access and that just falls under another NUMA config. If Windows isn't supporting this properly, it's the issue. Windows 10 no doubt supports the power standards that differentiate a hard plugged desktop vs a mobile battery powered device. Windows scheduler is just hot garbage (period). That being said, any software that can scale to 32 cores should have the proper plugs and settings to handle NUMA configurations and core parking. Overall, the least of all problems fall on AMD. Lastly, anyone buying these processors should have known this. I avoided buying them for these very reasons. I'm not comfortable with a whole die not having direct I/O access. You're obviously going to take a performance hit and for my compute needs that'd be too much. That being said, I could never imagine running these processors on Winblows.

https://stackoverflow.com/questions/28921328/why-does-windows-switch-processes-between-processors
There are solid and understood reasons for this behavior but I've even known Windows in the past to spaz out and do this way too often or have severe scheduling bugs.

Linux is also not foreign to this issue :
https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/
The concept of NUMA nodes whiout local memory is new to the Windows Ecosystem, and i knew this was going to be an issue since day 1.
 

TheELF

Platinum Member
Dec 22, 2012
2,637
44
106
#54
I don't know if it was really necessarily sometimes but at least now it's totally worthless and stupid thing to do. Even with Intel-systems there's preferred high-binned turbo cores which boost higher than other cores so high-priority thread has to be rant with one single core all the time. And supposedly that isn't harming anything or whole preferred turbo-core scheme is faulty.
https://www.nsaneforums.com/topic/294848-intel-turbo-boost-max-technology-30-v1001031-x64/
Yes exactly,it's called Intel Turbo Boost Max Technology 3.0 and intel has a tool that will diagnose your CPU to find which cores boost how high and tell the task scheduler to prefer those cores, it also allows you to select cores manually.
If you rely on task manager alone you will not get the best performance.
 

TheELF

Platinum Member
Dec 22, 2012
2,637
44
106
#55
There's nothing wrong w/ AMD's processor 1st off.
There is nothing wrong with them but they are also not main stream CPUs,you can't expect them to run at top performance on the same OS that has to work on celerons and atoms.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#56
There is nothing wrong with them but they are also not main stream CPUs,you can't expect them to run at top performance on the same OS that has to work on celerons and atoms.
Agreed.. This product was a write-off for me for several reasons pre-launch when I surmised that two dies wouldn't have direct I/O access. I'll be waiting to see if AMD is going to address this by having the new thread-ripper w/ the I/O chip Rome has. 16 core is the sensible limit for me up until that point.
 

Markfw

CPU Moderator, VC&G Moderator, Elite Member
Super Moderator
May 16, 2002
16,947
359
136
#57
Agreed.. This product was a write-off for me for several reasons pre-launch when I surmised that two dies wouldn't have direct I/O access. I'll be waiting to see if AMD is going to address this by having the new thread-ripper w/ the I/O chip Rome has. 16 core is the sensible limit for me up until that point.
I am not worried about the IO, as while it slows it down a little, my 3.6 ghz on all cores (OC'ed) compared to the 7601@ 3 ghz, I still have more throughput at less than half the price.
 

moinmoin

Senior member
Jun 1, 2017
625
144
96
#58
Thread migration by the Windows scheduler is a known and regularly discussed issue, even here on this forum, e.g.:

Microsoft's .NET documentations states:
A process thread can migrate from processor to processor, with each migration reloading the processor cache. Specifying a processor for a thread can improve performance under heavy system loads by reducing the number of times the processor cache is reloaded.

The source of the problem appears to be that the Windows scheduler combines context switches with thread migration, something that was essentially a necessity for load balancing on chips with few hardware threads. The time slice available to a thread, a quantum, is (used to be?) is a multiple of a clock interval (which is dependent on the CPU, reportedly around 15ms, can be measured with ClockRes). As of XP/2003/Vista a short quantum (desktop usage) is 2 times that (so around 30ms) whereas a long quantum (server usage) 12 times (so around 180ms). So every time a thread's time slice is used up there is a high possibility of it being moved to another available hardware thread, trashing the current cache, needing to move the required data over and booting that hardware thread up if it was idle before. More documentation
If this was changed since and there's better documentation I and I'm sure many others would be glad to see it. The fact that discussion about the Windows scheduler and its thread migration are going on since when there is easy access to chips with more than 2 hardware threads makes me think not much has changed.

Microsoft's optimization on this was for software developer to make use of UMS (user-mode scheduling) or manually set processor affinity. Meanwhile 3rd parties try to fill the unoptimized void with the likes of the independent Process Lasso, Intel's Turbo Boost Max Technology 3.0, AMD's Dynamic Local Mode etc. pp.
 

Mockingbird

Senior member
Feb 12, 2017
572
49
96
#59
Post deleted due to Off-topic
Markfw
Anandtech Moderator
 
Last edited by a moderator:
Jun 5, 2002
162
7
101
#60
Have anyone tried if this fix also boost normal desktop Ryzen performance? Dekstop-Ryzens with split L3 will also suffer if scheluder jumps pointlessly threads between CCXs.
 
Apr 15, 2012
1,601
59
136
#61
AMD comments on the problem.

https://www.anandtech.com/show/1385...eadripper-2-performance-and-windows-scheduler

TL;DR

Microsoft scheduler basically assumes you're running a VM host machine if you have multiple NUMA nodes. The way the scheduler handles this is by assigning a best NUMA node and any thread assigned to a best NUMA node gets priority and kicks off other threads on that node whether or not they are also on their best NUMA node.

So, if you are trying to use multiple nodes to run the same 1-2 heavy compute tasks, then they all get assigned to the same "best" NUMA node and are constantly kicking each other off the same node over and over again causing a severe performance regression. Microsoft fixed this for 2 NUMA nodes at some point due to intel workstation with 2 NUMA nodes becoming more popular, but anything over 2 doesn't have this fix. AMD has multiple support tickets in to fix the problem. Linux handles multiple NUMA nodes much more intelligently.
 

ub4ty

Senior member
Jun 21, 2017
749
304
96
#62
AMD comments on the problem.

https://www.anandtech.com/show/1385...eadripper-2-performance-and-windows-scheduler

TL;DR

Microsoft scheduler basically assumes you're running a VM host machine if you have multiple NUMA nodes. The way the scheduler handles this is by assigning a best NUMA node and any thread assigned to a best NUMA node gets priority and kicks off other threads on that node whether or not they are also on their best NUMA node.

So, if you are trying to use multiple nodes to run the same 1-2 heavy compute tasks, then they all get assigned to the same "best" NUMA node and are constantly kicking each other off the same node over and over again causing a severe performance regression. Microsoft fixed this for 2 NUMA nodes at some point due to intel workstation with 2 NUMA nodes becoming more popular, but anything over 2 doesn't have this fix. AMD has multiple support tickets in to fix the problem. Linux handles multiple NUMA nodes much more intelligently.
Winblows strikes again. Obvious is obvious.
 

Schmide

Diamond Member
Mar 7, 2002
5,232
49
106
#63
Man I really get the feeling this is a stupid mask issue. Especially with Ian's exclude core 0 get better performance.

Somewhere the core provision routine sees one more core than their actually is.

AKA the mask for 32 spots is 32-1 = 31
 
Jun 12, 2014
314
24
101
#65
In other words, Microsoft managed to fix Windows just enough to avoid a repeat of Quad FX, but left it sufficiently broken to tank performance with any more NUMA nodes than that. Why am I not surprised?
 


ASK THE COMMUNITY