Ryzen: Strictly technical

Page 27 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

malventano

Junior Member
May 27, 2009
18
19
76
PCPer.com
The problem is the "Processor group" split, not the "NUMA group" split. If you could create two NUMA groups, without creating two Processor groups, it might give a better result, but it doesn't look like Microsoft is offering that option.

Actually, when I tried testing the split on Ryzen this morning, it did not work as it does with other CPUs. It only ended up disabling the cores after the boundary that should have started an additional NUMA node. Perhaps there is some whitelisting / configuration that needs to be coded into Windows to properly support manual NUMA splitting with Ryzen.

Either way, at least we can put the scheduler rumors to bed now that AMD has released a statement confirming the scheduler is not the issue (in Windows 10 and in Windows 7).
 
  • Like
Reactions: looncraz

Kromaatikse

Member
Mar 4, 2017
83
169
56
Colour me confused, then.

My best interpretation of AMD's statement is: Windows is choosing the cores in the right order for performance (physical before virtual). But this is not implemented by the scheduler - rather, by the core parking algorithm.

They are however saying nothing about Windows' insistence on moving threads around so restlessly. That *is* a property of the scheduler, and it *is* harmful, on both Intel and AMD CPUs.
 
  • Like
Reactions: powerrush

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Actually, when I tried testing the split on Ryzen this morning, it did not work as it does with other CPUs. It only ended up disabling the cores after the boundary that should have started an additional NUMA node. Perhaps there is some whitelisting / configuration that needs to be coded into Windows to properly support manual NUMA splitting with Ryzen.

Either way, at least we can put the scheduler rumors to bed now that AMD has released a statement confirming the scheduler is not the issue (in Windows 10 and in Windows 7).

I had the same experience when I set groupsize on Ryzen with Windows 7. Haven't tried with Windows 10, yet.

I will say one thing: Ryzen has the worst cache performance characteristics of any CPU of mine I have tested (so far) when dealing with random data.

We're talking about THREE TIMES worse... That's the main problem with some of the games.

Mind you, I'm still testing with DDR4-2133 CL15 2T settings, so that has an impact.
 
  • Like
Reactions: T1beriu and Drazick

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Colour me confused, then.

My best interpretation of AMD's statement is: Windows is choosing the cores in the right order for performance (physical before virtual). But this is not implemented by the scheduler - rather, by the core parking algorithm.

They are however saying nothing about Windows' insistence on moving threads around so restlessly. That *is* a property of the scheduler, and it *is* harmful, on both Intel and AMD CPUs.

My guess is that Microsoft plainly said: we aren't going to fix it for you, AMD. So AMD decided to stay in Microsoft's good graces, lest Microsoft further help Intel's monopoly.
 

innociv

Member
Jun 7, 2011
54
20
76
AMD engineers have to understand this scheduling problem and how this is predominately a Windows problem and not application problem...

So why does the marketing department and customer service give another answer? Has MS said "your problem not ours" and AMD doesn't want to hurt their relationship with MS...? This makes no damn sense.

lol. looncraz beat me to it.
 
  • Like
Reactions: looncraz

looncraz

Senior member
Sep 12, 2011
722
1,651
136
It is disabled.
"Comme pour au dessus, nous réalisons les tests à 3 GHz, le SMT est désactivé pour limiter la variabilité."

3GHz and SMT disabled

Thanks, I'm testing with both configurations right now. Cache performance seems to be positively abhorrent with random accesses. Even in-page.
 
  • Like
Reactions: Drazick

Kromaatikse

Member
Mar 4, 2017
83
169
56
My guess is that Microsoft plainly said: we aren't going to fix it for you, AMD. So AMD decided to stay in Microsoft's good graces, lest Microsoft further help Intel's monopoly.

So we're back to the stupid manual workaround using CPU affinity. Hooray.

This is actually quite typical Microsoft attitude. As long as they have some convoluted workaround that needs applying on an application-by-application or installation-by-installation basis, they can pad their revenue streams from Support Incidents and training courses for Microsoft Certified Highly Paid Consultants. Indirectly, it even perversely reinforces the lock-in factor, because people are loath to discard all those tweaks they've built up over the years, even if the best alternative doesn't need them.
 
  • Like
Reactions: looncraz

imported_jjj

Senior member
Feb 14, 2009
660
430
136
Thanks, I'm testing with both configurations right now. Cache performance seems to be positively abhorrent with random accesses. Even in-page.

Any guesses as to why?
BTW, found this memory latency scaling graph, wonder if it can get a bit better if they open up access to more timings.Can't quickly find any tests with bclk OC as those could be helpful.
As for M$,maybe they are upset that Ryzen works on Win 7 since forcing AMD to not officially support it wasn't enough.
amd-ryzen-ddr4-scaling-latency.jpg
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
They are however saying nothing about Windows' insistence on moving threads around so restlessly. That *is* a property of the scheduler, and it *is* harmful, on both Intel and AMD CPUs.
The justification for moving threads is simple: scheduler cycles threads on any CPU all the damn time, so trying to force a thread onto the same core is actually a gamble for cache remaining hot against a whole quantum of time for thread being wasted. Workaround is basic: force highest priority on threads you do not want to see shuffled that often.

You may argue that Win10 is ignorant of CCX structure, but that's about it.

Also, i am the only one who finds the policy of reporting junction temperature offset by 20 degrees on X SKUs for nothing but to force fan to be louder... weird?
I will say one thing: Ryzen has the worst cache performance characteristics of any CPU of mine I have tested (so far) when dealing with random data.

We're talking about THREE TIMES worse... That's the main problem with some of the games.
Wait, you are talking about L3, right (since well, L2 looks alright from what we see)? As in, it is slower than Phenom II? I'll blame memory instead of accepting it.
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
The justification for moving threads is simple: scheduler cycles threads on any CPU all the damn time, so trying to force a thread onto the same core is actually a gamble for cache remaining hot against a whole quantum of time for thread being wasted. Workaround is basic: force highest priority on threads you do not want to see shuffled that often.

As I noted earlier, this is about threads which have nothing competing with them for the same core. The optimal strategy in that case, from a cache-miss, branch-predict and context-switch perspective, is always to keep it on that same core.
 
  • Like
Reactions: rvborgh

Pookums

Member
Mar 6, 2017
32
13
36
Any guesses as to why?
BTW, found this memory latency scaling graph, wonder if it can get a bit better if they open up access to more timings.Can't quickly find any tests with bclk OC as those could be helpful.
amd-ryzen-ddr4-scaling-latency.jpg

Thank you for that, I was correct, its definitely the UMC that is the issue. (or UMC + fabric). Its half speed MEMCLK will be its limitation, and with the Memory issues is likely to not be fixed this go around. Hardlocking the UMC and fabric speeds, preventing a software workaround, is the worst decision AMD could have possibly made.

Look at 2933 cl14 vs 3200 cl 16. With those timings 2933 should be faster(though scaling is never perfect). A massive decrease of roughly 4(approximately the average jump between 266mhz leaps) should not be possible. Not unless the bottleneck is the UMC.

6900k has 2.8ghz uncore. Skylakes have substantially faster. At 3200mhz, it would have 1.6ghz on ryzen. You would need Memory speeds in the range of at least 4800mhz to be able to reach anything close to parity(likely even more as 6900k OCers raise it to around 3.5 to maximize its performance).

The only change that occured for those latency jumps is the gradual bump of UMC + fabric speeds in that benchmark.

Edit: fixed last sentence. Not sure what that was about.
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Also, i am the only one who finds the policy of reporting junction temperature offset by 20 degrees on X SKUs for nothing but to force fan to be louder... weird?

If it's the only reported temperature directly from the CPU, then yes it is weird. I would expect to see a true reading alongside the "fan reference" tweaked one.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
So we're back to the stupid manual workaround using CPU affinity. Hooray.

This is actually quite typical Microsoft attitude. As long as they have some convoluted workaround that needs applying on an application-by-application or installation-by-installation basis, they can pad their revenue streams from Support Incidents and training courses for Microsoft Certified Highly Paid Consultants. Indirectly, it even perversely reinforces the lock-in factor, because people are loath to discard all those tweaks they've built up over the years, even if the best alternative doesn't need them.

I think AMD is preferring to work with the specific game companies whose engines have issues rather than dealing with Wintel.

I've found at least two bugs in Windows related to Ryzen (setting groupsize disables half of Ryzen cores, but acts properly on Intel (Windows 7 - haven't tested with 10, yet) and setting affinity to every other logical core forces workloads to just two cores (Windows 10, about to test on 7)).
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Any guesses as to why?
BTW, found this memory latency scaling graph, wonder if it can get a bit better if they open up access to more timings.Can't quickly find any tests with bclk OC as those could be helpful.
As for M$,maybe they are upset that Ryzen works on Win 7 since forcing AMD to not officially support it wasn't enough.
amd-ryzen-ddr4-scaling-latency.jpg


That's a difference of only 5ns - or ~15 cycles.

I'm seeing differences of 60 cycles on Intel and > 300 cycles on Ryzen.

I'll make charts soon.
 
  • Like
Reactions: Drazick

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
As I noted earlier, this is about threads which have nothing competing with them for the same core. The optimal strategy in that case, from a cache-miss, branch-predict and context-switch perspective, is always to keep it on that same core.
That would work if your threads never had to interact with OS.
If it's the only reported temperature directly from the CPU, then yes it is weird. I would expect to see a true reading alongside the "fan reference" tweaked one.
Their statement clearly implies it is the only reading of temperature from CPU.
That's a difference of only 5ns - or ~15 cycles.

I'm seeing differences of 60 cycles on Intel and > 300 cycles on Ryzen.

I'll make charts soon.
Wait, wait wait, difference between what?
 

Blake_86

Junior Member
Mar 13, 2017
21
3
36
Actually, when I tried testing the split on Ryzen this morning, it did not work as it does with other CPUs. It only ended up disabling the cores after the boundary that should have started an additional NUMA node. Perhaps there is some whitelisting / configuration that needs to be coded into Windows to properly support manual NUMA splitting with Ryzen.

Either way, at least we can put the scheduler rumors to bed now that AMD has released a statement confirming the scheduler is not the issue (in Windows 10 and in Windows 7).

Hi, i took a look at your test suite on Pcper. Assuming your result are correct, there are a few things i still cant catch.
The inter CCX connection for example. It is a fact that because of nature of the data fabric shifting tread from a CCX to another result in a heavy penalty. Also , it's been prooved that, at least in some scenarios, Windows 10 allocate the treads without the precaution to shift CCX as little is possible.

https://www.youtube.com/watch?v=JbryPYcnscA
https://www.youtube.com/watch?v=BORHnYLLgyY

So, why AMD refuse in any means to culprit windows 10?
 
  • Like
Reactions: powerrush

TheELF

Diamond Member
Dec 22, 2012
3,967
720
126
I've found at least two bugs in Windows related to Ryzen (setting groupsize disables half of Ryzen cores, but acts properly on Intel (Windows 7 - haven't tested with 10, yet) and setting affinity to every other logical core forces workloads to just two cores (Windows 10, about to test on 7)).
Are they windows bugs? Did AMD release any info on SMT yet? If it is like IBM's power it might just join the two smt threads of a core into a super slice which would be proper and logical behaviour. (from the standpoint of someone who has no idea (me) about how power works)
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Wait, wait wait, difference between what?

Intel has 60 cycles latency in random access... Ryzen has > 300cycles in the same test.

Ryzen cache latency performance is simply abysmal. It has a narrow window where it's better than intel for linear accesses thanks to a larger L2 per core.

I'm currently testing with SMT off, results are only slightly better - and only in a very narrow region.

After this series of tests I will disable a CCX and test once again, since these numbers are likely including the CCX penalty.

Then, finally, I will overclocked the memory and test with one CCX, no SMT, and max memory overclock (which is a meager 2667... using DDR4-3200 RAM).
 

Blake_86

Junior Member
Mar 13, 2017
21
3
36
Intel has 60 cycles latency in random access... Ryzen has > 300cycles in the same test.

Ryzen cache latency performance is simply abysmal. It has a narrow window where it's better than intel for linear accesses thanks to a larger L2 per core.

I'm currently testing with SMT off, results are only slightly better - and only in a very narrow region.

After this series of tests I will disable a CCX and test once again, since these numbers are likely including the CCX penalty.

Then, finally, I will overclocked the memory and test with one CCX, no SMT, and max memory overclock (which is a meager 2667... using DDR4-3200 RAM).
we are waiting for your results!
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Are they windows bugs? Did AMD release any info on SMT yet? If it is like IBM's power it might just join the two smt threads of a core into a super slice which would be proper and logical behaviour. (from the standpoint of someone who has no idea (me) about how power works)

Considering that CPU affinity works perfectly on Linux on Ryzen, yes it's a Windows bug.

SMT is simply the ability to share the execution resources of one physical core between two virtual ones. When only one of those virtual cores is actually running, it gets *all* of the physical core's resources. That's the same as on Intel's HT, which is just a brand name for SMT technology.
 

Elixer

Lifer
May 7, 2002
10,376
762
126
My guess is that Microsoft plainly said: we aren't going to fix it for you, AMD. So AMD decided to stay in Microsoft's good graces, lest Microsoft further help Intel's monopoly.
Or, AMD's PR arm isn't talking to the engineers...
Or, MS said that will cost millions, and AMD didn't want to pay for the update, so, now, all is fine, brush it under the rug, nothing new here, carry on!
 
Status
Not open for further replies.