Ryzen: Strictly technical

Page 34 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

theevilsharpie

Platinum Member
Nov 2, 2009
2,322
14
81
And here is the rub - Zen doesn't have significant penetration on Windows desktops or Servers yet. So Mickey$oft has no incentive to risk destabilizing Intel scheduling performance for the sake of Ryzen processors. Evidently, it's not as simple as reading the CPUID and using switch/case conditionals.

The Windows scheduler is already NUMA-aware and can tell the difference between logical and physical cores. It has everything needed to schedule work on Ryzen efficiently (except for perhaps some power control stuff), and just needs to be aware of the processor's core and cache topology.
 

Ajay

Lifer
Jan 8, 2001
15,468
7,874
136
I think looncraz's findings and the suggestions he made regarding improving Windows scheduling would benefit Intel CPUs as well.

We can't know that for sure. We don't have the algorithms for recent versions of the window's scheduler. I would be surprised if there weren't server applications out there that rely on scheduler profiles to optimize performance - same thing for embedded windows applications. There are, no doubt, quirks to be avoided as well (something game devs might know of). This OS has been evolving for over 20 years - who really knows what's in there (aside from MS)?

So those recommendations work great as guideposts developing a change to the functional specification, but the devil is in the implementation.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,821
3,643
136
We can't know that for sure. We don't have the algorithms for recent versions of the window's scheduler. I would be surprised if there weren't server applications out there that rely on scheduler profiles to optimize performance - same thing for embedded windows applications. There are, no doubt, quirks to be avoided as well (something game devs might know of). This OS has been evolving for over 20 years - who really knows what's in there (aside from MS)?

So those recommendations work great as guideposts developing a change to the functional specification, but the devil is in the implementation.
Getting to look at the algorithms might not be possible, but Windows scheduling is pretty well-documented.
 
  • Like
Reactions: Kromaatikse

Ajay

Lifer
Jan 8, 2001
15,468
7,874
136
Getting to look at the algorithms might not be possible, but Windows scheduling is pretty well-documented.
Thanks, that a slightly newer descriptor of the one I have bookmarked. Kernel symbols are available for Win10/7 etc., so one could install them and debug into the kernel and take a look at the scheduler code.

Anywho, that link is just a description of the windows scheduling API, with some operational details (but very few!). It's not enough to have much of a clue how the scheduler actually behaves in realtime - both normally, edge cases, and any hazards (if there are any). This isn't s trivial snippet of code - here is the current Linux version...
.h: https://github.com/torvalds/linux/blob/master/include/linux/sched.h
.c: https://github.com/torvalds/linux/blob/master/kernel/sched/fair.c
 

iBoMbY

Member
Nov 23, 2016
175
103
86
The Windows scheduler is already NUMA-aware and can tell the difference between logical and physical cores. It has everything needed to schedule work on Ryzen efficiently (except for perhaps some power control stuff), and just needs to be aware of the processor's core and cache topology.

Yes, it is NUMA aware, but Ryzen is not reported as two NUMA nodes to the system.
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Getting to look at the algorithms might not be possible, but Windows scheduling is pretty well-documented.

I found this:
Thread Ideal Processor
When you specify a thread ideal processor, the scheduler runs the thread on the specified processor when possible. Use the SetThreadIdealProcessor function to specify a preferred processor for a thread. This does not guarantee that the ideal processor will be chosen but provides a useful hint to the scheduler.​

So, a game which is aware of Ryzen's special topology can influence Windows' scheduling behaviour without relying on an affinity mask. This is fortunate since the latter appears to be broken. The problem is that each and every game dev needs to think about and set this correctly.

I remain convinced that the scheduler itself is completely oblivious to SMT, NUMA, etc. Any illusion otherwise is given by the core-parking algorithm (which is at least SMT aware), and by affinity optimisations applied internally or externally to a given process. The documentation linked above talks a lot about applications needing to take responsibility for optimising their own affinity settings for topological considerations.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,821
3,643
136
Thankfully Codemasters correctly identifies the topology, it is only when syncing the configuration files through Steam when migrating a previous installation of F1 2016 that it fails to update its detection of the new configuration.

If this is the situation, then who knows what other developers might be doing - the Ghost Recon Wildlands example(typical of Ubisoft) doesn't inspire much confidence.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Thankfully Codemasters correctly identifies the topology, it is only when syncing the configuration files through Steam when migrating a previous installation of F1 2016 that it fails to update its detection of the new configuration.

If this is the situation, then who knows what other developers might be doing - the Ghost Recon Wildlands example(typical of Ubisoft) doesn't inspire much confidence.
Didn't Robert Hallock/AMD state that F1 detected 16 cores?
 
  • Like
Reactions: looncraz

tamz_msc

Diamond Member
Jan 5, 2017
3,821
3,643
136
Didn't Robert Hallock/AMD state that F1 detected 16 cores?
Ah, I seem corrected then. The 16-core detection is in addition to the syncing issue. GRW on the other hand is the one that blatantly shows 16 cores in the results of its built-in benchmark.
 
  • Like
Reactions: Dresdenboy

innociv

Member
Jun 7, 2011
54
20
76
Even if F1 and other games using the engine correctly identified 8 cores with SMT, that doesn't make it "aware of its topology"

Only if an application identifies that it has 2 L3 caches per set of 4 cores and significant latency to cross them is it actually aware of the topology.

So Mickey$oft has no incentive to risk destabilizing Intel scheduling performance for the sake of Ryzen processors. Evidently, it's not as simple as reading the CPUID and using switch/case conditionals.
Hm.
So they can use the CPUID to disable Windows 7 updates on Ryzen CPUs, but they can't use it to change the scheduling pattern. Got it. :)

I think looncraz's findings and the suggestions he made regarding improving Windows scheduling would benefit Intel CPUs as well.
It would, wouldn't it?
Isn't there more latency from one side of the L3 cache to the other, so it's best to cluster threads for a process together regardless of it being a "true 8 core" or a "2x4 core"?

I think many people did assume the scheduler was already aware of the latency of cross communication from one core to the next as it seems like such a straight forward optimization to do.

In my view, there are fewer similarities between anything in the Bulldozer family and Ryzen, than between Ryzen and old K8 multi-socket Opterons, at least as far as topology goes. And that's a *good* thing. But if you want another AMD CPU with four cores per last-level cache and significantly more latency between those caches, look no further than Jaguar, as used in the PS4 and XBone. Of course, those consoles don't run Windows, but a specialised game-centric OS.
Jaguar is based on Bobcat which is based on K10.
Jaguar has more in common with K8 and Ryzen as well than Bulldozer.

Ryzen is sort of Phenom IV... if Phenom III ever existed. But you could think of Bobcat and Jaguar as Phenom III.
 
Last edited:

OrangeKhrush

Senior member
Feb 11, 2017
220
343
96
Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real.

Public knowledge by now but AMD has a new HEDT platform coming out in a couple of months
You'll see more of it at Computex I believe.
It's a 16 core /32 Thread, quad channel behemoth. And it is insanely quick in the tests that Ryzen is already excelling at. So Cinebench, and all other related productivity programs. The gaming issues that were causing the Ryzen AM4 CPUs to behave erratically to say the least have been ironed out. It's akin to a newer revision on a newer platform. This should be competing with the Xeon and of course 6950X Intel offers for $1700~$1800USD, but at about $1,000 USD if not less for some Skews. Coming soon.
CPSs are pretty big physically, about twice the size of surrent 6950X CPUs and a bit more perhaps.
And if you were hoping for pins, nope it's strictly LGA!
IT's NOT 8 channel, but Quad.

Will be a splendid competition between X299 and this AMD platform. Skylake-X is pretty good, not revolutionary but a meaningful step up in IPC and the clocks are pretty high as well. If Intel will have a 32 core part to compete on X299 remains to be seen, but the HEDT platform is going to change quite a bit in the next 4 to 6 months.

I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,821
3,643
136
I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?
 

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?

Speed up the DF? We've already seen how much increasing the BCLK increases performance. Although I'm very happy with the performance of my Ryzen 1700 already.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,821
3,643
136
There is a point beyond which BCLK increases affect PCI-E bandwidth. That threshold is pretty low for PCI-E 3.0 to 2.0 operation. Unless the DF is made independent of the IMC, latency issues will remain.
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
If this is the situation, then who knows what other developers might be doing - the Ghost Recon Wildlands example(typical of Ubisoft) doesn't inspire much confidence.

Well, this is of course the danger Microsoft inherently runs by punting topology awareness to applications. In general, application developers aren't as aware of these details as system devs, and certainly the spread of knowledge is more uneven.
 
  • Like
Reactions: deadhand and Ajay

dnavas

Senior member
Feb 25, 2017
355
190
116
Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real. I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.

I'm sitting next to my computer now which is doing work that my poor 860 couldn't have managed, so it's hard to be too disappointed. But.... Drat. I'm still seeing 6 hour encodes, so if they can ship a quad-channel, 16 core 3.6-4.0Ghz 200W TDP monster (32 pcie lanes?) on an ATX-sized board (hah!), AMD is going to get a lot of money from my wallet. Probably e-atx, huh? :sigh:
 

OrangeKhrush

Senior member
Feb 11, 2017
220
343
96
There is a point beyond which BCLK increases affect PCI-E bandwidth. That threshold is pretty low for PCI-E 3.0 to 2.0 operation. Unless the DF is made independent of the IMC, latency issues will remain.

That is the answer likely, they left things available to play with
 

tamz_msc

Diamond Member
Jan 5, 2017
3,821
3,643
136
Well, this is of course the danger Microsoft inherently runs by punting topology awareness to applications. In general, application developers aren't as aware of these details as system devs, and certainly the spread of knowledge is more uneven.
Typically games just use the CPUID for detection and store it in an xml. Are there applications that actually detect and store cache hierarchy in a separate file?
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
There is a point beyond which BCLK increases affect PCI-E bandwidth. That threshold is pretty low for PCI-E 3.0 to 2.0 operation. Unless the DF is made independent of the IMC, latency issues will remain.
I believe there was mentioned a presently disabled ability to run DF at 2x the mem bus clock (so basically to run it at the tick rate of memory). Should AMD be able use it with some new silicon revision, the DF issues for all intents and purposes may be resolved.
 
Status
Not open for further replies.