Ryzen: Strictly technical

theevilsharpie · Mar 16, 2017

Ajay said:
And here is the rub - Zen doesn't have significant penetration on Windows desktops or Servers yet. So Mickey$oft has no incentive to risk destabilizing Intel scheduling performance for the sake of Ryzen processors. Evidently, it's not as simple as reading the CPUID and using switch/case conditionals.

The Windows scheduler is already NUMA-aware and can tell the difference between logical and physical cores. It has everything needed to schedule work on Ryzen efficiently (except for perhaps some power control stuff), and just needs to be aware of the processor's core and cache topology.

Ajay · Mar 16, 2017

tamz_msc said:
I think looncraz's findings and the suggestions he made regarding improving Windows scheduling would benefit Intel CPUs as well.

We can't know that for sure. We don't have the algorithms for recent versions of the window's scheduler. I would be surprised if there weren't server applications out there that rely on scheduler profiles to optimize performance - same thing for embedded windows applications. There are, no doubt, quirks to be avoided as well (something game devs might know of). This OS has been evolving for over 20 years - who really knows what's in there (aside from MS)?

So those recommendations work great as guideposts developing a change to the functional specification, but the devil is in the implementation.

tamz_msc · Mar 16, 2017

Ajay said:
We can't know that for sure. We don't have the algorithms for recent versions of the window's scheduler. I would be surprised if there weren't server applications out there that rely on scheduler profiles to optimize performance - same thing for embedded windows applications. There are, no doubt, quirks to be avoided as well (something game devs might know of). This OS has been evolving for over 20 years - who really knows what's in there (aside from MS)?

So those recommendations work great as guideposts developing a change to the functional specification, but the devil is in the implementation.

Getting to look at the algorithms might not be possible, but Windows scheduling is pretty well-documented.

Ajay · Mar 16, 2017

tamz_msc said:
Getting to look at the algorithms might not be possible, but Windows scheduling is pretty well-documented.

Thanks, that a slightly newer descriptor of the one I have bookmarked. Kernel symbols are available for Win10/7 etc., so one could install them and debug into the kernel and take a look at the scheduler code.

Anywho, that link is just a description of the windows scheduling API, with some operational details (but very few!). It's not enough to have much of a clue how the scheduler actually behaves in realtime - both normally, edge cases, and any hazards (if there are any). This isn't s trivial snippet of code - here is the current Linux version...
.h: https://github.com/torvalds/linux/blob/master/include/linux/sched.h
.c: https://github.com/torvalds/linux/blob/master/kernel/sched/fair.c

tamz_msc · Mar 16, 2017

If only M$FT were generous enough to put their Windows code on github.

iBoMbY · Mar 16, 2017

theevilsharpie said:
The Windows scheduler is already NUMA-aware and can tell the difference between logical and physical cores. It has everything needed to schedule work on Ryzen efficiently (except for perhaps some power control stuff), and just needs to be aware of the processor's core and cache topology.

Yes, it is NUMA aware, but Ryzen is not reported as two NUMA nodes to the system.

Kromaatikse · Mar 16, 2017

tamz_msc said:
Getting to look at the algorithms might not be possible, but Windows scheduling is pretty well-documented.

I found this:

Thread Ideal Processor
When you specify a thread ideal processor, the scheduler runs the thread on the specified processor when possible. Use the SetThreadIdealProcessor function to specify a preferred processor for a thread. This does not guarantee that the ideal processor will be chosen but provides a useful hint to the scheduler.

So, a game which is aware of Ryzen's special topology can influence Windows' scheduling behaviour without relying on an affinity mask. This is fortunate since the latter appears to be broken. The problem is that each and every game dev needs to think about and set this correctly.

I remain convinced that the scheduler itself is completely oblivious to SMT, NUMA, etc. Any illusion otherwise is given by the core-parking algorithm (which is at least SMT aware), and by affinity optimisations applied internally or externally to a given process. The documentation linked above talks a lot about applications needing to take responsibility for optimising their own affinity settings for topological considerations.

tamz_msc · Mar 16, 2017

Thankfully Codemasters correctly identifies the topology, it is only when syncing the configuration files through Steam when migrating a previous installation of F1 2016 that it fails to update its detection of the new configuration.

If this is the situation, then who knows what other developers might be doing - the Ghost Recon Wildlands example(typical of Ubisoft) doesn't inspire much confidence.

Dresdenboy · Mar 16, 2017

tamz_msc said:
Thankfully Codemasters correctly identifies the topology, it is only when syncing the configuration files through Steam when migrating a previous installation of F1 2016 that it fails to update its detection of the new configuration.

If this is the situation, then who knows what other developers might be doing - the Ghost Recon Wildlands example(typical of Ubisoft) doesn't inspire much confidence.

Didn't Robert Hallock/AMD state that F1 detected 16 cores?

tamz_msc · Mar 16, 2017

Dresdenboy said:
Didn't Robert Hallock/AMD state that F1 detected 16 cores?

Ah, I seem corrected then. The 16-core detection is in addition to the syncing issue. GRW on the other hand is the one that blatantly shows 16 cores in the results of its built-in benchmark.

imported_jjj · Mar 16, 2017

Dresdenboy said:
Didn't Robert Hallock/AMD state that F1 detected 16 cores?

Found one more game that does it https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-30#post-38794057
BTW i don't use Twitter/FB so if anyone can let AMD know about it...

innociv · Mar 16, 2017

Even if F1 and other games using the engine correctly identified 8 cores with SMT, that doesn't make it "aware of its topology"

Only if an application identifies that it has 2 L3 caches per set of 4 cores and significant latency to cross them is it actually aware of the topology.

Ajay said:
So Mickey$oft has no incentive to risk destabilizing Intel scheduling performance for the sake of Ryzen processors. Evidently, it's not as simple as reading the CPUID and using switch/case conditionals.

Hm.
So they can use the CPUID to disable Windows 7 updates on Ryzen CPUs, but they can't use it to change the scheduling pattern. Got it.

tamz_msc said:
I think looncraz's findings and the suggestions he made regarding improving Windows scheduling would benefit Intel CPUs as well.

It would, wouldn't it?
Isn't there more latency from one side of the L3 cache to the other, so it's best to cluster threads for a process together regardless of it being a "true 8 core" or a "2x4 core"?

I think many people did assume the scheduler was already aware of the latency of cross communication from one core to the next as it seems like such a straight forward optimization to do.

Kromaatikse said:
In my view, there are fewer similarities between anything in the Bulldozer family and Ryzen, than between Ryzen and old K8 multi-socket Opterons, at least as far as topology goes. And that's a *good* thing. But if you want another AMD CPU with four cores per last-level cache and significantly more latency between those caches, look no further than Jaguar, as used in the PS4 and XBone. Of course, those consoles don't run Windows, but a specialised game-centric OS.

Jaguar is based on Bobcat which is based on K10.
Jaguar has more in common with K8 and Ryzen as well than Bulldozer.

Ryzen is sort of Phenom IV... if Phenom III ever existed. But you could think of Bobcat and Jaguar as Phenom III.

OrangeKhrush · Mar 16, 2017

Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real.

Public knowledge by now but AMD has a new HEDT platform coming out in a couple of months
You'll see more of it at Computex I believe.
It's a 16 core /32 Thread, quad channel behemoth. And it is insanely quick in the tests that Ryzen is already excelling at. So Cinebench, and all other related productivity programs. The gaming issues that were causing the Ryzen AM4 CPUs to behave erratically to say the least have been ironed out. It's akin to a newer revision on a newer platform. This should be competing with the Xeon and of course 6950X Intel offers for $1700~$1800USD, but at about $1,000 USD if not less for some Skews. Coming soon.
CPSs are pretty big physically, about twice the size of surrent 6950X CPUs and a bit more perhaps.
And if you were hoping for pins, nope it's strictly LGA!
IT's NOT 8 channel, but Quad.

Will be a splendid competition between X299 and this AMD platform. Skylake-X is pretty good, not revolutionary but a meaningful step up in IPC and the clocks are pretty high as well. If Intel will have a 32 core part to compete on X299 remains to be seen, but the HEDT platform is going to change quite a bit in the next 4 to 6 months.

I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.

tamz_msc · Mar 16, 2017

I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?

lolfail9001 · Mar 16, 2017

tamz_msc said:
How does a silicon revision solve what is fundamentally an interconnectivity issue?

Increase data fabric clock?

piesquared · Mar 16, 2017

tamz_msc said:
I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?

Speed up the DF? We've already seen how much increasing the BCLK increases performance. Although I'm very happy with the performance of my Ryzen 1700 already.

tamz_msc · Mar 16, 2017

There is a point beyond which BCLK increases affect PCI-E bandwidth. That threshold is pretty low for PCI-E 3.0 to 2.0 operation. Unless the DF is made independent of the IMC, latency issues will remain.

Kromaatikse · Mar 16, 2017

tamz_msc said:
If this is the situation, then who knows what other developers might be doing - the Ghost Recon Wildlands example(typical of Ubisoft) doesn't inspire much confidence.

Well, this is of course the danger Microsoft inherently runs by punting topology awareness to applications. In general, application developers aren't as aware of these details as system devs, and certainly the spread of knowledge is more uneven.

dnavas · Mar 16, 2017

OrangeKhrush said:
Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real. I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.

I'm sitting next to my computer now which is doing work that my poor 860 couldn't have managed, so it's hard to be too disappointed. But.... Drat. I'm still seeing 6 hour encodes, so if they can ship a quad-channel, 16 core 3.6-4.0Ghz 200W TDP monster (32 pcie lanes?) on an ATX-sized board (hah!), AMD is going to get a lot of money from my wallet. Probably e-atx, huh? :sigh:

dnavas · Mar 16, 2017

lolfail9001 said:
Increase data fabric clock?

Wasn't there a mention of being able to run the fabric clock at 1:1 to memory clock? Could be that there was a problem in current revision that didn't allow clocks that high?

OrangeKhrush · Mar 16, 2017

tamz_msc said:
There is a point beyond which BCLK increases affect PCI-E bandwidth. That threshold is pretty low for PCI-E 3.0 to 2.0 operation. Unless the DF is made independent of the IMC, latency issues will remain.

That is the answer likely, they left things available to play with

tamz_msc · Mar 16, 2017

Kromaatikse said:
Well, this is of course the danger Microsoft inherently runs by punting topology awareness to applications. In general, application developers aren't as aware of these details as system devs, and certainly the spread of knowledge is more uneven.

Typically games just use the CPUID for detection and store it in an xml. Are there applications that actually detect and store cache hierarchy in a separate file?

lolfail9001 · Mar 16, 2017

tamz_msc said:
There is a point beyond which BCLK increases affect PCI-E bandwidth. That threshold is pretty low for PCI-E 3.0 to 2.0 operation. Unless the DF is made independent of the IMC, latency issues will remain.

I believe there was mentioned a presently disabled ability to run DF at 2x the mem bus clock (so basically to run it at the tick rate of memory). Should AMD be able use it with some new silicon revision, the DF issues for all intents and purposes may be resolved.

formulav8 · Mar 16, 2017

dnavas said:
Wasn't there a mention of being able to run the fabric clock at 1:1 to memory clock

I wouldn't be surprised if a future bios allows you to change the divider's ratio.

OrangeKhrush · Mar 16, 2017

Get me a 1500 with newer revision silicon and a beer

Ryzen: Strictly technical

Platinum Member

Lifer

Diamond Member

Lifer

Diamond Member

Member

Member

Diamond Member

Golden Member

Diamond Member

Senior member

Member

Senior member

Diamond Member

Golden Member

Golden Member

Diamond Member

Member

Senior member

Senior member

Senior member

Diamond Member

Golden Member

Diamond Member

Senior member