Ryzen: Strictly technical

R0H1T · Mar 12, 2017

Kromaatikse said:
In terms of Ethernet - Ethernet as used with "dumb hubs" is a half-duplex protocol. Data fabrics are designed to support multiple full-duplex data paths, and for that reason are used in Ethernet switches, which are very much more efficient than Ethernet hubs and can support Ethernet in full-duplex mode. In fact, it's impossible to use Gigabit Ethernet with dumb hubs, because it doesn't *have* a half-duplex mode.

So forget about 40% of anything.

Based on stuff I've found out in the past few days, here's how Windows appears to assign threads to cores. I'm basing this on experiments with Windows 7, but I believe Win10 is essentially similar.

1: After every timeslice, the core assignment for each thread is calculated anew, and is not stable. This results in frequent migrations of threads between cores, unless only one core is available to that thread (of which more below) or all other available cores are already busy.

2: The timeslice is based on the thread priority, which itself is influenced by the process priority. Higher priorities get longer timeslices - "realtime" gets a (pseudo?) infinite timeslice. You can therefore reduce the thread migration frequency just by increasing the process priority.

3: Core availability is determined solely by Core Parking. The scheduler itself **is not SMT aware**, but the Core Parking algorithm is. Core Parking is not part of the scheduler - it is part of the power management subsystem, and its tunables can be made available in the Power Options control panel with some registry tweaks. I emphasise: if all cores are unparked, Windows will happily migrate threads randomly onto both physical and virtual cores, just as if they were all physical.

4: If the thread or process has CPU Affinity set, that further limits the available set of cores for that thread, on top of Core Parking. However, if all affinity-set cores for a thread are parked, Core Parking is overridden and the thread is assigned to one of its affinity cores anyway. Core Parking can detect this and will unpark cores which see significant affinity activity. NB: a parked core is not necessarily shut down (in C-state) - but a parked core is *usually* idle and *therefore* usually shut down. C-stated cores are necessary to go beyond the all-core turbo speeds automatically.

5: On Windows 7, by default all physical cores have at least one virtual core unparked at all times. This might be different on Win10, based on screenshots. There is a toggle for this behaviour.

6: The Core Parking algorithm is very dumb - it doesn't seem to know how many threads are "runnable" at any given instant - and is therefore tuned to be very liberal about how many cores are unparked. I can fully believe that it will unpark about twice as many cores as there are full-time threads. This means that a 2T workload will appear to be spread evenly across 4 cores on a many-core system - but this **must not** be taken as evidence of NUMA awareness.

In short, it's a mess. As I noted over on Reddit, Linux does all of this about 500% better, because Linux people care about technical results, not risk-averse arse-covering.

Well, now the chickens have come home to roost in Redmond, as far as technical debt in the basic scheduler is concerned. Here are a couple of things that could be fixed by one competent engineer in a week:

1: Make the core assignment algorithm stable, ie. so that it prefers to assign the core that the thread is already running on, or that it last ran on (if it isn't busy). This will have immediate performance benefits on *all* SMP, SMT, CMT, and NUMA systems, not just Ryzen. Less context-switch overhead, fewer cache misses, better branch prediction, more truly-idle cores that can be C-stated. It's pure win.

2: Rework the Core Parking algorithm to use the runnable thread count to guide the number of unparked cores. On NUMA systems, unpark cores preferentially on the same node for runnable threads belonging to the same process, and preferentially on different nodes for runnable threads in different processes. Better yet, scrap Core Parking entirely and work on the best practical implementation of Point 1.

That, of course, is how Linux does it.

So, about my other thread, perhaps someone with a Ryzen could try this out ~ https://forums.anandtech.com/threads/rightmark-processor-power-management-panel.2500879/

The tool in question isn't an OC utility per se, it'll however adjust advanced parameters wrt core parking, processor performance, idle & platform specific settings (using registry) among others that you could use to see how the scheduler (adversely) affects win10 performance for Ryzen.

unseenmorbidity · Mar 12, 2017

CatMerc said:
Drawcall testing of CCX communication penalty.
tl;dw - 17 FPS when threads are on one CCX, 14FPS when threads are split across CCX's.

Here it can be seen that Windows randomly assigns threads to the CCX's. Sometimes it splits them, sometimes it keeps them on one.

Wow that is a massive difference just because of sheer chance. It's no wonder gaming benchmarks are so varied. It was a literal crapshoot.

rvborgh · Mar 12, 2017

i've been yelling trying to get Ryzen people for the past week to use Process Lasso... (it seems that Ryan Schrout was not even aware of its existence) its good that they are finally doing it (i saw huge gains in fps on my quad NUMA setup using it).

i wonder if the SMT functionality in Ryzen cores could report back on the "SMT" ness (ie the amount of gain in throughput of the threads it is running due to SMT vs. not running it)... and notify Windows to "switch off" the SMT (ie not schedule no more than a single heavy weight thread on a single real core) in real time. i think this would really help things.

i think its absolutely crazy that Windows schedules things so badly.

One thing nice about Ryzen... lots of real cores... i think this would be beneficial for games with few heavy duty threads.

CatMerc said:
Drawcall testing of CCX communication penalty.
tl;dw - 17 FPS when threads are on one CCX, 14FPS when threads are split across CCX's.

Here it can be seen that Windows randomly assigns threads to the CCX's. Sometimes it splits them, sometimes it keeps them on one.

R0H1T · Mar 12, 2017

rvborgh said:
i've been yelling trying to get Ryzen people for the past week to use Process Lasso... (it seems that Ryan Schrout was not even aware of its existence) its good that they are finally doing it (i saw huge gains in fps on my quad NUMA setup using it).

i wonder if the SMT functionality in Ryzen cores could report back on the "SMT" ness (ie the amount of gain in throughput of the threads it is running due to SMT vs. not running it)... and notify Windows to switch off the SMT (ie not schedule no more than a single heavy weight thread on a single real core) in real time. i think this would really help things.

i think its absolutely crazy that Windows schedules things so badly.

One thing nice about Ryzen... lots of real cores... i think this would be beneficial for games with few heavy duty threads.

Do you have any numbers you'd wanna share?

I've been using the same utility for as long as I can remember & it definitely works better than the (Windows) OS scheduler, heck even Intel's speed step is something that might've been inspired by it.

rvborgh · Mar 12, 2017

none that would really be applicable to the Ryzen folks (i will be putting together a Ryzen system in the next few months or so after things settle down) aside from telling them they really need to use Process Lasso.

On my quad Opteron setup... (8 dies in 4 sockets) i think i went from 230 or so fps, to 335 fps by using Process Lasso to force the threads of a game i play onto the cores in the first socket (same problem as alot of games have with Ryzen... they have no idea what to do with 48 threads much less 16). Quad socket Opterons share something in common with Ryzen... HT links between sockets (albeit Ryzen's are all on a single die)... avoiding unnecessarily using HT links really helps.

Tuning and setting games up is effortless really and the settings will apply any time you play the game in the future... so you don't have to worry about windows randomly rotating threads to suboptimal core combinations.

R0H1T said:
Do you have any numbers you'd wanna share?

I've been using the same utility for as long as I can remember & it definitely works better than the (Windows) OS scheduler, heck even Intel's speed step is something that might've been inspired by it.

Kromaatikse · Mar 12, 2017

Apparently Allyn at PCper redid the test,and now admits he is wrong.

That result is definitely consistent with Core Parking as I know it. Actually, I'm surprised so much of the work is landing on the first CCX; on Win7, there's usually a 50-50 or 60-40-ish load split between any two cores that are sharing the load. I do suspect Win10 has either an updated parking algorithm or better default parameters.

To expand a little more on why the Core Parking algorithm is broken, understand that it only looks at the *actual* CPU consumption. So if it sees one core loaded to 100% and all other cores perfectly idle, it thinks there is exactly 1 core's worth of load, even if there are actually 8 full-time threads fighting over that single core. Therefore, in order to recognise when more cores need to be unparked, it must first unpark some more cores - a Catch-22. And of course, as soon as an extra core is unparked, the scheduler will start needlessly migrating threads to and fro, even if there are plenty of unparked cores to go around. Also, the extra cores to be unparked will be on separate physical cores, so with a 4T workload on Ryzen, they'll be on the second CCX.

If the parking algorithm was conscious of its own "trial" unparking behaviour, it could unpark a virtual core when speculating. This would allow it to detect additional load without incurring the power, turbo and cache penalties of activating an extra physical core. (Only the scheduler itself can avoid the context-switch penalties.)

Kromaatikse · Mar 12, 2017

CatMerc said:
Drawcall testing of CCX communication penalty.
tl;dw - 17 FPS when threads are on one CCX, 14FPS when threads are split across CCX's.

Here it can be seen that Windows randomly assigns threads to the CCX's. Sometimes it splits them, sometimes it keeps them on one.

In that video, I noticed him calling all the even-numbered cores "physical" and the odd-numbered ones "virtual". This is a misconception of SMT.

Each physical core has two threads, which appear as two virtual cores to the OS. Both virtual cores will operate at the same performance unless one is explicitly prioritised - older HT CPUs didn't have this capability, and not all OS versions know how to drive it.

Obviously, if either or both of the virtual cores are active, the physical core is also active; conversely, if both of the virtual cores are halted, then the physical core is halted. The trick with SMT is that when both virtual cores are active at the same time, they share the various execution resources of the same physical core; this reduces the performance of each thread, but usually increases total performance.

So when I (or someone else who really knows what they're talking about) say "putting it on physical cores", this is really shorthand for "using at most one virtual core per physical core". The opposite would be to assign a 2T workload to *both* threads of a *single* physical core, eg. CPUs 0 and 1 in Windows, and we would expect to see a performance decrease but also a power consumption decrease in that case.

lolfail9001 · Mar 12, 2017

rvborgh said:
i've been yelling trying to get Ryzen people for the past week to use Process Lasso... (it seems that Ryan Schrout was not even aware of its existence) its good that they are finally doing it (i saw huge gains in fps on my quad NUMA setup using it).

From looncraz's testing it looks like it is presently broken with Win10 and Ryzen. As in, Win10 literally assigns every single thread of lassoed process to just cores 0 and 2 if try to set affinities on just 1 thread per core.

Panino Manino · Mar 12, 2017

Sorry if this is old, but the information in these imagens are correct?

rvborgh · Mar 12, 2017

That's true... you can see this visually if you run "ThreadRacer" (another older Bitsum utility) or write your own. On my i7-4770 assigning two "heavy" threads to a single core (ie two logical cores mapped to a single physical as you mentioned)... results in both of them running at an apparent half speed (although if SMT is working well due to the code characteristics you get better throughput through both).

Kromaatikse said:
In that video, I noticed him calling all the even-numbered cores "physical" and the odd-numbered ones "virtual". This is a misconception of SMT.

Each physical core has two threads, which appear as two virtual cores to the OS. Both virtual cores will operate at the same performance unless one is explicitly prioritised - older HT CPUs didn't have this capability, and not all OS versions know how to drive it.

Obviously, if either or both of the virtual cores are active, the physical core is also active; conversely, if both of the virtual cores are halted, then the physical core is halted. The trick with SMT is that when both virtual cores are active at the same time, they share the various execution resources of the same physical core; this reduces the performance of each thread, but usually increases total performance.

So when I (or someone else who really knows what they're talking about) say "putting it on physical cores", this is really shorthand for "using at most one virtual core per physical core". The opposite would be to assign a 2T workload to *both* threads of a *single* physical core, eg. CPUs 0 and 1 in Windows, and we would expect to see a performance decrease but also a power consumption decrease in that case.

JoeRambo · Mar 12, 2017

rvborgh said:
i think its absolutely crazy that Windows schedules things so badly.

Really? Well think from scheduler point of view:

1) A load of 4 threads that is making use of L3 cache on CCX with little inter thread communication - you want them to be spread across CCX in 2+2 pattern to make use of 8MB cache on each CCX.
2) Same load, but with heavy inter thread communication -> now you want them to be run on one CCX to make use of shared L3 and avoid hitting inter-CCX communication that is infinite only in name and comes with nasty latency. Each check needs to hit L3 and L2 of each and every core in target CCX to see if particular cache line is there.

How should it know what type of {1,2} of load it is dealing with? That is generally a hard CS problem and will not be solved any time soon. ( before you reply with award winning algo, think also about power saving settings that would prefer to keep cores and core complexes downclocked and downvolted).

Process affinity is nice and not new, but it is not general solution. It works great in things like "Limit Linx to 4 physical core threads" and seek max throughput. But start doing "let's limit this draw call generator to this CCX" and you run the risk of kernel scheduling graphic driver core thread to different CCX and things going unpredictably wrong from here. Is it a fault of scheduler or your own this time? What if it works when testing with light load tool, but fails when there is real game load on cores?

AMD made conscious decision to make CCX 4 core sized to have same building block in CPUs from entry level to large servers. The fact that you need a NUMA node for just 4 cores is outright ugly, as NUMA is not magic bullet and apps would need NUMA awareness as well. ( and even with NUMA "2" scenario applies, just so you stick your command generating threads to one NUMA node CCX, does not mean that driver that is consuming them is going to run on it, actually you just made OS more likely to move it to different CCX just by virtue of keeping it comparatively under loaded.

Good news is that one can bet a farm on a fact that AMD is going to increase core count to 6-8 per CCX and be done with all those problems for desktop and lessen the scope of problems on servers, as 8 high per Ryzen cores is a nice chunk of computing.

Udgnim · Mar 12, 2017

guessing people that buy the 6 core R5 are going to really want 4+2 enabled cores instead of 3+3 enabled cores if their purchase is for gaming

I'm assuming R5s are R7s with 2 cores disabled

Osjur · Mar 12, 2017

Why nobody hasn't tested games with one CCX completely disabled vs both enabled at the same frequency to see how big the latency penalty is? Or is not possible to disable CCX completely with current BIOS available from mobo makers?

Kromaatikse · Mar 12, 2017

There are in fact a few reviews which have done such tests.

Kromaatikse · Mar 12, 2017

R0H1T said:
So, about my other thread, perhaps someone with a Ryzen could try this out ~ https://forums.anandtech.com/threads/rightmark-processor-power-management-panel.2500879/

I don't have a Ryzen myself (yet), but it seems to expose the same parameters as I've been playing with - and in a much neater and easier format than Microsoft's own control panel manages to. So, good find!

Judging by those screenshots, Win10 does expose a different set of parameters in the core-parking arena to Win7, strongly suggesting that it uses a different core-parking algorithm. The basic two-layer approach to CPU scheduling remains, however - the scheduler itself is apparently blissfully unaware of anything except an externally-set mask of unparked cores which it can use.

It might help me to understand the Win10 algorithm if I had the detailed help-text from each of the core-parking options. I don't need fullscreen screenshots for that, just the window or even the plain text (option name, help text, default values from each of Powersave, Balanced and High Performance profiles).

Osjur · Mar 12, 2017

Kroomatikse, can you link one of those tests because I can only find tests with "4 core" ryzen vs 7700 but not against 8 core ryzen

gtbtk · Mar 12, 2017

@Kromaatikse You are showing your ignorance about ethernet networking. Gigabit, 10Gig, 100BaseT 10BaseT 2Base2 all use CSMA/CD (Carrier Sense Multiple Access with Collision Detection). That basically means that a device on the network just throws packets of data out addressed to another device and sees if the packet or group of packets generates an acknowledgement. If the ack doesnt come back, it sends it again because it assumes that the last packet had a collision with a packet from another device going elsewhere. In the millisecond lifetimes of data packets, there are lots of empty spaces available up until you get to about a 40% load on a multi drop network.

More bandwidth, Full Duplex, switches, Jumbo packets etc have all been invented to create order on a network and mitigate some of the down sides of CSMA/CD much like Traffic lights or a traffic cop does when trying to manage heavy motor vehicle traffic. Switches mitigate the problem by making every segment point to point so that only the switch port and the attached device are on that particular network. The switch then holds the packet data and forwards it when there is a gap in the traffic. The protocol still waits for an acknowledgement and will resend if necessary. Token ring and FDDI were protocols that manage traffic by token passing (you can only send data if you hold the token) but that approach provides too much overhead and it is more efficient to send and pray assuming that a large percentage of the time you won't need to resend. The times you do resend is the "cost of doing business" as it were. The plan all falls down as traffic loads reach that 40% number.

The 40% and everything grinds to a halt principal also works with other things we do in life as well. Road traffic will hit a certain volume and end up grid locked. Traffic lights, stop signs round abouts have all been created to manage traffic flow just like network switches.

The data fabric is a network that the chip uses internal to itself. I am not suggesting that it is using Ethernet internally, it has its own protocol that still needs to send receive and make sure that the data arrived at the other end. Those principles are common to all networks so I was using Ethernet as an analogy to illustrate what I was talking about. Maybe I am just too old and because of the high bandwith and network switches the downsides of collision detect has been hidden from view so no one bothers to learn about it any more. We all deal with the performance degradation collision detect can suffer from every time we use a wifi network in a busy wifi area if that makes it easier to visualize.

If the schedulers and thread switching were creating the entire problem, You would not be seeing good performance in ANY benchmarks because Thread switching is tied solely to CPU activity. It may well be using the data fabric to switch CCX, for better or worse, that is how the chip architecture is designed to work. Without any extra loads on the DF like a graphics load, the switching is not creating any noticeable impact on performance. The only performance impact you are seeing is when memory, GPU, CPU threads, storage access is all working hard and communicating with each other at the same time. While I am sure you are seeing the CCX switching and it appears that it is an issue, you are measuring is a symptom of the underlying cause which is the current bandwidth restrictions of the DF.

Take a look at the combined scores on this range of firestrike results I have included below.

The first group are Ryzen 1800X with GPUs that range from Titan XP to 1080TI to SLI and single 1070.The Physics scores that only load the CPU are all fall roughly where we would expect a 4Ghz 8 core to be while the combined scores are all limited to 6000-7000 because the system as a whole is hitting a limitation.

The second group are intel 6900K and 7700K with similar GPUs, I even put in a 1070SLI result and it didn't make any difference, Physics scores land where we expect given the maturity and OC levels that are achievable but the combined score all tops out at 9000-10000. Intel architecture has a limitation too but it is higher than the current knowledge of AMD allows.

The third image is 1800X with 1060 and 7700K with 1060. Physics scores are in range but combined scores are lower and roughly the same as each other, That is more in line of what using a lower powered GPU would lead you to expect.

I did not include it but Single RX480 shows a similar combined scores of 5000-5400 on both 1800X and 7700K while Crossfire 480s Increase on 7700K to the 7000s while 1800X is topping out at about 5800.

If the CCX was root of the problem and not a symptom, then you would see a similar disparity between Intel and AMD regardless of the GPU used and that does not appear to be the case.

I'm sure the DF fabric can be tuned to provide better performance as the platform gains maturity. Faster memory support is certainly beneficial as it also increases the DF LCLK. At least with Intel, CPU PLL, together with the IO voltage allows the tuning of the equivalent interconnects. I highly recommend that owners do some tests while adjusting SOC and CPU PLL voltages because if I'm right, that area of the bios is where the tuning improvements are going to come from.

lolfail9001 · Mar 12, 2017

Osjur said:
Kroomatikse, can you link one of those tests because I can only find tests with "4 core" ryzen vs 7700 but not against 8 core ryzen

[H] does not work properly for me so Have it

PCGH did exactly that: http://www.pcgameshardware.de/commoncfm/comparison/clickSwitch.cfm?id=137982 4+4 > 3+3 > 4+0 > 2+2.

Osjur · Mar 12, 2017

Yeah it shows that 4 is better than 2+2 as a thought but doesn't tell why 4+4 and 3+3 is so much faster.

lolfail9001 · Mar 12, 2017

Osjur said:
Yeah it shows that 4 is better than 2+2 as a thought but doesn't tell why 4+4 and 3+3 is so much faster.

Why would not they be?

dfk7677 · Mar 12, 2017

lolfail9001 said:
[H] does not work properly for me so Have it

PCGH did exactly that: http://www.pcgameshardware.de/commoncfm/comparison/clickSwitch.cfm?id=137982 4+4 > 3+3 > 4+0 > 2+2.

What I don't like about the specific benchmark, is that it is single player. Multiplayer BF1 needs ~50% more CPU resources.

Kromaatikse · Mar 12, 2017

gtbtk said:
You are showing your ignorance about ethernet networking. Gigabit, 10Gig, 100BaseT 10BaseT 2Base2 all use CSMA/CD (Carrier Sense Multiple Access with Collision Detection). That basically means that a device on the network just throws packets of data out addressed to another device and sees if the packet or group of packets generates an acknowledgement. If the ack doesnt come back, it sends it again because it assumes that the last packet had a collision with a packet from another device going elsewhere.

Oh dear. Your first paragraph and substantive statement in the post, and it's flat-out wrong. I stopped reading.

The older "bus Ethernet" standards: 10base-2, 10base-5, 10base-T half-duplex, 100base-TX half-duplex. These are all half-duplex protocols and assuming a shared-bus topology. The twisted-pair wiring used by 10base-T and 100base-TX actually supports full-duplex signalling, but this capability is ignored in the half-duplex mode. The coaxial cable used by the others mentioned physically supports only half-duplex signalling. This is the version of the protocol using CSMA/CD.

CSMA/CD stands for Carrier Sense Multiple Access with Collision Detection. It should not be confused with CSMA/CA which uses Collision Avoidance. Wifi is a CSMA/CA technology, which uses the link-level acknowledgements and retries you describe; it's very inefficient and leads to high latency and jitter.

CSMA/CD as used in Ethernet does not have link-level acknowledgement. However, each sender does listen for interference with their transmission while it is in progress - this is the "collision detection". If there's a collision, the transmission is stopped immediately, a random wait time is inserted, and the frame is then retried. I won't go into exhaustive detail about exponential backoff and congestion collapse - that's all easy to find if you look for it, and will be drummed into your head at any undergraduate-level compsci course you actually attend the lectures for.

The newer "switched Ethernet" standards: 10base-T full-duplex, 100base-TX full-duplex, 1000base-T (aka GigE). These are all full-duplex protocols assuming a switched-star (or star-of-stars) topology. They do not rely on CSMA/CD, but instead some implementations support explicit flow control using pause/resume control frames. The frames themselves are exactly the same as on "bus Ethernet", and your average Ethernet switch can seamlessly forward traffic between "bus" and "switch" networks. Indeed, early switches were designed to link several buses together, to reduce the collision domain and thereby increase reliability.

1000base-T is particularly interesting because it transmits both ways on all four wire pairs simultaneously, at 250Mbps per direction on each pair. This is enabled by self-cancelling and echo-cancelling, in the same way (though much more sophisticated) as your average landline telephone. 10base-T and 100base-TX use only one pair per direction; you can actually send two independent older Ethernet lines down the same four-pair cable! In both cases, with full-duplex communication on each twisted-pair cable, there is obviously no possibility of a collision, so the whole CSMA/CD algorithm is switched off.

So please refrain from claiming I don't know anything about Ethernet.

lolfail9001 · Mar 12, 2017

dfk7677 said:
What I don't like about the specific benchmark, is that it is single player. Multiplayer BF1 needs ~50% more CPU resources.

Fair enough, but BF1 is only 1 of games here. And meaningful replication of BF1 multiplayer is troubled as is.

OrangeKhrush · Mar 12, 2017

unseenmorbidity said:
Apparently Allyn at PCper redid the test,and now admits he is wrong. He hasn't removed the article though...

Did he expressly state that or did he just implicitly through inferance come to that conclusion? It would be mighty big on him to admit it is wrong and then retract. But will wait and see on this.

dfk7677 · Mar 12, 2017

lolfail9001 said:
Fair enough, but BF1 is only 1 of games here. And meaningful replication of BF1 multiplayer is troubled as is.

Of course replication is quite difficult, and if I may, not useful. You cannot go into firefights (which are the most taxing in multiplayer) and have an exact replication. That is why for CPU benchmarking the best scenario is the tabletop view as a spectator in a full conquest server in a specific map.

I am insisting on BF1 because it is the most optimized game for multiple threads out there.

Ryzen: Strictly technical

Platinum Member

Golden Member

Member

Platinum Member

Member

Member

Member

Golden Member

Golden Member

Member

Golden Member

Diamond Member

Member

Member

Member

Member

Junior Member

Golden Member

Member

Golden Member

Member

Member

Golden Member

Senior member

Member