Ryzen: Strictly technical

lolfail9001 · Mar 12, 2017

dfk7677 said:
But the case for Ryzen in games doesn't involve 100% working threads, quite the opposite. Still the framerate is capped by something else.

It does involve 100% CPU load threads, but Win10 load balancing masks that, you can see the visible effect in looncraz and PCPer's task manager screenshots: load balancing basically masks 100% load on 3 or 4 threads as ~50%ish load on a whole bunch of threads instead.

deadhand · Mar 12, 2017

dfk7677 said:
But the case for Ryzen in games doesn't involve 100% working threads, quite the opposite. Still the framerate is capped by something else.

It is exactly the same for dual socket Xeon systems.

I should note that a real-world case of false sharing would not necessarily have to show 100% utilization on any thread due to the polling interval. The tests that produces the utilization graphs in my post were designed (by the author, not me) to only test the effects of false sharing, constantly for a period of time, to an extreme. In the same sense a programmer would likely notice that a task fails to scale to > some number of threads and possibly restrict it to 'x' number of threads where there is still some degree of scaling. (Perhaps assuming the issue is elsewhere)

EDIT: I'll show my own test results of thread scaling with different degrees of false sharing. The graph I'll post consists of values that are actually fairly consistent. It's my other tests that were not.

EDIT: Nvm, perhaps another time once the other issues are sorted out.

lolfail9001 · Mar 12, 2017

deadhand said:
In effect, 'False Sharing' is what occurs when a thread is writing to data located on the same cache line that another thread is attempting to access. As I understand it, only one core can have a lock on the cache line when it's being modified, so this produces a dependency between threads that is not obvious to the programmer. It's a serialization of resource access where the entire cache line is bounced back and forth between cores, even if each thread are operating on completely different (but within the same cache line) memory locations, and are otherwise embarrassingly parallel.

Wait, wait, wait, Ryzen's l1 and l2 are local to each core, and l3 is a victim cache. The only clear path to false sharing that i see here is usage of SMT.

malventano · Mar 12, 2017

unseenmorbidity said:
Apparently Allyn at PCper redid the test,and now admits he is wrong. He hasn't removed the article though...

Nothing there is an admission of wrong. You can't just turn on NUMA as it is meant to segment memory spaces, not caches / CCX modules. The scheduler can't be expected to be aware of this as the CCX segmentation does not appear to be part of the CPUID. Further, the primary point of the article was to dispel the misinformation re: Windows 10 scheduler not properly handling physical vs. logical cores, and that point is still made with the pic I put in the comments. All that 'counters' is that a lighter load was repeatably allocated within a single CCX while the heavier load spilled over. There is no surprise here, as that is exactly what is happening in games, etc. The fix for this is to get the scheduler to prioritize CCX's to the point where it will load up the second logical cores before spreading threads to the other CCX, but again, there's nothing in place to direct it to do that and NUMA is not the magical answer here.

deadhand · Mar 12, 2017

lolfail9001 said:
Wait, wait, wait, Ryzen's l1 and l2 are local to each core, and l3 is a victim cache. The only clear path to false sharing that i see here is usage of SMT.

Is it? That's interesting. I wasn't aware of that.

Chl Pixo · Mar 12, 2017

lolfail9001 said:
Wait, wait, wait, Ryzen's l1 and l2 are local to each core, and l3 is a victim cache. The only clear path to false sharing that i see here is usage of SMT.

Could the L3 be incorrectly used?
I mean most games would use intel or ms compilers.

Trender · Mar 12, 2017

Kromaatikse said:
Thanks to core parking, it appears that single-threaded benchmarks are in fact being kept on one CCX - most of the time. So, even if they are frequently moved between cores, they only suffer context switching and extra L2 misses, which hit in the L3 cache instead. That's a relatively minor problem, and Ryzen is well-equipped to deal with it since its L3 cache is high-bandwidth and reasonably low-latency.

With a full multi-threaded benchmark which uses all available cores (virtual and otherwise), the scheduler doesn't move threads around because there are no idle cores to move them to. Context-switch overhead and excess cache misses go away. Furthermore, most workloads of this type are "embarrassingly parallelisable" which means very little communication between threads is necessary for correct results - mostly "I've finished this batch" and "Here's another one to work on". Inter-CCX traffic therefore remains low, and Ryzen still performs very well.

Games don't cleanly fall into either of the above categories. Modern game engines are multithreaded to some degree, but they generally can't keep all 16 hardware threads busy at once, yet they *can* keep the CPU busy enough for many (if not all) cores to be unparked. Worse, they are not running clean, uniform, embarrassingly-parallelisable algorithms, but a heterogeneous mixture of producers and consumers which are *constantly* communicating and synchronising among themselves. This, for Ryzen, is the worst-case scenario.

And that's why we're talking about the problem in these terms - if we can tame Windows' scheduler, Ryzen will run faster in games.

But then (from what I understood) won't be just need to be playing games only on 4 cores? Because if use more than 4 Cores then its changed CCX between the other module of 4 cores and thats the slow problem?

itsmydamnation · Mar 12, 2017

gtbtk said:
@Kromaatikse You are showing your ignorance about ethernet networking. Gigabit, 10Gig, 100BaseT 10BaseT 2Base2 all use CSMA/CD (Carrier Sense Multiple Access with Collision Detection). That basically means that a device on the network just throws packets of data out addressed to another device and sees if the packet or group of packets generates an acknowledgement. If the ack doesnt come back, it sends it again because it assumes that the last packet had a collision with a packet from another device going elsewhere. In the millisecond lifetimes of data packets, there are lots of empty spaces available up until you get to about a 40% load on a multi drop network.

More bandwidth, Full Duplex, switches, Jumbo packets etc have all been invented to create order on a network and mitigate some of the down sides of CSMA/CD much like Traffic lights or a traffic cop does when trying to manage heavy motor vehicle traffic. Switches mitigate the problem by making every segment point to point so that only the switch port and the attached device are on that particular network. The switch then holds the packet data and forwards it when there is a gap in the traffic. The protocol still waits for an acknowledgement and will resend if necessary. Token ring and FDDI were protocols that manage traffic by token passing (you can only send data if you hold the token) but that approach provides too much overhead and it is more efficient to send and pray assuming that a large percentage of the time you won't need to resend. The times you do resend is the "cost of doing business" as it were. The plan all falls down as traffic loads reach that 40% number.

The 4

What are you talking about.....

Networking is a terrible example because of how dumb typical window scaling is , with simple segmentation and small amount of buffers ( something like -64k256k to a port) with a "good" TCP window scaler or actually supporting things like ECN/flow control/remote side buffer credits well see near line rate regardless of number of senders/receivers.

trying to drag this back to CSMA/CA on shared medium make no sense. There is even a point in the hot chips Q/A where micheal clarke states that the L3 is single ported but there are buffers around the L3 that each core can write to.

We know Ifinity fabric has both a control and a data plan and i think its a pretty safe assumption there are also buffers on CCX ingress and or Egress depending on exactly how they handle data flow.

Chl Pixo said:
Could the L3 be incorrectly used?
I mean most games would use intel or ms compilers.

What do you mean? compilers can't choose which level of cache to put things in. high end OOOE engines tend to no like software prefetching as it can confuse the predictors.

the L3 is an eviction cache and that is fine.

malventano · Mar 12, 2017

JDG1980 said:
The "devious scandal" was that Nvidia flat-out lied to reviewers in its spec sheets, claiming that the GTX 970 had 64 ROPs and 2MB of L2 cache when in fact it had only 56 ROPs and 1.75 MB of L2 cache. This was illegal and immoral regardless of whether any of this had any real-world effect on performance or not.

Despite all of the pitchforks and torches, it appeared to us to really just be a miscommunication between the group writing up the spec sheets vs. the group actually designing the architecture. I'm basing that on the type of reaction we got from Nvidia when we asked them about it. It really was more of a 'wait, what?' response as opposed to a Jedi hand wave trying to convince us that nothing was wrong with the spec sheet.

lolfail9001 · Mar 12, 2017

deadhand said:
Is it? That's interesting. I wasn't aware of that.

https://en.wikichip.org/wiki/amd/microarchitectures/zen

Kind of. With what we know of associated latencies ( http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html ), it does not look for hits outside of CCX L3 either, but that's a guess.

Ajay · Mar 12, 2017

lolfail9001 said:
Wait, wait, wait, Ryzen's l1 and l2 are local to each core, and l3 is a victim cache. The only clear path to false sharing that i see here is usage of SMT.

L3$ is 'mostly' a victim cache. I haven't read yet under what circumstances it behaves inclusively.
I'm not even sure what the L2 eviction protocol is. Has AMD published this info, or does one need to be a registered developer to get it? Public info seems to be coming out in dribs and drabs. Hmm, guess I should go look...

deadhand · Mar 12, 2017

malventano said:
Despite all of the pitchforks and torches, it appeared to us to really just be a miscommunication between the group writing up the spec sheets vs. the group actually designing the architecture. I'm basing that on the type of reaction we got from Nvidia when we asked them about it. It really was more of a 'wait, what?' response as opposed to a Jedi hand wave trying to convince us that nothing was wrong with the spec sheet.

You would think that the most basic information about a product would be conveyed correctly, or that someone aware of it might speak up when false information is on the product packaging and spec sheets themselves.

lolfail9001 · Mar 12, 2017

Ajay said:
L3$ is 'mostly' a victim cache. I haven't read yet under what circumstances it behaves inclusively.

Mostly!? That slide does not really afford any double interpretations

HC28.AMD.Mike%20Clark.final-page-013.jpg

unseenmorbidity · Mar 12, 2017

malventano said:
Nothing there is an admission of wrong. You can't just turn on NUMA as it is meant to segment memory spaces, not caches / CCX modules. The scheduler can't be expected to be aware of this as the CCX segmentation does not appear to be part of the CPUID. Further, the primary point of the article was to dispel the misinformation re: Windows 10 scheduler not properly handling physical vs. logical cores, and that point is still made with the pic I put in the comments. All that 'counters' is that a lighter load was repeatably allocated within a single CCX while the heavier load spilled over. There is no surprise here, as that is exactly what is happening in games, etc. The fix for this is to get the scheduler to prioritize CCX's to the point where it will load up the second logical cores before spreading threads to the other CCX, but again, there's nothing in place to direct it to do that and NUMA is not the magical answer here.

Odd, that's kind of the opposite of what you said here,

https://twitter.com/ryanshrout/status/840377942932893696

malventano · Mar 12, 2017

Bacon1 said:
Any chance you will do an updated version of that test? The lack of proper freesync vs gsync comparisons in the last year+ is pretty sad.

With the addition of LFC to FreeSync panels that have a sufficient FPS range to support it, the playing field is mostly equal. The only real difference I've seen anymore is that most of the FreeSync panels still don't get overdrive as good as it could be (example 1 2 3), particularly when operating in the VRR range. G-Sync still does a better job at overdrive regardless of refresh rate. So long as you are ok with possibly imperfect overdrive, and you ensure the panel supports LFC, there's no longer a reason to have to jump ship on your GPU to get the panel that does what you want it to.

I guess the answer to your question is that there isn't a big reason to repeat the test, unless you can think of something more specific you were hoping to see from such a test?

unseenmorbidity · Mar 12, 2017

malventano said:
Despite all of the pitchforks and torches, it appeared to us to really just be a miscommunication between the group writing up the spec sheets vs. the group actually designing the architecture. I'm basing that on the type of reaction we got from Nvidia when we asked them about it. It really was more of a 'wait, what?' response as opposed to a Jedi hand wave trying to convince us that nothing was wrong with the spec sheet.

You sound more like a lobbyist here than a critic.

malventano · Mar 12, 2017

unseenmorbidity said:
Odd, that's kind of the opposite of what you said here,

https://twitter.com/ryanshrout/status/840377942932893696

1. Not my tweet.
2. It's not the scheduler's fault for not knowing something that the current CPUID framework may be incapable of communicating.

malventano · Mar 12, 2017

unseenmorbidity said:
You sound more like a lobbyist here than a critic.

Just calling it like I saw it, but you're entitled to your opinion. I'd like to think that folks would put a bit more weight on someone who was a party to the actual phone calls, but logic like that doesn't work on folks with an obvious axe to grind.

lolfail9001 · Mar 12, 2017

malventano said:
2. It's not the scheduler's fault for not knowing something that the current CPUID framework may be incapable of communicating.

From Linux patches, it does communicate CCX number via Apic, but it requires separate patch to recognize it, see Dresdenboy's post http://dresdenboy.blogspot.ru/2016/02/amd-zeppelin-cpu-codename-confirmed-by.html

For present implementation: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/amd.c

Code:

        if (cpuid_edx(0x80000006)) {
            if (c->x86 == 0x17) {
                /*
                 * LLC is at the core complex level.
                 * Core complex id is ApicId[3].
                 */
                per_cpu(cpu_llc_id, cpu) = c->apicid >> 3;
            } else {
                /* LLC is at the node level. */
                per_cpu(cpu_llc_id, cpu) = node_id;
            }
        }

piesquared · Mar 12, 2017

malventano said:
Just calling it like I saw it, but you're entitled to your opinion. I'd like to think that folks would put a bit more weight on someone who was a party to the actual phone calls, but logic like that doesn't work on folks with an obvious axe to grind.

I'm curious about this- when you have the quad core in the socket, it would take probably about 5 minutes of your time.

Maybe you could do some tests to see how 8 cores compare to 4 cores in gaming during quick movements like in the example i gave @ 10:46.

https://www.youtube.com/shared?ci=WBtyzo1BZp8

That looks very fluid, id be interested to know if a quad core can replicate that smoothness without having to compromise.

deadhand · Mar 12, 2017

unseenmorbidity said:
You sound more like a lobbyist here than a critic.

malventano said:
1. Not my tweet.
2. It's not the scheduler's fault for not knowing something that the current CPUID framework may be incapable of communicating.

I don't think this thread is an appropriate place for this.

Ajay · Mar 12, 2017

lolfail9001 said:
Mostly!? That slide does not really afford any double interpretations

Sigh

From your wikichip link (and elsewhere):

The L3 cache is an 8 MiB 16-way set associative victim cache and is mostly exclusive of the L2.

lolfail9001 · Mar 12, 2017

Ajay said:
Sigh From your wikichip link (and elsewhere):

Ah, confirmation bias, i missed the mostly part in wikichip link.

Question arises, in what sense is it "mostly".

malventano · Mar 12, 2017

lolfail9001 said:
From Linux patches, it does communicate CCX number via Apic, but it requires separate patch to recognize it, see Dresdenboy's post http://dresdenboy.blogspot.ru/2016/02/amd-zeppelin-cpu-codename-confirmed-by.html

For present implementation: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/amd.c

Code:

if (cpuid_edx(0x80000006)) { if (c->x86 == 0x17) { /* * LLC is at the core complex level. * Core complex id is ApicId[3]. */ per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; } else { /* LLC is at the node level. */ per_cpu(cpu_llc_id, cpu) = node_id; } }

Interesting. Curious why AMD would not have informed MS of this with enough lead time to get the feature added to their scheduler prior to Ryzen's release. It is possible that MS reserves such updates for major releases, as it is a very low-level fix that requires lots of QC and testing.

deadhand · Mar 12, 2017

malventano said:
Interesting. Curious why AMD would not have informed MS of this with enough lead time to get the feature added to their scheduler prior to Ryzen's release. It is possible that MS reserves such updates for major releases, as it is a very low-level fix that requires lots of QC and testing.

Given that they apparently didn't give enough lead time to motherboard OEMs (which i'd consider to be far more critical), I don't see this as particularly shocking.

Ryzen: Strictly technical

Golden Member

Junior Member

Golden Member

Junior Member

Junior Member

Junior Member

Junior Member

Diamond Member

Junior Member

Golden Member

Lifer

Junior Member

Golden Member

Golden Member

Junior Member

Golden Member

Junior Member

Junior Member

Golden Member

Golden Member

Junior Member

Lifer

Golden Member

Junior Member

Junior Member