Ryzen: Strictly technical

coercitiv · Mar 15, 2017

OrangeKhrush said:
possibly just stability and power state issues at present.

What stability issues?

Oleyska · Mar 15, 2017

OrangeKhrush said:
I have a source inside the PC testing circles, he had access to Ryzen from the first samples to QS samples and he posted two interesting things about Ryzen.

basically you are actually BETA testing.

OrangeKhrush said:
It looks right about on par, possibly just stability and power state issues at present.

Motherboard, Memory,Overclock and bios. nothing have given me issues - Product competes, feels finished to me.
I've had no stability issues the past 2 weeks, I've had no memory issues

C6H owners are the only ones which seems to be beta testing to me.

Mockingbird · Mar 15, 2017

OrangeKhrush said:
I have a source inside the PC testing circles, he had access to Ryzen from the first samples to QS samples and he posted two interesting things about Ryzen.

basically you are actually BETA testing.

AMD basically pulled an all-nighter with Zen.

It reminds me a little bit of my friends and I.

The quarter is ten weeks long excluding finals week.

Week 4: most classes have their first midterms

Week 6: first project/report/essay for each of classes are due

Week 8: most classes have their second midterms

Week 10: second project/report/essay for each of the classes are due

Week 11: Final Exams

Naturally, since we are always busy, we have to occasionally pull all-nighters.

The paper gets the points across, but are literally filled with spelling and grammar errors that would have been easily been corrected had there been more time.

The same can be said for AMD and Zen.

Kromaatikse · Mar 15, 2017

innociv said:
I think you misread/misheard.

The only update for Linux, which happened right away, was to correctly assign SMT threads.

-znver1 is relying upon the btver1 scheduler model. Btver1 is for AMD's Bobcat.
People are finding using the Haswell scheduler model improves performance 5-10% in Linux, but they are working on a proper Zen scheduling model that should be more like 10-20% improvements in some cases.

And this is exactly what Windows needs, its own scheduler model for Ryzen...

I think you might be confusing process scheduling with instruction scheduling. The latter is a function of the compiler; Windows apps, like Linux distro packages, generally optimise for a generic CPU to maximise compatibility.

keymaster151 · Mar 15, 2017

Since we know of two different games that treat Ryzen as a 16 core processor it made me think. Is there a way to force a process to treat the CPU as 8/16 instead? If this is an issue with more games, could this be the cause of the less than optimal performance we are seeing with SMT enabled in some games?

lolfail9001 · Mar 15, 2017

keymaster151 said:
Since we know of two different games that treat Ryzen as a 16 core processor it made me think. Is there a way to force a process to treat the CPU as 8/16 instead? If this is an issue with more games, could this be the cause of the less than optimal performance we are seeing with SMT enabled in some games?

Win10 scheduler already does that. Besides, didn't you see that explicitly setting R7 as 8 core in F1's case led to measly 3% improvement?

keymaster151 · Mar 15, 2017

But if the scheduler already does that, how come there is any improvement at all?

lolfail9001 · Mar 15, 2017

keymaster151 said:
But if the scheduler already does that, how come there is any improvement at all?

Because SMT by the virtue of simply being enabled gimps few queues (uop queue, retire queue and store queue) with static partitioning. If you can harness the full throughput (and Zen apparently does have more non-AVX throughput than Skylake) with SMT it is great. Not so much otherwise.

keymaster151 · Mar 15, 2017

Hmm. The post about F1 2016 made no mention of turning off SMT, only that they set it to 8 physical cores instead of 16.

Kromaatikse · Mar 15, 2017

That's one particular game which has a "clever" but decidedly faulty system for optimising its use of threads for particular machines. That it has a configuration file associated with that system is fortuitous, as it allows experimenting as AMD did internally.

Each such game will have its own quirks and will require separate attention.

dnavas · Mar 15, 2017

looncraz said:
One thing that does scale, in the real world, with frequency that is unexpected: memory performance. Usually changing the core frequency doesn't have much of an impact on memory reads and writes - maybe 500MB/s or so. I'm seeing 35GB/s changing to 43GB/s going from 3GHz to 3.8GHz - and Geekbench memory scores jumping from 3500 @ 3Ghz to 4000 at 3.8GHz.

How fast is the memory you're running? Do you see changes in latency? Someone (I forget who) was saying they were seeing occasional 300ns latencies. I'm wondering if there's a general problem between the CPU and the MC?
Your Windows 7 problems are troubling, given I'm using Win7, though I think I'll wait on making more than just memory system changes atm.

CataclysmZA · Mar 15, 2017

For those of you still testing inter-CCX communication and trying to set core affinities, Ryzen has a peculiar habit of pinging/strobing both RAM and the L3 cache in adjacent cores when the L3 cache of one core doesn't contain the needed data. If it doesn't receive a reply from either of those (and it probably needs a reply from both), it is supposed to ping the L3 caches connected to cores in the other CCX. Considering that going to the other CCX adds 100ns of latency, I think that if the second memory controller is only connected directly to the second CCX, that this is what is causing some of the issues with games that can, and do, scale across CCXs, or somehow have equal amounts of threads in both complexes. CCX bandwidth was initially pegged at 22GB/s, right? Well, that's the theoretical peak transfer rate of DDR4-2666. Maybe that's why it scales with RAM, rather than relying on an uncore rate like Intel's architectures. If your RAM is faster, then you access the cache with less latency at that same speed when running across CCX modules.

L3 isn't completely a victim cache, to be clear. It's mostly a victim cache for L2 data, but it can also be used as a regular L3 cache all the same. I wager that some games that don't load up the caches enough to spill into L3 do relatively well in the performance stakes. Games that also don't have a lot of inter-thread dependency also don't seem to have an issue running well on Ryzen (Mafia III being one of them, because it also performs absurdly well on Piledriver). If most of L3 is filled with L2 cache data though, that means that RAM acts as a last-level cache, and both it and the L3 caches get pinged regardless of what data is in them. This won't help, I imagine, if your second memory controller has a longer pathway than the path along the fabric to the second CCX.

Correct me if I'm wrong, obviously. I'm still learning here.

Kromaatikse · Mar 15, 2017

CataclysmZA said:
RAM acts as a last-level cache

This is an oxymoron and makes no sense. Please stop repeating it from wherever you heard it.

My understanding of the cache-lookup procedure on Ryzen is as follows:

First cache lookup is to L1-D of same core. L2 isn't touched unless this fails. At this point TLB lookup (for virtual memory) has already succeeded, one way or another.
Next the local L2 cache is checked. This takes a bit longer, as there are more ways to go through and it's further away.
If the local L2 cache misses, the request is sent straight to the L3 cache. This holds, among other things, a partial copy of the L2 tags for other cores in the same CCX. If one of *those* hits, the request is then forwarded to that L2 cache in a partially-decoded state; the correct cache completes the lookup and supplies the data. Because the L2-L1 hierarchy is inclusive, it is not necessary to also perform a lookup in other L1 caches.
If the L2 tag lookups fail, the L3 lookup was already in progress and now completes. If it hits, the data is supplied by the local L3 cache and promoted to the L2 and L1 caches.
If the local L3 cache lookup fails, the request is broadcast over Infinity Fabric to the other L3 cache(s) in the system (possibly plural because of Naples & Snowy Owl). The hybrid L2/L3 lookup procedure above is thus repeated.
If all of these lookups fail, the request is routed to the appropriate RAM controller. This is the final resort, and is only initiated after *all* possible cache lookups have proved fruitless.

Blake_86 · Mar 15, 2017

Kromaatikse said:
This is an oxymoron and makes no sense. Please stop repeating it from wherever you heard it.

My understanding of the cache-lookup procedure on Ryzen is as follows:

First cache lookup is to L1-D of same core. L2 isn't touched unless this fails. At this point TLB lookup (for virtual memory) has already succeeded, one way or another.

Next the local L2 cache is checked. This takes a bit longer, as there are more ways to go through and it's further away.

If the local L2 cache misses, the request is sent straight to the L3 cache. This holds, among other things, a partial copy of the L2 tags for other cores in the same CCX. If one of *those* hits, the request is then forwarded to that L2 cache in a partially-decoded state; the correct cache completes the lookup and supplies the data. Because the L2-L1 hierarchy is inclusive, it is not necessary to also perform a lookup in other L1 caches.

If the L2 tag lookups fail, the L3 lookup was already in progress and now completes. If it hits, the data is supplied by the local L3 cache and promoted to the L2 and L1 caches.

If the local L3 cache lookup fails, the request is broadcast over Infinity Fabric to the other L3 cache(s) in the system (possibly plural because of Naples & Snowy Owl). The hybrid L2/L3 lookup procedure above is thus repeated.

If all of these lookups fail, the request is routed to the appropriate RAM controller. This is the final resort, and is only initiated after *all* possible cache lookups have proved fruitless.

so the problems may begin at the level of the last 2 points, right? The More requests misses and fall back to L3 or ram level, the more cycles are lost waiting for data because the fabric bandwidth is shared between all the requests.

OrangeKhrush · Mar 15, 2017

Maybe someone can make out what this means

https://www.techpowerup.com/231518/a...ng-am4-updates

This. The info is incomplete, this support for these memory speeds won't be for AM4, but another platform.

Not sure if that was in reference to my last post, but all those are AM4 and the same platform as the ones we have right now.
Why it's rather challenging to add new memory divider support for CPUs is because boards are designed with specific trace lengths and all signal routing is done for specific memory frequencies.
It's not all the same, from 2133 to 3,600MHz. So even though the dividers may be added via uCode. They may not work at all simply because the board track/trace layout cannot handle such frequencies.
Again, not saying it's impossible, but these frequencies are for another platform built with these memory dividers in mind from the beginning.

lolfail9001 · Mar 15, 2017

Kromaatikse said:
My understanding of the cache-lookup procedure on Ryzen is as follows:

Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html

Blake_86 · Mar 15, 2017

lolfail9001 said:
Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.

they assumed this obderving the abnormous jump in latency when the accessses grow larger then 8 MB right? If the latency go sky-level mean there is an acces on the ram. The test is empiric so is it possible to assume it 's 100% correct?

dfk7677 · Mar 15, 2017

lolfail9001 said:
Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html

Could the latency of cross-CCX access be the same as the one of the memory?

lolfail9001 · Mar 15, 2017

dfk7677 said:
Could the latency of cross-CCX access be the same as the one of the memory?

Well, PCPer numbers suggest it could be. But that would be a terrific mess up at design stage, so i am optimistic and think it was a deliberate choice to skip global L3 altogether.

deadhand · Mar 15, 2017

Blake_86 said:
they assumed this obderving the abnormous jump in latency when the accessses grow larger then 8 MB right? If the latency go sky-level mean there is an acces on the ram. The test is empiric so is it possible to assume it 's 100% correct?

It's very easy to test the wrong thing with a micro benchmark, should be very careful about such things.

dfk7677 · Mar 15, 2017

If cross-CCX latency is the same as the memory's then skipping the other CCX's L3 would be the same thing. Does this mean that Ryzen 7 has an effective 8MB L3$ when more than 4 cores are used by the same program, instead of 16MB?

richaron · Mar 15, 2017

I have a theory the dual CCX idea is basically a working development board for refining software (/hardware) for future AMD MCM (/"infinity fabric") systems. And further the dual CCX idea is currently a more applicable and "simplified" (i.e. lower initial complexity [software already knows SMP, now only 2 distinct sections needing communication]) version of Bulldozer's CMP idea; which I postulate had the same final goal (i.e. software/platform development on a competitive system) but was much too drastic a change for software to capitalize upon.

Did you say strictly technical or tin foil hats welcome?

Kromaatikse · Mar 15, 2017

lolfail9001 said:
Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html

Ah, but that's expected given the above, and knowing that the L3 cache only contains data which has been evicted from its *local* L2 caches. So if you run a single-threaded microbenchmark, you'll see an 8MB LLC.

You also have to be careful about the stride of such a test's accesses, to ensure they cover all four address-interleaved slices of the L3 instead of accidentally hitting only one of them.

The latency of inter-CCX cache access is still less than going all the way to RAM, but is more complicated to measure.

Kromaatikse · Mar 15, 2017

richaron said:
I have a theory the dual CCX idea is basically a working development board for refining software (/hardware) for future AMD MCM (/"infinity fabric") systems. And further the dual CCX idea is currently a more applicable and "simplified" (i.e. lower initial complexity [software already knows SMP, now only 2 distinct sections needing communication]) version of Bulldozer's CMP idea; which I postulate had the same final goal (i.e. software/platform development on a competitive system) but was much too drastic a change for software to capitalize upon.

Did you say strictly technical or tin foil hats welcome?

No, I think the CCX module is genuinely a tool for making larger multi-core designs than would otherwise by feasible for AMD. They are using the same Infinity Fabric based design for the larger Naples and Snowy Owl server/workstation parts, which are in fact MCM.

As for Bulldozer, it's been proved conclusively that there was no "potential performance" that could feasibly be unlocked by adopting better optimisations. Not only that, but K10 would in many workloads have achieved better performance in the same power and die size budgets on the same process, compared to *either* Bulldozer or Piledriver. Steamroller and Excavator might have been small improvements, but not nearly enough to justify the R&D costs and ecosystem disruption.

The difference with Ryzen is that the "potential performance" does actually exist this time, and is easily demonstrated in productivity benchmarks. Games are achieving anomalously low performance in comparison, though it's still much better performance than anything from the Bulldozer family could even dream about. These are fixable problems.

CataclysmZA · Mar 15, 2017

Kromaatikse said:
The latency of inter-CCX cache access is still less than going all the way to RAM, but is more complicated to measure.

To clarify all the CCX stuff, I've sent a few questions to my contact at AMD and hopefully we'll have some answers by the weekend. I specifically asked about latency to all points of the system, including from CCX0 to IMC1 and IMC2 in a Ryzen 7 processor.

I'm not too clued up on how Ryzen's cache structure works either, so I'm trying to learn about it from as many sources as I can. There's still so much about infinity fabric that we don't know.

Ryzen: Strictly technical

Diamond Member

Junior Member

Senior member

Member

Junior Member

Golden Member

Junior Member

Golden Member

Junior Member

Member

Senior member

Junior Member

Member

Junior Member

Senior member

Golden Member

Junior Member

Member

Golden Member

Junior Member

Member

Golden Member

Member

Member

Junior Member