Ryzen: Strictly technical

Kromaatikse · Mar 23, 2017

Ajay said:
There is a description of the scheduler from 'Windows Internals: part 1' by Russinovich and Solomon here (PDF) :
https://download.microsoft.com/down...2B9FD877995/97807356648739_SampleChapters.pdf

The 7th edition, including windows 10 and Server 2016 is due out this year (I'm probably going to get it - I'm so outdated).

That was an interesting read, but unfortunately the algorithms it describes in critical areas are not consistent with the actual behaviour observed (neither on Win7 nor Win10). Therefore I cannot rely on any of the information in it.

looncraz · Mar 23, 2017

lolfail9001 said:
Is it? Memory access is still uniform, it is more of an Skulltrail than NUMA config, now that i consider it. In fact, it is literally Skulltrail config, just on single die: 2 quad cores with plenty of cache for them connected by a bus with each other, memory and all the IO.

It's much closer to NUMA than to a monolithic core. You have half the CPU which can only talk to the other half, AFAICT, by going to the IMC. All of the prefetch logic become useless as soon as data needs to be shared between those halves of the CPU.

This is very close to the dynamic you see with NUMA, just it's occurring on-die.

I am currently preparing to update to the latest Windows 10 because my existing install is performing pretty horridly - hoping a simple upgrade installation will do the trick.

Trovaricon · Mar 23, 2017

lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller. Therefore Idea of "every ccx has a single channel IMC = has lower latency when accessing its memory space" is not correct.

NUMA has nothing to do with cache... What about 2 or 4 cores sharing a L2? Or even L1i in case of Bulldozer uArch? Do you consider them "NUMA-like" architectures?

If one writes algorithm (in)correctly you can outperform SandyBridge i3 by Piledriver A10 core in a single threaded micro benchmark (personal experience)

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.

CatMerc · Mar 23, 2017

Trovaricon said:
lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller. Therefore Idea of "every ccx has a single channel IMC = has lower latency when accessing its memory space" is not correct.

NUMA has nothing to do with cache... What about 2 or 4 cores sharing a L2? Or even L1i in case of Bulldozer uArch? Do you consider them "NUMA-like" architectures?

If one writes algorithm (in)correctly you can outperform SandyBridge i3 by Piledriver A10 core in a single threaded micro benchmark (personal experience)

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.

It's NUMA like, not NUMA. I don't think anyone claims otherwise.
If you think about the L3 as "memory", then it's very much NUMA like.

I believe Bulldozer's design has identical access times to its L2 cache between two cores in a module, so it's not exactly the same situation. With Zen the design has two blocks with non uniform access to L3 cache.

itsmydamnation · Mar 23, 2017

Trovaricon said:
lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller.

if you mean HT/QPI and the ring bus are all "FSB"'s then sure.

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.

Its all a matter of trade off's, Zepplin trades inter ccx latency for easy to scale core count and a single SOC that has a massive TAM.

Maybe you can explain better why Ryzen is lagging behind Intel IPC in less threaded tasks? I will wait...

Its inter CCX latency obviously, but bandwidth or clock doesn't singularly determine performance and in the case of increased memory clock improving performance massively we are far more likely seeing the improved latency then the throughput itself. But as i said before we have no idea yet how the inter CCX cache coherency protocol works. Hopefully we can find out when Naples is released.

Inter CCX thread ping pong isn't ideal, but threads should still be shared across CCX's to minimize hot spots etc.

coercitiv · Mar 23, 2017

looncraz said:
It's much closer to NUMA than to a monolithic core.

Can't we just please everyone and say it's closer to LEGO than anything else?

Kromaatikse · Mar 23, 2017

coercitiv said:
Can't we just please everyone and say it's closer to LEGO than anything else?

I certainly wouldn't want to tread on one in the middle of the night!

Trovaricon · Mar 23, 2017

Non uniform memory architecture

CatMerc said:
It's NUMA like, not NUMA. I don't think anyone claims otherwise.
If you think about the L3 as "memory", then it's very much NUMA like.

Cache is not memory! Cache is a cache. (edit: it is not a memory in the context of this discussion)
Where is written that UMA CPU's LLC should span over the entire processor (to all cores)?
Is Athlon x2 (K8 & K10) Athlon x4 (K10), all Bulldozer uarch based APUs a NUMA-like architecture? Jaguar in x2/x4 configuration has unified LLC, x8 (consoles) hasn't.

Yes, most Intel architectures (products based on them) are designed this way. And... yes, most AMD products aren't (that is nothing new).

CatMerc said:
I believe Bulldozer's design has identical access times to its L2 cache between two cores in a module, so it's not exactly the same situation. With Zen the design has two blocks with non uniform access to L3 cache.

Well then jumping of a single thread L2 data-fitting algorithm in Bulldozer uArch shows different performance depending on where does the thread gets re-scheduled especially considering the high latency low bandwidth L3, doesn't it?

What about tests with 10-15MB dataset, intentional algorithmic jumping around memory (to always cause cache-miss)

If AMD manages to connect zeppelin dies together using the same bandwidth / latency data fabric as within the die between CCXs (pipe dream) then you end up with UMA multi-die CPU.

Kromaatikse · Mar 23, 2017

Trovaricon said:
If AMD manages to connect zeppelin dies together using the same bandwidth / latency data fabric as within the die between CCXs (pipe dream) then you end up with UMA multi-die CPU.

Pipe dream? All they have to do is send the Infinity Fabric link between MCM dies using, probably, something like the interposer technology already used for HBM. It's eminently doable, especially for Snowy Owl (the 16-core workstation part) where they only need to connect two dies together that way.

However, I think I heard something about Naples (32-core) having an L4 cache in addition to the four Zeppelins. I reckon that, as well as eliminating the IMC itself from inter-CCX transfers (if that's really what happens, of which I'm skeptical) and reducing the average latency due to relatively slow ECC server-certified RAM, the L4 cache die serves as an Infinity Fabric hub, simplifying the problem of connecting four dies together.

CatMerc · Mar 23, 2017

Kromaatikse said:
Pipe dream? All they have to do is send the Infinity Fabric link between MCM dies using, probably, something like the interposer technology already used for HBM. It's eminently doable, especially for Snowy Owl (the 16-core workstation part) where they only need to connect two dies together that way.

However, I think I heard something about Naples (32-core) having an L4 cache in addition to the four Zeppelins. I reckon that, as well as eliminating the IMC itself from inter-CCX transfers (if that's really what happens, of which I'm skeptical) and reducing the average latency due to relatively slow ECC server-certified RAM, the L4 cache die serves as an Infinity Fabric hub, simplifying the problem of connecting four dies together.

Didn't hear about Naples having L4 cache, probably some theoretical discussion.

As for connecting the two CCX, another metal layer should do the trick in a revision, which would be cheaper than an interposer.

CatMerc · Mar 23, 2017

Trovaricon said:
Non uniform memory architecture

Cache is not memory! Cache is a cache. (edit: it is not a memory in the context of this discussion)
Where is written that UMA CPU's LLC should span over the entire processor (to all cores)?
Is Athlon x2 (K8 & K10) Athlon x4 (K10), all Bulldozer uarch based APUs a NUMA-like architecture? Jaguar in x2/x4 configuration has unified LLC, x8 (consoles) hasn't.

Yes, most Intel architectures (products based on them) are designed this way. And... yes, most AMD products aren't (that is nothing new).

Well then jumping of a single thread L2 data-fitting algorithm in Bulldozer uArch shows different performance depending on where does the thread gets re-scheduled especially considering the high latency low bandwidth L3, doesn't it?

What about tests with 10-15MB dataset, intentional algorithmic jumping around memory (to always cause cache-miss)

If AMD manages to connect zeppelin dies together using the same bandwidth / latency data fabric as within the die between CCXs (pipe dream) then you end up with UMA multi-die CPU.

NUMA like, not NUMA. I don't think anyone is contesting that.

It has similar behavior to NUMA is the point.

ryzenmaster · Mar 23, 2017

Kromaatikse said:
Is this result with Core Parking on? If so, how does it change with it off? (Or vice versa.)

This would be with core parking disabled. I'll have to run some tests with core parking enabled, though I suspect it will still migrate threads, just not on the ones that are parked when running lightly threaded workloads.

At any rate I shouldn't have to rely on crippling compute capacity in order to get the most out of it. I have 8 cores and I want them all be used, so I prefer to keep core parking off. I did this even on my Phenom II x6 1090T ever since some results suggested that Battlefield 4 performance may suffer from core parking.

looncraz · Mar 23, 2017

Trovaricon said:
lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller. Therefore Idea of "every ccx has a single channel IMC = has lower latency when accessing its memory space" is not correct.

NUMA has nothing to do with cache... What about 2 or 4 cores sharing a L2? Or even L1i in case of Bulldozer uArch? Do you consider them "NUMA-like" architectures?

If one writes algorithm (in)correctly you can outperform SandyBridge i3 by Piledriver A10 core in a single threaded micro benchmark (personal experience)

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.

This is partly determined by what is acting as the LLC for the CPU and how cores can communicate with each other.

If Ryzen had an L4 which handled L3 evictions and global data, then there'd be nothing NUMA-like about Ryzen. Instead, Ryzen has to core complexes which are fully independent of each other - if they communicate at all, it's through memory.

It is very much like a dual socket system - and we treat those CPUs specially.

Trovaricon · Mar 23, 2017

Kromaatikse said:
Pipe dream? All they have to do is send the Infinity Fabric link between MCM dies using, probably, something like the interposer technology already used for HBM. It's eminently doable, especially for Snowy Owl (the 16-core workstation part) where they only need to connect two dies together that way.

However, I think I heard something about Naples (32-core) having an L4 cache in addition to the four Zeppelins. I reckon that, as well as eliminating the IMC itself from inter-CCX transfers (if that's really what happens, of which I'm skeptical) and reducing the average latency due to relatively slow ECC server-certified RAM, the L4 cache die serves as an Infinity Fabric hub, simplifying the problem of connecting four dies together.

Yes, AMD's presentation about Infinity Fabric describes Scalable Data fabric as Multi Socket & Multi Die Ready.
The question is what will be the inter-die width of Infinity Fabric (its DF part) "channel" & latency.
What we know is that CCX is connected through 256bit "channel" to on-die DF but what is the data fabric "backbone" bandwidth and how it performs with more than 2 CCXes moving data around (slide about DF has "engines" / "hubs" drawn in a schema) remains to be seen.

Did anyone try / seen test to "hammer" DataFabric "backbone" (hub/engine/whatever)?
A Diagram of Ryzen’s Clock Domains on 18th page contains information about "width" of components connected to DF: CCx 32B/c, IMC 32B/c, IO Hub 32B/c. To really test DF "switching speed" you need to access RAM at max speed and also IO devices (some kind of PCI-e bandwidth test) at the same time.

Atari2600 · Mar 23, 2017

Here is a stupid question - I assume all requests for data are sent to local CCX, foreign CCX and main system memory at the same time? Then the uncore loads in the first positive return that comes its way?

[it must do... it'd be daft to wait for a return from L1 before interrogating L2, wait for that return before interrogating local L3, wait for that return before interrogating foreign L3 etc etc]

powerrush · Mar 23, 2017

imported_jjj said:
That seems mostly a matter of DRAM latency and not CCX to CCX.
You don't specify the CAS for the DRAM used but with 2133 CL14 you would get around 100ns while with 3200 CL14 you would get low 70s.
The issue with CCX to CCX seems to be the data path not BW or latency. It doesn't go straight to the other CCX and we don't quite know what it does and why.
Clocking the DF higher would reduce latency but the latency will still be high if the data always goes through a bunch of hops.

The latency of memory is probably the main problem, but the CCX designs add another layer to this problem.

How the process thread is transported through the CCX ? It jumps directly through the data fabric to the ccx or is a DMA operation ?

imported_jjj · Mar 23, 2017

powerrush said:
The latency of memory is probably the main problem, but the CCX designs add another layer to this problem.

How the process thread is transported through the CCX ? It jumps directly through the data fabric to the ccx or is a DMA operation ?

Wasn't saying that there is a memory latency problem though. Broadwell-E is in the 60s with decent memory (3200 CL16) and i think Summit Ridge can get there or very close soon.
The fabric is just a coherent interconnect and due to its marketing name, people are freaking out too much about it.
My (crazy) theory is that the CCX to CCX implementation is GMI related, this die was made for MCM with a very limited budget.Ofc that's speculation and should not be presented as fact.

piesquared · Mar 23, 2017

I don't see a 'problem' at all with mine. What i do see, and expected to see, is an opportunistic minority throwing around negative adjectives. Excellent marketing too all around, there's nothing they could have done better really. Patches are already inbound to close the single area where there is a small gap in performance, some games. For other games like Mafia III, Ryzen outperforms everything which indicate the architecture is solid for gaming, developers only need to update the code as it's a different architecture that they have been developing games on for 10 years. Yet it's a beast out of the box for productivity.

Ajay · Mar 23, 2017

Kromaatikse said:
That was an interesting read, but unfortunately the algorithms it describes in critical areas are not consistent with the actual behaviour observed (neither on Win7 nor Win10). Therefore I cannot rely on any of the information in it.

Yeah, there are details that are clearly missing. The thread state table and state machine are far to simplistic to determine how the scheduler is going to behave. I think this is the reason that the authors put a fair bit of focus on the use of utilities - so that programmers can tune their code based on observed behaviors. It definitely doesn't appear to be the case that the scheduler is actually deterministic - I agree with you there.

This is unsurprising given that there is no module dedicated to thread scheduling and dispatch are distributed throughout the kernel!!

The Windows scheduling code is implemented in the kernel. There’s no single “scheduler” module or routine, however—the code is spread throughout the kernel in which scheduling-related events occur. The routines that perform these duties are collectively called the kernel’s dispatcher. The following events might require thread dispatching:

■■ A thread becomes ready to execute—for example, a thread has been newly created or has just
been released from the wait state.
■■ A thread leaves the running state because its time quantum ends, it terminates, it yields
execution,
or it enters a wait state.
■■ A thread’s priority changes, either because of a system service call or because Windows itself
changes the priority value.
■■ A thread’s processor affinity changes so that it will no longer run on the processor on which it
was running.

Crazy stuff

Edit: Some of the quote didn't get pasted.

CataclysmZA · Mar 23, 2017

Atari2600 said:
Here is a stupid question - I assume all requests for data are sent to local CCX, foreign CCX and main system memory at the same time? Then the uncore loads in the first positive return that comes its way?

Mostly correct, at least from what we know. When there's a data request for data that isn't on the local CCX L3, a strobe is sent out to the second CCX's L3 caches and to RAM at the same time. There's no information about the local CCX request being done at the same time, although it would make sense to do that given all the available bandwidth.

Ajay said:
This is unsurprising given that there is no module dedicated to thread scheduling and dispatch are distributed throughout the kernel!!

Crazy stuff

Contrast that to the Linux CFS: https://en.wikipedia.org/wiki/Completely_Fair_Scheduler

Microsoft does some really crazy stuff with their kernel.

lolfail9001 · Mar 23, 2017

looncraz said:
You have half the CPU which can only talk to the other half, AFAICT, by going to the IMC.

Did you test it? AMD's slides sure suggest otherwise. In fact, if you were right then 4+0 config would not really be uniform in memory access at ALL.

CatMerc said:
It's NUMA like, not NUMA. I don't think anyone claims otherwise.
If you think about the L3 as "memory", then it's very much NUMA like.

But we have no real clue if L3 is ever touched by different CCX. hardware.fr's test heavily implies it does not, imho. So, it is literally symmetric in regards to memory and other I/O.

itsmydamnation said:
if you mean HT/QPI and the ring bus are all "FSB"'s then sure.

Ring bus on Intel stuff would be a fair comparison, but does not afford painting Ryzen as Skulltrail the SoC.

looncraz said:
Instead, Ryzen has to core complexes which are fully independent of each other - if they communicate at all, it's through memory.

It is very much like a dual socket system - and we treat those CPUs specially.

Oh, please, you are well aware that dual socket system is not always NUMA-like. Sure, modern systems are all NUMA cause of memory controllers being integrated on all the modern CPUs since Nehalem. But the old ones...

piesquared said:
I don't see a 'problem' at all with mine. What i do see, and expected to see, is an opportunistic minority throwing around negative adjectives. Excellent marketing too all around, there's nothing they could have done better really. Patches are already inbound to close the single area where there is a small gap in performance, some games. For other games like Mafia III, Ryzen outperforms everything which indicate the architecture is solid for gaming, developers only need to update the code as it's a different architecture that they have been developing games on for 10 years. Yet it's a beast out of the box for productivity.

Got anything technical to say?

looncraz · Mar 23, 2017

lolfail9001 said:
Oh, please, you are well aware that dual socket system is not always NUMA-like. Sure, modern systems are all NUMA cause of memory controllers being integrated on all the modern CPUs since Nehalem. But the old ones...

Every multi-socket system with a unified LLC per socket is NUMA-like, with the node being the socket.

A multi-die processor with no unified LLC but with shared cache between more than one core is NUMA-like (Core 2 Quad).

AMD construction core CPUs with L3 do not apply, but those without absolutely do - but inter-core bandwidth and latency were plenty enough for the performance the cores offered, negating the penalty... but Windows was modified to accommodate it.

Sharing memory between two sockets incurs a non-uniform penalty compared to accessing memory on just one socket when the CPU in each socket has a LLC - even if the memory controller is part of an external north bridge.

It is only NUMA-like as this only impacts in-use/cached memory and not stale memory.

The difference in L3 bandwidth and latency within the CCX and to main memory is on the order of 500%. Depending on what memory you are accessing with what CCX at a given time, you may pay a 500% penalty for treating Ryzen as a monolithic processor.

That's the same concern with NUMA, except that it's a more static situation.

Trovaricon · Mar 23, 2017

I see that everyone started to apply his point of view from a software side - algorithm optimization problems on description of hardware architecture (including me).
Big dataset processing vs. producer-consumer with small dataset vs. "I don't know what else - its after midnight" problems.

looncraz said:
It is only NUMA-like as this only impacts in-use/cached memory and not stale memory.

The difference in L3 bandwidth and latency within the CCX and to main memory is on the order of 500%. Depending on what memory you are accessing with what CCX at a given time, you may pay a 500% penalty for treating Ryzen as a monolithic processor.

That's the same concern with NUMA, except that it's a more static situation.

Now imagine if Intel's L3 wasn't inclusive - this is actually the food for though we should probably focus on. It is the source of inter-core latency & bandwidth Intel reigns in. A single high priority "rogue" thread could destroy most of the benefits of LLC.
Victim partitioned L3 vs Incusive non-partitioned L3. Do we know anything about "coherent fabric" role of Infinity Fabric in on-die communication?

coercitiv said:
Can't we just please everyone and say it's closer to LEGO than anything else?

I have a feeling you knew where this "terminus technicus crossfire" would lead. If we can't call it by its name - Non uniform/segmented/partitioned last level cache, then LEGO it is!

Dresdenboy · Mar 23, 2017

I think, we already have a good understanding of what's happening in and between the 2 CCXs in a Ryzen CPU, mostly as analysis of specific behavioural patterns in kind of test cases (incl. the one PCPer did).

Real world software shows a mix of different effects. And we've also seen that Win performance profiles, core parking, SMT off can show performance gains similar to moving from x+x to 2x+0 CCX configurations. So how might these effects be related?

looncraz · Mar 23, 2017

Trovaricon said:
Now imagine if Intel's L3 wasn't inclusive - this is actually the food for though we should probably focus on. It is the source of inter-core latency & bandwidth Intel reigns in. A single high priority "rogue" thread could destroy most of the benefits of LLC.

Ryzen uses a mostly exclusive L3 per CCX. It keeps a copy of the L2 tags and the burden of getting data from a neighboring core's L2 is actually quite similar to the cost of getting it from the L3, AFAICT from testing. I can only imagine AMD specifically designed it to be that way.

I think a future improvement for the L3 would be to prefetch directly into it - that should help with in-page random access latency and would help some of the algorithms which are currently behaving badly on Ryzen. There are a lot of caveats to that, though.

From what I can tell on Ryzen, any one core can only evict to 4MB of L3 - which is why single-threaded cache latency tests show a sudden latency hit when exceeding 4MB. Each core can read from any part of the L3, though, so there will be multiple cache tag searches at once.

Some of my testing which uses a mutex-free user-mode spinlock seems to suggest inter-CCX latency is only 32~60 cycles for commands or simple data (one way). As soon as the data is more than 128-bits, it seems latency skyrockets - but I have to modify my test more - it's pretty cruddy right now.

I think the command bus is much lower latency than the data bus and can even carry a small data payload. This would help explain why Ryzen has amazing multi-threaded scaling even across CCXes, but light, data-heavy, workloads suffer immensely.

Ryzen: Strictly technical

Member

Senior member

Member

Golden Member

Diamond Member

Diamond Member

Member

Member

Member

Golden Member

Golden Member

Member

Senior member

Member

Golden Member

Junior Member

Senior member

Golden Member

Lifer

Junior Member

Golden Member

Senior member

Member

Golden Member

Senior member