Ryzen: Strictly technical

Page 46 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Kromaatikse

Member
Mar 4, 2017
83
169
56
There is a description of the scheduler from 'Windows Internals: part 1' by Russinovich and Solomon here (PDF) :
https://download.microsoft.com/down...2B9FD877995/97807356648739_SampleChapters.pdf

The 7th edition, including windows 10 and Server 2016 is due out this year (I'm probably going to get it - I'm so outdated).

That was an interesting read, but unfortunately the algorithms it describes in critical areas are not consistent with the actual behaviour observed (neither on Win7 nor Win10). Therefore I cannot rely on any of the information in it.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Is it? Memory access is still uniform, it is more of an Skulltrail than NUMA config, now that i consider it. In fact, it is literally Skulltrail config, just on single die: 2 quad cores with plenty of cache for them connected by a bus with each other, memory and all the IO.

It's much closer to NUMA than to a monolithic core. You have half the CPU which can only talk to the other half, AFAICT, by going to the IMC. All of the prefetch logic become useless as soon as data needs to be shared between those halves of the CPU.

This is very close to the dynamic you see with NUMA, just it's occurring on-die.

I am currently preparing to update to the latest Windows 10 because my existing install is performing pretty horridly - hoping a simple upgrade installation will do the trick.
 
  • Like
Reactions: CatMerc and Drazick

Trovaricon

Member
Feb 28, 2015
28
41
91
lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller. Therefore Idea of "every ccx has a single channel IMC = has lower latency when accessing its memory space" is not correct.

NUMA has nothing to do with cache... What about 2 or 4 cores sharing a L2? Or even L1i in case of Bulldozer uArch? Do you consider them "NUMA-like" architectures?

If one writes algorithm (in)correctly you can outperform SandyBridge i3 by Piledriver A10 core in a single threaded micro benchmark (personal experience)

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller. Therefore Idea of "every ccx has a single channel IMC = has lower latency when accessing its memory space" is not correct.

NUMA has nothing to do with cache... What about 2 or 4 cores sharing a L2? Or even L1i in case of Bulldozer uArch? Do you consider them "NUMA-like" architectures?

If one writes algorithm (in)correctly you can outperform SandyBridge i3 by Piledriver A10 core in a single threaded micro benchmark (personal experience)

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.
It's NUMA like, not NUMA. I don't think anyone claims otherwise.
If you think about the L3 as "memory", then it's very much NUMA like.

I believe Bulldozer's design has identical access times to its L2 cache between two cores in a module, so it's not exactly the same situation. With Zen the design has two blocks with non uniform access to L3 cache.
 
  • Like
Reactions: looncraz

itsmydamnation

Platinum Member
Feb 6, 2011
2,743
3,074
136
lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller.
if you mean HT/QPI and the ring bus are all "FSB"'s then sure.

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.

Its all a matter of trade off's, Zepplin trades inter ccx latency for easy to scale core count and a single SOC that has a massive TAM.

Maybe you can explain better why Ryzen is lagging behind Intel IPC in less threaded tasks? I will wait...

Its inter CCX latency obviously, but bandwidth or clock doesn't singularly determine performance and in the case of increased memory clock improving performance massively we are far more likely seeing the improved latency then the throughput itself. But as i said before we have no idea yet how the inter CCX cache coherency protocol works. Hopefully we can find out when Naples is released.

Inter CCX thread ping pong isn't ideal, but threads should still be shared across CCX's to minimize hot spots etc.
 
  • Like
Reactions: Dresdenboy

Trovaricon

Member
Feb 28, 2015
28
41
91
Non uniform memory architecture
It's NUMA like, not NUMA. I don't think anyone claims otherwise.
If you think about the L3 as "memory", then it's very much NUMA like.
Cache is not memory! Cache is a cache. (edit: it is not a memory in the context of this discussion)
Where is written that UMA CPU's LLC should span over the entire processor (to all cores)?
Is Athlon x2 (K8 & K10) Athlon x4 (K10), all Bulldozer uarch based APUs a NUMA-like architecture? Jaguar in x2/x4 configuration has unified LLC, x8 (consoles) hasn't.

Yes, most Intel architectures (products based on them) are designed this way. And... yes, most AMD products aren't (that is nothing new).

I believe Bulldozer's design has identical access times to its L2 cache between two cores in a module, so it's not exactly the same situation. With Zen the design has two blocks with non uniform access to L3 cache.
Well then jumping of a single thread L2 data-fitting algorithm in Bulldozer uArch shows different performance depending on where does the thread gets re-scheduled especially considering the high latency low bandwidth L3, doesn't it?

What about tests with 10-15MB dataset, intentional algorithmic jumping around memory (to always cause cache-miss)

If AMD manages to connect zeppelin dies together using the same bandwidth / latency data fabric as within the die between CCXs (pipe dream) then you end up with UMA multi-die CPU.
 
  • Like
Reactions: icelight_

Kromaatikse

Member
Mar 4, 2017
83
169
56
If AMD manages to connect zeppelin dies together using the same bandwidth / latency data fabric as within the die between CCXs (pipe dream) then you end up with UMA multi-die CPU.
Pipe dream? All they have to do is send the Infinity Fabric link between MCM dies using, probably, something like the interposer technology already used for HBM. It's eminently doable, especially for Snowy Owl (the 16-core workstation part) where they only need to connect two dies together that way.

However, I think I heard something about Naples (32-core) having an L4 cache in addition to the four Zeppelins. I reckon that, as well as eliminating the IMC itself from inter-CCX transfers (if that's really what happens, of which I'm skeptical) and reducing the average latency due to relatively slow ECC server-certified RAM, the L4 cache die serves as an Infinity Fabric hub, simplifying the problem of connecting four dies together.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Pipe dream? All they have to do is send the Infinity Fabric link between MCM dies using, probably, something like the interposer technology already used for HBM. It's eminently doable, especially for Snowy Owl (the 16-core workstation part) where they only need to connect two dies together that way.

However, I think I heard something about Naples (32-core) having an L4 cache in addition to the four Zeppelins. I reckon that, as well as eliminating the IMC itself from inter-CCX transfers (if that's really what happens, of which I'm skeptical) and reducing the average latency due to relatively slow ECC server-certified RAM, the L4 cache die serves as an Infinity Fabric hub, simplifying the problem of connecting four dies together.
Didn't hear about Naples having L4 cache, probably some theoretical discussion.

As for connecting the two CCX, another metal layer should do the trick in a revision, which would be cheaper than an interposer.
 
  • Like
Reactions: powerrush

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Non uniform memory architecture

Cache is not memory! Cache is a cache. (edit: it is not a memory in the context of this discussion)
Where is written that UMA CPU's LLC should span over the entire processor (to all cores)?
Is Athlon x2 (K8 & K10) Athlon x4 (K10), all Bulldozer uarch based APUs a NUMA-like architecture? Jaguar in x2/x4 configuration has unified LLC, x8 (consoles) hasn't.

Yes, most Intel architectures (products based on them) are designed this way. And... yes, most AMD products aren't (that is nothing new).


Well then jumping of a single thread L2 data-fitting algorithm in Bulldozer uArch shows different performance depending on where does the thread gets re-scheduled especially considering the high latency low bandwidth L3, doesn't it?

What about tests with 10-15MB dataset, intentional algorithmic jumping around memory (to always cause cache-miss)

If AMD manages to connect zeppelin dies together using the same bandwidth / latency data fabric as within the die between CCXs (pipe dream) then you end up with UMA multi-die CPU.
NUMA like, not NUMA. I don't think anyone is contesting that.

It has similar behavior to NUMA is the point.
 
  • Like
Reactions: powerrush

ryzenmaster

Member
Mar 19, 2017
40
89
61
Is this result with Core Parking on? If so, how does it change with it off? (Or vice versa.)

This would be with core parking disabled. I'll have to run some tests with core parking enabled, though I suspect it will still migrate threads, just not on the ones that are parked when running lightly threaded workloads.

At any rate I shouldn't have to rely on crippling compute capacity in order to get the most out of it. I have 8 cores and I want them all be used, so I prefer to keep core parking off. I did this even on my Phenom II x6 1090T ever since some results suggested that Battlefield 4 performance may suffer from core parking.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
lolfail9001 is right, it is more like a Skulltrail platform but with extremely buffed up "FSB". According to AMD slides any access to memory goes through "FSB" (data fabric) to memory controller. Therefore Idea of "every ccx has a single channel IMC = has lower latency when accessing its memory space" is not correct.

NUMA has nothing to do with cache... What about 2 or 4 cores sharing a L2? Or even L1i in case of Bulldozer uArch? Do you consider them "NUMA-like" architectures?

If one writes algorithm (in)correctly you can outperform SandyBridge i3 by Piledriver A10 core in a single threaded micro benchmark (personal experience)

The point is, that AMD again came with not-so-copy-paste (from Intel) architecture CPU building block - it is much less exotic than Bulldozer, but it is not 1to1 intel replacement when it comes to design of (some) high performance algorithms.

This is partly determined by what is acting as the LLC for the CPU and how cores can communicate with each other.

If Ryzen had an L4 which handled L3 evictions and global data, then there'd be nothing NUMA-like about Ryzen. Instead, Ryzen has to core complexes which are fully independent of each other - if they communicate at all, it's through memory.

It is very much like a dual socket system - and we treat those CPUs specially.
 

Trovaricon

Member
Feb 28, 2015
28
41
91
Pipe dream? All they have to do is send the Infinity Fabric link between MCM dies using, probably, something like the interposer technology already used for HBM. It's eminently doable, especially for Snowy Owl (the 16-core workstation part) where they only need to connect two dies together that way.

However, I think I heard something about Naples (32-core) having an L4 cache in addition to the four Zeppelins. I reckon that, as well as eliminating the IMC itself from inter-CCX transfers (if that's really what happens, of which I'm skeptical) and reducing the average latency due to relatively slow ECC server-certified RAM, the L4 cache die serves as an Infinity Fabric hub, simplifying the problem of connecting four dies together.
Yes, AMD's presentation about Infinity Fabric describes Scalable Data fabric as Multi Socket & Multi Die Ready.
The question is what will be the inter-die width of Infinity Fabric (its DF part) "channel" & latency.
What we know is that CCX is connected through 256bit "channel" to on-die DF but what is the data fabric "backbone" bandwidth and how it performs with more than 2 CCXes moving data around (slide about DF has "engines" / "hubs" drawn in a schema) remains to be seen.

Did anyone try / seen test to "hammer" DataFabric "backbone" (hub/engine/whatever)?
A Diagram of Ryzen’s Clock Domains on 18th page contains information about "width" of components connected to DF: CCx 32B/c, IMC 32B/c, IO Hub 32B/c. To really test DF "switching speed" you need to access RAM at max speed and also IO devices (some kind of PCI-e bandwidth test) at the same time.
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
Here is a stupid question - I assume all requests for data are sent to local CCX, foreign CCX and main system memory at the same time? Then the uncore loads in the first positive return that comes its way?

[it must do... it'd be daft to wait for a return from L1 before interrogating L2, wait for that return before interrogating local L3, wait for that return before interrogating foreign L3 etc etc]
 

powerrush

Junior Member
Aug 18, 2016
20
4
41
That seems mostly a matter of DRAM latency and not CCX to CCX.
You don't specify the CAS for the DRAM used but with 2133 CL14 you would get around 100ns while with 3200 CL14 you would get low 70s.
The issue with CCX to CCX seems to be the data path not BW or latency. It doesn't go straight to the other CCX and we don't quite know what it does and why.
Clocking the DF higher would reduce latency but the latency will still be high if the data always goes through a bunch of hops.

The latency of memory is probably the main problem, but the CCX designs add another layer to this problem.

How the process thread is transported through the CCX ? It jumps directly through the data fabric to the ccx or is a DMA operation ?
 

imported_jjj

Senior member
Feb 14, 2009
660
430
136
The latency of memory is probably the main problem, but the CCX designs add another layer to this problem.

How the process thread is transported through the CCX ? It jumps directly through the data fabric to the ccx or is a DMA operation ?

Wasn't saying that there is a memory latency problem though. Broadwell-E is in the 60s with decent memory (3200 CL16) and i think Summit Ridge can get there or very close soon.
The fabric is just a coherent interconnect and due to its marketing name, people are freaking out too much about it.
My (crazy) theory is that the CCX to CCX implementation is GMI related, this die was made for MCM with a very limited budget.Ofc that's speculation and should not be presented as fact.
 
Last edited:

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
I don't see a 'problem' at all with mine. What i do see, and expected to see, is an opportunistic minority throwing around negative adjectives. Excellent marketing too all around, there's nothing they could have done better really. Patches are already inbound to close the single area where there is a small gap in performance, some games. For other games like Mafia III, Ryzen outperforms everything which indicate the architecture is solid for gaming, developers only need to update the code as it's a different architecture that they have been developing games on for 10 years. Yet it's a beast out of the box for productivity.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
That was an interesting read, but unfortunately the algorithms it describes in critical areas are not consistent with the actual behaviour observed (neither on Win7 nor Win10). Therefore I cannot rely on any of the information in it.

Yeah, there are details that are clearly missing. The thread state table and state machine are far to simplistic to determine how the scheduler is going to behave. I think this is the reason that the authors put a fair bit of focus on the use of utilities - so that programmers can tune their code based on observed behaviors. It definitely doesn't appear to be the case that the scheduler is actually deterministic - I agree with you there.

This is unsurprising given that there is no module dedicated to thread scheduling and dispatch are distributed throughout the kernel!!
The Windows scheduling code is implemented in the kernel. There’s no single “scheduler” module or routine, however—the code is spread throughout the kernel in which scheduling-related events occur. The routines that perform these duties are collectively called the kernel’s dispatcher. The following events might require thread dispatching:

■■ A thread becomes ready to execute—for example, a thread has been newly created or has just
been released from the wait state.
■■ A thread leaves the running state because its time quantum ends, it terminates, it yields
execution,
or it enters a wait state.
■■ A thread’s priority changes, either because of a system service call or because Windows itself
changes the priority value.
■■ A thread’s processor affinity changes so that it will no longer run on the processor on which it
was running.

Crazy stuff o_O

Edit: Some of the quote didn't get pasted.
 
Last edited:

CataclysmZA

Junior Member
Mar 15, 2017
6
7
81
Here is a stupid question - I assume all requests for data are sent to local CCX, foreign CCX and main system memory at the same time? Then the uncore loads in the first positive return that comes its way?

Mostly correct, at least from what we know. When there's a data request for data that isn't on the local CCX L3, a strobe is sent out to the second CCX's L3 caches and to RAM at the same time. There's no information about the local CCX request being done at the same time, although it would make sense to do that given all the available bandwidth.

This is unsurprising given that there is no module dedicated to thread scheduling and dispatch are distributed throughout the kernel!!

Crazy stuff o_O

Contrast that to the Linux CFS: https://en.wikipedia.org/wiki/Completely_Fair_Scheduler

Microsoft does some really crazy stuff with their kernel.
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
You have half the CPU which can only talk to the other half, AFAICT, by going to the IMC.
Did you test it? AMD's slides sure suggest otherwise. In fact, if you were right then 4+0 config would not really be uniform in memory access at ALL.
It's NUMA like, not NUMA. I don't think anyone claims otherwise.
If you think about the L3 as "memory", then it's very much NUMA like.
But we have no real clue if L3 is ever touched by different CCX. hardware.fr's test heavily implies it does not, imho. So, it is literally symmetric in regards to memory and other I/O.
if you mean HT/QPI and the ring bus are all "FSB"'s then sure.
Ring bus on Intel stuff would be a fair comparison, but does not afford painting Ryzen as Skulltrail the SoC.
Instead, Ryzen has to core complexes which are fully independent of each other - if they communicate at all, it's through memory.

It is very much like a dual socket system - and we treat those CPUs specially.
Oh, please, you are well aware that dual socket system is not always NUMA-like. Sure, modern systems are all NUMA cause of memory controllers being integrated on all the modern CPUs since Nehalem. But the old ones...

I don't see a 'problem' at all with mine. What i do see, and expected to see, is an opportunistic minority throwing around negative adjectives. Excellent marketing too all around, there's nothing they could have done better really. Patches are already inbound to close the single area where there is a small gap in performance, some games. For other games like Mafia III, Ryzen outperforms everything which indicate the architecture is solid for gaming, developers only need to update the code as it's a different architecture that they have been developing games on for 10 years. Yet it's a beast out of the box for productivity.
Got anything technical to say?
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Oh, please, you are well aware that dual socket system is not always NUMA-like. Sure, modern systems are all NUMA cause of memory controllers being integrated on all the modern CPUs since Nehalem. But the old ones...

Every multi-socket system with a unified LLC per socket is NUMA-like, with the node being the socket.

A multi-die processor with no unified LLC but with shared cache between more than one core is NUMA-like (Core 2 Quad).

AMD construction core CPUs with L3 do not apply, but those without absolutely do - but inter-core bandwidth and latency were plenty enough for the performance the cores offered, negating the penalty... but Windows was modified to accommodate it.

Sharing memory between two sockets incurs a non-uniform penalty compared to accessing memory on just one socket when the CPU in each socket has a LLC - even if the memory controller is part of an external north bridge.

It is only NUMA-like as this only impacts in-use/cached memory and not stale memory.

The difference in L3 bandwidth and latency within the CCX and to main memory is on the order of 500%. Depending on what memory you are accessing with what CCX at a given time, you may pay a 500% penalty for treating Ryzen as a monolithic processor.

That's the same concern with NUMA, except that it's a more static situation.
 
  • Like
Reactions: CatMerc and Drazick

Trovaricon

Member
Feb 28, 2015
28
41
91
I see that everyone started to apply his point of view from a software side - algorithm optimization problems on description of hardware architecture (including me).
Big dataset processing vs. producer-consumer with small dataset vs. "I don't know what else - its after midnight" problems.

It is only NUMA-like as this only impacts in-use/cached memory and not stale memory.

The difference in L3 bandwidth and latency within the CCX and to main memory is on the order of 500%. Depending on what memory you are accessing with what CCX at a given time, you may pay a 500% penalty for treating Ryzen as a monolithic processor.

That's the same concern with NUMA, except that it's a more static situation.
Now imagine if Intel's L3 wasn't inclusive - this is actually the food for though we should probably focus on. It is the source of inter-core latency & bandwidth Intel reigns in. A single high priority "rogue" thread could destroy most of the benefits of LLC.
Victim partitioned L3 vs Incusive non-partitioned L3. Do we know anything about "coherent fabric" role of Infinity Fabric in on-die communication?

Can't we just please everyone and say it's closer to LEGO than anything else? :D
I have a feeling you knew where this "terminus technicus crossfire" would lead. If we can't call it by its name - Non uniform/segmented/partitioned last level cache, then LEGO it is!
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I think, we already have a good understanding of what's happening in and between the 2 CCXs in a Ryzen CPU, mostly as analysis of specific behavioural patterns in kind of test cases (incl. the one PCPer did).

Real world software shows a mix of different effects. And we've also seen that Win performance profiles, core parking, SMT off can show performance gains similar to moving from x+x to 2x+0 CCX configurations. So how might these effects be related?
 
  • Like
Reactions: Drazick

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Now imagine if Intel's L3 wasn't inclusive - this is actually the food for though we should probably focus on. It is the source of inter-core latency & bandwidth Intel reigns in. A single high priority "rogue" thread could destroy most of the benefits of LLC.

Ryzen uses a mostly exclusive L3 per CCX. It keeps a copy of the L2 tags and the burden of getting data from a neighboring core's L2 is actually quite similar to the cost of getting it from the L3, AFAICT from testing. I can only imagine AMD specifically designed it to be that way.

I think a future improvement for the L3 would be to prefetch directly into it - that should help with in-page random access latency and would help some of the algorithms which are currently behaving badly on Ryzen. There are a lot of caveats to that, though.

From what I can tell on Ryzen, any one core can only evict to 4MB of L3 - which is why single-threaded cache latency tests show a sudden latency hit when exceeding 4MB. Each core can read from any part of the L3, though, so there will be multiple cache tag searches at once.

Some of my testing which uses a mutex-free user-mode spinlock seems to suggest inter-CCX latency is only 32~60 cycles for commands or simple data (one way). As soon as the data is more than 128-bits, it seems latency skyrockets - but I have to modify my test more - it's pretty cruddy right now.

I think the command bus is much lower latency than the data bus and can even carry a small data payload. This would help explain why Ryzen has amazing multi-threaded scaling even across CCXes, but light, data-heavy, workloads suffer immensely.
 
Last edited:
Status
Not open for further replies.