• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Ryzen: Strictly technical

Page 38 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
You know what, if it was 32B for both ways (so 16B one-way), then AMD's number of 22GB/s makes perfect sense for DDR4-2666.
I know that this number popped up in an article out of nowhere. So how has it been achieved in the first place?

When one can see read bandwidths close to 42GB/s with that memory by accessing UMCs sitting in different corners of the chip, why won't it be possible to get as much from the neighbour CCX? Has it already been proven, that no memory access is involved in those CCX ping pong games, as not the bandwidth was been measured so far, but the latency, which magically matches mem latency?
 

looncraz

Senior member
Sep 12, 2011
716
1,638
136
I know that this number popped up in an article out of nowhere. So how has it been achieved in the first place?

When one can see read bandwidths close to 42GB/s with that memory by accessing UMCs sitting in different corners of the chip, why won't it be possible to get as much from the neighbour CCX? Has it already been proven, that no memory access is involved in those CCX ping pong games, as not the bandwidth was been measured so far, but the latency, which magically matches mem latency?
HT 3.1, 1.2Ghz, 11GB/s bi-directional = 22GB/s bandwidth (I hate this type of math, TBH).

That's the only logic I could think up. It has been debunked, though.

32B/cycle. 1.2GHz is something like 36GB/s (top of my head). It appears that bandwidth is achievable in both directions concurrently (if not, that'd certainly help explain some latency issues).

This figure actually works pretty well for what we see.

DDR4-3200: ~48GiB/s
DDR4-2933: ~44GiB/s
DDR4-2800: ~42GiB/s
DDR4-2667: ~40GiB/s
DDR4-2400: ~36GiB/s
DDR4-2133: ~32GiB/s

The interesting thing, though, is that this would mean we are seeing somewhere very near to 100% perfect throughput on the data fabric.
 

looncraz

Senior member
Sep 12, 2011
716
1,638
136
Oh, of course, that does not excuse using such band-aids however.
I wouldn't call it abuse - Windows has long needed to change how it handles load balancing. I believe it moves threads around somewhat randomly within a node because Microsoft didn't want to invest in a proper scheduler... and why would they? What they have does well enough for most cases.

The core parking logic helps limit where threads go and another routine keeps sorting a list of lowest utilized cores, with parked cores not present in the list. The scheduler can just assign threads to the cores in that list round-robin style, which is the fastest way to do it. This is why Windows application performance can often be higher than interactive preemptive schedulers on other OSes - the OS isn't stealing 300us from your threads every 3~6ms - it is stealing 200us every 10ms.

Caveat: I haven't actually kept up with the Windows kernel developments since Windows 7.
 

tamz_msc

Platinum Member
Jan 5, 2017
2,679
2,355
106
Intel has no monolithic block either with some data hyperloops between each core and the IMC. There is a ring bus.

I'd wait for further data points before putting all the blame on the data fabric latency. On Reddit I saw similar comments, resembling something like the CCX would match Intel's separate quadcore dies connected via FSB. It's actually not that bad. And the ring bus based designs also don't have direct connections between each core and the mem controller. In all those cases, the access requests and returning data (or store addresses and data) have to pass one or more hops/ring bus stops to get to the UMC/IMC, and again on the way back to the core.

That's what I assume to be happening:
CCX mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check L3 tags (L3 IF/CCX XBar) -> [clock domain crossing] send request to DF (router or XBar?) -> (1+ hops?) -> UMC -> access DRAMs -> data received at UMC -> transmit 64B line to DF (2 cycles) -> (1+ hops) -> receive data at CCX [clock domain crossing] -> move data to requesting core.
Intel ring bus mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check local L3 tags (L3 IF) -> [clock domain crossing to core like ring bus clock] send request via ring bus -> (1 to n hops) -> IMC -> access DRAMs -> data received at IMC -> transmit 64B line to DF -> (1 to n hops) -> receive data at core

So the UMC accesses via DF should add at least one hop (no direct connection), or 0.5 to 0.9 ns per direction (address, then data) depending on DF clock.
On Intel's 8C ring bus SoCs the avg. distance should be 2.5 hops (1 to 4 hops per 4 core half), but at clocks as high as core clocks -> 0.6 to 0.8 ns per direction.
It would seem then that your hypothesis indicates that an additional penalty might be incurred on Ryzen during the point between transmission of the data line to DF and receiving data at the core.
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
I wrote a long comment on that post, explaining pretty much all I know about this phenomenon - which is a lot. Also offering most of the safer tweaks that can be applied easily to improve matters.
Wait, wait, wait, it have to be ask it over and over, because i manage to miss the answer every time: where is the screenshot of evidence for scheduler not being SMT aware?
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
Run the experiment detailed in the middle of that comment, and you'll see it for yourself.
That would require me spending like 1/3rd of remaining SSD space and more time than i presently have to install Windows. If i could check it myself in a moment i would.
 

DrMrLordX

Lifer
Apr 27, 2000
16,501
5,479
136
So has anyone tried running the Unigine benchmarks on a Ryzen system?

I got all kinds of weird hitching in Heaven running @ stock using Performance that brought minfps down below 10. Running DDR4-3200 14-14-14-32 + 4 GHz OC got minfps up to 39.1 and smoothed out a lot of the hitching. Very odd. I haven't tried Balanced profile yet.
 

Chl Pixo

Junior Member
Mar 9, 2017
11
2
41
The interesting thing, though, is that this would mean we are seeing somewhere very near to 100% perfect throughput on the data fabric.
It has separate C&C fabric so it can be done.
What would normaly be overhead would simply go through C&C fabric.

Still we are missing information how the cores are connected to each other and to DF, or i have missed it.
According to the PCPerspective it takes 42ns to reach core on same CCX.
If its the same 42ns to reach DF that would mean 84ns to reach core on second CCX + any latency going through DF would introduce.

EDIT: corrected the source
 
Last edited:

looncraz

Senior member
Sep 12, 2011
716
1,638
136
So has anyone tried running the Unigine benchmarks on a Ryzen system?

I got all kinds of weird hitching in Heaven running @ stock using Performance that brought minfps down below 10. Running DDR4-3200 14-14-14-32 + 4 GHz OC got minfps up to 39.1 and smoothed out a lot of the hitching. Very odd. I haven't tried Balanced profile yet.
I did extensive testing with Heaven and Valley - probably five or six hours. Never an issue.
 
  • Like
Reactions: Drazick

looncraz

Senior member
Sep 12, 2011
716
1,638
136
It has separate C&C fabric so it can be done.
What would normaly be overhead would simply go through C&C fabric.

Still we are missing information how the cores are connected to each other and to DF, or i have missed it.
According to the Hardware.fr it takes 42ns to reach core on same CCX.
If its the same 42ns to reach DF that would mean 82ns to reach core on second CCX + any latency going through DF would introduce.
Well, that's the thing:



What everyone is missing is how that compares to Intel:



AMD's intra-CCX latency is HALF Intel's latency between cores (the flatness of Intel's solution is a result of having a bidirectional ring bus - you will average the same number of hops over many runs).

AMD has an additional 100ns delay when communicating between cores in other CCXes. What we need to know is if this is a direct communication or if this is done through main memory. The simplest answer is that it is done through main memory and the latency is therefore CCX latency + DF latency + IMC latency + DF latency + CCX latency.

44 + ? + 19~30 + ? + 44 = ~140

This would suggest that the data fabric latency, itself, is actually fairly low - from 16.5ns to 11ns on average. All testing seems to suggest there's no real difference between accessing one CCX over another - but there IS one. We can see it in the first image above, where the accesses to the left CCX are just a couple nanoseconds longer than for those on the right.

That feature almost certainly denotes that the data is not being directly communicated, but is hitting system memory - with the other thread requesting that data (listening to a port, accessing an address, etc...). The two CCXes would not need to know about the outside world - which has major advantages when it comes to design.

This then begs the question - what is Intel doing that is hiding the ring bus latency from benchmarks and applications? IMC latency to a core, on average, appears to be about 80ns - so they shouldn't be able to show 20ns - it should be 100ns. There may be a simple answer - I'm not fully versed on what Intel is doing these days.
 

DrMrLordX

Lifer
Apr 27, 2000
16,501
5,479
136
I did extensive testing with Heaven and Valley - probably five or six hours. Never an issue.
Huh. Well the hitching seems mostly gone with DDR4 speeds high, but with "stock" (DDR4-2133) it was pretty bad. Interesting.

Wonder if it's a scheduler problem.
 

beginner99

Diamond Member
Jun 2, 2009
4,626
1,012
136
I wrote a long comment on that post, explaining pretty much all I know about this phenomenon - which is a lot. Also offering most of the safer tweaks that can be applied easily to improve matters.
So indeed the Windows scheduler is broken in contrary to AMDs PR release and MS won't fix it. Thanks for the confirmation.
 
  • Like
Reactions: Kromaatikse

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
It would seem then that your hypothesis indicates that an additional penalty might be incurred on Ryzen during the point between transmission of the data line to DF and receiving data at the core.
That's true. And each hop would add the same amount of nanoseconds per direction. With PCB/package wired Hypertransport, the next die (MCM) was 1 hop away.
 

CataclysmZA

Junior Member
Mar 15, 2017
6
7
41
So indeed the Windows scheduler is broken in contrary to AMDs PR release and MS won't fix it. Thanks for the confirmation.
Actually, the scheduler is working as intended, it's the core parking settings that are not. This is why AMD is issuing an update in the future to add tweaks to the balanced profile to improve performance, and they're going to alter the core parking settings.
 

mat9v

Member
Mar 17, 2017
25
0
11
AMD's intra-CCX latency is HALF Intel's latency between cores (the flatness of Intel's solution is a result of having a bidirectional ring bus - you will average the same number of hops over many runs).

AMD has an additional 100ns delay when communicating between cores in other CCXes. What we need to know is if this is a direct communication or if this is done through main memory. The simplest answer is that it is done through main memory and the latency is therefore CCX latency + DF latency + IMC latency + DF latency + CCX latency.

44 + ? + 19~30 + ? + 44 = ~140

This would suggest that the data fabric latency, itself, is actually fairly low - from 16.5ns to 11ns on average. All testing seems to suggest there's no real difference between accessing one CCX over another - but there IS one. We can see it in the first image above, where the accesses to the left CCX are just a couple nanoseconds longer than for those on the right.

That feature almost certainly denotes that the data is not being directly communicated, but is hitting system memory - with the other thread requesting that data (listening to a port, accessing an address, etc...). The two CCXes would not need to know about the outside world - which has major advantages when it comes to design.

This then begs the question - what is Intel doing that is hiding the ring bus latency from benchmarks and applications? IMC latency to a core, on average, appears to be about 80ns - so they shouldn't be able to show 20ns - it should be 100ns. There may be a simple answer - I'm not fully versed on what Intel is doing these days.
But weren't those pings between cores? Not L3 caches? So you would have to subtract 20ns from the 140ns score.
So to access L3 cache on second CCX you would need about 120ns.
From AMD slides L3 was connected to DF and memory was connected to DF (on the other side) - to go to second part of L3 cache through memory is senseless as that would imply having 2 DF and memory being connected to both in a mirror-like way:
core==L3 ==DF==memory==DF==L3==core
while in fact it is like this:
core==L3==DF==L3==core
.....................||
................memory
(please ignore "." they are there because otherwise picture makes no sense)
Effective memory latency was about 98ms (I will round it to 100 for convenience).
So:
- core to L3 (20)
- core to memory (100) - from hardware.fr test
- core to another CCX L3 (120) - from core ping result
We don't know DF latency and we don't know memory to DF latency BUT:
say "a" is a latency from L3 to DF
say "b" is a latency from DF to memory
say "c" is a latency from core to L3
then
c+a+b=100 (latency access to memory)
c+a+a=120 (latency access to L3 from another CCX) - it is the same ac c+a+a+c=140 if you ping core in another CCX
from that a=50 and b=30

a is a combined latency of link between L3 cache and half trip through DF itself
b is a combined latency of link between memory and another half trip through DF itself
It is not possible really to tell what is the latency of DF but then we don't need that anyway to tell that access to second part of L3 is NOT through memory, if that was the case it would be close to 200ms and it does not make any sense from designer point of view, such configuration would not bring any advantages to performance OR expandability of design.

Btw, why would 44ns be CCX latency? If you mean L3 latency it is about 20ns, 44ns from core ping test is a round trip - core=L3=core
 
Last edited:

looncraz

Senior member
Sep 12, 2011
716
1,638
136
But weren't those pings between cores? Not L3 caches? So you would have to subtract 20ns from the 140ns score.
So to access L3 cache on second CCX you would need about 120ns.
From AMD slides L3 was connected to DF and memory was connected to DF (on the other side) - to go to second part of L3 cache through memory is senseless as that would imply having 2 DF and memory being connected to both in a mirror-like way:
core==L3 ==DF==memory==DF==L3==core
while in fact it is like this:
core==L3==DF==L3==core
.....................||
................memory
(please ignore "." they are there because otherwise picture makes no sense)
Effective memory latency was about 98ms (I will round it to 100 for convenience).
So:
- core to L3 (20)
- core to memory (100) - from hardware.fr test
- core to another CCX L3 (120) - from core ping result
We don't know DF latency and we don't know memory to DF latency BUT:
say "a" is a latency from L3 to DF
say "b" is a latency from DF to memory
say "c" is a latency from core to L3
then
c+a+b=100 (latency access to memory)
c+a+a=120 (latency access to L3 from another CCX) - it is the same ac c+a+a+c=140 if you ping core in another CCX
from that a=50 and b=30

a is a combined latency of link between L3 cache and half trip through DF itself
b is a combined latency of link between memory and another half trip through DF itself
It is not possible really to tell what is the latency of DF but then we don't need that anyway to tell that access to second part of L3 is NOT through memory, if that was the case it would be close to 200ms and it does not make any sense from designer point of view, such configuration would not bring any advantages to performance OR expandability of design.

Btw, why would 44ns be CCX latency? If you mean L3 latency it is about 20ns, 44ns from core ping test is a round trip - core=L3=core
Communications wouldn't go through the L3 cache, so its latency is not part of the active communication.

Only the tag search latency would apply. You don't pay the full access and fetch penalty on a cache miss. On Ryzen, the 1/4 tags system means that quite often you could have a result within a couple cycles after submitting an address search in the event of a failure (some 65% of the time, if my mental statistical analysis is up to snuff - no promises :p).

Another interesting bit about Ryzen's cache, I believe, is that cores will only evict to their own local L3 partition - so a single core can only store its personal evicted lines in 2/4MB of L3 - which is why cache aware algorithms need to be designed as if Ryzen only has 4MB of L3. The L3 can do multiple tag searches at once.

A Ryzen quad core module is actually two dual core modules, when you look at it closely. Even the L3 is mostly just a duplication, with some functional block changes (PLL, and so on). The cores communicate via a mesh bus from L2 to L2 in the upper metal layers. So you have L1D Latency + L2 latency + bus latency + L2 latency + L1 Latency.

1.3 + 8.5 + 10 + 8.5 + 1.3 ~= 46 ns

To talk to another core in another CCX, contrary to my prior thinking, then you have the following latencies:

L1D + L2 + BUS + DFI + DF + IMC + DF + DFI + BUS + L2 + L1D

1.3 + 8.5 + 10 + ? + ? + ? + ? + ? + 10 + 8.5 + L1D ~= 140

DFI, of course, is the data fabric interface unit I assume must exist - can't just send signals however you please, you need to find a window and mix with any other traffic that might be on that bus.

When I get a board again, I will be experimenting with heavy PCI-e traffic and test memory latencies and throughput. Basically, I will clone the NVMe boot drive to an SSD RAID 0 array while running AIDA64 :p

I will not be surprised by any result I find - be it positive or horrendously negative. But, I do believe Ryzen will do just fine - I wholly believe there to be dedicated links in the fabric rather than just a dumb bus.
 
Status
Not open for further replies.

ASK THE COMMUNITY

TRENDING THREADS