• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Ryzen: Strictly technical

Page 39 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

innociv

Member
Jun 7, 2011
54
20
76
edit: nvm figured out what I had posted here.

This was with the assumption that the "ping test" is how long it takes for one core to set something into L3 and RAM, and for another core to then read it.

Was thinking that 142ns delay was really because it was fetching from RAM, but no that's unlikely even if they had memory latency as low as 70ns, when it seems to usually be in the 75-105 range for Ryzen.
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
edit: nvm figured out what I had posted here.

This was with the assumption that the "ping test" is how long it takes for one core to set something into L3 and RAM, and for another core to then read it.

Was thinking that 142ns delay was really because it was fetching from RAM, but no that's unlikely even if they had memory latency as low as 70ns, when it seems to usually be in the 75-105 range for Ryzen.
Just a thought: the UMCs likely have write buffers, and so do the CCX' internally. We might see a faster UMC response due to the data sitting in the buffer still (kind of LTSF). The CCX buffers might just add a few more (CCX clock) cycles.
 
  • Like
Reactions: looncraz
May 11, 2008
18,310
829
126
Well, after reading the updates on the 1400x and the 1500x and keeping the current status of the inter ccx communication in my mind, it seems the 1400x is going to be the processor i am going to buy together with a B350 board. I am sticking with my choice from a few months ago.

Why?
Well, the 1500x has 4 cores and 8 threads and 16MB L3 cache.
This suggests a 2 + 2 setup where 2 cores are disabled on each ccx.
The 1400x has 4 cores and 8 threads and 8MB L3 cache.
This suggest a 4 + 0 setup where 1 ccx is entirely disabled.
The L3 is primarily a victim cache for L2.
If i understand it correctly, data evicted from L2 end up in L3. If that data is needed again, it is still present in L3.
New data from main memory is immediately stored in L2.
L2 cache tags and more stuff that is important for L2 is held in L3 as well.
My guess is that a 2 + 2 setup will not gain much from the 16MB and that the inter ccx communication will hold it back even more in comparison to a 4 + 4 setup.
Of course , there are no reviews yet, but since a processor design is all about balancing the right sizes of local on die storage memory, i doubt 2 cores for each ccx would have much to gain from 2x L3.
Especially with generic coded programs.

Am i making any sense so far ?
Or is there a flaw in my reasoning ?
 
  • Like
Reactions: looncraz

looncraz

Senior member
Sep 12, 2011
716
1,638
136
Just a thought: the UMCs likely have write buffers, and so do the CCX' internally. We might see a faster UMC response due to the data sitting in the buffer still (kind of LTSF). The CCX buffers might just add a few more (CCX clock) cycles.
Glad I read your comment before posting mine :p This was exactly what I was going to mention as a possibility - the write coalescing buffer for the IMC could, in theory, be used to service requests if it has data to be written to memory, which should be the case at least in some cases with inter-CCX communications... but that check actually needs to be made. PHY to RAM latency is probably only 20ns or so, maybe less. IMC to PHY, however, occurs over the data fabric - which is how interleaving occurs, interestingly, if the AGESA entries I've seen are to be believed.

Data from a CCX goes along the data fabric to the IMC. Data from the IMC goes to each PHY, in 256B, 512B, 1KiB, or 2KiB chunks, over the data fabric once again.

So the memory path is a pretty crazy thing to explore...

L1D + L2 + BUS + DFI + DF + DFI + IMC + DFI + DF + DFI + PHY + BUS + RAM

Most of that is just because the PHYs are on the opposite sides of the die... a wholly baffling situation. In fact, there's a great deal about the Ryzen die that is crazy baffling... so much SRAM... so many functional blocks that seem to defy identification.

--

People often miss just how simple each component has to be in these systems for everything to work. And Ryzen certainly works. I don't remember the last time a new platform's teething issues didn't include stability problems. Skylake has stability problems and it was mostly just an iterative design change.

One CCX does not know about the other - it knows about writing to a memory address. The receiving CCX doesn't know who had any given data, it just knows the address to read. Creating a dedicated CCX-to-CCX communication network/protcol would be fraught with peril... but it's ripe ground for future revision given the potential rewards.
 

looncraz

Senior member
Sep 12, 2011
716
1,638
136
Well, after reading the updates on the 1400x and the 1500x and keeping the current status of the inter ccx communication in my mind, it seems the 1400x is going to be the processor i am going to buy together with a B350 board.

Why?
Well, the 1500x has 4 cores and 8 threads and 16MB L3 cache.
This suggests a 2 + 2 setup where 2 cores are disabled on each ccx.
The 1400x has 4 cores and 8 threads and 8MB L3 cache.
This suggest a 4 + 0 setup where 1 ccx is entirely disabled.
The L3 is primarily a victim cache for L2.
If i understand it correctly, data evicted from L2 end up in L3. If that data is needed again, it is still present in L3.
New data from main memory is immediately stored in L2.
L2 cache tags and more stuff that is important for L2 is held in L3 as well.
My guess is that a 2 + 2 setup will not gain much from the 16MB and that the inter ccx communication will hold it back even more in comparison to a 4 + 4 setup.
Of course , there are no reviews yet, but since a processor design is all about balancing the right sizes of local on die storage memory, i doubt 2 cores for each ccx would have much to gain from 2x L3.
Especially with generic coded programs.

Am i making any sense so far ?
Or is there a flaw in my reasoning ?
I see no flaws and came to much the same conclusion. 2+2 is not what I want. However, there are apps that prefer 2+2 over 4+0 thanks to that extra cache. But I'd still take 4+0 over 2+2.
 
May 11, 2008
18,310
829
126
I see no flaws and came to much the same conclusion. 2+2 is not what I want. However, there are apps that prefer 2+2 over 4+0 thanks to that extra cache. But I'd still take 4+0 over 2+2.
Thank you. :)

Internal ccx communication is way faster then communication between the two ccx. So, most generic coded programs would benefit more from the 1400x with of course the always present few exceptions.
 

looncraz

Senior member
Sep 12, 2011
716
1,638
136
May 11, 2008
18,310
829
126
I will wait for some time to be certain. I would like the 1400x instead of a possible 1400. Although if i can believe the news, the next one lower in line will be the 1300.
When i buy, i want the hardware to be stable with a matured bios.
I will go for a gigabyte motherboard again. I have good experiences with gigabyte. And as it seems for ryzen, the gigabyte is a good choice for a stable board since the release of ryzen.
I also would like a different custom cooler and i still have to do thorough research about what memory is best.
Will also be including a nvme and that is going to be the most expensive part of the whole new system, a samsung 960 pro 512GB.

I will be updating from a A10-6700 cpu , so it will be a big win in the end.
I can reuse the RX480 card that i have. And use the A10-6700 as a home theater system or backup pc.
 
May 11, 2008
18,310
829
126
I meant, why AMD went with such an abuse of such Windows behavior.
The next gen of consoles to appear. 4 core jaguar module >> 4 core zen module.
And since the 4 core module design is still present, all those games will be easier to port from the old jaguar based consoles.
It is all to convenient for my taste.
For AMD this is a big win. Next step up would be 8 cores in a ccx.

edit:
For server software that is NUMA aware of the design of the cpu, this all is not an issue.
Only for OS with a lot of abstraction to hide the hardware details from user programs.
And console programmers are already very familiar with the 4 + 4 setup of jaguar based consoles.
And program close to the metal, catching any particular issues.
PS4 and xbox one where both announced in early 2013 and released in late 2013, Jim Keller was hired in 2012. I assume their were a lot of meetings about what the future zen would be used for.
 
Last edited:

Chl Pixo

Junior Member
Mar 9, 2017
11
2
41
@looncraz did the AGESA have ability to run DF on 2x multiplier?
If not I may wait until the HEDT comes out. I would like to have 8 more PCIe lines from CPU anyway.
I dont want to go Intel as they change their socket too often. With AMD I could upgrade in 4 years without the need for new MB.
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Last edited:

mat9v

Member
Mar 17, 2017
25
0
11
Communications wouldn't go through the L3 cache, so its latency is not part of the active communication. Only the tag search latency would apply. You don't pay the full access and fetch penalty on a cache miss. On Ryzen, the 1/4 tags system means that quite often you could have a result within a couple cycles after submitting an address search in the event of a failure (some 65% of the time, if my mental statistical analysis is up to snuff - no promises :p).
I hope I'm not uneducated but why? Core typically does not know where the required data is at the moment. So it asks L1 Data cache, if it misses the question follows to L2 cache and in case of another miss it will ask L3. Sure, tags are gr8 but they lower typical access time from 17ns to maybe 10ns? (if core knows that data is in L3 and asks for it directly).
Now we should ask what the problems are for our core. Say it wants to access a result of computations of another thread. Threads from the same program share memory address space and are aware of their location, so it asks for a result. But it does not know where physically required data is present, so it asks the core that it knows that is/was running the thread. It must check L3 then L2 and lastly L1 cache for results (unless there is a mechanism/index that tells that this packet of data is not in cache so look elsewhere). If data is not in cache it asks memory controller for data fetch. Correct me if I'm wrong here.

A Ryzen quad core module is actually two dual core modules, when you look at it closely. Even the L3 is mostly just a duplication, with some functional block changes (PLL, and so on). The cores communicate via a mesh bus from L2 to L2 in the upper metal layers. So you have L1D Latency + L2 latency + bus latency + L2 latency + L1 Latency.

1.3 + 8.5 + 10 + 8.5 + 1.3 ~= 46 ns

To talk to another core in another CCX, contrary to my prior thinking, then you have the following latencies:

L1D + L2 + BUS + DFI + DF + IMC + DF + DFI + BUS + L2 + L1D

1.3 + 8.5 + 10 + ? + ? + ? + ? + ? + 10 + 8.5 + L1D ~= 140

DFI, of course, is the data fabric interface unit I assume must exist - can't just send signals however you please, you need to find a window and mix with any other traffic that might be on that bus.
From hardware.fr test L1 latency is 1.3ns, L2 latency is 6.3ns, L3 latency is 17.3ns
So going by the closest route, pinging another core in the "dual" on the same CCX would be:
1.3+6.3+bus+6.3+1.3=15.2ns+bus (never seen L2 mesh bus in any Ryzen diagram, are you sure about it's existence?)
Pinging core results showed over 40ns - are you suggesting that bus between L2 caches takes 25ns?
Or maybe pinging went through L3 cache instead? it certainly seems closer, 2x17.3=34.6ns (accounting that probably tests were done with different cpu frequency, it may be possible, up to 20% difference in frequency meaning 7ns could mean ~42ns as it is in core pinging results)
BTW, 1.3 + 8.5 + 10 + 8.5 + 1.3 = 29.6ns and not 46ns
http://images.anandtech.com/doci/10591/HC28.AMD.Mike Clark.final-page-013.jpg
This slide suggest that there is no special bus that links L2 straight to DF but rather go through L3 to get there. Please share the source of your info otherwise.
Why are you so bent on all queries from CCX to CCX going through memory controller, this slide again suggests otherwise: http://images.anandtech.com/doci/11170/AMD_Ryzen_Mark_Papermaster_Final-page-009_575px.jpg
It shows cores, hubs, connected to infinity fabric and not to memory controllers, they are other entities connected to the IF.
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Threads from the same program share memory address space and are aware of their location, so it asks for a result. But it does not know where physically required data is present, so it asks the core that it knows that is/was running the thread. It must check L3 then L2 and lastly L1 cache for results (unless there is a mechanism/index that tells that this packet of data is not in cache so look elsewhere).
Neither the thread nor the hardware are a priori aware of which core the data is on. The hardware has to search for it, and it always searches in the lowest-latency places first.

On Ryzen, the L2 cache is inclusive of the L1 cache. This means that only the L1 cache of the core requesting the data is explicitly checked - everything else is done at the L2 and L3 caches. If the data isn't in L2 cache then it *can't* be in the corresponding L1 cache.

The "partial tags" kept by the L3 cache are another such shortcut mechanism. If these partial tags don't hit (this check takes much less time than a full lookup), then the data can't be in the locations covered by them. If they *do* hit, then they point to the correct cache to query, and the lookup is completed there.

What we seem to have confusion about is exactly how one CCX communicates with another. There is a 100ns inter-CCX latency penalty measurement out there - and that's rather a lot, to the point of sounding decidedly suboptimal.
 

malitze

Junior Member
Feb 15, 2017
24
49
51
I wish PC Perspective listed the code they are using for the "ping test" - trying to understand this w/o the code is nuts!
Yeah, this really annoys me aswell, though I can understand it at least to a certain degree. But without the code I'm not too sure as to what to make of the results.
 

CrazyElf

Member
May 28, 2013
88
21
81
Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real.



I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.

On the X399 report. this was just Tweeted:

https://twitter.com/CPCHardware/status/843099198799187968
Il y a bien des Ryzen 16C/32T axés HEDT prévus sur X399 dans 4/6 mois. Clocks ~2.4/2.8G. 2 dies MCM. 4 chan DDR4. Socket LGA SP3r2. ~150W.
Canard PC maintains CPU-Z, so I'm thinking this rumor is legit.
 
  • Like
Reactions: lightmanek

CatMerc

Golden Member
Jul 16, 2016
1,114
1,146
106
Well, if CPCHardware claims X399 is a thing, then we can say it's a thing.
 

CrazyElf

Member
May 28, 2013
88
21
81
Well, that's the thing:



...
AMD has an additional 100ns delay when communicating between cores in other CCXes. What we need to know is if this is a direct communication or if this is done through main memory. The simplest answer is that it is done through main memory and the latency is therefore CCX latency + DF latency + IMC latency + DF latency + CCX latency.

44 + ? + 19~30 + ? + 44 = ~140
I might have an answer for you.

Memory latency is 98 ns on Ryzen.
https://www.techpowerup.com/231268/amds-ryzen-cache-analyzed-improvements-improveable-ccx-compromises




Let's think about this.


We get about 45ns or so within cores. We get 98ns or so within RAM. 45 ns (from within the CCX) + 98 ns (from the memory latency) = the 143 ns, which is about what PC Perspective is getting.

So it looks like it is going to DRAM. An L4 cache might be helpful here.

For a comparison, here's Skylake: https://techreport.com/review/31179/intel-core-i7-7700k-kaby-lake-cpu-reviewed/4

About 45ns, which is less than half of AMD's 98 ns.

If AMD could get memory latency down to Skylake levels, that would be awesome, because that would be about the same speed as the 5960X in terms of CCX + memory latency.

We expect a penalty in quad channel latency, but AMD's memory controller is slow - slower than even Bulldozer it seems.
 

R0H1T

Platinum Member
Jan 12, 2013
2,566
142
106
On the X399 report. this was just Tweeted:

https://twitter.com/CPCHardware/status/843099198799187968


Canard PC maintains CPU-Z, so I'm thinking this rumor is legit.
So, 2x 1700(x) or something, Intel better bring out their big guns with the next HEDT launch or AMD might just eat them for lunch & eat their proverbial lunch at the top end. A flagship processor that beats the best of Intel will sell just for its name & bragging rights of course.
 

R0H1T

Platinum Member
Jan 12, 2013
2,566
142
106

CrazyElf

Member
May 28, 2013
88
21
81
That article is clickbait, just check the comments section & the original author's reservations about TPU's own conclusions.
True, but the data was from Hardware.fr, which is a pretty solid source. In any event, the memory latency numbers, while an interpretation might be incorrect, the numbers themselves are not going to change.
 

imported_jjj

Senior member
Feb 14, 2009
660
430
136
I might have an answer for you.

Memory latency is 98 ns on Ryzen.
https://www.techpowerup.com/231268/amds-ryzen-cache-analyzed-improvements-improveable-ccx-compromises




Let's think about this.


We get about 45ns or so within cores. We get 98ns or so within RAM. 45 ns (from within the CCX) + 98 ns (from the memory latency) = the 143 ns, which is about what PC Perspective is getting.

So it looks like it is going to DRAM. An L4 cache might be helpful here.

For a comparison, here's Skylake: https://techreport.com/review/31179/intel-core-i7-7700k-kaby-lake-cpu-reviewed/4

About 45ns, which is less than half of AMD's 98 ns.

If AMD could get memory latency down to Skylake levels, that would be awesome, because that would be about the same speed as the 5960X in terms of CCX + memory latency.

We expect a penalty in quad channel latency, but AMD's memory controller is slow - slower than even Bulldozer it seems.

Skylake and Kaby were using 3866MHz CL18 in those TR tests. Ryzen would get to about 60ns with such RAM settings - may be achievable after the May update- and right now it gets to as low as 70ns with 3200 CL14 but there is no access to secondary timings yet. After all is said and done and BIOS settles, 60+ns might be doable with 3200MHz DRAM, close enough to Broadwell-E.
 

tamz_msc

Platinum Member
Jan 5, 2017
2,627
2,299
106
True, but the data was from Hardware.fr, which is a pretty solid source. In any event, the memory latency numbers, while an interpretation might be incorrect, the numbers themselves are not going to change.
I suggest you read that particular page of the Hardware.fr review. They had AIDA64 engineers design a benchmark to test sequential L3 data access for different block sizes for a more detailed analysis. This is not incorporated in the software release of AIDA64, even in the beta that officially enabled support for Ryzen. Performance is comparable to a 6900K, even a bit faster, up to the ~6MB mark.
 
  • Like
Reactions: Malogeek

iBoMbY

Member
Nov 23, 2016
175
103
86
Isn't it pretty strange that it gets worse after about 6 MB already (L3-L2?)? Shouldn't it be fast up to 8.5-10 MB (L2+L3), if it is a victim cache?
 
Status
Not open for further replies.

ASK THE COMMUNITY