Ryzen: Strictly technical

innociv · Mar 18, 2017

edit: nvm figured out what I had posted here.

This was with the assumption that the "ping test" is how long it takes for one core to set something into L3 and RAM, and for another core to then read it.

Was thinking that 142ns delay was really because it was fetching from RAM, but no that's unlikely even if they had memory latency as low as 70ns, when it seems to usually be in the 75-105 range for Ryzen.

Dresdenboy · Mar 18, 2017

innociv said:
edit: nvm figured out what I had posted here.

This was with the assumption that the "ping test" is how long it takes for one core to set something into L3 and RAM, and for another core to then read it.

Was thinking that 142ns delay was really because it was fetching from RAM, but no that's unlikely even if they had memory latency as low as 70ns, when it seems to usually be in the 75-105 range for Ryzen.

Just a thought: the UMCs likely have write buffers, and so do the CCX' internally. We might see a faster UMC response due to the data sitting in the buffer still (kind of LTSF). The CCX buffers might just add a few more (CCX clock) cycles.

William Gaatjes · Mar 18, 2017

Well, after reading the updates on the 1400x and the 1500x and keeping the current status of the inter ccx communication in my mind, it seems the 1400x is going to be the processor i am going to buy together with a B350 board. I am sticking with my choice from a few months ago.

Why?
Well, the 1500x has 4 cores and 8 threads and 16MB L3 cache.
This suggests a 2 + 2 setup where 2 cores are disabled on each ccx.
The 1400x has 4 cores and 8 threads and 8MB L3 cache.
This suggest a 4 + 0 setup where 1 ccx is entirely disabled.
The L3 is primarily a victim cache for L2.
If i understand it correctly, data evicted from L2 end up in L3. If that data is needed again, it is still present in L3.
New data from main memory is immediately stored in L2.
L2 cache tags and more stuff that is important for L2 is held in L3 as well.
My guess is that a 2 + 2 setup will not gain much from the 16MB and that the inter ccx communication will hold it back even more in comparison to a 4 + 4 setup.
Of course , there are no reviews yet, but since a processor design is all about balancing the right sizes of local on die storage memory, i doubt 2 cores for each ccx would have much to gain from 2x L3.
Especially with generic coded programs.

Am i making any sense so far ?
Or is there a flaw in my reasoning ?

looncraz · Mar 18, 2017

Dresdenboy said:
Just a thought: the UMCs likely have write buffers, and so do the CCX' internally. We might see a faster UMC response due to the data sitting in the buffer still (kind of LTSF). The CCX buffers might just add a few more (CCX clock) cycles.

Glad I read your comment before posting mine 😛 This was exactly what I was going to mention as a possibility - the write coalescing buffer for the IMC could, in theory, be used to service requests if it has data to be written to memory, which should be the case at least in some cases with inter-CCX communications... but that check actually needs to be made. PHY to RAM latency is probably only 20ns or so, maybe less. IMC to PHY, however, occurs over the data fabric - which is how interleaving occurs, interestingly, if the AGESA entries I've seen are to be believed.

Data from a CCX goes along the data fabric to the IMC. Data from the IMC goes to each PHY, in 256B, 512B, 1KiB, or 2KiB chunks, over the data fabric once again.

So the memory path is a pretty crazy thing to explore...

L1D + L2 + BUS + DFI + DF + DFI + IMC + DFI + DF + DFI + PHY + BUS + RAM

Most of that is just because the PHYs are on the opposite sides of the die... a wholly baffling situation. In fact, there's a great deal about the Ryzen die that is crazy baffling... so much SRAM... so many functional blocks that seem to defy identification.

--

People often miss just how simple each component has to be in these systems for everything to work. And Ryzen certainly works. I don't remember the last time a new platform's teething issues didn't include stability problems. Skylake has stability problems and it was mostly just an iterative design change.

One CCX does not know about the other - it knows about writing to a memory address. The receiving CCX doesn't know who had any given data, it just knows the address to read. Creating a dedicated CCX-to-CCX communication network/protcol would be fraught with peril... but it's ripe ground for future revision given the potential rewards.

looncraz · Mar 18, 2017

William Gaatjes said:
Well, after reading the updates on the 1400x and the 1500x and keeping the current status of the inter ccx communication in my mind, it seems the 1400x is going to be the processor i am going to buy together with a B350 board.

Why?
Well, the 1500x has 4 cores and 8 threads and 16MB L3 cache.
This suggests a 2 + 2 setup where 2 cores are disabled on each ccx.
The 1400x has 4 cores and 8 threads and 8MB L3 cache.
This suggest a 4 + 0 setup where 1 ccx is entirely disabled.
The L3 is primarily a victim cache for L2.
If i understand it correctly, data evicted from L2 end up in L3. If that data is needed again, it is still present in L3.
New data from main memory is immediately stored in L2.
L2 cache tags and more stuff that is important for L2 is held in L3 as well.
My guess is that a 2 + 2 setup will not gain much from the 16MB and that the inter ccx communication will hold it back even more in comparison to a 4 + 4 setup.
Of course , there are no reviews yet, but since a processor design is all about balancing the right sizes of local on die storage memory, i doubt 2 cores for each ccx would have much to gain from 2x L3.
Especially with generic coded programs.

Am i making any sense so far ?
Or is there a flaw in my reasoning ?

I see no flaws and came to much the same conclusion. 2+2 is not what I want. However, there are apps that prefer 2+2 over 4+0 thanks to that extra cache. But I'd still take 4+0 over 2+2.

William Gaatjes · Mar 18, 2017

looncraz said:
I see no flaws and came to much the same conclusion. 2+2 is not what I want. However, there are apps that prefer 2+2 over 4+0 thanks to that extra cache. But I'd still take 4+0 over 2+2.

Thank you. 🙂

Internal ccx communication is way faster then communication between the two ccx. So, most generic coded programs would benefit more from the 1400x with of course the always present few exceptions.

looncraz · Mar 18, 2017

William Gaatjes said:
Thank you. 🙂

Internal ccx communication is way faster then communication between the two ccx. So, most generic coded programs would benefit more from the 1400x with of course the always present few exceptions.

No doubt.

You can order the 1400 here:

http://www.shopblt.com/item/amd-ryzen-5-1400-wraith-stealth/amd_yd1400bbaebox.html

I already did 😛

William Gaatjes · Mar 18, 2017

I will wait for some time to be certain. I would like the 1400x instead of a possible 1400. Although if i can believe the news, the next one lower in line will be the 1300.
When i buy, i want the hardware to be stable with a matured bios.
I will go for a gigabyte motherboard again. I have good experiences with gigabyte. And as it seems for ryzen, the gigabyte is a good choice for a stable board since the release of ryzen.
I also would like a different custom cooler and i still have to do thorough research about what memory is best.
Will also be including a nvme and that is going to be the most expensive part of the whole new system, a samsung 960 pro 512GB.

I will be updating from a A10-6700 cpu , so it will be a big win in the end.
I can reuse the RX480 card that i have. And use the A10-6700 as a home theater system or backup pc.

William Gaatjes · Mar 18, 2017

lolfail9001 said:
I meant, why AMD went with such an abuse of such Windows behavior.

The next gen of consoles to appear. 4 core jaguar module >> 4 core zen module.
And since the 4 core module design is still present, all those games will be easier to port from the old jaguar based consoles.
It is all to convenient for my taste.
For AMD this is a big win. Next step up would be 8 cores in a ccx.

edit:
For server software that is NUMA aware of the design of the cpu, this all is not an issue.
Only for OS with a lot of abstraction to hide the hardware details from user programs.
And console programmers are already very familiar with the 4 + 4 setup of jaguar based consoles.
And program close to the metal, catching any particular issues.
PS4 and xbox one where both announced in early 2013 and released in late 2013, Jim Keller was hired in 2012. I assume their were a lot of meetings about what the future zen would be used for.

Chl Pixo · Mar 18, 2017

@looncraz did the AGESA have ability to run DF on 2x multiplier?
If not I may wait until the HEDT comes out. I would like to have 8 more PCIe lines from CPU anyway.
I dont want to go Intel as they change their socket too often. With AMD I could upgrade in 4 years without the need for new MB.

Kromaatikse · Mar 18, 2017

lolfail9001 said:
where is the screenshot of evidence for scheduler not being SMT aware?

Full core parking, using my tweaks:
https://www.dropbox.com/s/hle0l0reegxoumr/FullParking.png

Core parking a la Win7, with all physical cores unparked:
https://www.dropbox.com/s/v9q6w9bfeknu89v/CoreOverride.png

Core parking disabled:
https://www.dropbox.com/s/gmp2x7dd45ug3e3/NoParking.png

This is on Steamroller - Windows treats CMT the same way as SMT/HT.

mat9v · Mar 18, 2017

looncraz said:
Communications wouldn't go through the L3 cache, so its latency is not part of the active communication. Only the tag search latency would apply. You don't pay the full access and fetch penalty on a cache miss. On Ryzen, the 1/4 tags system means that quite often you could have a result within a couple cycles after submitting an address search in the event of a failure (some 65% of the time, if my mental statistical analysis is up to snuff - no promises 😛).

I hope I'm not uneducated but why? Core typically does not know where the required data is at the moment. So it asks L1 Data cache, if it misses the question follows to L2 cache and in case of another miss it will ask L3. Sure, tags are gr8 but they lower typical access time from 17ns to maybe 10ns? (if core knows that data is in L3 and asks for it directly).
Now we should ask what the problems are for our core. Say it wants to access a result of computations of another thread. Threads from the same program share memory address space and are aware of their location, so it asks for a result. But it does not know where physically required data is present, so it asks the core that it knows that is/was running the thread. It must check L3 then L2 and lastly L1 cache for results (unless there is a mechanism/index that tells that this packet of data is not in cache so look elsewhere). If data is not in cache it asks memory controller for data fetch. Correct me if I'm wrong here.

A Ryzen quad core module is actually two dual core modules, when you look at it closely. Even the L3 is mostly just a duplication, with some functional block changes (PLL, and so on). The cores communicate via a mesh bus from L2 to L2 in the upper metal layers. So you have L1D Latency + L2 latency + bus latency + L2 latency + L1 Latency.

1.3 + 8.5 + 10 + 8.5 + 1.3 ~= 46 ns

To talk to another core in another CCX, contrary to my prior thinking, then you have the following latencies:

L1D + L2 + BUS + DFI + DF + IMC + DF + DFI + BUS + L2 + L1D

1.3 + 8.5 + 10 + ? + ? + ? + ? + ? + 10 + 8.5 + L1D ~= 140

DFI, of course, is the data fabric interface unit I assume must exist - can't just send signals however you please, you need to find a window and mix with any other traffic that might be on that bus.

From hardware.fr test L1 latency is 1.3ns, L2 latency is 6.3ns, L3 latency is 17.3ns
So going by the closest route, pinging another core in the "dual" on the same CCX would be:
1.3+6.3+bus+6.3+1.3=15.2ns+bus (never seen L2 mesh bus in any Ryzen diagram, are you sure about it's existence?)
Pinging core results showed over 40ns - are you suggesting that bus between L2 caches takes 25ns?
Or maybe pinging went through L3 cache instead? it certainly seems closer, 2x17.3=34.6ns (accounting that probably tests were done with different cpu frequency, it may be possible, up to 20% difference in frequency meaning 7ns could mean ~42ns as it is in core pinging results)
BTW, 1.3 + 8.5 + 10 + 8.5 + 1.3 = 29.6ns and not 46ns
http://images.anandtech.com/doci/10591/HC28.AMD.Mike Clark.final-page-013.jpg
This slide suggest that there is no special bus that links L2 straight to DF but rather go through L3 to get there. Please share the source of your info otherwise.
Why are you so bent on all queries from CCX to CCX going through memory controller, this slide again suggests otherwise: http://images.anandtech.com/doci/11170/AMD_Ryzen_Mark_Papermaster_Final-page-009_575px.jpg
It shows cores, hubs, connected to infinity fabric and not to memory controllers, they are other entities connected to the IF.

Kromaatikse · Mar 18, 2017

mat9v said:
Threads from the same program share memory address space and are aware of their location, so it asks for a result. But it does not know where physically required data is present, so it asks the core that it knows that is/was running the thread. It must check L3 then L2 and lastly L1 cache for results (unless there is a mechanism/index that tells that this packet of data is not in cache so look elsewhere).

Neither the thread nor the hardware are a priori aware of which core the data is on. The hardware has to search for it, and it always searches in the lowest-latency places first.

On Ryzen, the L2 cache is inclusive of the L1 cache. This means that only the L1 cache of the core requesting the data is explicitly checked - everything else is done at the L2 and L3 caches. If the data isn't in L2 cache then it *can't* be in the corresponding L1 cache.

The "partial tags" kept by the L3 cache are another such shortcut mechanism. If these partial tags don't hit (this check takes much less time than a full lookup), then the data can't be in the locations covered by them. If they *do* hit, then they point to the correct cache to query, and the lookup is completed there.

What we seem to have confusion about is exactly how one CCX communicates with another. There is a 100ns inter-CCX latency penalty measurement out there - and that's rather a lot, to the point of sounding decidedly suboptimal.

Ajay · Mar 18, 2017

I wish PC Perspective listed the code they are using for the "ping test" - trying to understand this w/o the code is nuts!

malitze · Mar 18, 2017

Ajay said:
I wish PC Perspective listed the code they are using for the "ping test" - trying to understand this w/o the code is nuts!

Yeah, this really annoys me aswell, though I can understand it at least to a certain degree. But without the code I'm not too sure as to what to make of the results.

CrazyElf · Mar 18, 2017

OrangeKhrush said:
Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real.

I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.

On the X399 report. this was just Tweeted:

https://twitter.com/CPCHardware/status/843099198799187968

Il y a bien des Ryzen 16C/32T axés HEDT prévus sur X399 dans 4/6 mois. Clocks ~2.4/2.8G. 2 dies MCM. 4 chan DDR4. Socket LGA SP3r2. ~150W.

Canard PC maintains CPU-Z, so I'm thinking this rumor is legit.

CatMerc · Mar 18, 2017

Well, if CPCHardware claims X399 is a thing, then we can say it's a thing.

CrazyElf · Mar 18, 2017

looncraz said:
Well, that's the thing:

...
AMD has an additional 100ns delay when communicating between cores in other CCXes. What we need to know is if this is a direct communication or if this is done through main memory. The simplest answer is that it is done through main memory and the latency is therefore CCX latency + DF latency + IMC latency + DF latency + CCX latency.

44 + ? + 19~30 + ? + 44 = ~140

I might have an answer for you.

Memory latency is 98 ns on Ryzen.
https://www.techpowerup.com/231268/...yzed-improvements-improveable-ccx-compromises

Let's think about this.

We get about 45ns or so within cores. We get 98ns or so within RAM. 45 ns (from within the CCX) + 98 ns (from the memory latency) = the 143 ns, which is about what PC Perspective is getting.

So it looks like it is going to DRAM. An L4 cache might be helpful here.

For a comparison, here's Skylake: https://techreport.com/review/31179/intel-core-i7-7700k-kaby-lake-cpu-reviewed/4

About 45ns, which is less than half of AMD's 98 ns.

If AMD could get memory latency down to Skylake levels, that would be awesome, because that would be about the same speed as the 5960X in terms of CCX + memory latency.

We expect a penalty in quad channel latency, but AMD's memory controller is slow - slower than even Bulldozer it seems.

R0H1T · Mar 18, 2017

CrazyElf said:
On the X399 report. this was just Tweeted:

https://twitter.com/CPCHardware/status/843099198799187968

Canard PC maintains CPU-Z, so I'm thinking this rumor is legit.

So, 2x 1700(x) or something, Intel better bring out their big guns with the next HEDT launch or AMD might just eat them for lunch & eat their proverbial lunch at the top end. A flagship processor that beats the best of Intel will sell just for its name & bragging rights of course.

R0H1T · Mar 18, 2017

CrazyElf said:
I might have an answer for you.

Memory latency is 98 ns on Ryzen.
https://www.techpowerup.com/231268/...yzed-improvements-improveable-ccx-compromises

Let's think about this.

We get about 45ns or so within cores. We get 98ns or so within RAM. 45 + 98 ns = the 143 ns, which is about what PC Perspective is getting.

That article is clickbait, just check the comments section & the original author's reservations about TPU's own conclusions.

french toast · Mar 18, 2017

CatMerc said:
Well, if CPCHardware claims X399 is a thing, then we can say it's a thing.

Really? Threadripper inbound?

CrazyElf · Mar 18, 2017

R0H1T said:
That article is clickbait, just check the comments section & the original author's reservations about TPU's own conclusions.

True, but the data was from Hardware.fr, which is a pretty solid source. In any event, the memory latency numbers, while an interpretation might be incorrect, the numbers themselves are not going to change.

imported_jjj · Mar 18, 2017

CrazyElf said:
I might have an answer for you.

Memory latency is 98 ns on Ryzen.
https://www.techpowerup.com/231268/...yzed-improvements-improveable-ccx-compromises

Let's think about this.

We get about 45ns or so within cores. We get 98ns or so within RAM. 45 ns (from within the CCX) + 98 ns (from the memory latency) = the 143 ns, which is about what PC Perspective is getting.

So it looks like it is going to DRAM. An L4 cache might be helpful here.

For a comparison, here's Skylake: https://techreport.com/review/31179/intel-core-i7-7700k-kaby-lake-cpu-reviewed/4

About 45ns, which is less than half of AMD's 98 ns.

If AMD could get memory latency down to Skylake levels, that would be awesome, because that would be about the same speed as the 5960X in terms of CCX + memory latency.

We expect a penalty in quad channel latency, but AMD's memory controller is slow - slower than even Bulldozer it seems.

Skylake and Kaby were using 3866MHz CL18 in those TR tests. Ryzen would get to about 60ns with such RAM settings - may be achievable after the May update- and right now it gets to as low as 70ns with 3200 CL14 but there is no access to secondary timings yet. After all is said and done and BIOS settles, 60+ns might be doable with 3200MHz DRAM, close enough to Broadwell-E.

tamz_msc · Mar 18, 2017

CrazyElf said:
True, but the data was from Hardware.fr, which is a pretty solid source. In any event, the memory latency numbers, while an interpretation might be incorrect, the numbers themselves are not going to change.

I suggest you read that particular page of the Hardware.fr review. They had AIDA64 engineers design a benchmark to test sequential L3 data access for different block sizes for a more detailed analysis. This is not incorporated in the software release of AIDA64, even in the beta that officially enabled support for Ryzen. Performance is comparable to a 6900K, even a bit faster, up to the ~6MB mark.

iBoMbY · Mar 18, 2017

Isn't it pretty strange that it gets worse after about 6 MB already (L3-L2?)? Shouldn't it be fast up to 8.5-10 MB (L2+L3), if it is a victim cache?

Ryzen: Strictly technical

Member

Golden Member

Lifer

Senior member

Senior member

Lifer

Senior member

Lifer

Lifer

Junior Member

Member

Member

Member

Lifer

Junior Member

Member

Golden Member

Member

Platinum Member

Platinum Member

Senior member

Member

Senior member

Diamond Member

Member