Ryzen: Strictly technical

Page 36 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
It is designed to be AMD's equivalent platform to Intel's X99 and soon to be X299 LGA2011 sockets, even confirmed that it is going to be LGA instead of AMD's typical PGA due to the physical size.
Of course. I'm just saying that a new fixed Zeppelin revision with a single die should still fit just fine on AM4.

They aren't making a completely new quad CCX design, they're just taking two revised dual CCX dies and slapping them together like with Naples. Therefore they should be able to take the single die revision and apply it to AM4 CPU's too.
 

OrangeKhrush

Senior member
Feb 11, 2017
220
343
96
Of course. I'm just saying that a new fixed Zeppelin revision with a single die should still fit just fine on AM4.

They aren't making a completely new quad CCX design, they're just taking two revised dual CCX dies and slapping them together like with Naples. Therefore they should be able to take the single die revision and apply it to AM4 CPU's too.

Ah yes it will be, it will be throughout all platforms.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
What about WinRAR tests with single-channel RAM with CCX affinity?
That test is very interesting thing for Ryzen.

I disabled interleaving at the data fabric level for some quick testing to see the impact as well as changing the interleaving sizes. Disabling interleaving, naturally, destroyed performance - but it only gained ~1.5ns... but lost half the bandwidth, naturally. I didn't get a chance to try with just one stick installed to force a single data pathway.

Sadly, I'm out of commission on Ryzen for a while - the ASRock board gave up the ghost just a few hours ago - doesn't even try to power on. I suspect the problem is in the soft-on circuitry, but it might be somewhere else... I restarted the system then the fans went full-tilt for some reason, I hit reset, and nothing at all happened, held the power button for a good six seconds... nothing. So I cut the power from the PSU switch. Hasn't turned back on since :-(

And most of my data is on the NVMe drive... and I have no way of accessing that. Figures - I have more DDR4 coming tomorrow for testing... and now no board.
 
  • Like
Reactions: Drazick

iBoMbY

Member
Nov 23, 2016
175
103
86
Isn`t that too low? Could you post the source of this information?
The speed you mentioned would be around the speed of single-channel DDR4-2666.
According the slide with clock domain the DF is 256b wide while DDR4 is 64b wide.
On 1333 MHz clock DF theoretical speed is 42.6GB/s.

If the speed was really that abysmal using DDR4-1600 would put the DF at 12.8GB/s.
Thats almost 1/2 speed of the x24 PCIe lines the CPU has.
I dont think AMD is stupid enough to do something that stupid.

Comparing dual-channel speed to single-channel would show where the bottleneck is.
If it is DF speed would be the same.

Well, I think it is pretty obvious, the DF speed, and inter CCX connection, is not half of the maximum memory bandwidth. 32 byte per clock per direction seems to be accurate. The DF most likely works like full duplex network switch though, so there may be some packet overhead, and multiple connections to the same resource will obviously slow down everything.
 

ndtech

Junior Member
Mar 14, 2017
8
2
51
I didn't get a chance to try with just one stick installed to force a single data pathway. Sadly, I'm out of commission on Ryzen for a while - the ASRock board gave up the ghost just a few hours ago - doesn't even try to power on.

So I ask another Ryzen owners to test WinRAR with single-channel RAM.
You can find details about required test here:

https://forums.anandtech.com/posts/38793301/

Nobody tested Ryzen with single-channel RAM still.
And maybe it will show some unexpected results.
 

Oleyska

Junior Member
Mar 7, 2017
6
11
16
So I ask another Ryzen owners to test WinRAR with single-channel RAM.
You can find details about required test here:

https://forums.anandtech.com/posts/38793301/

Nobody tested Ryzen with single-channel RAM still.
And maybe it will show some unexpected results.

Ryzen 1700 @ Default BIOS settings SMT Enabled.
ASUS B350M-A (M-ATX one, not B350 Prime)
HPET
2666 CL15 Corsair LPX 16GB 2666mhz D.O.C.P Profile 1 DIMM single Channel.
Balanced power profile.

8747 kb/s - Test 4 - single-channel - WinRAR
5571 kb/sec - Test 5 - single-channel - WinRAR with Affinity to cores 0-7 (CCX0)
1557 kb/sec -Test 6 - single-channel - WinRAR with Affinity to cores 8-15 (CCX1)

I've redone the CCX1 test several times! CCX0 does not show this issue.
CCX1 only used 2 last threads.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
8747 kb/s - Test 4 - single-channel - WinRAR
5571 kb/sec - Test 5 - single-channel - WinRAR with Affinity to cores 0-7 (CCX0)
1557 kb/sec -Test 6 - single-channel - WinRAR with Affinity to cores 8-15 (CCX1)

I've redone the CCX1 test several times! CCX0 does not show this issue.
CCX1 only used 2 last threads.

Very interesting. Can you move your single stick of DRAM over to the other channel? I bet if you did then your test 5 and test 6 results would swap.
 

Oleyska

Junior Member
Mar 7, 2017
6
11
16
Very interesting. Can you move your single stick of DRAM over to the other channel? I bet if you did then your test 5 and test 6 results would swap.
Cores are just parked, won't leave parked mode in balanced.

Edit2: High performance CCX1= 5584
High performance CCX0= 5790

so not memory related, just that Microsoft have issues with Ryzen (Shock)
Most users use Balanced power profile, hence My default testing is always in Balanced :)
 

Magic Hate Ball

Senior member
Feb 2, 2017
290
250
96
I disabled interleaving at the data fabric level for some quick testing to see the impact as well as changing the interleaving sizes. Disabling interleaving, naturally, destroyed performance - but it only gained ~1.5ns... but lost half the bandwidth, naturally. I didn't get a chance to try with just one stick installed to force a single data pathway.

Sadly, I'm out of commission on Ryzen for a while - the ASRock board gave up the ghost just a few hours ago - doesn't even try to power on. I suspect the problem is in the soft-on circuitry, but it might be somewhere else... I restarted the system then the fans went full-tilt for some reason, I hit reset, and nothing at all happened, held the power button for a good six seconds... nothing. So I cut the power from the PSU switch. Hasn't turned back on since :-(

And most of my data is on the NVMe drive... and I have no way of accessing that. Figures - I have more DDR4 coming tomorrow for testing... and now no board.

Did you reset the CMOS? There should be a way to do it with the power off. My Taichi even has an external button for it on the I/O plate.
 

ndtech

Junior Member
Mar 14, 2017
8
2
51
High performance CCX1= 5584
High performance CCX0= 5790

Thank you!
If there is no difference, both IMCs probably work together as pair.
I hoped that each CCX contains dedicated IMC. So some CCX in single channel mode could show better results.
Please test also single thread WinRAR benchmark for CCX0 / CCX1 and tests with dual-channel.

Also you can disable CCXs cores in BIOS instead of Affinity changing. Probably it's more robust way to limit cores. And maybe BIOS can switch CCX-IMC to more fast mode, if BIOS knows that only one CCX is enabled.
 
Last edited:
  • Like
Reactions: Kromaatikse

OrangeKhrush

Senior member
Feb 11, 2017
220
343
96
If they come out with new revisions wonder which coding it will use?

1850X
1770X
1750
1650X
1550
1450X
1370
1350
1250
1150

holymeatballs that would be confusing.
 

Magic Hate Ball

Senior member
Feb 2, 2017
290
250
96
If they come out with new revisions wonder which coding it will use?

1850X
1770X
1750
1650X
1550
1450X
1370
1350
1250
1150

holymeatballs that would be confusing.

Any worse than this?
4960X
4930K
4820K
4790
4790K
4770
4770K
4790
4670
4670K
4570
4590
4460
4440
4430
4330

And that's not counting the T, S, M, and U versions of chips. When the mobile ones might not even match core counts/threads as the main chip series.

Frankly, marketing departments are the bane of simplicity when it comes to hardware.
 

Oleyska

Junior Member
Mar 7, 2017
6
11
16
Thank you!
If there is no difference, both IMCs probably work together as pair.
I hoped that each CCX contains dedicated IMC. So some CCX in single channel mode could show better results.
Please test also single thread WinRAR benchmark for CCX0 / CCX1 and tests with dual-channel.

Also you can disable CCXs cores in BIOS instead of Affinity changing. Probably it's more robust way to limit cores. And maybe BIOS can switch CCX-IMC to more fast mode, if BIOS knows that only one CCX is enabled.
By the results I get I got my answers and yours ?
core parking is a serious issue.
There is no difference in Performance.
CCX Disabling does almost nothing in my short experience, my temperatures are almost the same, power consumption drop very little and my overclock headroom not larger.
I hope AMD does something fancy for their lower SKU's.

haha yeah like my work laptop "i5" which is dual core with hyperthreading.
This is documented in Intel's Naming scheme, same applies for in part for GPU's as well.
I hate it, because People think their UltraPortable have an Proper I7, asks me what's wrong with it cause it doesn't replace their I7 5820K (Yes, it have happened....)

This Was at least how it was +/-
Intel I7 on 10" and less = dualcore WithHT
Intel I7 on bigger than 10.1-13" Quadcore Without hyperthreading
Intel I7 on 17" and bigger = Quadcore with HT.
I5 on 10" and less = dualcore
I5 on 10-13" = dualcore with HT
I5 on 17" and bigger = quadcore without HT

And So on.
GPU's does something in this Direction
Nvidia gpu's: GTX980M = GTX960 or so,
GTX1080M= GTX1060
AMD 5970M = AMD 5870
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Speaking of comparing desktop and laptop parts, consider the Bristol Ridge APUs.

AMD listened to criticism about Carrizo model numbers, and encoded the TDP range into the third digit of the laptop parts. So we have A12-9800P (15W) and A12-9830P (35W). This is fine.

But then we have the desktop parts, which *don't* have this encoding. Instead we have A12-9800. The only way to distinguish this 65W part from the 15W laptop part is by the lack of a P at the end. So some user asks for help with their "A12-9800" and we find out 20 minutes later that in fact they have a laptop. Would it have killed AMD to make it an A12-9860?
 

iBoMbY

Member
Nov 23, 2016
175
103
86
I hoped that each CCX contains dedicated IMC. So some CCX in single channel mode could show better results.

It's quite obvious that this is not the case:

srNvaGe.jpg


The Infinity Fabric is an on-chip switch, and everything is connected through it.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?
You know, that while running at only 1.x GHz, there likely are only 1 hops from CCX to mem. Inside the CCX, L3 runs at core speed.

On 8-16C Intel chips, the Ring Bus can run at core clock speeds (~3x DF speed), but has on average multiple hops to do to reach the MC and vice versa (command -> IMC, data -> core).
 
  • Like
Reactions: prtskg and looncraz

iBoMbY

Member
Nov 23, 2016
175
103
86
I also think the current source of the latency issues (between CCX and memory latencies) is quite obviously the Data Fabric, which, for some reason, is probably not working as fast as AMD planned it to be. But as I think this may be based on the Ethernet physical layer like Gen-Z, there may be a chance that this is a firmware(-fixable) issue.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I also think the current source of the latency issues (between CCX and memory latencies) is quite obviously the Data Fabric, which, for some reason, is probably not working as fast as AMD planned it to be. But as I think this may be based on the Ethernet physical layer like Gen-Z, there may be a chance that this is a firmware(-fixable) issue.
Are Intel latencies actually measured from the closest core to the IMC?
 

tamz_msc

Diamond Member
Jan 5, 2017
3,799
3,627
136
You know, that while running at only 1.x GHz, there likely are only 1 hops from CCX to mem. Inside the CCX, L3 runs at core speed.

On 8-16C Intel chips, the Ring Bus can run at core clock speeds (~3x DF speed), but has on average multiple hops to do to reach the MC and vice versa (command -> IMC, data -> core).
While I agree with what you're saying, the peculiarities of Ryzen 7's latency issues are not fully understood yet. Going back to the Hardware.fr investigation, they speculate:
In practice, a memory access for the processor does not take place in isolation. When the processor needs a data, it will begin by querying its cache hierarchy to see if the data is up-to-date. If this is the case, it will be read directly and no memory request will be made. The double partition of the L3 of Ryzen complicates things however. When a data is required, and after checking inside the CCX, the core will send two requests simultaneously, one to the other CCX, and the other to the memory controller. The two operations take place simultaneously (waiting for the response of the other CCX would be much too penalizing), but they impose an additional cost.
The cache issues are even more telling: they designed a benchmark that tests different sizes sequentially, and the results for sizes >=6MB are particularly problematic. In this case, they say it is because sequential access disables the other CCX, and in this benchmark, it behaves as if it only has 8MB of L3.
We asked AMD who confirmed this: in practice, in this case of school the L3 of the second CCX is not used, Ryzen behaving as if it had only 8 MB of L3.

This is a particularity of the benchmark used: to be able to measure the latency correctly, it is essential that the threads are attached to a heart and do not move. However, you probably remember, the L3 cache is a "victim" cache, it contains the data that is removed from the L2 cache. And to fill the L2 cache, access must be made ... inside the CCX. Result, blocking our benchmark (latency is only measured on a thread) on the first core, the other CCX is not solicited, its cache not filling.

Quotes are copy-pasted from Google Translate.
 

iBoMbY

Member
Nov 23, 2016
175
103
86
Does the infinite fabric scale with DDR4 latency and timings?

The Data Fabric clock is based on the memory clock, so it should get faster with higher memory clocks, but I don't think the memory timings have a direct effect on the Data Fabric. The Data Fabric probably has its own timing settings though, which are currently not accessible. And the memory problems could be partially because of synchronization issues between the Data Fabric and the memory controller, and the memory itself.
 

mat9v

Member
Mar 17, 2017
25
0
66
We could use Process Lasso to bind all those background tasks and programs that are normally running to cores on CCX1 instead of CCX0 and get much better chance of game threads remaining on CCX0 where it belongs, because Windows scheduler will see cores on CCX1 running more tasks so it will schedule the game to run on CCX0. It may only help, it is not a solution but I think worthy of trying.
 
Status
Not open for further replies.