Ryzen: Strictly technical

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

CrazyElf

Member
May 28, 2013
88
21
81
There is no "uncore" in Ryzen.
CCX essentially operates at it's own (core speed) and the fabrics and the memory controller at half the effective MEMCLK speed.
Uncore is just the term used by CPU-Z for data fabric (DFICLK).


That's basically what I meant - unlike say, Skylake, where the CPU core and uncore are split, on Ryzen it's all the same clocks. That makes sense.



Has anyone seen these Asrock boards?

http://www.asrock.com/MB/AMD/Fatal1ty X370 Professional Gaming/index.asp
Hyper BCLK Engine II
An additional external base clock generator that supports PCIe frequency overclocking. It provides a wider range of frequencies and more precise clock waveforms, allowing any user to get most from their CPU investment with precise, stable overclocking.

They claim a 130 MHz base clock with their "engine". Any idea what that means?

m6wi53le51jy.png


Just marketing or does it actually do anything?


Edit: On Z170 boards apparently this played a role in the "SkyOC" function that allowed non-K CPUs to OC.
http://forum.asrock.com/forum_posts.asp?TID=2200&title=hyper-bclk-engine
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
That's basically what I meant - unlike say, Skylake, where the CPU core and uncore are split, on Ryzen it's all the same clocks. That makes sense.



Has anyone seen these Asrock boards?

http://www.asrock.com/MB/AMD/Fatal1ty X370 Professional Gaming/index.asp

They claim a 130 MHz base clock with their "engine". Any idea what that means?

m6wi53le51jy.png


Just marketing or does it actually do anything?


Edit: On Z170 boards apparently this played a role in the "SkyOC" function that allowed non-K CPUs to OC.
http://forum.asrock.com/forum_posts.asp?TID=2200&title=hyper-bclk-engine

Marketing, in practice.
The FCH in Zeppelin is nearly identical as on Carrizo / Bristol Ridge.
There is no need to have an external Pll as the FCH has it's own Plls which are designed for the purpose.
There are some advantages (like ASRock mentions) in using an external Pll, however both will work.

I've used the internal Pll on both of the previous mentioned designs and it definitely can be adjusted to operate beyond 100MHz.
136MHz is the basic ceiling for the internal Pll, however with some tricks you can go higher.
 

scannall

Golden Member
Jan 1, 2012
1,944
1,638
136
Looking at Geekbench results, and a couple of things stand out. The MT results are quite a bit different between Linux and Windows 10. Another indicator of a Windows 10 scheduling problem? Also, the Gigabyte boards seem to be running the best. I'm wondering if running on Windows 7 would bring the MT scores closer to Linux.
 

CrazyElf

Member
May 28, 2013
88
21
81
Marketing, in practice.
The FCH in Zeppelin is nearly identical as on Carrizo / Bristol Ridge.
There is no need to have an external Pll as the FCH has it's own Plls which are designed for the purpose.
There are some advantages (like ASRock mentions) in using an external Pll, however both will work.

I've used the internal Pll on both of the previous mentioned designs and it definitely can be adjusted to operate beyond 100MHz.
136MHz is the basic ceiling for the internal Pll, however with some tricks you can go higher.

Yeah that's pretty much what I expected.

That said, the Asrock seems like a decent board, at least once the BIOS matures. Oh, and VRM is total overkill. 12 + 4 phase of 40A TI NextFET Mosfets and 60A chokes. Granted without OC headroom that doesn't matter.

Looking at Geekbench results, and a couple of things stand out. The MT results are quite a bit different between Linux and Windows 10. Another indicator of a Windows 10 scheduling problem? Also, the Gigabyte boards seem to be running the best. I'm wondering if running on Windows 7 would bring the MT scores closer to Linux.

Could be that the Linux Kernel is designed to keep things within a CCX.

Yeah I've heard that Gigabyte has the most mature BIOS, so I'm not surprised. Wait a few weeks and see if the BIOS matures.
 

virpz

Junior Member
Sep 11, 2014
13
12
81
The Stilt said:

I would like to thank you for the work you have put on here sharing knowledge.

I've been following your thread for a few days now and I have one question but please, forgive me if it makes no sense but:

What about that the windows coeherency / affinity not working as supposed plus the Neural Prediction and/or Smart prefetch getting fooled by this issue and so are aggravating the problem ? Does it makes any sense ?
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
What does this mean for what we've seen so far...RTC bias ....and likely to see in the near future?
http://hwbot.org/newsflash/4335_ryz...bias_w88.110_not_allowed_on_select_benchmarks

I've been curios about this. I think it might explain some of the variation we see in results. We need to do some stop-watch confirmation of some of the scores we've seen.

This might be why some people at AMD thought disabling HPET made the system faster (but couldn't explain why).

The end result: leave HPET **ON** just like AMD's own Ryzen Master utility requires.
 

CrazyElf

Member
May 28, 2013
88
21
81
Turns out the base clock generator will allow for higher RAM overclocks right now.

The way AMD has their BIOS set up, you cannot yet change the memory timings. The high multipliers have very loose timings. The base clock generator allows for higher timings by isolating the Base Clock of the RAM from the CPU and then allowing you to use a lower multiplier by using a higher base clock. That in turn allows for higher memory overclocks, while keeping it at looser timings.

It might be an advantage considering DRAM is the last level cache.

Buildzoid has a pretty good video about it:



I've been curios about this. I think it might explain some of the variation we see in results. We need to do some stop-watch confirmation of some of the scores we've seen.

This might be why some people at AMD thought disabling HPET made the system faster (but couldn't explain why).

The end result: leave HPET **ON** just like AMD's own Ryzen Master utility requires.


It may be best to tweak to the setting you like then disable the HPET.
 

wahdangun

Golden Member
Feb 3, 2011
1,007
148
106
thanks stilt for this amazing thread, but i have a request can anyone here test overclock with bclk vs multiplier, i wan t to know what is the best performance for oc in games bclk or multiplier overclock.
 

lopri

Elite Member
Jul 27, 2002
13,209
594
126
I am a little confused about PCIe configuration of Summit Ridge. Exactly how many lanes are user accessible, and how many are reserved for system?
 

inf64

Diamond Member
Mar 11, 2011
3,685
3,957
136
Stilt, any idea why Winrar performance is so abysmal? Is it due to memory access patterns it has? From a few reviews I saw it seems like Ryzen is slower than 8350 in ST benchmark score which is very odd.
 
  • Like
Reactions: Drazick

lopri

Elite Member
Jul 27, 2002
13,209
594
126
If 24 lanes are user accessible, does that mean Infinity Fabric is taking up 8 lanes? AMD is claiming 128 lanes for Naples, which come from 4 Zeppelin dies. Wouldn't there be higher bandwidth requirement for Infinity Fabric on Naples than Summit Ridge, in which case the available lanes will be fewer than 96? I am curious how Infinity Fabric works here - there is almost no information about it. Its bandwidth/latency as well as how it scales are an almost complete unknown.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
If 24 lanes are user accessible, does that mean Infinity Fabric is taking up 8 lanes? AMD is claiming 128 lanes for Naples, which come from 4 Zeppelin dies. Wouldn't there be higher bandwidth requirement for Infinity Fabric on Naples than Summit Ridge, in which case the available lanes will be fewer than 96? I am curious how Infinity Fabric works here - there is almost no information about it. Its bandwidth/latency as well as how it scales are an almost complete unknown.

128 lanes - two sockets - 64 cores - 8 dice total.

That's 16 lanes per die. I had to be corrected on it as well :p It's all very confusing.

That leaves 32 lanes per CPU available for communication (8 per die) between sockets. That's only 64GB/s.

Frankly, that's not much, considering 8-channels of DDR4-2400 being attached to each socket... which is something like 150GB/s of memory bandwidth per CPU... but it might actually be plenty considering the software being run on these systems is fully aware of configurations and running it as a dual-node system with 8 CPUs per node would not be a problem in a server environment (and would have a lot of potential advantages).
 
  • Like
Reactions: Drazick

imported_jjj

Senior member
Feb 14, 2009
660
430
136
128 lanes - two sockets - 64 cores - 8 dice total.

That's 16 lanes per die. I had to be corrected on it as well :p It's all very confusing.

That leaves 32 lanes per CPU available for communication (8 per die) between sockets. That's only 64GB/s.

Frankly, that's not much, considering 8-channels of DDR4-2400 being attached to each socket... which is something like 150GB/s of memory bandwidth per CPU... but it might actually be plenty considering the software being run on these systems is fully aware of configurations and running it as a dual-node system with 8 CPUs per node would not be a problem in a server environment (and would have a lot of potential advantages).

Computerbase thinks they have 128 lanes per socket and when using dual socket, each CPU uses 64 lanes to link the 2.
https://www.computerbase.de/2017-03/amd-naples-cpu-benchmarks.
So maybe there are 32 per die but that doesn't explain why desktop doesn't have more usable.
 
Last edited:
  • Like
Reactions: lightmanek

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Computerbase thinks they have 128 lanes per socket and when using dual socket, each CPU uses 64 lanes to link the 2.
https://www.computerbase.de/2017-03/amd-naples-cpu-benchmarks.
So maybe there are 32 per die.

The die shot looks like there are 32 PCI-e lanes, but it's rare we get such a nice die shot for comparison as we have with Ryzen.

I marked what I think can be identified here:

http://files.looncraz.net/Zen_Die_Ident_Live.jpg

Corrections or suggestions are very welcome.

I'd REALLY like to know what those SRAM caches are for... When I first saw them I thought they might be for inter-CCX communications, but their physical locations suggest other possible uses.

There's definitely more going on than we see in Intel CPUs...
 
  • Like
Reactions: Drazick

Minkoff

Member
Nov 7, 2013
54
8
41
The die shot looks like there are 32 PCI-e lanes, but it's rare we get such a nice die shot for comparison as we have with Ryzen.

I marked what I think can be identified here:

http://files.looncraz.net/Zen_Die_Ident_Live.jpg

Corrections or suggestions are very welcome.

I'd REALLY like to know what those SRAM caches are for... When I first saw them I thought they might be for inter-CCX communications, but their physical locations suggest other possible uses.

There's definitely more going on than we see in Intel CPUs...

Could have something to do with the Neural Net Prediction? SenseMI? Curious indeed...
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
128 lanes - two sockets - 64 cores - 8 dice total.

That's 16 lanes per die. I had to be corrected on it as well :p It's all very confusing.

That leaves 32 lanes per CPU available for communication (8 per die) between sockets. That's only 64GB/s.

Is it not 128 lanes per CPU?

Of which, 64 lanes are for communication between sockets?

Naples will be offered as either a single processor platform (1P), or a dual processor platform (2P). In dual processor mode, and thus a system with 64 cores and 128 threads, each processor will use 64 of its PCIe lanes as a communication bus between the processors as part of AMD’s Infinity Fabric. The Infinity Fabric uses a custom protocol over these lanes, but bandwidth is designed to be on the order of PCIe. As each core uses 64 PCIe lanes to talk to the other, this allows each of the CPUs to give 64 lanes to the rest of the system, totaling 128 PCIe 3.0 again.

http://www.anandtech.com/show/11183...ng-in-q2?_ga=1.182003296.698670777.1483630983
 

iBoMbY

Member
Nov 23, 2016
175
103
86
Yes, that's pretty interesting. It seems like they can switch one PCIe 3.0 x16 per Summit Ridge module into one Infinity Fabric Link. This could make it possible, that they'd also be able to switch to this protocol, if you connect a Vega to a Ryzen, for example (like I believe it had been rumored earlier).

Edit: Which could be why the Vega High Bandwidth Cache Controller works so good, because it can access the memory with up to about 25 GB/s instead of just up to about 15 GB/s.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Stilt, any idea why Winrar performance is so abysmal? Is it due to memory access patterns it has? From a few reviews I saw it seems like Ryzen is slower than 8350 in ST benchmark score which is very odd.

RAR5 is CPU limited, while LZMA2 is purely memory bandwidth limited when using multiple cores.

Despite the two compression algorithms are very similar in terms of compression efficiency and speed, RAR5 is IMO significantly better as it has an advantage: RAR5 decompression can be multithreaded whereas LZMA2 cannot.
In case of an extremely large archive size, RAR5 is significantly faster to decompress when high compression levels have been used.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
thanks stilt for this amazing thread, but i have a request can anyone here test overclock with bclk vs multiplier, i wan t to know what is the best performance for oc in games bclk or multiplier overclock.

That makes no difference to the CPU.
Sure, the speed can be adjusted more accurately (when external Pll) is used, however it makes no difference to the CPU if the Pll frequency is generated either by combination of e.g. 30x100 or 24x125.
 

Timur Born

Senior member
Feb 14, 2016
277
139
116
RAR5 is CPU limited, while LZMA2 is purely memory bandwidth limited when using multiple cores.
Here is what Igor Pavlov writes about that:

"About 7-Zip / LZMA speed for AMD Ryzen R7.

Decompression speed is OK at Ryzen.
Compression speed in fast mode with small dictionary probably is OK also.
Compression speed with big dictionary is not good. Compression with big dictionary uses big amount of memory and it needs low memory access latency.

And memory access latency is BAD for Ryzen R7.
Look the following review with memory tests:
http://www.hardware.fr/articles/956-22/retour-sous-systeme-memoire.html

Also maybe shared cache in Intel CPUs is better than two separated caches in Ryzen CPUs for multithreaded LZMA compressing. Probably AMD will ask Microsoft to improve thread scheduling to reduce thread walking from one CCX to another CCX. Maybe such fixed thread scheduling can help slightly in some cases, but I'm note sure that it will help for 7-Zip compression.

Probably special thread scheduler that can be embedded to 7-Zip program will help, but it's difficult to develop it. Some versions of Windows don't like when program changes thread affinity. So such feature requires big development tests with different versions of Windows and different types of CPUs. It can be difficult.

But any improvement for memory latency will help for compression speed. I suppose it's difficult for AMD to reduce memory latency in current Ryzens. I hope they will try to fix it in next Ryzen revisions, if they will have strong understanding how memory latency is important for some programs.

I didn't contact with AMD about Ryzen.
"

Despite the two compression algorithms are very similar in terms of compression efficiency and speed, RAR5 is IMO significantly better as it has an advantage: RAR5 decompression can be multithreaded whereas LZMA2 cannot.
In case of an extremely large archive size, RAR5 is significantly faster to decompress when high compression levels have been used.
I am not sure that multi-threading is the main benefit here, rather than that RAR5 archives are overall faster to decompress. This even is true if you decompress them via 7-Zip, which only uses a single core for their decompression.

On my 4790K I did a quick comparison, compressing a bunch of PDF files into a solid archive of 4.7 gb (7z) and 4.61 gb (RAR5) size, that about 80% compression ratio. Best/Ultra compression setting, 64 mb booksize, 4 gb solid blocks. Decompression times:

7Z via 7-Zip: 1:50 min
7Z via Winrar: 1:54 min
RAR5 via 7-Zip: 0:32 min
RAR5 via Winrar: 0:22 min

Yes, multi-threaded decompression via Winrar results in 31% faster decompression, but even via single-threaded decompression the RAR5 format decompresses in only 29% of the time that the 7Z archive takes. Interestingly Winrar taxes a second core for decompressing the 7Z archive, but still takes longer.

All that being said, if a 4.7 gb archive takes less than 2 minutes to decompress than you really have to do these chores regularly to make the times matter.

What I take from this, is that 7-Zip decompression should maybe be run with manually settings its affinity to a single core (CCX).
 
Last edited:
Status
Not open for further replies.