Ryzen: Strictly technical

CrazyElf · Mar 7, 2017

The Stilt said:
There is no "uncore" in Ryzen.
CCX essentially operates at it's own (core speed) and the fabrics and the memory controller at half the effective MEMCLK speed.
Uncore is just the term used by CPU-Z for data fabric (DFICLK).

That's basically what I meant - unlike say, Skylake, where the CPU core and uncore are split, on Ryzen it's all the same clocks. That makes sense.

Has anyone seen these Asrock boards?

http://www.asrock.com/MB/AMD/Fatal1ty X370 Professional Gaming/index.asp

Hyper BCLK Engine II
An additional external base clock generator that supports PCIe frequency overclocking. It provides a wider range of frequencies and more precise clock waveforms, allowing any user to get most from their CPU investment with precise, stable overclocking.

They claim a 130 MHz base clock with their "engine". Any idea what that means?

Just marketing or does it actually do anything?

Edit: On Z170 boards apparently this played a role in the "SkyOC" function that allowed non-K CPUs to OC.
http://forum.asrock.com/forum_posts.asp?TID=2200&title=hyper-bclk-engine

The Stilt · Mar 7, 2017

CrazyElf said:
That's basically what I meant - unlike say, Skylake, where the CPU core and uncore are split, on Ryzen it's all the same clocks. That makes sense.

Has anyone seen these Asrock boards?

http://www.asrock.com/MB/AMD/Fatal1ty X370 Professional Gaming/index.asp

They claim a 130 MHz base clock with their "engine". Any idea what that means?

Just marketing or does it actually do anything?

Edit: On Z170 boards apparently this played a role in the "SkyOC" function that allowed non-K CPUs to OC.
http://forum.asrock.com/forum_posts.asp?TID=2200&title=hyper-bclk-engine

Marketing, in practice.
The FCH in Zeppelin is nearly identical as on Carrizo / Bristol Ridge.
There is no need to have an external Pll as the FCH has it's own Plls which are designed for the purpose.
There are some advantages (like ASRock mentions) in using an external Pll, however both will work.

I've used the internal Pll on both of the previous mentioned designs and it definitely can be adjusted to operate beyond 100MHz.
136MHz is the basic ceiling for the internal Pll, however with some tricks you can go higher.

scannall · Mar 7, 2017

Looking at Geekbench results, and a couple of things stand out. The MT results are quite a bit different between Linux and Windows 10. Another indicator of a Windows 10 scheduling problem? Also, the Gigabyte boards seem to be running the best. I'm wondering if running on Windows 7 would bring the MT scores closer to Linux.

CrazyElf · Mar 7, 2017

The Stilt said:
Marketing, in practice.
The FCH in Zeppelin is nearly identical as on Carrizo / Bristol Ridge.
There is no need to have an external Pll as the FCH has it's own Plls which are designed for the purpose.
There are some advantages (like ASRock mentions) in using an external Pll, however both will work.

I've used the internal Pll on both of the previous mentioned designs and it definitely can be adjusted to operate beyond 100MHz.
136MHz is the basic ceiling for the internal Pll, however with some tricks you can go higher.

Yeah that's pretty much what I expected.

That said, the Asrock seems like a decent board, at least once the BIOS matures. Oh, and VRM is total overkill. 12 + 4 phase of 40A TI NextFET Mosfets and 60A chokes. Granted without OC headroom that doesn't matter.

scannall said:
Looking at Geekbench results, and a couple of things stand out. The MT results are quite a bit different between Linux and Windows 10. Another indicator of a Windows 10 scheduling problem? Also, the Gigabyte boards seem to be running the best. I'm wondering if running on Windows 7 would bring the MT scores closer to Linux.

Could be that the Linux Kernel is designed to keep things within a CCX.

Yeah I've heard that Gigabyte has the most mature BIOS, so I'm not surprised. Wait a few weeks and see if the BIOS matures.

virpz · Mar 7, 2017

The Stilt said:
.

I would like to thank you for the work you have put on here sharing knowledge.

I've been following your thread for a few days now and I have one question but please, forgive me if it makes no sense but:

What about that the windows coeherency / affinity not working as supposed plus the Neural Prediction and/or Smart prefetch getting fooled by this issue and so are aggravating the problem ? Does it makes any sense ?

fingerbob69 · Mar 7, 2017

What does this mean for what we've seen so far...RTC bias ....and likely to see in the near future?
http://hwbot.org/newsflash/4335_ryz...bias_w88.110_not_allowed_on_select_benchmarks

looncraz · Mar 7, 2017

fingerbob69 said:
What does this mean for what we've seen so far...RTC bias ....and likely to see in the near future?
http://hwbot.org/newsflash/4335_ryz...bias_w88.110_not_allowed_on_select_benchmarks

I've been curios about this. I think it might explain some of the variation we see in results. We need to do some stop-watch confirmation of some of the scores we've seen.

This might be why some people at AMD thought disabling HPET made the system faster (but couldn't explain why).

The end result: leave HPET **ON** just like AMD's own Ryzen Master utility requires.

CrazyElf · Mar 7, 2017

Turns out the base clock generator will allow for higher RAM overclocks right now.

The way AMD has their BIOS set up, you cannot yet change the memory timings. The high multipliers have very loose timings. The base clock generator allows for higher timings by isolating the Base Clock of the RAM from the CPU and then allowing you to use a lower multiplier by using a higher base clock. That in turn allows for higher memory overclocks, while keeping it at looser timings.

It might be an advantage considering DRAM is the last level cache.

Buildzoid has a pretty good video about it:

looncraz said:
I've been curios about this. I think it might explain some of the variation we see in results. We need to do some stop-watch confirmation of some of the scores we've seen.

This might be why some people at AMD thought disabling HPET made the system faster (but couldn't explain why).

The end result: leave HPET **ON** just like AMD's own Ryzen Master utility requires.

It may be best to tweak to the setting you like then disable the HPET.

Malogeek · Mar 7, 2017

If PCIe is reduced to v2 from BCLK adjustments, does that affect NVMe x4 performance?

looncraz · Mar 7, 2017

Malogeek said:
If PCIe is reduced to v2 from BCLK adjustments, does that affect NVMe x4 performance?

Yes, cuts its theoretical performance in half.

Not that you're likely to notice outside of benchmarks... My Samsung 960 Evo @ PCI-e x4 2.0

I don't have 3.0 numbers... yet.

wahdangun · Mar 7, 2017

thanks stilt for this amazing thread, but i have a request can anyone here test overclock with bclk vs multiplier, i wan t to know what is the best performance for oc in games bclk or multiplier overclock.

CrazyElf · Mar 7, 2017

See here for the 960 Evo:
http://www.thessdreview.com/featured/samsung-960-evo-m-2-nvme-ssd-review-250gb1tb/3/

lopri · Mar 8, 2017

I am a little confused about PCIe configuration of Summit Ridge. Exactly how many lanes are user accessible, and how many are reserved for system?

inf64 · Mar 8, 2017

Stilt, any idea why Winrar performance is so abysmal? Is it due to memory access patterns it has? From a few reviews I saw it seems like Ryzen is slower than 8350 in ST benchmark score which is very odd.

lopri · Mar 8, 2017

If 24 lanes are user accessible, does that mean Infinity Fabric is taking up 8 lanes? AMD is claiming 128 lanes for Naples, which come from 4 Zeppelin dies. Wouldn't there be higher bandwidth requirement for Infinity Fabric on Naples than Summit Ridge, in which case the available lanes will be fewer than 96? I am curious how Infinity Fabric works here - there is almost no information about it. Its bandwidth/latency as well as how it scales are an almost complete unknown.

looncraz · Mar 8, 2017

lopri said:
If 24 lanes are user accessible, does that mean Infinity Fabric is taking up 8 lanes? AMD is claiming 128 lanes for Naples, which come from 4 Zeppelin dies. Wouldn't there be higher bandwidth requirement for Infinity Fabric on Naples than Summit Ridge, in which case the available lanes will be fewer than 96? I am curious how Infinity Fabric works here - there is almost no information about it. Its bandwidth/latency as well as how it scales are an almost complete unknown.

128 lanes - two sockets - 64 cores - 8 dice total.

That's 16 lanes per die. I had to be corrected on it as well

It's all very confusing.

That leaves 32 lanes per CPU available for communication (8 per die) between sockets. That's only 64GB/s.

Frankly, that's not much, considering 8-channels of DDR4-2400 being attached to each socket... which is something like 150GB/s of memory bandwidth per CPU... but it might actually be plenty considering the software being run on these systems is fully aware of configurations and running it as a dual-node system with 8 CPUs per node would not be a problem in a server environment (and would have a lot of potential advantages).

imported_jjj · Mar 8, 2017

looncraz said:
128 lanes - two sockets - 64 cores - 8 dice total.

That's 16 lanes per die. I had to be corrected on it as well It's all very confusing.

That leaves 32 lanes per CPU available for communication (8 per die) between sockets. That's only 64GB/s.

Frankly, that's not much, considering 8-channels of DDR4-2400 being attached to each socket... which is something like 150GB/s of memory bandwidth per CPU... but it might actually be plenty considering the software being run on these systems is fully aware of configurations and running it as a dual-node system with 8 CPUs per node would not be a problem in a server environment (and would have a lot of potential advantages).

Computerbase thinks they have 128 lanes per socket and when using dual socket, each CPU uses 64 lanes to link the 2.
https://www.computerbase.de/2017-03/amd-naples-cpu-benchmarks.
So maybe there are 32 per die but that doesn't explain why desktop doesn't have more usable.

looncraz · Mar 8, 2017

imported_jjj said:
Computerbase thinks they have 128 lanes per socket and when using dual socket, each CPU uses 64 lanes to link the 2.
https://www.computerbase.de/2017-03/amd-naples-cpu-benchmarks.
So maybe there are 32 per die.

The die shot looks like there are 32 PCI-e lanes, but it's rare we get such a nice die shot for comparison as we have with Ryzen.

I marked what I think can be identified here:

http://files.looncraz.net/Zen_Die_Ident_Live.jpg

Corrections or suggestions are very welcome.

I'd REALLY like to know what those SRAM caches are for... When I first saw them I thought they might be for inter-CCX communications, but their physical locations suggest other possible uses.

There's definitely more going on than we see in Intel CPUs...

Minkoff · Mar 8, 2017

looncraz said:
The die shot looks like there are 32 PCI-e lanes, but it's rare we get such a nice die shot for comparison as we have with Ryzen.

I marked what I think can be identified here:

http://files.looncraz.net/Zen_Die_Ident_Live.jpg

Corrections or suggestions are very welcome.

I'd REALLY like to know what those SRAM caches are for... When I first saw them I thought they might be for inter-CCX communications, but their physical locations suggest other possible uses.

There's definitely more going on than we see in Intel CPUs...

Could have something to do with the Neural Net Prediction? SenseMI? Curious indeed...

Atari2600 · Mar 8, 2017

looncraz said:
128 lanes - two sockets - 64 cores - 8 dice total.

That's 16 lanes per die. I had to be corrected on it as well It's all very confusing.

That leaves 32 lanes per CPU available for communication (8 per die) between sockets. That's only 64GB/s.

Is it not 128 lanes per CPU?

Of which, 64 lanes are for communication between sockets?

Naples will be offered as either a single processor platform (1P), or a dual processor platform (2P). In dual processor mode, and thus a system with 64 cores and 128 threads, each processor will use 64 of its PCIe lanes as a communication bus between the processors as part of AMD’s Infinity Fabric. The Infinity Fabric uses a custom protocol over these lanes, but bandwidth is designed to be on the order of PCIe. As each core uses 64 PCIe lanes to talk to the other, this allows each of the CPUs to give 64 lanes to the rest of the system, totaling 128 PCIe 3.0 again.

http://www.anandtech.com/show/11183...ng-in-q2?_ga=1.182003296.698670777.1483630983

iBoMbY · Mar 8, 2017

Yes, that's pretty interesting. It seems like they can switch one PCIe 3.0 x16 per Summit Ridge module into one Infinity Fabric Link. This could make it possible, that they'd also be able to switch to this protocol, if you connect a Vega to a Ryzen, for example (like I believe it had been rumored earlier).

Edit: Which could be why the Vega High Bandwidth Cache Controller works so good, because it can access the memory with up to about 25 GB/s instead of just up to about 15 GB/s.

The Stilt · Mar 8, 2017

inf64 said:
Stilt, any idea why Winrar performance is so abysmal? Is it due to memory access patterns it has? From a few reviews I saw it seems like Ryzen is slower than 8350 in ST benchmark score which is very odd.

RAR5 is CPU limited, while LZMA2 is purely memory bandwidth limited when using multiple cores.

Despite the two compression algorithms are very similar in terms of compression efficiency and speed, RAR5 is IMO significantly better as it has an advantage: RAR5 decompression can be multithreaded whereas LZMA2 cannot.
In case of an extremely large archive size, RAR5 is significantly faster to decompress when high compression levels have been used.

The Stilt · Mar 8, 2017

lopri said:
I am a little confused about PCIe configuration of Summit Ridge. Exactly how many lanes are user accessible, and how many are reserved for system?

2x x16 PCie lanes per Zeppelin die.
All of them are usable, however 8 of them are consumed by the eFCH (Promontory), when present.

The Stilt · Mar 8, 2017

wahdangun said:
thanks stilt for this amazing thread, but i have a request can anyone here test overclock with bclk vs multiplier, i wan t to know what is the best performance for oc in games bclk or multiplier overclock.

That makes no difference to the CPU.
Sure, the speed can be adjusted more accurately (when external Pll) is used, however it makes no difference to the CPU if the Pll frequency is generated either by combination of e.g. 30x100 or 24x125.

Timur Born · Mar 8, 2017

The Stilt said:
RAR5 is CPU limited, while LZMA2 is purely memory bandwidth limited when using multiple cores.

Here is what Igor Pavlov writes about that:

"About 7-Zip / LZMA speed for AMD Ryzen R7.

Decompression speed is OK at Ryzen.
Compression speed in fast mode with small dictionary probably is OK also.
Compression speed with big dictionary is not good. Compression with big dictionary uses big amount of memory and it needs low memory access latency.

And memory access latency is BAD for Ryzen R7.
Look the following review with memory tests:
http://www.hardware.fr/articles/956-22/retour-sous-systeme-memoire.html

Also maybe shared cache in Intel CPUs is better than two separated caches in Ryzen CPUs for multithreaded LZMA compressing. Probably AMD will ask Microsoft to improve thread scheduling to reduce thread walking from one CCX to another CCX. Maybe such fixed thread scheduling can help slightly in some cases, but I'm note sure that it will help for 7-Zip compression.

Probably special thread scheduler that can be embedded to 7-Zip program will help, but it's difficult to develop it. Some versions of Windows don't like when program changes thread affinity. So such feature requires big development tests with different versions of Windows and different types of CPUs. It can be difficult.

But any improvement for memory latency will help for compression speed. I suppose it's difficult for AMD to reduce memory latency in current Ryzens. I hope they will try to fix it in next Ryzen revisions, if they will have strong understanding how memory latency is important for some programs.

I didn't contact with AMD about Ryzen."

Despite the two compression algorithms are very similar in terms of compression efficiency and speed, RAR5 is IMO significantly better as it has an advantage: RAR5 decompression can be multithreaded whereas LZMA2 cannot.
In case of an extremely large archive size, RAR5 is significantly faster to decompress when high compression levels have been used.

I am not sure that multi-threading is the main benefit here, rather than that RAR5 archives are overall faster to decompress. This even is true if you decompress them via 7-Zip, which only uses a single core for their decompression.

On my 4790K I did a quick comparison, compressing a bunch of PDF files into a solid archive of 4.7 gb (7z) and 4.61 gb (RAR5) size, that about 80% compression ratio. Best/Ultra compression setting, 64 mb booksize, 4 gb solid blocks. Decompression times:

7Z via 7-Zip: 1:50 min
7Z via Winrar: 1:54 min
RAR5 via 7-Zip: 0:32 min
RAR5 via Winrar: 0:22 min

Yes, multi-threaded decompression via Winrar results in 31% faster decompression, but even via single-threaded decompression the RAR5 format decompresses in only 29% of the time that the 7Z archive takes. Interestingly Winrar taxes a second core for decompressing the 7Z archive, but still takes longer.

All that being said, if a 4.7 gb archive takes less than 2 minutes to decompress than you really have to do these chores regularly to make the times matter.

What I take from this, is that 7-Zip decompression should maybe be run with manually settings its affinity to a single core (CCX).

Ryzen: Strictly technical

Member

Golden Member

Golden Member

Member

Junior Member

Member

Senior member

Member

Golden Member

Senior member

Golden Member

Member

Elite Member

Diamond Member

Elite Member

Senior member

Senior member

Senior member

Member

Golden Member

Member

Golden Member

Golden Member

Golden Member

Senior member