Ryzen: Strictly technical

The Stilt · Mar 4, 2017

CatMerc said:
Is it something you've heard or just saying that for posterity?

"Lazarus".

zir_blazer said:
Sorry The Stilt for hijacking a bit your interesing Thread, but for anyone that is into Passthrough as I do, I got the lspci output from Patrick from ServeTheHome from a Ryzen in an ASUS PRIME B350-PLUS. The PCI topology looks like this. I actually find it rather clean and flexible.
I hear from this video that there seems to be poor isolation, and everything gets in the same IOMMU Group, or something like that (I don't see that they show log info or anything related to see how it actually looks). However, that means that they DID enable the IOMMU for the Linux Kernel (Else you won't get the IOMMU Groups constructs) and the Kernel didn't panicked when doing so. Since its an isolation issue, what seems to be broken is PCIe ACS, AMD-Vi/IOMMU itself works. However, with no further info is impossible to know what is broken.
For reference, Skylake and Kaby Lake Chipsets also initially had ugly IOMMU Groups because the Chipset has a PCIe ACS related Errata, the ACS is found at an slight offset compared to the what the specifications says that it should be found at. Thus, for proper functionality, they require that you're using a Linux Kernel that had the fix to the quirk included so you get proper IOMMU Grouping. Since most people focuses on Ubuntu, chances are that if there was some last minute work or fixes for Ryzen that got include in the latest Linux Kernel, they are missing them. A bleeding edge distribution like Arch Linux, or a self compiled Linux Kernel from git, could be more interesing for Ryzen.

A thing which extremely surprises me is Ryzen performance-per-Watt, is like AMD not only got competitive with Intel, but can actually beat it if you focus in that metric. I think that Ryzen weakness is the low Frequency ceiling, 4-4.2 GHz is not enough to put a good show against high Frequency Kaby Lakes in classical desktop workloads and gaming, but for Linux Server workloads it is AMAZINGLY IMPRESSIVE. Also, the binning that 1800X may require puts it extremely close to the "factory overclock" definition, taking away the performance-per-Watt and the fun of overclocking. Ryzen could put an excellent show if it stays at the 3.3 GHz Critical 1 Point, which is why the 1700 looks soo good.
On Server workloads where performance-per-Watt is more important, is probable that Ryzen may be dramatically better. I think that ultimately, that is what will help AMD the most, especially from a profitability point of view, since AMD marketshare on Servers (Which got higher profit magins) was pretty much nonexistant since Sandy Bridge, and Ryzen could help get in with force. Desktop workloads doesn't showcase what it can truly do, it just shows its weakness. It could have made more sense if they focused on Server-first, as the original AMD K8 where Opterons came like 6 months before Athlons 64, but all the present issues shows instead that it would have been a product too inmature for that...
I believe that Ryzen is "Bulldozer done right". I had the same expectatives from Ryzen that I had from Bulldozer (Reducing the ST gap with Intel to the "good enough" point, but dramatically better MT performance at similar price points), just that this time AMD delivered.

BTW, what are the chances that we may see a midterm Ryzen refresh? I remember the Phenom II C2 vs C3 and Piledriver FX 8320E (Which was a more optimized Stepping. I recall a Thread from The Stilt talking about it some years ago). If Ryzen 4C/6C parts gets an extra 300-500 MHz headroom, it will seriously threat mainstream Kaby Lakes in ST. But at that point, it may have to face Coffee Lake instead...

I think we'll know much more about the IOV by the time the server SKUs (Naples and Snowy Owl) hit the market. There is no way to launch the server platform without having the virtualization features fully functional and documented.
Starting from Carrizo AMD has put surprisingly ample amount of resources into Linux.

If I would get to decide what AMD would do with Zeppelin:

- Iron out all of the existing shenanigans (obviously), where possible.
- Rewrite the Turbo & XFR algorithms in the SMU: Turbo & XFR are maintained during "OC-Mode" operation, Turbo & XFR CPUFID/CPUDFSId/CPUVID are made user configurable
- Start porting Zeppelin on 16nm FF+. Porting the design and all of the re-tooling it involves will be extremely costly, however it would be definitely worth it if it allows hitting even 300MHz higher Fmax on average, which it most likely would. Release as a refresh e.g. 1750, 1750X, 1850X.

The Stilt · Mar 4, 2017

looncraz said:
Can you see if applying all Windows 10 updates makes any difference in performance?

Maybe try the fast track as well?

All of the updates have been applied, however I need to try those fast tracks.

The Stilt · Mar 4, 2017

thigobr said:
Would be possible to test PCIE bandwidth to the graphics card?

I might have some tools for that, but I'm not certain if they are compatible with Zeppelin. Need to check it out.

JimmiG · Mar 4, 2017

How likely are you to hit the "few cores" boost frequency (e.g. 4 GHz for 1800X)?

I'm seeing 3.7 GHz pretty much constantly with my 1800X, even when running Furmark or Prime95 with one thread. Shouldn't it be hitting 4 GHz under such a workload? Maybe it's the Windows thread shuffling that causes more than 2 cores to be partially active at all times, blocking the boost?

I don't think there are any problems with my cooling or power supply since I'm always constantly at the "all cores boost" of 3.7 GHz rather than the base 3.6 GHz. My CPU-Z score seems to mirror those in the reviews, both for single-threaded and 16 threads (2140 / 19203 respectively), which leads me to believe Boost behaves like this for everyone.

Mockingbird · Mar 4, 2017

duplicated post

The Stilt · Mar 4, 2017

JimmiG said:
How likely are you to hit the "few cores" boost frequency (e.g. 4 GHz for 1800X)?

I'm seeing 3.7 GHz pretty much constantly with my 1800X, even when running Furmark or Prime95 with one thread. Shouldn't it be hitting 4 GHz under such a workload? Maybe it's the Windows thread shuffling that causes more than 2 cores to be partially active at all times, blocking the boost?

I don't think there are any problems with my cooling or power supply since I'm always constantly at the "all cores boost" of 3.7 GHz rather than the base 3.6 GHz. My CPU-Z score seems to mirror those in the reviews, both for single-threaded and 16 threads (2140 / 19203 respectively), which leads me to believe Boost behaves like this for everyone.

I suggest you check the frequencies using the newest HWInfo rather than with CPU-Z.
Also Prime95 is currently not working properly on Ryzen, so I suggest you try with another workload.

But yeah, generally during a true single core workload you should be able to sustain 4.0 - 4.1GHz on 1800X.

Mockingbird · Mar 4, 2017

The Stilt said:
All of the updates have been applied, however I need to try those fast tracks.

You should do a clean install because Coreinfo shows that something is not configured correctly.

Other users report theirs look like this:

starheap said:

Here is my Coreinfo output on windows 10 with an 1800x

Code:

Logical to Physical Processor Map:
**--------------  Physical Processor 0 (Hyperthreaded)
--**------------  Physical Processor 1 (Hyperthreaded)
----**----------  Physical Processor 2 (Hyperthreaded)
------**--------  Physical Processor 3 (Hyperthreaded)
--------**------  Physical Processor 4 (Hyperthreaded)
----------**----  Physical Processor 5 (Hyperthreaded)
------------**--  Physical Processor 6 (Hyperthreaded)
--------------**  Physical Processor 7 (Hyperthreaded)

Logical Processor to Socket Map:
****************  Socket 0

Logical Processor to NUMA Node Map:
****************  NUMA Node 0

No NUMA nodes.

Logical Processor to Cache Map:
**--------------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**--------------  Instruction Cache   0, Level 1,   64 KB, Assoc   4, LineSize  64
**--------------  Unified Cache       0, Level 2,  512 KB, Assoc   8, LineSize  64
********--------  Unified Cache       1, Level 3,    8 MB, Assoc  16, LineSize  64
--**------------  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
--**------------  Instruction Cache   1, Level 1,   64 KB, Assoc   4, LineSize  64
--**------------  Unified Cache       2, Level 2,  512 KB, Assoc   8, LineSize  64
----**----------  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
----**----------  Instruction Cache   2, Level 1,   64 KB, Assoc   4, LineSize  64
----**----------  Unified Cache       3, Level 2,  512 KB, Assoc   8, LineSize  64
------**--------  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
------**--------  Instruction Cache   3, Level 1,   64 KB, Assoc   4, LineSize  64
------**--------  Unified Cache       4, Level 2,  512 KB, Assoc   8, LineSize  64
--------**------  Data Cache          4, Level 1,   32 KB, Assoc   8, LineSize  64
--------**------  Instruction Cache   4, Level 1,   64 KB, Assoc   4, LineSize  64
--------**------  Unified Cache       5, Level 2,  512 KB, Assoc   8, LineSize  64
--------********  Unified Cache       6, Level 3,    8 MB, Assoc  16, LineSize  64
----------**----  Data Cache          5, Level 1,   32 KB, Assoc   8, LineSize  64
----------**----  Instruction Cache   5, Level 1,   64 KB, Assoc   4, LineSize  64
----------**----  Unified Cache       7, Level 2,  512 KB, Assoc   8, LineSize  64
------------**--  Data Cache          6, Level 1,   32 KB, Assoc   8, LineSize  64
------------**--  Instruction Cache   6, Level 1,   64 KB, Assoc   4, LineSize  64
------------**--  Unified Cache       8, Level 2,  512 KB, Assoc   8, LineSize  64
--------------**  Data Cache          7, Level 1,   32 KB, Assoc   8, LineSize  64
--------------**  Instruction Cache   7, Level 1,   64 KB, Assoc   4, LineSize  64
--------------**  Unified Cache       9, Level 2,  512 KB, Assoc   8, LineSize  64

iBoMbY said:
Yes, I got mine today as well, and my Coreinfo looks exactly like yours, and it seems to be correct so far (only maybe a NUMA node per CCX could be nice). Seems like they fixed that cache thing with some microcode update?

The Stilt · Mar 4, 2017

cytg111 said:
(also, why is this review in at forums? should have its own space imo)

Where would this "own place" be, in your opinion?

cytg111 · Mar 4, 2017

The Stilt said:
Where would this "own place" be, in your opinion?

I dont know, anandtechs, ars, your own site perhaps? I understand that politics would follow and I do enjoy the objectivity of this review, so theres that...
I am just saying I think its good and I think you could get paid.. thats all.

imported_jjj · Mar 4, 2017

Computerbase got a 17% boost in avg FPS in Total War: Warhammer when disabling SMT.
At 1080p with not sure what GPU as i don't speak german but Titan X or GTX 1080.

JimmiG · Mar 4, 2017

The Stilt said:
I suggest you check the frequencies using the newest HWInfo rather than with CPU-Z.
Also Prime95 is currently not working properly on Ryzen, so I suggest you try with another workload.

But yeah, generally during a true single core workload you should be able to sustain 4.0 - 4.1GHz on 1800X.

I guess something is wrong then after all

I'm seeing 4.1 GHz "Maximum" in HWInfo, but it's only sustaining 3.7 GHz.

Annoyingly it's impossible to even get trustworthy temperature readings because the BIOS from 28 February shows 56C idle, 74C full load, the one from 2/24 shows 34C idle and 58C full load. Don't want to rip out the heatsink and buy a new one only to find it was a BIOS issue all along...

bongey · Mar 4, 2017

Does it support 64-bit PCI addressing/Above 4G decoding?
Trying to figure out if Xeon Phi will work in it.

formulav8 · Mar 4, 2017

The Stilt said:
this "own place" be

I could give you your own place on my local computer side-job site I have. (not listed

Edit: This is a high profile site

gupsterg · Mar 4, 2017

@The Stilt

Thank you for the article

.

Any chance you can fill in the x.xGHz ACXFC for a R7 1700?

For example, for the 1700 SKU the clock configuration is following: 3.0GHz all core frequency (MACF), 3.7GHz single core frequency (MSCF), x.xGHz maximum all core XFR ceiling (ACXFRC) and 3.75GHz maximum single core XFR ceiling (SCXFRC).

starheap · Mar 4, 2017

For those curious mine was on a totally fresh install done from ryzen, and i used a freshly downloaded iso image from microsoft.

Mockingbird · Mar 4, 2017

Two questions for @The Stilt

1. Since the frequency of the data fabric is fixed to the memory frequency at a ratio of 1:2, does this mean that using faster memory would result in much faster performance and in what tasks would having faster fabric frequency be most beneficial?

It seems that fixing the data fabric frequency to the memory frequency impose significant restriction to the data fabric. In the future, could the data fabric frequency be decoupled from the frequency of the memory controller or perhaps the ratio could be changed for higher data fabric frequency?

2. Since low clock speed is the result of Samsung 14nm LPP and that increasing frequency beyond Critical 2 would require significantly higher voltage, would it be beneficial for AMD to instead move its production of high frequency products to TSMC?

Or rather, should AMD be focused on increasing its IPC at the given frequency?

Encrypted11 · Mar 4, 2017

There seem to be no public record of IO testing on the internet with rushed reviews.

Will you test the IO?

starheap · Mar 4, 2017

So i have a theory regarding this HLSL_Instancing test you people were doing. It is an x86 not x64 program. For example if you run the CPU-Z bench with the x86 version the performance is less than half of the x64 version. So could it just be that ryzen/amd just has bad x86 performance?

The source code is included with it so technically one could attempt to recompile it for x64 to try and confirm this.

Kromaatikse · Mar 4, 2017

I doubt it. The same general deficit in graphics performance also shows up under Linux, where 64-bit code is more ubiquitous, and most of the games showing trouble are also 64-bit.

IMHO all signs point to BIOS or driver problems, not a flaw in the CPU itself.

Ajay · Mar 4, 2017

looncraz said:
Windows load-balances the cores, so the heavy-hitter threads are being moved around between differing cores (but not the SMT thread on the same core) every 10ms or so (Windows kernel scheduling interrupt interval). As was mentioned, you're seeing an average over 0.5 second or more, so it will appear that no core is being fully utilized - but they are... momentarily.

This process, though, makes a few issues with Ryzen.

1. It effectively prevents 'AI' prefetch adaptation
- so 10~15% of its total performance is lost right there (if AMD is to be believed)

2. It shuffles data across CCXes about 50% of the time.
- This damages data locality and causes new fetches from memory.

3a. A driver may detect this cache behavior and then load up VRAM to the max for better performance...
OR
3b. nVidia is intentionally loading more data on AMD CPUs... for whatever reason.

Hmm, all because of Windows odd behavior vis-a-vis thread allocation. Windows NT, at least back to 3.5, switches threads between cores (then CPUs) according to some algorithm (always looked random to me). I remember a spat on COMP.ARCH between some server dude and Dave Cutler over this on a dual CPU system - task manager showed exactly 50% utilization on each CPU when running a single threaded process. AMD, apparently, couldn't afford too design and implement two separate CPUs for client and server (with both being monolithic)

re: 1) AMD had to know this, just based on the design. No easy way to fix it without sort of breaking windows. I think Linux is smarter about threads and data locality - probably why it performs so well with Ryzen (haven't looked at Linux scheduling in a long time, so AFAIK). Even within one CCX this will be an issue, since prefetches are to the private caches.

re: 2) As Stilt pointed out, a ring 0 proggie can change this via core parking or from the command prompt with the /affinity switch can be used. I've been using Process Lasso for years to deal with this issue in some, high performance, programs (and to manipulate priority levels). It sounds like MS may be waiting for the April launch of Redstone 2 ("creators update") to fix this within the windows scheduler when Zeppelin CPU is detected. I would guess that the fix may already be in the latest Windows "Fast Ring" updates or will be in the next couple of weeks.

I do wonder if AMD is planning on creating a monolithic design for 7nm? I don't know if their fabric can support > 12 cores with a single shared L3$ or if they can even afford to do that (since it must already be in development). If AMD is successful and is able afford to spend more on CPU development - I think we'll see something even more impressive than Zen (well, in absolute terms).

CrazyElf · Mar 4, 2017

I guess at this point we'll just have to wait for more applications to be optimized for this unique topology.

The lesson is the minimize the communication between the 2 CCX as much as possible. We may have gains that use fewer than 4 cores just use 1 CCX, while the rest of the operating system uses the rest. With the SMT off and some decent RAM overclcoks, we should actually get some decent gaming performance, perhaps even approaching the workstation benchmarks.

The Stilt said:
There are many more fabrics in Zeppelin than just the data fabric.
I'd assume the inter-CCX fabric frequency is 4x DFICLK (i.e. 5333MHz @ 2666MHz DRAM), however I don't know it as a fact.

Any idea what the speed is of the inter-CCX interconnects?

From PCGH.de: http://www.pcgameshardware.de/Ryzen-7-1800X-CPU-265804/Tests/Test-Review-1222033/

Zwei dieser CCX kommunizieren in Summit Ridge (Ryzen R7) intern über das sogenannte Inifinty Fabric, eine kohärente Verbindung. Diese Verbindung erreicht laut AMD rund 22 GByte Datendurchsatz pro Sekunde bei Single-Thread-Aufgaben und gemischten Lese-/Schreibzugriffen. Das Infinity Fabric lotst Datenanfragen an den Speichercontroller und prüft gleichzeitig den L3-Cache anderer CCX auf Verfügbarkeit der angefoderten Daten. Je nachdem, woher schneller eine Rückmeldung kommt, wird entweder auf den anderen L3 (in den allermeisten Fällen) oder auf den Speicher zugegriffen. Die Speicheranfrage wird annulliert, sofern die Daten bereits im L3 vorhanden sind.

PCGH.de like Hardware.fr says the same speed. 22 GB/s seems to be the speed, which seems slow. For a comparison, the QPI links for Haswell EP are 9.6 GT/s, which works out to 38.4 GB/s.

There's got to be a latency penalty for using RAM as last level cache here.

looncraz said:
This is how software should treat it. DRAM as LLC... not what I expected at all.

This means memory frequency is all the more important. Ryzen may be more bandwidth sensitive than latency sensitive as a result.

I apparently won't have a motherboard for a couple weeks (ARGH!) thanks to Amazon's ineptness, otherwise I would test this. For now, I'm going to write an app to do just that.

Yeah I really think that this CPU needed an L4 cache of some sort and a faster interconnect between the CCX.

Once the BIOS fixes are in, we'll need to see if the memory overclocking is more sensitive to frequency or tight timings.

piesquared · Mar 4, 2017

It will be interesting to hear developers take on this CPU.

crashtech · Mar 4, 2017

Certainly Linux is not perfect, it has a scheduler, too. Time will tell if some patches can be applied to inform the scheduler of Ryzen's new architecture.

deadhand · Mar 4, 2017

I've spent some time working with two people ('Longcat' and 'iWalkingCorpse' from AMD discord) with Ryzen systems to test Valve's CS: Go map compiling tools on Ryzen. I believe the results support the notion that the inter-CCX fabric can cause potentially significant slow-downs when the scheduler is allowed to freely shuffles threads between CCX's.

A while ago I discovered that if I use the map compiling tools across my dual E5-2680's (8 core / 16 thread SandyBridge-EP processors w/ 2.7 base, 3.5 boost clocks), I get negative scaling. By using an application called 'Process Lasso', I forced all 16 threads from 'vrad' (radiosity computations for lightmaps) and 'vvis' (generates visibility sets) to a single CPU. Performance went up dramatically. Given Ryzen's floor plan, I thought perhaps there would be a similar issue. Sure enough, there is, and here are the results:

Note: In all cases, 'vrad' (the longest running program, by far) was set to use 8 threads. This was so it aligns nicely with a single CCX.

Additionally, the results from the E5-2680 are not really intended to compare directly against the Ryzen CPUs, except to note that the benefit of having all threads mapped to the first 8 hardware threads is significantly greater on Ryzen than on my Intel CPU. Further, the dual-processor test was to show how these tools behave in a NUMA environment.

EDIT: I should also note that the e5-2680 can do an all-core boost of around ~3 ghz. I'm also using Registered ECC memory @ 1333 mhz per stick in quad channel (for each CPU), with a total of 16 modules.

(Lower is better)

The green bars are when the thread affinity is set to a single CCX, or in the case of my E5-2680's, the first 8 hardware threads. While the scheduler still shuffles the threads around within it, it appears much faster.

I should also note that there is more variance the worse the results are - I've taken the best of each set of results, but other runs on the 2 CPU / physical cores only test had a 135 point variance (908 seconds in the worst result), whereas the better performing tests, the green results - had significantly less variance between runs (~10 seconds between best and worst).

Additionally, it's interesting to note that the R7-1700 is getting better results than the R7-1800x system on the cross-CCX tests (though this may be margin of error). The RAM used by the owner of the OC'd r7-1700 is DDR4-3000 mhz, while owner of the R7-1800x is using DDR4-2400 memory.

Lastly, please take these results with a grain of salt. Testing was difficult due to multiple people testing on different machines. I wouldn't post these results if there wasn't such an obvious spread in performance between the different affinity configurations. If anyone is interested in exact methodologies, I can post below or edit this post.

Please also note that I chose these compile tools for testing as they are an excellent example of the kind of memory bottlenecks that can occur in multi-threading. It also does not scale well beyond 8 threads on 4 physical CPUs (though all tests i ran involved using the '-threads 8' flag to only allow vrad to use 8 main worker threads)

DisEnchantment · Mar 4, 2017

So probably this could explain why disabling SMT improves performance in non threaded application like games.

With threaded workloads the shuffling across CCXs is not there whereas in lightly threaded workloads what you mentioned above is a bottle neck. So it could explain why the massive MultiThread performance of Ryzen does not translate into game performance. It is actually a performance penalty.

Question though, would disabling a complete CCX actually help reduce the penalty, could someone with the Ryzen chips please try running the gaming bench with one of the CCX disabled and SMT disabled?

Ryzen: Strictly technical

Golden Member

Golden Member

Golden Member

Platinum Member

Senior member

Golden Member

Senior member

Golden Member

Lifer

Senior member

Platinum Member

Junior Member

Diamond Member

Junior Member

Junior Member

Senior member

Junior Member

Junior Member

Member

Lifer

Member

Golden Member

Lifer

Junior Member

Golden Member