Ryzen: Strictly technical

knowndragon · Apr 10, 2017

Lots of info to read here. I appreciate the OP for taking the time. This is being seen and linked on other forums I belong to.

So rule of thumb, as I have not read all the posts in this thread yet. I will if you can't get pass the original frequency of the xfr, The best thing to do is leave it alone at stock? I am going to try and see about base over clock with a multi mix if this is possible.

looncraz · Apr 10, 2017

Timur Born said:
Thanks. I just had a case where I closed and reopened the browser and found the main text content empty (Looncache enabled). That was the "X Threads" page. I then clicked on the "1 Thread" page and it was empty, too. Only once I clicked the "Home" page I could access the other pages again.

Thanks, I refreshed the cached versions in case they get used again. I was originally planning to use PHP charts, which are compute-heavy, so I was going to cache the pages full-time in memory. Not sure how the empty cached versions came up at all, though... pretty strange.

Chl Pixo · Apr 10, 2017

@looncraz good info.
Now only if the virtualization was not so lacking it would be great.
Still seeing big perf drop on passed GPU.

I am curious if the new AGESA resolves the problem with IOMMU and AMD cards on the chipset slot.
Currently no linux will boot at least on ASUS prime x370-pro i there is GCN based card in chipset slot.
Tested this with RX 460, R9 290 & R7 260X.
Old HD 6450 work fine and according to what I read on net Nvidia card works too.

Dresdenboy · Apr 10, 2017

looncraz said:
http://zen.looncraz.net/ now online.

I don't have time for the next few weeks, maybe as much as a month, to finish more than what has been done.

At that point in time I should also have more Intel results as I will be upgrading an Ivy Bridge Xeon system to Ryzen and will have the parts on consignment for a short while (long enough to run a series of benchmarks at 3Ghz with and without Hyper-threading).

Someone noticed that the 1T SB/XV charts are swapped, Edit: but they seem OK to me.

mattiasnyc · Apr 10, 2017

Timur Born said:
It would be a very strange design if an onboard PCI bridge would slow down CPU bound x16 then. It's rather off-topic, but since the question came up, here is what I discuss in a Gearslutz (forum) thread:

I don't think it's off-topic, it relates to strictly technical issues.... And I'm the same Mattias as on Gearslutz!

I don't think the CPU bound x16 is slowing down, I just think it's an odd coincidence that the boards with legacy PCI have chosen to not offer x8/x8 for the CPU PCIe 3.0 x16.

Either way, as soon as I get confirmation that the adapter outputs the correct voltage required for my Lynx card I'll probably settle for that as opposed to legacy PCI on board. That'll allow me to do x8/x8 for video work in the future.

dnavas · Apr 10, 2017

Dresdenboy said:
Someone noticed that the 1T SB/XV charts are swapped, Edit: but they seem OK to me.

Yes, I see all of the charts swapped -- SandyBridge on top. :shrug: I'm more interested in all the remaining pages. I hope your time frees up.

I've been using Ryzen for video editing, and I've noticed two things.
First, the lack of quicksync really hampers decode acceleration, to the point where a 7700k may well be a better bet for some users. Single-threaded UHD avc decode in my NLE can't complete at 60p, so unless the video is packaged with multiple slices (two might work at ~4.1G, but I'm only stable at 3.9), I'm not real-time in my software. That's a problem. Fortunately I have quad-slice cam output, so I'm fine for now, and can only hope that the software eventually makes use of nvdec (or some upgraded amd equivalent).
Worse (and second), if the decode threads, which are long-pole items, wind up having their time stolen by SMT'd threads, I'm in for a bad time. There are very strange performance pits. I can put a few cams-worth of video in a loop, and one time through the loop everything is good, and another time through it, we're hiccoughing like crazy. This usually happens when the processor is nearly fully utilized (over 80%). I haven't re-run that test after I upgraded my BIOS (to F5g on a Gaming 5, which is supposed to have 1.0.0.4), so maybe things have improved, but it feels like some kind of scheduling problem. If the threads are scheduled on top of each other (err, same core), they seem to stay that way until the next go-around. :shrug: [This is Win7, btw]

Hopefully those observations are useful in some way. Obviously more investigation would be required. A longer, video-editing-focused review which I would sum up with "not all things parallelizable are parallelized" is here https://www.pugetsystems.com/labs/a...2017-AMD-Ryzen-7-1700X-1800X-Performance-909/. As someone who wants to know why their benchmarks look a certain way, this kind of article will likely grate, but the kind of software behavior it demonstrates is going to be a problem for some HEDT targets. That said, I'd be lying if I claimed I wasn't interested in a 16 core anyway.

Thanks for the time!

Kromaatikse · Apr 10, 2017

dnavas said:
not all things parallelizable are parallelized

This is something I see quite specifically in Gentoo Linux, which builds all of its packages from source on the end system. During this build process, there are several distinct phases which occur entirely sequentially:

Source archives are verified, unpacked and patched. This is basically a serial operation, although decompressors effectively parallelise with the unarchivers and patchers they directly feed. In any case it only takes a long time for very big packages.
The build tree is configured. More often than not, this involves a GNU Autotools script, which is notoriously slow and pedantic - and also completely serial.
The source code is compiled and linked. This is theoretically the meat of the business, and is usually properly parallelised on large packages, as you'd expect of a multi-file compiler workload. There may be a few bottlenecks in the dependency chain, but that's it.
The build products, documentation, etc. are installed. This is mostly a disk-limited operation, but with one or two notable exceptions: in particular Glibc inexplicably delays building locale descriptors to this stage and does not parallelise this 100+ step (by default) process.

i-know-not · Apr 10, 2017

.vodka said:
Some official data on Zen and how to optimise for it. Found here

Google translate'd

Original

----------------------------------

Haven't seen this posted yet: full pdf of GDC optimization slides

Dresdenboy · Apr 10, 2017

dnavas said:
Yes, I see all of the charts swapped -- SandyBridge on top. :shrug: I'm more interested in all the remaining pages. I hope your time frees up.

I've been using Ryzen for video editing, and I've noticed two things.
First, the lack of quicksync really hampers decode acceleration, to the point where a 7700k may well be a better bet for some users. Single-threaded UHD avc decode in my NLE can't complete at 60p, so unless the video is packaged with multiple slices (two might work at ~4.1G, but I'm only stable at 3.9), I'm not real-time in my software. That's a problem. Fortunately I have quad-slice cam output, so I'm fine for now, and can only hope that the software eventually makes use of nvdec (or some upgraded amd equivalent).
Worse (and second), if the decode threads, which are long-pole items, wind up having their time stolen by SMT'd threads, I'm in for a bad time. There are very strange performance pits. I can put a few cams-worth of video in a loop, and one time through the loop everything is good, and another time through it, we're hiccoughing like crazy. This usually happens when the processor is nearly fully utilized (over 80%). I haven't re-run that test after I upgraded my BIOS (to F5g on a Gaming 5, which is supposed to have 1.0.0.4), so maybe things have improved, but it feels like some kind of scheduling problem. If the threads are scheduled on top of each other (err, same core), they seem to stay that way until the next go-around. :shrug: [This is Win7, btw]

Hopefully those observations are useful in some way. Obviously more investigation would be required. A longer, video-editing-focused review which I would sum up with "not all things parallelizable are parallelized" is here https://www.pugetsystems.com/labs/a...2017-AMD-Ryzen-7-1700X-1800X-Performance-909/. As someone who wants to know why their benchmarks look a certain way, this kind of article will likely grate, but the kind of software behavior it demonstrates is going to be a problem for some HEDT targets. That said, I'd be lying if I claimed I wasn't interested in a 16 core anyway.

Thanks for the time!

It's not my review. The charts are meant to show performance relative to XV (top) and SB (bottom). Thus they have to include SB (top) and XV (bottom).

Your observation reminds me of a question I asked in the past regarding performance measurement of actual video editing, not just the (overnight) rendering. This would include the storage subsystem, mem substytem, and the CPU of course.

SMT related things (e.g. BG threads reducing performance of FG threads with user interaction) could be improved on by setting affinity, using process lasso, etc. But a smarter scheduler would help, too, of course.

What kind of NVMs, SSDs, HDDs do you use - and how much RAM?

looncraz · Apr 10, 2017

dnavas said:
Yes, I see all of the charts swapped -- SandyBridge on top. :shrug: I'm more interested in all the remaining pages. I hope your time frees up.

I've been using Ryzen for video editing, and I've noticed two things.
First, the lack of quicksync really hampers decode acceleration, to the point where a 7700k may well be a better bet for some users. Single-threaded UHD avc decode in my NLE can't complete at 60p, so unless the video is packaged with multiple slices (two might work at ~4.1G, but I'm only stable at 3.9), I'm not real-time in my software. That's a problem. Fortunately I have quad-slice cam output, so I'm fine for now, and can only hope that the software eventually makes use of nvdec (or some upgraded amd equivalent).
Worse (and second), if the decode threads, which are long-pole items, wind up having their time stolen by SMT'd threads, I'm in for a bad time. There are very strange performance pits. I can put a few cams-worth of video in a loop, and one time through the loop everything is good, and another time through it, we're hiccoughing like crazy. This usually happens when the processor is nearly fully utilized (over 80%). I haven't re-run that test after I upgraded my BIOS (to F5g on a Gaming 5, which is supposed to have 1.0.0.4), so maybe things have improved, but it feels like some kind of scheduling problem. If the threads are scheduled on top of each other (err, same core), they seem to stay that way until the next go-around. :shrug: [This is Win7, btw]

Hopefully those observations are useful in some way. Obviously more investigation would be required. A longer, video-editing-focused review which I would sum up with "not all things parallelizable are parallelized" is here https://www.pugetsystems.com/labs/a...2017-AMD-Ryzen-7-1700X-1800X-Performance-909/. As someone who wants to know why their benchmarks look a certain way, this kind of article will likely grate, but the kind of software behavior it demonstrates is going to be a problem for some HEDT targets. That said, I'd be lying if I claimed I wasn't interested in a 16 core anyway.

Thanks for the time!

I have no idea how the charts could ever be swapped, they're hard-coded in place inside their cells and have never been placed in the wrong cell. What browser are you using? And, are you sure they are swapped? The results relative to Excavator will contain the Sandy Bridge results, whereas the results relative to Sandy Bridge will contain the Excavator results.

What program do you use for video editing? Proprietary solutions for generic problems always irks me. QuickSync isn't anything special, it's just GPU compute.

DeeJayBump · Apr 10, 2017

looncraz said:
I have no idea how the charts could ever be swapped, they're hard-coded in place inside their cells and have never been placed in the wrong cell. What browser are you using? And, are you sure they are swapped? The results relative to Excavator will contain the Sandy Bridge results, whereas the results relative to Sandy Bridge will contain the Excavator results.
...

Thanks for all the hard work you've provided us with all of this Ryzen testing, first of all.

As for reversed charts, using Pale Moon, the charts [in both Single Thread + Multi-Thread sections] are reversed for me as well. Appears that the charts themselves are misnamed [Excavator-named charts lack Excavator results, SB-named charts lack SB results] which appears to be the issue.

looncraz · Apr 11, 2017

DeeJayBump said:
Thanks for all the hard work you've provided us with all of this Ryzen testing, first of all.

As for reversed charts, using Pale Moon, the charts [in both Single Thread + Multi-Thread sections] are reversed for me as well. Appears that the charts themselves are misnamed [Excavator-named charts lack Excavator results, SB-named charts lack SB results] which appears to be the issue.

"Relative to Excavator" charts will not contain Excavator results - they would all just be 100% ;-)

Relative charts generally exclude that to which they are relative as that would just be the 100% marker.

dnavas · Apr 11, 2017

Dresdenboy said:
It's not my review. The charts are meant to show performance relative to XV (top) and SB (bottom). Thus they have to include SB (top) and XV (bottom).

Yes, I know. I reopened the page and all seems reasonable. Perhaps I misread the first time through? I dunno -- been a long couple of days....

SMT related things (e.g. BG threads reducing performance of FG threads with user interaction) could be improved on by setting affinity, using process lasso, etc. But a smarter scheduler would help, too, of course.

Edius comes in two different versions, and the workgroup version (which I have) indicates that it supports multi-CPU systems. I don't know the extent to which it is getting in its own way. I should look into Lasso.

What kind of NVMs, SSDs, HDDs do you use - and how much RAM?

Well, the OS is on an SSD. I'll probably graduate it to nvme, but not in any hurry.
Most folks that I know have local raid arrays, but I prefer to edit in the location of the final resting place for my bits, so I have a rather unusual setup where the NAS sits next to my PC. I've got 8 spinning 4TB drives in raid-6 fronted by two 256GB SSDs in raid-0 in read-only (SATA, because qnap doesn't support pcie-based nvme). Now that I've started dealing with 4k, I'm considering a dedicated 10GbE connection, although I don't normally do multicam, and 140mbps is a pretty simple thing for straight gigabit. I'm more concerned about the pre-rendered stuff being able to be streamed adequately. The NAS has 8GB, my computer has 16GB. Edius itself doesn't really require a lot of ram.

looncraz said:
And, are you sure they are swapped?

Well, I was sure this afternoon, but looking at it again, they seem as expected. Perhaps I was confused. The current labels (relative to ...) make things look fine to me this evening.

Proprietary solutions for generic problems always irks me. QuickSync isn't anything special, it's just GPU compute.

If decode was a solved problem, editing long-gop 4:2:2 4k video wouldn't be such a difficult task. It is, though, because QS, nvdec, etc. don't support 4:2:2. Generally, you talk to CPU people and they say "but, that's just gpu" and you talk to gpu people and they say "yeah, but who watches 4:2:2 video?" So you have nvdec supporting 8k hevc formats for the broad consuming public -- all 2 of them, but not the 4:2:2 format that's required for delivered video in various places around the world. You have Intel with qsv in their consumer chips, but not the CPUs which would otherwise be more useful in editing. Because it's just gpu. And the hardware is only there because it's useful for driving down the "watts-while-watching-bluray" numbers. And who uses an 8-core processor to watch blu-rays.

The thing is, that hardware is really useful. In Edius it's easily worth a couple of cores. I don't have numbers for the 7700k, but the higher the decode resolution that gets supported, the greater the number of lower resolution simultaneous decodes that can run. It's why Vegas is making such a big deal of their support of it. Meanwhile we're staring at the 8k freight train and looking backwards in time towards the use of proxies. Unpleasant

But, off-topic.
Given the current immaturity of the platform, against the shifting sands of bios updates and game patches, to attempt what you've attempted is a thing worthy of note. I do appreciate it. I think it'll be really important in a few months when Zen heads up against Skylake-X. It'll likely be processor count vs frequency, and understanding the shortcomings (and not) of the former is going to go a long way to having a good discussion about the merits of the platforms. Thanks muchly.

DeeJayBump · Apr 11, 2017

looncraz said:
"Relative to Excavator" charts will not contain Excavator results - they would all just be 100% ;-)

Relative charts generally exclude that to which they are relative as that would just be the 100% marker.

Got it. Brain cramp on my part, sorry.

Paratus · Apr 11, 2017

Ars has a pretty interesting article on performance improvements from patches to games, window and the processor microcode. They take a pretty deep dive into what's going on.

https://arstechnica.com/information...ryzen-showing-just-what-can-and-cant-be-done/

The last few weeks have seen the release of a couple of game patches designed to address certain Ryzen issues. AMD has also released guidance to game developers on how best to use its processor, as well as a new power management profile for Windows 10. Together, we can gain some insight into some of the complexities of developing game software for modern processors and get some understanding of what kind of performance gains gamers might hope to see.

It basically shows what I kind of already expected. Several areas where it appears to significantly underperform against broadwell are mostly due to the lack of Ryzen specific optimizations

Dygaza · Apr 11, 2017

Pcper tested memory speed effects on ping times between cores over CCX's.

https://www.pcper.com/reviews/Proce...Core-i5/CCX-Latency-Testing-Pinging-between-t

CatMerc · Apr 11, 2017

Dygaza said:
Pcper tested memory speed effects on ping times between cores over CCX's.

https://www.pcper.com/reviews/Proce...Core-i5/CCX-Latency-Testing-Pinging-between-t

Heh, I see they took my request. I emailed them about doing these tests

At 1066MHz clock, a message takes 100ns to cross the CCX barrier. At 1600MHz, a message takes 71ns to cross the CCX barrier. Looks like around linear scaling.
4000MHz RAM should reduce this to 55ns, so a total of 95ns ish. Close to Intel's 80ns.

If they fixed the DF clock at 4GHz, it would be 70ns ish worst case, or just 27.5ns added from the data fabric.

imported_jjj · Apr 11, 2017

Another review tried to look at cross CCX core to core http://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html
They don't look at Broadwell-E and cache OC though.

TerionX6 · Apr 11, 2017

Ryzen CCX latencies
2133 > 2400
1/8th increase in DF/mem clock gives ~1/12.5th decrease in latency
scaling of .96

2400 > 2933
1/4.5th increase in DF/mem clock gives ~1/8.19th decrease in latency
scaling of .918

2933 > 3200
1/11th increase in DF/mem clock gives ~1/25.8th decrease in latency
scaling of .95

Following these figures I no longer believe a 2Ghz DF clock, 4Ghz RAM speed would lower latency so much. My calculations show ~95ns at best., which is within 15% of Intel's monolithic approach. Still impressive!

Ciao,
Terion

CatMerc · Apr 11, 2017

imported_jjj said:
Another review tried to look at cross CCX core to core http://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html
They don't look at Broadwell-E and cache OC though.

I'm glad someone finally tested core to core latency for Skylake quad cores. It appears the intra CCX latency is identical to a quad core Skylake at 40ns. Inter-CCX latency however jumps up even beyond the 80ns mark of Broadwell-E. I imagine with 4000MT/s RAM, the latency difference against Broadwell-E won't be significant enough to have a real effect, while in best case it will still be better.

The question is how Skylake-X fares. Will it maintain the 40ns latency to all cores? Because if so, then that fabric will have some work to do lol

TerionX6 · Apr 11, 2017

Rather I am curious of the latency differences between Naples and Intel's top end server SKUs. It's said that Intel's ring implementation has more and more latency for more and more cores. While Naples will have to deal with not just Inter-CCX comms, but as well communication delays between the 8 core clusters, Intel's current designs have to deal with ever larger ring delays. If we could get our hands on latency tests of those fancy 28-core Xeons...

With that said I read someone mention they expect a mesh based KNL-like topography for future Xeons. I can't imagine this would be available on skylake or any Intel 14nm design.

hondaman · Apr 11, 2017

Chl Pixo said:
@looncraz good info.
Now only if the virtualization was not so lacking it would be great.
Still seeing big perf drop on passed GPU.

I am curious if the new AGESA resolves the problem with IOMMU and AMD cards on the chipset slot.
Currently no linux will boot at least on ASUS prime x370-pro i there is GCN based card in chipset slot.
Tested this with RX 460, R9 290 & R7 260X.
Old HD 6450 work fine and according to what I read on net Nvidia card works too.

I have an Asrock Taichi with v2.0 bios. I have an rx460 in the PCI-E 1 slot (nearest the cpu) and a NV 1070 in the "middle" PCI-E slot. Running Ubuntu 17.04 beta. I've been trying and failed to do pci-e pass through.

SpecChum · Apr 12, 2017

As you know, my 1700 couldn't (well, not consistently) hit 3200 on my gskill 3200c14 memory so I decided to buy another and swap it out last night.

Result?

3200c14 ram first time every time. Nothing has changed but the CPU.

Obviously not conclusive by any means, but food for thought.

Timur Born · Apr 13, 2017

Ambient 21°C, Radiator 21.5°C, Sense Skew enabled (Auto/defaults), "Power Safer" W10 profile

Idle:

Idle with WmiPrvSE.EXE background load:

Different CPU load profiles (power vs. temperature), x-axis not aligned:

Power

Temperature

Sorry for the typo, I meant Core 15 "odd". Cores 1-16 (15 = odd) in Statuscore correspond to cores 0-15 (15 = even) in Task-Manager. I meant core 15 in Statuscore (core 14 in TM). Statuscores uses different CPU instruction sets for even and odd core stress tests!

And to make things a bit more complicated:

Timur Born · Apr 14, 2017

Using Statuscore's stresstest on different CPU cores:

Ryzen: Strictly technical

Junior Member

Senior member

Junior Member

Golden Member

Senior member

Senior member

Member

Junior Member

Golden Member

Senior member

Member

Senior member

Senior member

Member

Lifer

Member

Golden Member

Senior member

Junior Member

Golden Member

Junior Member

Senior member

Member

Senior member

Senior member