Ryzen: Strictly technical

Page 60 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

knowndragon

Junior Member
Apr 3, 2017
17
4
36
Lots of info to read here. I appreciate the OP for taking the time. This is being seen and linked on other forums I belong to.

So rule of thumb, as I have not read all the posts in this thread yet. I will if you can't get pass the original frequency of the xfr, The best thing to do is leave it alone at stock? I am going to try and see about base over clock with a multi mix if this is possible.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Thanks. I just had a case where I closed and reopened the browser and found the main text content empty (Looncache enabled). That was the "X Threads" page. I then clicked on the "1 Thread" page and it was empty, too. Only once I clicked the "Home" page I could access the other pages again.

Thanks, I refreshed the cached versions in case they get used again. I was originally planning to use PHP charts, which are compute-heavy, so I was going to cache the pages full-time in memory. Not sure how the empty cached versions came up at all, though... pretty strange.
 
  • Like
Reactions: Drazick

Chl Pixo

Junior Member
Mar 9, 2017
11
2
41
@looncraz good info.
Now only if the virtualization was not so lacking it would be great.
Still seeing big perf drop on passed GPU.

I am curious if the new AGESA resolves the problem with IOMMU and AMD cards on the chipset slot.
Currently no linux will boot at least on ASUS prime x370-pro i there is GCN based card in chipset slot.
Tested this with RX 460, R9 290 & R7 260X.
Old HD 6450 work fine and according to what I read on net Nvidia card works too.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
http://zen.looncraz.net/ now online.

I don't have time for the next few weeks, maybe as much as a month, to finish more than what has been done.

At that point in time I should also have more Intel results as I will be upgrading an Ivy Bridge Xeon system to Ryzen and will have the parts on consignment for a short while (long enough to run a series of benchmarks at 3Ghz with and without Hyper-threading).
Someone noticed that the 1T SB/XV charts are swapped, Edit: but they seem OK to me.
 
Last edited:
  • Like
Reactions: Drazick

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
It would be a very strange design if an onboard PCI bridge would slow down CPU bound x16 then. It's rather off-topic, but since the question came up, here is what I discuss in a Gearslutz (forum) thread:

I don't think it's off-topic, it relates to strictly technical issues.... And I'm the same Mattias as on Gearslutz! :)

I don't think the CPU bound x16 is slowing down, I just think it's an odd coincidence that the boards with legacy PCI have chosen to not offer x8/x8 for the CPU PCIe 3.0 x16.

Either way, as soon as I get confirmation that the adapter outputs the correct voltage required for my Lynx card I'll probably settle for that as opposed to legacy PCI on board. That'll allow me to do x8/x8 for video work in the future.
 

dnavas

Senior member
Feb 25, 2017
355
190
116
Someone noticed that the 1T SB/XV charts are swapped, Edit: but they seem OK to me.

Yes, I see all of the charts swapped -- SandyBridge on top. :shrug: I'm more interested in all the remaining pages. I hope your time frees up.

I've been using Ryzen for video editing, and I've noticed two things.
First, the lack of quicksync really hampers decode acceleration, to the point where a 7700k may well be a better bet for some users. Single-threaded UHD avc decode in my NLE can't complete at 60p, so unless the video is packaged with multiple slices (two might work at ~4.1G, but I'm only stable at 3.9), I'm not real-time in my software. That's a problem. Fortunately I have quad-slice cam output, so I'm fine for now, and can only hope that the software eventually makes use of nvdec (or some upgraded amd equivalent).
Worse (and second), if the decode threads, which are long-pole items, wind up having their time stolen by SMT'd threads, I'm in for a bad time. There are very strange performance pits. I can put a few cams-worth of video in a loop, and one time through the loop everything is good, and another time through it, we're hiccoughing like crazy. This usually happens when the processor is nearly fully utilized (over 80%). I haven't re-run that test after I upgraded my BIOS (to F5g on a Gaming 5, which is supposed to have 1.0.0.4), so maybe things have improved, but it feels like some kind of scheduling problem. If the threads are scheduled on top of each other (err, same core), they seem to stay that way until the next go-around. :shrug: [This is Win7, btw]

Hopefully those observations are useful in some way. Obviously more investigation would be required. A longer, video-editing-focused review which I would sum up with "not all things parallelizable are parallelized" is here https://www.pugetsystems.com/labs/a...2017-AMD-Ryzen-7-1700X-1800X-Performance-909/. As someone who wants to know why their benchmarks look a certain way, this kind of article will likely grate, but the kind of software behavior it demonstrates is going to be a problem for some HEDT targets. That said, I'd be lying if I claimed I wasn't interested in a 16 core anyway. :)

Thanks for the time!
 
Last edited:
  • Like
Reactions: french toast

Kromaatikse

Member
Mar 4, 2017
83
169
56
not all things parallelizable are parallelized
This is something I see quite specifically in Gentoo Linux, which builds all of its packages from source on the end system. During this build process, there are several distinct phases which occur entirely sequentially:
  • Source archives are verified, unpacked and patched. This is basically a serial operation, although decompressors effectively parallelise with the unarchivers and patchers they directly feed. In any case it only takes a long time for very big packages.
  • The build tree is configured. More often than not, this involves a GNU Autotools script, which is notoriously slow and pedantic - and also completely serial.
  • The source code is compiled and linked. This is theoretically the meat of the business, and is usually properly parallelised on large packages, as you'd expect of a multi-file compiler workload. There may be a few bottlenecks in the dependency chain, but that's it.
  • The build products, documentation, etc. are installed. This is mostly a disk-limited operation, but with one or two notable exceptions: in particular Glibc inexplicably delays building locale descriptors to this stage and does not parallelise this 100+ step (by default) process.
 

i-know-not

Junior Member
Mar 2, 2017
13
14
41
Some official data on Zen and how to optimise for it. Found here

Google translate'd

Original

----------------------------------

zezcQy.jpg


hvEGhP.jpg


bt8VDh.jpg


AhozDG.jpg


vCYY13.jpg


r7IUdK.jpg


NbNDaT.jpg


IqNbjl.jpg


12P0eL.jpg

Haven't seen this posted yet: full pdf of GDC optimization slides
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Yes, I see all of the charts swapped -- SandyBridge on top. :shrug: I'm more interested in all the remaining pages. I hope your time frees up.

I've been using Ryzen for video editing, and I've noticed two things.
First, the lack of quicksync really hampers decode acceleration, to the point where a 7700k may well be a better bet for some users. Single-threaded UHD avc decode in my NLE can't complete at 60p, so unless the video is packaged with multiple slices (two might work at ~4.1G, but I'm only stable at 3.9), I'm not real-time in my software. That's a problem. Fortunately I have quad-slice cam output, so I'm fine for now, and can only hope that the software eventually makes use of nvdec (or some upgraded amd equivalent).
Worse (and second), if the decode threads, which are long-pole items, wind up having their time stolen by SMT'd threads, I'm in for a bad time. There are very strange performance pits. I can put a few cams-worth of video in a loop, and one time through the loop everything is good, and another time through it, we're hiccoughing like crazy. This usually happens when the processor is nearly fully utilized (over 80%). I haven't re-run that test after I upgraded my BIOS (to F5g on a Gaming 5, which is supposed to have 1.0.0.4), so maybe things have improved, but it feels like some kind of scheduling problem. If the threads are scheduled on top of each other (err, same core), they seem to stay that way until the next go-around. :shrug: [This is Win7, btw]

Hopefully those observations are useful in some way. Obviously more investigation would be required. A longer, video-editing-focused review which I would sum up with "not all things parallelizable are parallelized" is here https://www.pugetsystems.com/labs/a...2017-AMD-Ryzen-7-1700X-1800X-Performance-909/. As someone who wants to know why their benchmarks look a certain way, this kind of article will likely grate, but the kind of software behavior it demonstrates is going to be a problem for some HEDT targets. That said, I'd be lying if I claimed I wasn't interested in a 16 core anyway. :)

Thanks for the time!
It's not my review. The charts are meant to show performance relative to XV (top) and SB (bottom). Thus they have to include SB (top) and XV (bottom).

Your observation reminds me of a question I asked in the past regarding performance measurement of actual video editing, not just the (overnight) rendering. This would include the storage subsystem, mem substytem, and the CPU of course.

SMT related things (e.g. BG threads reducing performance of FG threads with user interaction) could be improved on by setting affinity, using process lasso, etc. But a smarter scheduler would help, too, of course.

What kind of NVMs, SSDs, HDDs do you use - and how much RAM?
 
  • Like
Reactions: Drazick

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Yes, I see all of the charts swapped -- SandyBridge on top. :shrug: I'm more interested in all the remaining pages. I hope your time frees up.

I've been using Ryzen for video editing, and I've noticed two things.
First, the lack of quicksync really hampers decode acceleration, to the point where a 7700k may well be a better bet for some users. Single-threaded UHD avc decode in my NLE can't complete at 60p, so unless the video is packaged with multiple slices (two might work at ~4.1G, but I'm only stable at 3.9), I'm not real-time in my software. That's a problem. Fortunately I have quad-slice cam output, so I'm fine for now, and can only hope that the software eventually makes use of nvdec (or some upgraded amd equivalent).
Worse (and second), if the decode threads, which are long-pole items, wind up having their time stolen by SMT'd threads, I'm in for a bad time. There are very strange performance pits. I can put a few cams-worth of video in a loop, and one time through the loop everything is good, and another time through it, we're hiccoughing like crazy. This usually happens when the processor is nearly fully utilized (over 80%). I haven't re-run that test after I upgraded my BIOS (to F5g on a Gaming 5, which is supposed to have 1.0.0.4), so maybe things have improved, but it feels like some kind of scheduling problem. If the threads are scheduled on top of each other (err, same core), they seem to stay that way until the next go-around. :shrug: [This is Win7, btw]

Hopefully those observations are useful in some way. Obviously more investigation would be required. A longer, video-editing-focused review which I would sum up with "not all things parallelizable are parallelized" is here https://www.pugetsystems.com/labs/a...2017-AMD-Ryzen-7-1700X-1800X-Performance-909/. As someone who wants to know why their benchmarks look a certain way, this kind of article will likely grate, but the kind of software behavior it demonstrates is going to be a problem for some HEDT targets. That said, I'd be lying if I claimed I wasn't interested in a 16 core anyway. :)

Thanks for the time!

I have no idea how the charts could ever be swapped, they're hard-coded in place inside their cells and have never been placed in the wrong cell. What browser are you using? And, are you sure they are swapped? The results relative to Excavator will contain the Sandy Bridge results, whereas the results relative to Sandy Bridge will contain the Excavator results.

What program do you use for video editing? Proprietary solutions for generic problems always irks me. QuickSync isn't anything special, it's just GPU compute.
 
  • Like
Reactions: Drazick

DeeJayBump

Member
Oct 9, 2008
60
63
91
I have no idea how the charts could ever be swapped, they're hard-coded in place inside their cells and have never been placed in the wrong cell. What browser are you using? And, are you sure they are swapped? The results relative to Excavator will contain the Sandy Bridge results, whereas the results relative to Sandy Bridge will contain the Excavator results.
...

Thanks for all the hard work you've provided us with all of this Ryzen testing, first of all.

As for reversed charts, using Pale Moon, the charts [in both Single Thread + Multi-Thread sections] are reversed for me as well. Appears that the charts themselves are misnamed [Excavator-named charts lack Excavator results, SB-named charts lack SB results] which appears to be the issue.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Thanks for all the hard work you've provided us with all of this Ryzen testing, first of all.

As for reversed charts, using Pale Moon, the charts [in both Single Thread + Multi-Thread sections] are reversed for me as well. Appears that the charts themselves are misnamed [Excavator-named charts lack Excavator results, SB-named charts lack SB results] which appears to be the issue.

"Relative to Excavator" charts will not contain Excavator results - they would all just be 100% ;-)

Relative charts generally exclude that to which they are relative as that would just be the 100% marker.
 

dnavas

Senior member
Feb 25, 2017
355
190
116
It's not my review. The charts are meant to show performance relative to XV (top) and SB (bottom). Thus they have to include SB (top) and XV (bottom).

Yes, I know. I reopened the page and all seems reasonable. Perhaps I misread the first time through? I dunno -- been a long couple of days....

SMT related things (e.g. BG threads reducing performance of FG threads with user interaction) could be improved on by setting affinity, using process lasso, etc. But a smarter scheduler would help, too, of course.

Edius comes in two different versions, and the workgroup version (which I have) indicates that it supports multi-CPU systems. I don't know the extent to which it is getting in its own way. I should look into Lasso.

What kind of NVMs, SSDs, HDDs do you use - and how much RAM?

Well, the OS is on an SSD. I'll probably graduate it to nvme, but not in any hurry.
Most folks that I know have local raid arrays, but I prefer to edit in the location of the final resting place for my bits, so I have a rather unusual setup where the NAS sits next to my PC. I've got 8 spinning 4TB drives in raid-6 fronted by two 256GB SSDs in raid-0 in read-only (SATA, because qnap doesn't support pcie-based nvme). Now that I've started dealing with 4k, I'm considering a dedicated 10GbE connection, although I don't normally do multicam, and 140mbps is a pretty simple thing for straight gigabit. I'm more concerned about the pre-rendered stuff being able to be streamed adequately. The NAS has 8GB, my computer has 16GB. Edius itself doesn't really require a lot of ram.

And, are you sure they are swapped?

Well, I was sure this afternoon, but looking at it again, they seem as expected. Perhaps I was confused. The current labels (relative to ...) make things look fine to me this evening.

Proprietary solutions for generic problems always irks me. QuickSync isn't anything special, it's just GPU compute.

If decode was a solved problem, editing long-gop 4:2:2 4k video wouldn't be such a difficult task. It is, though, because QS, nvdec, etc. don't support 4:2:2. Generally, you talk to CPU people and they say "but, that's just gpu" and you talk to gpu people and they say "yeah, but who watches 4:2:2 video?" So you have nvdec supporting 8k hevc formats for the broad consuming public -- all 2 of them, but not the 4:2:2 format that's required for delivered video in various places around the world. You have Intel with qsv in their consumer chips, but not the CPUs which would otherwise be more useful in editing. Because it's just gpu. And the hardware is only there because it's useful for driving down the "watts-while-watching-bluray" numbers. And who uses an 8-core processor to watch blu-rays. :cry:

The thing is, that hardware is really useful. In Edius it's easily worth a couple of cores. I don't have numbers for the 7700k, but the higher the decode resolution that gets supported, the greater the number of lower resolution simultaneous decodes that can run. It's why Vegas is making such a big deal of their support of it. Meanwhile we're staring at the 8k freight train and looking backwards in time towards the use of proxies. Unpleasant :(

But, off-topic.
Given the current immaturity of the platform, against the shifting sands of bios updates and game patches, to attempt what you've attempted is a thing worthy of note. I do appreciate it. I think it'll be really important in a few months when Zen heads up against Skylake-X. It'll likely be processor count vs frequency, and understanding the shortcomings (and not) of the former is going to go a long way to having a good discussion about the merits of the platforms. Thanks muchly.
 

Paratus

Lifer
Jun 4, 2004
16,613
13,296
146
Ars has a pretty interesting article on performance improvements from patches to games, window and the processor microcode. They take a pretty deep dive into what's going on.

https://arstechnica.com/information...ryzen-showing-just-what-can-and-cant-be-done/

The last few weeks have seen the release of a couple of game patches designed to address certain Ryzen issues. AMD has also released guidance to game developers on how best to use its processor, as well as a new power management profile for Windows 10. Together, we can gain some insight into some of the complexities of developing game software for modern processors and get some understanding of what kind of performance gains gamers might hope to see.

It basically shows what I kind of already expected. Several areas where it appears to significantly underperform against broadwell are mostly due to the lack of Ryzen specific optimizations
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Pcper tested memory speed effects on ping times between cores over CCX's.

ccxlatency2.png


https://www.pcper.com/reviews/Proce...Core-i5/CCX-Latency-Testing-Pinging-between-t
Heh, I see they took my request. I emailed them about doing these tests :p

At 1066MHz clock, a message takes 100ns to cross the CCX barrier. At 1600MHz, a message takes 71ns to cross the CCX barrier. Looks like around linear scaling.
4000MHz RAM should reduce this to 55ns, so a total of 95ns ish. Close to Intel's 80ns.

If they fixed the DF clock at 4GHz, it would be 70ns ish worst case, or just 27.5ns added from the data fabric.
 
Last edited:

TerionX6

Junior Member
Jun 29, 2015
14
20
46
Ryzen CCX latencies
2133 > 2400
1/8th increase in DF/mem clock gives ~1/12.5th decrease in latency
scaling of .96

2400 > 2933
1/4.5th increase in DF/mem clock gives ~1/8.19th decrease in latency
scaling of .918

2933 > 3200
1/11th increase in DF/mem clock gives ~1/25.8th decrease in latency
scaling of .95

Following these figures I no longer believe a 2Ghz DF clock, 4Ghz RAM speed would lower latency so much. My calculations show ~95ns at best., which is within 15% of Intel's monolithic approach. Still impressive!

Ciao,
Terion
 
  • Like
Reactions: T1beriu

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Another review tried to look at cross CCX core to core http://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html
They don't look at Broadwell-E and cache OC though.
I'm glad someone finally tested core to core latency for Skylake quad cores. It appears the intra CCX latency is identical to a quad core Skylake at 40ns. Inter-CCX latency however jumps up even beyond the 80ns mark of Broadwell-E. I imagine with 4000MT/s RAM, the latency difference against Broadwell-E won't be significant enough to have a real effect, while in best case it will still be better.

The question is how Skylake-X fares. Will it maintain the 40ns latency to all cores? Because if so, then that fabric will have some work to do lol
 

TerionX6

Junior Member
Jun 29, 2015
14
20
46
Rather I am curious of the latency differences between Naples and Intel's top end server SKUs. It's said that Intel's ring implementation has more and more latency for more and more cores. While Naples will have to deal with not just Inter-CCX comms, but as well communication delays between the 8 core clusters, Intel's current designs have to deal with ever larger ring delays. If we could get our hands on latency tests of those fancy 28-core Xeons...

With that said I read someone mention they expect a mesh based KNL-like topography for future Xeons. I can't imagine this would be available on skylake or any Intel 14nm design.
 

hondaman

Senior member
Oct 9, 1999
210
0
71
@looncraz good info.
Now only if the virtualization was not so lacking it would be great.
Still seeing big perf drop on passed GPU.

I am curious if the new AGESA resolves the problem with IOMMU and AMD cards on the chipset slot.
Currently no linux will boot at least on ASUS prime x370-pro i there is GCN based card in chipset slot.
Tested this with RX 460, R9 290 & R7 260X.
Old HD 6450 work fine and according to what I read on net Nvidia card works too.
I have an Asrock Taichi with v2.0 bios. I have an rx460 in the PCI-E 1 slot (nearest the cpu) and a NV 1070 in the "middle" PCI-E slot. Running Ubuntu 17.04 beta. I've been trying and failed to do pci-e pass through.
 

SpecChum

Member
Aug 16, 2007
31
8
81
As you know, my 1700 couldn't (well, not consistently) hit 3200 on my gskill 3200c14 memory so I decided to buy another and swap it out last night.

Result?

3200c14 ram first time every time. Nothing has changed but the CPU.

Obviously not conclusive by any means, but food for thought.
 

Timur Born

Senior member
Feb 14, 2016
277
139
116
Ambient 21°C, Radiator 21.5°C, Sense Skew enabled (Auto/defaults), "Power Safer" W10 profile

Idle:

700


Idle with WmiPrvSE.EXE background load:

700


Different CPU load profiles (power vs. temperature), x-axis not aligned:

Power
700

Temperature
700


Sorry for the typo, I meant Core 15 "odd". Cores 1-16 (15 = odd) in Statuscore correspond to cores 0-15 (15 = even) in Task-Manager. I meant core 15 in Statuscore (core 14 in TM). Statuscores uses different CPU instruction sets for even and odd core stress tests!

And to make things a bit more complicated: :p

700
 
  • Like
Reactions: lightmanek
Status
Not open for further replies.