AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Page 234 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

bjt2

Senior member
Sep 11, 2016
784
180
86
I don't think anything particular happens. If the reviewers would only use 256-bit workloads, then there wouldn't be too many workloads in their test suite.
I don't see any reason to avoid 256-bit workloads either, since they are becoming more and more common (which they would already be, if Intel hadn't been sand bagging for years). For example it is hard if not impossible to find a modern video encoder which wouldn't support 256-bit AVX/2 (X264, X265, VPX).

For example in X265, the gain from AVX2 is >20% on Haswell and newer. Also many of the heavier workloads e.g rendering (Blender, Embree, etc) support AVX/AVX2 as do many of the scientific workloads / libraries. Since Ryzen in 8C/16T configuration is a HEDT oriented
part, I see no reason to exclude those workloads.

Since in Handbrake from new horizon seems than Zen is 7% faster than BDWE (and so 4% faster than skylake), I suppose that also Zen gains from 256 bit instructions...
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,795
3,224
136
Indeed.
NBody, Linpack, X265, WinRar and Himeno do.
5/26.
Awesome and are you going to also test on a 8 core haswell/broadwell with the same forced vector widths?

looks like your going to have the best data of any of the reviewers! Are you going to publish it or just dump it all in a forum post?

Since in Handbrake from new horizon seems than Zen is 7% faster than BDWE (and so 4% faster than skylake), I suppose that also Zen gains from 256 bit instructions...
I think it would be very much determined by load store, if you get good register reuse and thus lower the percentage of loads and stores per x86 op then Zen should be fine.

for example linpack will probably fall quite a long way behind broadwell-E
The number of floating-point operations is 2/3n3 and the number of data references, both loads and stores, is 2/3n3. Thus, for every add/multiply pair we must perform a load and store of the elements, unfortunately obtaining little reuse of data
 
Last edited:

Justinbaileyman

Golden Member
Aug 17, 2013
1,980
249
106
Um does Ryzen actually have 256bit instructions??I thought it was 128bit split into 2 sets acting like 256bit?? I am really really liking the fact its kicking butt in Handbrake as this is the whole main purpose of me purchasing this CPU!!
 

imported_jjj

Senior member
Feb 14, 2009
660
430
136
Its an old debate, how much software do you use that is properly threaded? That needs threading?

To me the lack of scaling above 4 cores feels like a myth at this point and that's why i am asking as i can't quite find what needs to scale and doesn't.
Wasn't being hostile, just trying to find the truth.
 
  • Like
Reactions: cytg111

itsmydamnation

Platinum Member
Feb 6, 2011
2,795
3,224
136
Um does Ryzen actually have 256bit instructions??I thought it was 128bit split into 2 sets acting like 256bit?? I am really really liking the fact its kicking butt in Handbrake as this is the whole main purpose of me purchasing this CPU!!

Yes Zen supports 256bit instructions (bulldozer did as well). They decode to 1 uop and when they reach the FPU scheduler/dispatch it is split into 2 uop one for the lower order and one for the higher order. These are then executed over the 4 128bit pipelines as needed. Remember these pipelines are full pipelined meaning in the typical case even if an instruction takes 5 cycles to complete it can still issue a new instruction to the same pipeline in the next cycle, so your 256bit op takes 6 cycles where your 128bit op takes 5.

Because Zen has 4 128bit pipelines that can all perform a large amount of operations throughput in the FPU shouldn't be a problem except for FMA. The bigger problem is getting these 256bit vectors in and out of the core. Zen can execute 512bits of AVX/2 data a cycle but can only store 128bits. Skylake can execute 512bits of AVX/2/FMA a cycle but can store 256bits.
 

imported_jjj

Senior member
Feb 14, 2009
660
430
136
Frankly I cannot see anything wrong in the "conclusion" example you gave, as long as both the strengths and the weaknesses are equally taken into account, well documented & explained.

That is slightly harder to arrange than before, but I should be able to provide application specific power measurements at least for some workloads.

I am very curious about 2-5% load too,maybe you could try to plot a curve - W vs load in percentage, at least vs the 6900k since doing it for everything would be a lot of work.
Multitasking gaming plus light workloads could be seen as an attempt to favor Ryzen but i think it's a valid real world scenario, seen it done for Broadwell-E before.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Awesome and are you going to also test on a 8 core haswell/broadwell with the same forced vector widths?

looks like your going to have the best data of any of the reviewers! Are you going to publish it or just dump it all in a forum post?

I'm wasn't using any specific compiler options in terms of the vector width. I left that to the compiler to decide.
As for the other compiler settings for the custom binaries, I used QaxCORE-AVX2 for ICL (combined with /O2 and /fp: precise) and -O3 together with the common (µarch) supported instruction sets for GCC.

No µarch specific optimization (ICL 2017 now allows that too, for some reason ;)) was done for any of the binaries and everything that was compiled with ICL / iFortran was manually patched to treat both of the vendors equally (dispatcher).

I didn't choose ICL by an accident, since I tested all of the compilers (when possible) prior selecting the one I ended up using. On average ICL 2017 performed 13% better than MSVC 2015 or GCC in FP code, while total range of a difference was -100% - +700%... Pretty funny that the µarchs which gained most from using ICL 2017 were Excavator and Zen. The difference with Zen wasn't nearly as large as it was with XV thou.

In my tests Zen goes against Excavator, Haswell-E and Kaby Lake for IPC tests, and head to head (absolute performance) with 5960X and 7700K. The SMT yield is also measured on all three.

Due to the amount of stuff in my write-up it isn't really feasible to publish it at the forums. I plan to release it as a PDF document. Regardless what it end up being, I need to have it looked at before I release any of it. Some of the stuff might need some sanitization...
If everything goes as planned, it will contain some explanation how various things work in Zen, among with other interesting things.
 

CentroX

Senior member
Apr 3, 2016
351
152
116
aeHfbd7.jpg


people say this is flagship 1800X score


if so, it is kicking a 1800 dollar cpu!!
 

Justinbaileyman

Golden Member
Aug 17, 2013
1,980
249
106
Ok thanks for the info.. So basically Skylake should have faster throughput since it can store 256bit vs. Rysen which only can store 128bit? Is this only with AVX and AVX2 Instructions?
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,795
3,224
136
Ok thanks for the info.. So basically Skylake should have faster throughput since it can store 256bit vs. Rysen which only can store 128bit? Is this only with AVX and AVX2 Instructions?
Yes only with 256bit AVX and AVX2 instructions both can have 128bit ops as well, also FMA. But if your instruction mix has a lower level of load/store then there is enough time to load and store the data without it becoming a bottleneck. But a lot of SIMD FP workloads are very load /store heavy.
 

imported_jjj

Senior member
Feb 14, 2009
660
430
136
We also need memory scaling ,clocks and timings.
I usually favor tight timings over clocks but with 8 cores and 2 channels, decent BW might matter more, to a point.
 

PPB

Golden Member
Jul 5, 2013
1,118
168
106
I don't think anything particular happens. If the reviewers would only use 256-bit workloads, then there wouldn't be too many workloads in their test suite.
I don't see any reason to avoid 256-bit workloads either, since they are becoming more and more common (which they would already be, if Intel hadn't been sand bagging for years). For example it is hard if not impossible to find a modern video encoder which wouldn't support 256-bit AVX/2 (X264, X265, VPX).

For example in X265, the gain from AVX2 is >20% on Haswell and newer. Also many of the heavier workloads e.g rendering (Blender, Embree, etc) support AVX/AVX2 as do many of the scientific workloads / libraries. Since Ryzen in 8C/16T configuration is a HEDT oriented
part, I see no reason to exclude those workloads.
Vray supports embree, but only for certain calvulations (and uses fp32 instead of DP) . The rest is good ol' SSE2/3 and mostly fp64

And we are talking about THE biased renderer of the archviz/cg industry

Sent from my XT1040 using Tapatalk
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,150
136
I'm wasn't using any specific compiler options in terms of the vector width. I left that to the compiler to decide.
As for the other compiler settings for the custom binaries, I used QaxCORE-AVX2 for ICL (combined with /O2 and /fp: precise) and -O3 together with the common (µarch) supported instruction sets for GCC.

No µarch specific optimization (ICL 2017 now allows that too, for some reason ;)) was done for any of the binaries and everything that was compiled with ICL / iFortran was manually patched to treat both of the vendors equally (dispatcher).

I didn't choose ICL by an accident, since I tested all of the compilers (when possible) prior selecting the one I ended up using. On average ICL 2017 performed 13% better than MSVC 2015 or GCC in FP code, while total range of a difference was -100% - +700%... Pretty funny that the µarchs which gained most from using ICL 2017 were Excavator and Zen. The difference with Zen wasn't nearly as large as it was with XV thou.

In my tests Zen goes against Excavator, Haswell-E and Kaby Lake for IPC tests, and head to head (absolute performance) with 5960X and 7700K. The SMT yield is also measured on all three.

Due to the amount of stuff in my write-up it isn't really feasible to publish it at the forums. I plan to release it as a PDF document. Regardless what it end up being, I need to have it looked at before I release any of it. Some of the stuff might need some sanitization...
If everything goes as planned, it will contain some explanation how various things work in Zen, among with other interesting things.
Make sure to spam that PDF link everywhere, wouldn't want to miss it :p
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Awesome and are you going to also test on a 8 core haswell/broadwell with the same forced vector widths?

looks like your going to have the best data of any of the reviewers! Are you going to publish it or just dump it all in a forum post?


I think it would be very much determined by load store, if you get good register reuse and thus lower the percentage of loads and stores per x86 op then Zen should be fine.

for example linpack will probably fall quite a long way behind broadwell-E

If the data are truly streamed, without reuse, then only the RAM BW matters. Even if you have 512 bit datapath, the bottleneck will be RAM BW...
 

lopri

Elite Member
Jul 27, 2002
13,209
594
126
I really don't need anything other than these two:

1. Super Pi (any digits 1M or more)
2. Linpack using SSE4. (turn H/T off) Problem size something like 50000.

Someone please run them on Zen and report back.
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
people say this is flagship 1800X score
Listen and believe? Please, you know better.

Besides, that SMT scaling looks ******* good.

EDIT: Noticed The Stilt's explanation. Well, that should cover it, actually. May just be sort of legitimate.
 

KTE

Senior member
May 26, 2016
478
130
76
Frankly I cannot see anything wrong in the "conclusion" example you gave, as long as both the strengths and the weaknesses are equally taken into account, well documented & explained.



That is slightly harder to arrange than before, but I should be able to provide application specific power measurements at least for some workloads.
This might be a lot harder for you to test, but is Docker containerisation (build) also something you can test?

The reason I'm asking is as it is critical for enterprise cloud migrations right now.

Sent from HTC 10
(Opinions are own)
 

inf64

Diamond Member
Mar 11, 2011
3,706
4,047
136
aeHfbd7.jpg


people say this is flagship 1800X score


if so, it is kicking a 1800 dollar cpu!!

Looks legit if compared to 1500 12T model:
ST 1888 x 4.2 (XFR?) / 3.7 = 2143 ; XFR at work? just by using 4Ghz as Turbo we land 7% short of the alleged score of top model.
MT 12544 x 8 /6 x 3.6 / 3.4 = 17709 ; 6C part seems to have ran this test at 3.4Ghz.
 
  • Like
Reactions: Drazick

IEC

Elite Member
Super Moderator
Jun 10, 2004
14,343
4,952
136
It it were real and the 6 cores result is also real , would suggest that XFR pushes ST clocks at some 4.3GHz for this result.
If it is real.

I'm going to assume it's fake until proven otherwise. I've seen a bunch of fakes from Chinese forums already...
 

Teizo

Golden Member
Oct 28, 2010
1,271
31
91
It it were real and the 6 cores result is also real , would suggest that XFR pushes ST clocks at some 4.3GHz for this result.
If it is real.
Noticed in the supposed 6 core picture that it was running at only .374v at 3.4ghz without turbo (unless CPU-Z just isn't reading everything correctly). Color me skeptical on how legit that picture is. Not that I want it to be...just we live in the era of fake news now.

http://pclab.pl/zdjecia/artykuly/blind/2017/02/amdryz/amd582.jpg
 
Status
Not open for further replies.