AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

coercitiv · Feb 15, 2017

Dresdenboy said:
Maybe cache thrashing causes a perf stall/drop (per core) on SMT machines.

To be honest, at this moment I am unable to make any sense of the PCGamer Prime numbers. SMT does not seem to yield any benefit. If we apply frequency scaling to the 7600K score towards either 7700K @ 3.4Ghz or 7700K stock, the score is so close that delta is under margin of error.

Nothingness · Feb 15, 2017

itsmydamnation said:
Just guessing (watching taskmgr) they run it on all threads in parallel so on an 8T processor its really a 32mb benchmark, that would make both cache latency and memory latency important.

Not if memory is shared.

Agent-47 said:
the algorithm has a lot of If statements. one every few steps to see if the conditions are met. that's a lot of branching. remember the Flitz Chess benchmark? it had poor results too (8c vs 4c comparison).

Yes branching certainly is more an issue than the use of mod.

Asterox · Feb 15, 2017

https://warosu.org/g/thread/50289226

malitze · Feb 15, 2017

Agent-47 said:
the algorithm has a lot of If statements. one every few steps to see if the conditions are met. that's a lot of branching. remember the Flitz Chess benchmark? it had poor results too (8c vs 4c comparison).

It is more a question of the pattern these branches are taken or not taken. I took this simple python implementation of the sieve of Atkin algorithm and added a simple 2bit predictor to roughly approximate how good that one would work. With a limit of 100000 that is what I got:

Code:

All branches: 440832, taken: 40954 (0.09290160423925668 %), predicted correctly: 399873 (0.9070870535714286 %)

As I said that is a very rough approximation but I think more sophisticated predictors in modern CPUs should not have a hard time with this particular use case, unless I happened to overlook something crucial

OrangeKhrush · Feb 15, 2017

My SFF system is aging, the 4460 at 3.2/3.4 can barely hit 2000 on Passmark Singlethread. Add the option of better IPC over moarrrr cores and SFF fus ro dah to the competition.

OrangeKhrush · Feb 15, 2017

Primes.....the only one left is Optimus.

Synthetic of a synthetic who cares

Nothingness · Feb 15, 2017

malitze said:
It is more a question of the pattern these branches are taken or not taken. I took this simple python implementation of the sieve of Atkin algorithm and added a simple 2bit predictor to roughly approximate how good that one would work. With a limit of 100000 that is what I got:

Code:

All branches: 440832, taken: 40954 (0.09290160423925668 %), predicted correctly: 399873 (0.9070870535714286 %)

As I said that is a very rough approximation but I think more sophisticated predictors in modern CPUs should not have a hard time with this particular use case, unless I happened to overlook something crucial

The optimized reference implementation of Bernstein has more than 2% of branch mispred on an Ivy Bridge:

Code:

primegen-0.97$ perf stat -e instructions,branch-misses,branches,cpu-cycles ./primespeed 32000000
(...)
1973815 primes up to 32000000.
(...)
Performance counter stats for './primespeed 32000000':
       145,635,054      instructions              #    2.26  insns per cycle       
           309,039      branch-misses             #    2.16% of all branches       
        14,316,256      branches                                                   
        64,559,941      cpu-cycles                                                 
       0.028545228 seconds time elapsed

That is 2.1 MPKI which is indeed rather low (SPECint 2000 has about 5 MPKI on such a machine).

JDG1980 · Feb 15, 2017

How did Sandy Bridge do in the PassMark Prime Number benchmark? No one seems to break these down - they just give an aggregate score for all PassMark benches. I'm curious if there was something in Haswell and up that gave Intel a big edge here, or if it's just a straight trend line for all the recent Core CPUs.

coercitiv · Feb 15, 2017

Dresdenboy said:
Being a 4T machine having 2 fast DIMMs the analysis might become skewed a bit. I'll include your numbers, too. Maybe cache thrashing causes a perf stall/drop (per core) on SMT machines.

Are you ready for more fun?! I sure am!

Got home and ran the benchmark on my mobile Haswell 4c/8t @ fixed 3.4Ghz (DDR3 1600 CL11, 6MB L3). The scores were:

25-27 default
29-31 when setting affinity for logical cores 0-2-4-6 only

lobz · Feb 15, 2017

coercitiv said:
8 bots 8 threads...

now that would be utterly surprising as a cause

Agent-47 · Feb 15, 2017

malitze said:
It is more a question of the pattern these branches are taken or not taken. I took this simple python implementation of the sieve of Atkin algorithm and added a simple 2bit predictor to roughly approximate how good that one would work. With a limit of 100000 that is what I got:

Code:

All branches: 440832, taken: 40954 (0.09290160423925668 %), predicted correctly: 399873 (0.9070870535714286 %)

As I said that is a very rough approximation but I think more sophisticated predictors in modern CPUs should not have a hard time with this particular use case, unless I happened to overlook something crucial

not bad for a first post!

we already know Intel has good branch prediction, so you can have a conclusive answer if you run it on zen and then compare to intel.

do you mind sharing the code? I would like to compare the numbers FX. if a FX has the same level of performance, we know for sure the branch prediction requirements for this algorithm is not relavent towards the poor scores

tamz_msc · Feb 15, 2017

malitze said:
It is more a question of the pattern these branches are taken or not taken. I took this simple python implementation of the sieve of Atkin algorithm and added a simple 2bit predictor to roughly approximate how good that one would work. With a limit of 100000 that is what I got:

Code:

All branches: 440832, taken: 40954 (0.09290160423925668 %), predicted correctly: 399873 (0.9070870535714286 %)

As I said that is a very rough approximation but I think more sophisticated predictors in modern CPUs should not have a hard time with this particular use case, unless I happened to overlook something crucial

Your percentage figures have the decimal point shifted two places left.

malitze · Feb 15, 2017

Agent-47 said:
not bad for a first post!

we already know Intel has good branch prediction, so you can have a conclusive answer if you run it on zen and then compare to intel.

do you mind sharing the code? I would like to compare the numbers FX. if a FX has the same level of performance, we know for sure the branch prediction requirements for this algorithm is not relavent towards the poor scores

I wouldn't mind but since it is just simulating a simple branch predictor in software it won't make difference if run on a different CPU. A much better way that actually would actually consider the hardware it is run on was shown by Nothingness a few posts ago, if you have access to some kind of linux

I ran it on my i7-3520m for comparison:

Code:

primegen-0.97]$ perf stat -e instructions,branch-misses,branches,cpu-cycles ./primespeed 32000000
(...)
 
 Performance counter stats for './primespeed 32000000':
 
       145,055,943      instructions:u            #    2.85  insn per cycle                                           
           295,536      branch-misses:u           #    2.08% of all branches       
        14,199,131      branches:u                                                 
        50,950,712      cpu-cycles:u                                               
 
       0.025605543 seconds time elapsed

inf64 · Feb 15, 2017

You are reading too much into these prime benchmarks. It just one workload and as can be seen from other tests it is not reflective of general performance of the core. Every core has some weak(er) points, who cares about few tests? For desktop we have several things that matter: rendering, encoding, gaming, streaming (while gaming) and multitasking. In all of these scenarios Ryzen will be good. If it "sux" in something Vs its main competition it better be a wprime benchmark because nobody will care

.

AtenRa · Feb 15, 2017

inf64 said:
You are reading too much into these prime benchmarks. It just one workload and as can be seen from other tests it is not reflective of general performance of the core. Every core has some weak(er) points, who cares about few tests? For desktop we have several things that matter: rendering, encoding, gaming, streaming (while gaming) and multitasking. In all of these scenarios Ryzen will be good. If it "sux" in something Vs its main competition it better be a wprime benchmark because nobody will care .

Im sure some will make it the second most important benchmark the next weeks

Greyguy1948 · Feb 15, 2017

If we talk about branches in SPECint more info is here:
https://www.spec.org/workshops/2007...ance_Characterization_SPEC_CPU_Benchmarks.pdf

429.mcf and 471.omnetpp are hard to predict.

Nothingness · Feb 15, 2017

inf64 said:
You are reading too much into these prime benchmarks. It just one workload and as can be seen from other tests it is not reflective of general performance of the core. Every core has some weak(er) points, who cares about few tests? For desktop we have several things that matter: rendering, encoding, gaming, streaming (while gaming) and multitasking. In all of these scenarios Ryzen will be good. If it "sux" in something Vs its main competition it better be a wprime benchmark because nobody will care .

Come on, we have to waste our time while waiting for real benchmarks

Nothingness · Feb 15, 2017

Greyguy1948 said:
If we talk about branches in SPECint more info is here:
https://www.spec.org/workshops/2007...ance_Characterization_SPEC_CPU_Benchmarks.pdf

429.mcf and 471.omnetpp are hard to predict.

These are not branch mispredictions, but cache misses. For branch mpred you want astar or gobmk (Fig 3).

.vodka · Feb 15, 2017

JDG1980 said:
How did Sandy Bridge do in the PassMark Prime Number benchmark? No one seems to break these down - they just give an aggregate score for all PassMark benches. I'm curious if there was something in Haswell and up that gave Intel a big edge here, or if it's just a straight trend line for all the recent Core CPUs.

I originally did this vs the Ryzen baseline when it showed up, but decided to run some extra numbers.

CPU mark, 2500k @ 4.5GHz, fixed 10-11-10-30-1T timings

Memory mark, 2500k @ 4.5GHz, fixed 10-11-10-30-1T timings

I might have made a mistake here and there since I feel like crap (stupid cold/flu-ish), I could do the same with the CPU clocked up to 5.1GHz if needed.

I realize I'm varying bandwidth and latency at the same time having left timings equal throughout the run, but the results mirror itsmydamnation's Ivy Bridge results. Sandy and Ivy are pretty much the same at a high level after all, so it was expected.

Extrapolating to Ryzen, its around 14ns results could be compared to my DDR3-1333 15ns run. Look how much performance was left on the table by not using faster RAM in latency sensitive workloads.

I know if I'm getting my hands on Ryzen I'll get sticks capable of what my DDR3-1866 is (around 10ns or better, that could be DDR4 3000 CAS 15 -perfect 10ns- to start with) as not to unnecessarily gimp the CPU.

lopri · Feb 15, 2017

That is an awesome work. :beer:

cytg111 · Feb 15, 2017

Yea, great work. Damn, that scaling. Now to keep speed the same and only mess with latencies?

KTE · Feb 15, 2017

XFR to me is a gimmick, except for mobile users, and maybe, just maybe, standard Joe Bloggs.

Clocks are limited as it is. Keeping constant FO4 is fine but the problem with a 14nm high clock design is the tiny wires, RC not scaling, major EM issues, DP complexity, huge fin variations affecting current, end to Dennard scaling meaning high current densities, so high localized heat causing with ~30C deltas, and Vt variability being a significant showstopper. Plenty of research papers have discussed these at length.

IBMs J Warnock had a research paper discussing this based on their own 4-5.7GHz chips. And these are watercooled/chilled watercooled chips still facing these issues.

If XFR is dependent on average chip temps, that might work well, but if a single hotspot kills XFR, then it wouldn't be much use except a benchmark winner.

Voltages for Turbo and XFR clocks are also key here.

Sent from HTC 10
(Opinions are own)

CentroX · Feb 15, 2017

Someone on neogaf said that GloFo's 14nm process is equivalent to intels 22nm. Are they really that far behind?

Sven_eng · Feb 15, 2017

CentroX said:
Someone on neogaf said that GloFo's 14nm process is equivalent to intels 22nm. Are they really that far behind?

Haswell-E is close to 2x the size of Ryzen so what do you think?

revanchrist · Feb 15, 2017

CentroX said:
Someone on neogaf said that GloFo's 14nm process is equivalent to intels 22nm. Are they really that far behind?

https://www.semiwiki.com/forum/content/6498-2017-leading-edge-semiconductor-landscape.html
https://www.semiwiki.com/forum/content/6160-2016-leading-edge-semiconductor-landscape.html

Intel_10nm > TSMC_10nm > Samsung/Glofo_10nm > Intel_14nm > Samsung/Glofo_14nm > TSMC_16nm > Intel_22nm

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Diamond Member

Diamond Member

Golden Member

Junior Member

Senior member

Senior member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Junior Member

Diamond Member

Lifer

Member

Diamond Member

Diamond Member

Golden Member

Elite Member

Lifer

Senior member

Senior member

Member

Junior Member