AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

TurtleCrusher · Aug 25, 2016

ShintaiDK said:
Same reason why the Zen benchmark cant be trusted on so many levels.

It wont be trusted until a variety of benches are released from independent or verified sources. A little optimism never hurt though.

The Stilt · Aug 25, 2016

Scholzpdx said:
Why would we download custom-compiled builds when there's standard versions out there, such as blenchmark? As a developer, it's pretty damn easy to sneak in a few instructions that would cripple certain architectures.

I guess you didn't bother to read the previous posts?
MSVC 2013 is old and generally ICC provides superior performance to any other compiler. I wanted to see if the anomalies (ultra high SMT yield) in Cycles were causes of the compiler itself, and also what kind of performance impact does the AVX2 make.

Mr Evil · Aug 25, 2016

Scholzpdx said:
As a developer, it's pretty damn easy to sneak in a few instructions that would cripple certain architectures.

Arachnotronic said:
Really now? Some examples, please.

For instance, if you can arrange to have AVX-SSE transitions, you will hurt some architectures more than others. Or you can have faulty feature detection that fails to optimize for some architectures.

Abwx · Aug 25, 2016

The Stilt said:
I guess you didn't bother to read the previous posts?
MSVC 2013 is old and generally ICC provides superior performance to any other compiler. I wanted to see if the anomalies (ultra high SMT yield) in Cycles were causes of the compiler itself, and also what kind of performance impact does the AVX2 make.

The single frame rendering i posted take 54.3s on a Athlon 845 and 54s on a i3 4340.
To get the same time at same frequency than the i3 Zen s 40% IPC improvement should be augmented by 30% SMT gain, or any other proportion that yield 1.82x the throughput per core in respect of XV, otherwise it couldnt match BDW in AMD s demo...

You ll notice that the Athlon 2 FPUs match the i3 s 2 FPUs, and we know that Zen has largely as much FP ressource than a module..

DrMrLordX · Aug 25, 2016

Arachnotronic said:
It's marketing, they are going to put their best foot forward. Come on now.

When was the last time fp was AMD's strong suit? Look at all the Zen core diagrams. Does that look fp-heavy to you?

krumme · Aug 26, 2016

DrMrLordX said:
When was the last time fp was AMD's strong suit?

When K6-2 paved its way into the laptop market?

DrMrLordX · Aug 26, 2016

krumme said:
When K6-2 paved its way into the laptop market?

Hmm, not sure about what was in the lappie market back then, but k6-2 (and k6-3) had the same non-pipelined fp problem that the k6 had, making it vastly inferior to the Pentium II/III in fp operations. I think the last time AMD had a clear fp advantage in anything was 2005.

krumme · Aug 26, 2016

DrMrLordX said:
Hmm, not sure about what was in the lappie market back then, but k6-2 (and k6-3) had the same non-pipelined fp problem that the k6 had, making it vastly inferior to the Pentium II/III in fp operations. I think the last time AMD had a clear fp advantage in anything was 2005.

Yes. I am desperate trying to get the point through that a potent fpu cost ressources also besides the fpu part itself. You end up with a less efficient and more costly product. And it have consequences. While the K7 entered the enthusiast desktop market k6-2 was still selling well in laptops. With a dog slow fpu on even less than pentium level as i recall.

If there is one reason to have hopes for zen its because they have scaled down the fpu vs intels solutions. I takes an enourmous pressure off diesize, frontend, and development cost. Its imo a straight forward win. The reality is that if amd had gone for a 256b fpu the cache system and frontend would end up slower and less efficient. What market is there for such a product? The great thing about the current solution is eg an effective cache also helps fpu. Saved production cost and improved general performance at the same time.

Intel is going to fight off ibm solutions and the likes. Their core is meant to cover a very wide server market. By not adressing a very small market using a smallish fpu amd wins a far far bigger market all the way to near arm levels using the same arch. And gains a very needed cost and efficiency uplift vs an oponent with a magnitude more ressources.

If and zen core incl l2 is 5.5 to 6mm2 (180mm2 for 8c)
and efficient as claimed its in a good position imo. Lets see the size. Bd was much vasted space...

Zor Prime · Aug 26, 2016

Yea, K6-anything didn't have a pipelined FPU which really showed up in Quake 2 benchmarks. With K6-2/3 came along 3DNow! and a patch was released to uplift FPS quite a lot. Integer performance was faster per MHz generally though against the competing Intel offering at the time. AMD eventually rectified its lackluster FPU performance in K7/K8.

majord · Aug 26, 2016

Rectified, and well and truely surpassed P6

As mentioned, AMD have had strong FP performance right up until Core, though Defining a strong 'FPU' has become a more complicated now

Tuna-Fish · Aug 27, 2016

DrMrLordX said:
Hmm, not sure about what was in the lappie market back then, but k6-2 (and k6-3) had the same non-pipelined fp problem that the k6 had, making it vastly inferior to the Pentium II/III in fp operations.

3dnow! was at the time the world's best at what it did. I know a place that bought a datacenter full of them because they were by far (like 5 times or more) the cheapest way to buy 32-bit FP throughput, and they used them for rendering.

Of course, since they required you to write special code for an AMD-only extension with no good compiler support, only very few applications actually made any use of it.

DrMrLordX · Aug 27, 2016

Tuna-Fish said:
3dnow! was at the time the world's best at what it did. I know a place that bought a datacenter full of them because they were by far (like 5 times or more) the cheapest way to buy 32-bit FP throughput, and they used them for rendering.

Of course, since they required you to write special code for an AMD-only extension with no good compiler support, only very few applications actually made any use of it.

Interesting perspectives on 3DNow. I never knew anyone took it seriously at any level. It certainly didn't make much of a splash in the gaming arena, especially not after Intel launch SSE support on Pentium IIIs. So point taken.

As to AMD making a good choice with 128-bit FMACs and the like . . . time will tell. Intel has managed to launch mainstream processors with 256-bit FMACs and full AVX2 support (albeit with the tendency to downclock to deal with increased heat). Clearly it can be done, and at least in Intel's case, the performance/design penalties for doing so have not knocked them off the top of the market. The 6700k has no diesize and frontend problems, at least nothing that Intel couldn't handle anyway. Such a featureset may well have bloated Skylake's development costs well beyond the realm of the reasonable, never mind that Intel has a huge R&D budget so . . . what amounts to reasonable for them might not be so reasonable for anyone else.

I still think anyone serious about instruction-level parallelism may be stuck on Intel for awhile. If AMD still expects people to offload those kinds of workloads to GPUs then they'd better step up their development game. HSA is still basically dormant for their APU products, and the only firm out there producing low-latency, SVM-capable interconnects for dGPUs is Nvidia. PCI-SIG isn't doing AMD any favors either. Thus far, Zen hasn't done anything to change that situation.

That's one of the reasons why I'm so surprised that AMD actually chose something like Blender to demonstrate Zen rather than an Int-heavy benchmark. I'm also surprised that AVX2 seems to do nothing for Blender. So maybe that's part of the reason why they chose it.

Abwx · Aug 27, 2016

DrMrLordX said:
I'm also surprised that AVX2 seems to do nothing for Blender. So maybe that's part of the reason why they chose it.

So if it performs well it s because AMD is tricking the thing, and surely not because AVX not being a cure for throughput limitations, hey, we were told that it allow double the computation capabilities...

https://indico.cern.ch/event/327306...ttachments/635800/875267/HaswellConundrum.pdf

Go directly to page 5....

DrMrLordX · Aug 28, 2016

Abwx said:
So if it performs well it s because AMD is tricking the thing, and surely not because AVX not being a cure for throughput limitations, hey, we were told that it allow double the computation capabilities...

https://indico.cern.ch/event/327306...ttachments/635800/875267/HaswellConundrum.pdf

Go directly to page 5....

I didn't say anything about AMD tricking anything. Settle down, Beavis. Yes there are situations where AVX2 might not help (such as loops working outside of L1/L2). And then there are some where it really helps. It sure lights a fire under y-cruncher. I don't know enough about Blender's render engine to say whether or not it could be useful, but unless it's full of 64-bit divisions and/or sqrts it seems like it could help somewhere.

itsmydamnation · Aug 28, 2016

DrMrLordX said:
I didn't say anything about AMD tricking anything. Settle down, Beavis. Yes there are situations where AVX2 might not help (such as loops working outside of L1/L2). And then there are some where it really helps. It sure lights a fire under y-cruncher. I don't know enough about Blender's render engine to say whether or not it could be useful, but unless it's full of 64-bit divisions and/or sqrts it seems like it could help somewhere.

like what,

AVX2 != FMA
AVX2 != 256bit FP ops
AVX2 = 256bit int SIMD ops
AVX = 256bit FP ops
AVX2 = some bit manipulation/shifting/memory/gather that dont exsist in AVX that can be used on XMM or YMM registers.

I know you kept trying to say what AVX2 , but look at
http://www.agner.org/optimize/instruction_tables.pdf look at the start of the document for a general description of each instruction set......

So what exactly are the operations in AVX2 that are going to significantly help a render (floating point). Why do we keep pretending like AVX2 is slow on AMD? 256bit operations will by at 1/2 the peak throughput of >Haswell but thats because it has 1/2 the execution width, thats exactly the same situation as AVX.

IntelUser2000 · Aug 28, 2016

AVX2 != FMA

What are you talking about? AVX2 is about FMA. It also adds 256-bit Integer.

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

AVX = 256-bit
AVX2 = 256-bit + FMA
AVX3 = 2x AVX2

itsmydamnation · Aug 28, 2016

No fma is a separate instruction set, you can have FMA without AVX2 support if you designed it like that. You know how when you look at cpu-z and fma is listed with a flag for support.

You know how piledriver supports FMA3 but not AVX2...

DrMrLordX · Aug 28, 2016

itsmydamnation said:
So what exactly are the operations in AVX2 that are going to significantly help a render (floating point).

Mmmmkay:

In 2014, Intel introduced the Intel® Xeon®
processor E5-2600 v3 family, which
includes the Intel® Advanced Vector
Extensions 2 (Intel® AVX2) instruction
set. Intel AVX2 extends Intel SSE and Intel
AVX with 256-bit integer instructions and
also adds support for floating-point
fused multiply-add instructions, and
gather operations.

from http://www.intel.com/content/www/us...n-e5-v3-advanced-vector-extensions-paper.html

It doesn't add much on the fp side, but it does add something.

If you insist that only AVX - not AVX2 - has to do with fp, then fine. Not going to argue with you since we're getting off-topic anyway.

The point still stands that AMD's performance with AVX and AVX2 is quite poor up through XV, and their support in Zen/Summit Ridge doesn't seem to be much better. I'm not currently certain of whether The Stilt disabled both AVX and AVX2 in his custom blender builds when showing that Intel CPUs had no difference in performance with and without AVX2. But it's still a curious situation. You'd think that AVX support would make a difference even if AVX2 does not.

Why do we keep pretending like AVX2 is slow on AMD?

. . . because, at least on Carrizo, it is? Carrizo manages much better performance with xOP. The only time I've seen AVX2 directly tested on Carrizo (y-cruncher), Carrizo managed just as well with SSE3. I was not impressed. AVX performance was also quite sad, just as it's sad on my SR.

Dresdenboy · Aug 28, 2016

DrMrLordX said:
. . . because, at least on Carrizo, it is? Carrizo manages much better performance with xOP. The only time I've seen AVX2 directly tested on Carrizo (y-cruncher), Carrizo managed just as well with SSE3. I was not impressed. AVX performance was also quite sad, just as it's sad on my SR.

SR/XV have one integer pipe less in the FPU compared to BD/PD. And 256b usually don't gain anything there. AMD uses avx-128 optimization in GCC.

itsmydamnation · Aug 28, 2016

DrMrLordX said:
The point still stands that AMD's performance with AVX and AVX2 is quite poor up through XV, and their support in Zen/Summit Ridge doesn't seem to be much better. I'm not currently certain of whether The Stilt disabled both AVX and AVX2 in his custom blender builds when showing that Intel CPUs had no difference in performance with and without AVX2. But it's still a curious situation. You'd think that AVX support would make a difference even if AVX2 does not.

There are a few things here, pile driver AVX performance is fine, there is a slight perf drop with 256bit but perf per clock is right around IB which was the last intel uarch to have 256/128 load store.

In SR and EX amd have done "stuff " to the FPU i guess in the name of die size and power, As well as losing one of the pipelines it seems to have odd scheduling/execuation behavior where a single core cant get full throughput. I dont know if they have done things that also hurt 256bit vectors.

. . . because, at least on Carrizo, it is? Carrizo manages much better performance with xOP. The only time I've seen AVX2 directly tested on Carrizo (y-cruncher), Carrizo managed just as well with SSE3. I was not impressed. AVX performance was also quite sad, just as it's sad on my SR.

128bit or 256bit vectors?

The Stilt · Aug 28, 2016

DrMrLordX said:
I'm not currently certain of whether The Stilt disabled both AVX and AVX2 in his custom blender builds when showing that Intel CPUs had no difference in performance with and without AVX2. But it's still a curious situation. You'd think that AVX support would make a difference even if AVX2 does not.

In the previous builds I posted, I only disabled AVX2 (AVX was enabled). Now I made an additional build (57.5MB) with only SSE2, SSE3 and SSE4.1 kernels enabled (CCX_HAS_AVX & CCX_HAS_AVX2 = False).

On Piledriver there was no difference what so ever, and on Haswell the difference pretty much falls withing the margin of error (< 1.5%).

If you want to compare the builds between each other, use "ICL" & "ICLWOAVX2" builds from the previous package with this one.

AVX and AVX2 should definitely help even in pure FP workload. In this package you can find a simple Monte Carlo raytracer (based on SmallPT port). Exactly the same code and build options, excluding the allowed instructions (Arch SSE4.2 / AVX / AVX2). SSE4.2 being the baseline, AVX boosts the performance by 6.1% and AVX2 by 15.8% (tested on Haswell-EP).

Abwx · Aug 28, 2016

The Stilt said:
AVX boosts the performance by 6.1% and AVX2 by 15.8% (tested on Haswell-EP).

Are you sure that it s not FMA that increase the perf with the AVX2 build.?.

As pointed by some members AVX2 and FMA are two different things but FMA is available only with AVX2 compatible CPUs in Intel offerings, hence it is used when doing AVX2 compilations even if it s not part of this instruction set.

DrMrLordX · Aug 28, 2016

itsmydamnation said:
128bit or 256bit vectors?

Pretty sure 256bit

The Stilt said:
In the previous builds I posted, I only disabled AVX2 (AVX was enabled). Now I made an additional build (57.5MB) with only SSE2, SSE3 and SSE4.1 kernels enabled (CCX_HAS_AVX & CCX_HAS_AVX2 = False).

On Piledriver there was no difference what so ever, and on Haswell the difference pretty much falls withing the margin of error (< 1.5%).

If you want to compare the builds between each other, use "ICL" & "ICLWOAVX2" builds from the previous package with this one.

AVX and AVX2 should definitely help even in pure FP workload. In this package you can find a simple Monte Carlo raytracer (based on SmallPT port). Exactly the same code and build options, excluding the allowed instructions (Arch SSE4.2 / AVX / AVX2). SSE4.2 being the baseline, AVX boosts the performance by 6.1% and AVX2 by 15.8% (tested on Haswell-EP).

Thank you for the clarification.

JoeRambo · Aug 29, 2016

There are 2 ways AVX(1) can help FP performance:

1) The obviuos help from 256 bit wide vector registers that can process 8 single or 4 double precision at same time. But compiler can only help so much in extracting extra performance, because code is hard to autovectorize (even harder without hints like __builtin_assume_aligned and no aliasing guarantees etc) and even if it vectorized, there are natural limits in data structures. For example there are only XYZ coordinates in 3D vectors and so on. So 128 -> 256 could be little if no gain.
Also SB/IV could not really sustain 256bit operations if it had too many load/store and only Haswell added enough cache BW to increase

2) Hidden gem of AVX was so called VEX encoding scheme - it cleared the mess of Intel opcodes and created a scheme that was not only sane but also extensible. It did so by allowing instructions to have more operands, so you could reduce register pressure (both for compiler and CPU). It was godsent on OS 32bit mode and less important in OS 64bit mode (due to it allowing more XMM registers).

The Stilt said:
AVX and AVX2 should definitely help even in pure FP workload.

For pure FP non vectorized codes, SSE4 already had all what was needed to process them, AVX can only help in subtle ways as mentioned in (2).

Nothingness · Aug 29, 2016

The Stilt said:
AVX and AVX2 should definitely help even in pure FP workload. In this package you can find a simple Monte Carlo raytracer (based on SmallPT port). Exactly the same code and build options, excluding the allowed instructions (Arch SSE4.2 / AVX / AVX2).

It would have been nice to put the source code in the archive to see how SmallPT was changed.

SSE4.2 being the baseline, AVX boosts the performance by 6.1% and AVX2 by 15.8% (tested on Haswell-EP).

I think as Abwx: the AVX2 speedup is most likely due to the use of FMA instructions.

More generally, given how simple SmallPT is, it might spend proportionally more time in pure vectorizable computation than a more complex renderer such as Blender, thus giving AVX/AVX2 use more speedup than you'd get in Blender.

OTOH it's possible that Blender was optimized to do packet processing.

To sum up: hard to draw any conclusion on the usefulness of AVX/AVX2 only using run times

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Lifer

Golden Member

Senior member

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Golden Member

Senior member

Golden Member

Lifer

Lifer

Lifer

Diamond Member

Elite Member

Diamond Member

Lifer

Golden Member

Diamond Member

Golden Member

Lifer

Lifer

Golden Member

Diamond Member