AMD Zen “RYZEN” CPUs Detailed – 8 Cores, 3.4Ghz+ & Auto Overclocking With “XFR”

sirmo · Dec 17, 2016

itsmydamnation said:
If i had to guess he has stock right .
But i find it really odd that the known few fell the need to actually actively attack a product, its a very strange complex to me.

Most of those haven't bothered because they haven't been able to actually justify there position with all the information AMD has given . All you have to do is ask them explain how they at a technical level came to that conclusion and they go quite .

It's nowhere near as bad as Nvidia though. During the 1080 and rx480 launches it was just impossible to have a discussion without constant back and fourth.

itsmydamnation · Dec 17, 2016

sirmo said:
It's nowhere near as bad as Nvidia though. During the 1080 and rx480 launches it was just impossible to have a discussion without constant back and fourth.

I largely avoid that section for that reason, im more interested in the journey then the end result so all that bickering was pretty bad, i fend B3D to be the place to go for that side of GPU's anyway.

witeken · Dec 17, 2016

itsmydamnation said:
If i had to guess he has stock right .

I do not.

itsmydamnation · Dec 17, 2016

witeken said:
I do not.

Well there goes the only logical explanation...lol

witeken · Dec 17, 2016

itsmydamnation said:
Well there goes the only logical explanation...lol

I'm just interested in technology, especially semiconductors. That's it.

Tuna-Fish · Dec 17, 2016

UncleCrusty said:
A wide machine is very rarely going to be firing on all execution ports every cycle

The typical case of near-0 HT scaling is not a thread that monopolizes all resources, it's having a workload where a single thread is sufficient to completely monopolize a single type of resource. The most common case for that is a workload that stores every single cycle. These are relatively common for streaming operations where not a lot of math needs to be done per element. Since no x86 cpu can do more than one store per cycle, if you have a store hog running on the CPU, a second thread running the same load will produce no advantage.

IllogicalGlory · Dec 17, 2016

witeken said:
I'm just interested in technology, especially semiconductors. That's it.

Uh huh. None of your posts in this thread seem to bear that out at all. Seems you're simply here to insult members another perceived team to me. Twice already you've insulted the intelligence of other members here for disagreeing with you. I think you've lost the right to put out condescending posts like that one. Hopefully you'll be shown the door soon. People like you are the reason nothing can be discussed around here.

Someone who's only interested in the technology doesn't need to get involved the stuff you and I do.

Finally, I'd like to address your comment about AMD supporters and their supposed cognitive dissonance. They're the underdog, not necessarily because they aren't competitive, but because they have the lower marketshare and smaller R&D budget. There's no cognitive dissonance because a smaller marketshare and R&D budget is not equivalent to inferior products. Do they have inferior products? It all depends on the needs and budget of the end user. You could make a case that NV and Intel have conclusively better performing/more efficient chips, but that does not equate to a conclusively better product. Plus, not everyone is defined by which company they want to see succeed, perhaps their support goes a bit deeper than simply supporting the biggest companies with the best/most engineers. Some people don't mind not always having the glamour of supporting the winning side and dismissing number two. Some of us might be interested in that technology.

bjt2 · Dec 18, 2016

Dresdenboy said:
The theoretical limit for HT/SMT-2 scaling is 100% as no more than 2 threads (200% of # of threads) can run on Hyperthreading machines. This might be a fun exercise to produce such code. ^^

The other extreme you cite is a power virus - or close to that: Prime95, which I used. It got slowed down by nearly 50% thanks to the less prioritized threads sitting in the same core and occupying available resources. That's the thing with missing prioritization of Hyperthreads/logical cores. ATM I don't know , where this equalization happens, maybe already during fetch (in alternating cycles?). This is also, what BD did in the shared front end units. But this part is not SMT, but fine grained MT. So if instructions of the second thread enter the OoO section, the first thread can do nothing against it. Even not occupying the whole core (impossible) or just the AGUs for example wouldn't help, because this thread wouldn't get enough supply of instructions to continue playing this game.

I see what you mean. But amongst the millions of instructions executed during one thread's timeslice, there are so many different types (simple ALU, IMUL, IDIV, AGU, loads, stores, FP SIMD, Int SIMD, FMUL, FADD, shuffle, etc.), that there always will be some room for other instructions. Especially with 8 or more execution units.

Agreed. The realistic range for SMT scaling is more like that. As often said, there will be ST code (like Blender with 1T), which might have even higher IPC on Zen, and there will be other code with lower IPC. That's natural for the wide variety of instruction mixes. 256b AVX will be enough to show a difference.

You are right. Increased mem fetch latency due to OC'ing is another factor.

Look at this...

https://forums.anandtech.com/thread...-run-150-samples.2494600/page-5#post-38632293

Carrizo gains +100% from Stilt's build. CMT gain drops a little, but for skylake drops very much...

This means the optimized build has nothing to do with 256 bit AVX and it saturates skylake, but not carrizo...

Why?

Let's start from this video:

Here is calculated a 2 thread IPC for Zen of 2.22 instructions.

Since the official blender is 128 bit max and most of the instructions are 1 uop, we can safely assume under 2.5 uops/cycle for 2 threads (mean)...

If we assume 75% of these instructions are SSE/AVX 128 and that in the Stilt's build they are transformed in 256 bit and thus occupy 2 uops, we have that the mean IPC for 2 threads of this new build is a little more 4 uops cycle, with 3.8 uops/cycle of broken 256 bit instructions. But Zen FPU can do 4 uops/cycle. So theoretically even Zen can have the +100% increase of carrizo. I don't expect such an increase though, but at least +50% yes. And from the image posted above, skylake gains "only" about 30%...

witeken · Dec 18, 2016

IllogicalGlory said:
Seems you're simply here to insult members another perceived team to me. People like you are the reason nothing can be discussed around here.

If I remember correctly these angry comments from AMD fans started after I pointed out that if one normalized the measured power consumption for CPU power consumption, the Intel chip in fact consumed slightly less than Zen, resulting in higher efficiency.

Instead of people discussing the technicalities, I got ad hominem back and other such poor arguments, bringing the discussion down to an emotional one.

You know, I don't even care so much about which chip is the more power efficient one, but people were simply violently dismissing my reasoning because they didn't agree with (like) the conclusion.

So if people say nothing can be discussed here, then maybe first check if their own comments are hostile to people who have different opinions. If not, I too cannot guarantee that either.

Abwx · Dec 18, 2016

witeken said:
If I remember correctly these angry comments from AMD fans started after I pointed out that if one normalized the measured power consumption for CPU power consumption, the Intel chip in fact consumed slightly less than Zen, resulting in higher efficiency.

Instead of people discussing the technicalities,

What technicality exactly.?..

You re ignoring, willfully or not, some parameters that count as well when talking of efficency, Zen dynamic power is slightly more but it s obvious that it has lower static power, so all you did was a pseudo analysis that has no relevance technicaly, so much about branding people as fans...

itsmydamnation · Dec 18, 2016

witeken said:
If I remember correctly these angry comments from AMD fans started after I pointed out that if one normalized the measured power consumption for CPU power consumption, the Intel chip in fact consumed slightly less than Zen, resulting in higher efficiency.

Instead of people discussing the technicalities, I got ad hominem back and other such poor arguments, bringing the discussion down to an emotional one.

Actually plenty of people did you just didn't like the answers.

You ignore things like how Zen is rumored to have a really low idle clock (500mhz) which is just 62% of broadwell-E's lowest clock (800mhz).

but you know keep pretending its the horrible amd fans......

Dygaza · Dec 18, 2016

Quite interesting that Carrizo also benefits so much from The Stilt's AVX2 blender version. Anyone has any idea where that performance improvement comes?

itsmydamnation · Dec 18, 2016

So its either from the change form 2 operand to 3 operand int SIMD (SSE ==2 operand , AVX==3) or its from the additional memory/register instructions in AVX2.

bjt2 · Dec 18, 2016

Dygaza said:
Quite interesting that Carrizo also benefits so much from The Stilt's AVX2 blender version. Anyone has any idea where that performance improvement comes?

Low IPC. Zen has 2.22 for two threads. If you make the proportions, Carrizo has less than 1 per thread. With 8 units per module there is plenty of room for improvement...

cytg111 · Dec 18, 2016

Dresdenboy said:
What is your mind model of SMT?

- Magic dust. I used to believe that a "hyper thread" was using core resources at convenience of a another threads stall, mispredict, pipeline flush, someone here informed me that it is much smarter than that and greater parralism then that can be extracted. I have no understanding of "how ht works" beyond that. I do know that I have never heard of a scenario, outside of cache trashing, where HT hurts single threaded performance. I am ready to be educated and shown wrong..

superstition · Dec 18, 2016

witeken said:
angry comments from AMD fans

witeken said:
ad hominem

hmm...

inf64 · Dec 18, 2016

bjt2 said:
Look at this...

Let's start from this video:

Here is calculated a 2 thread IPC for Zen of 2.22 instructions.

Since the official blender is 128 bit max and most of the instructions are 1 uop, we can safely assume under 2.5 uops/cycle for 2 threads (mean)...

If we assume 75% of these instructions are SSE/AVX 128 and that in the Stilt's build they are transformed in 256 bit and thus occupy 2 uops, we have that the mean IPC for 2 threads of this new build is a little more 4 uops cycle, with 3.8 uops/cycle of broken 256 bit instructions. But Zen FPU can do 4 uops/cycle. So theoretically even Zen can have the +100% increase of carrizo. I don't expect such an increase though, but at least +50% yes. And from the image posted above, skylake gains "only" about 30%...

I think that Youtuber made a mistake since his IvyBridge system is faster per core and per clock than both Broadwell-E and Ryzen. He must have ran the render at 100 samples.

He scored 19.88s on 12C/24T IvyBridge core that according to him was running at ~2.8Ghz during blender rendering. So an 8C/16T version of that core running at 3.4Ghz (roughly matching the specs of Broadwell and Ryzen) should score 19.88 x 12 / 8 x 2.8/3.4=25s which is way faster than what Broadwell and Ryzen are scoring.

Dygaza · Dec 18, 2016

inf64 said:
I think that Youtuber made a mistake since his IvyBridge system is faster per core and per clock than both Broadwell-E and Ryzen. He must have ran the render at 100 samples.

He scored 19.88s on 12C/24T IvyBridge core that according to him was running at ~2.8Ghz during blender rendering. So an 8C/16T version of that core running at 3.4Ghz (roughly matching the specs of Broadwell and Ryzen) should score 19.88 x 12 / 8 x 2.8/3.4=25s which is way faster than what Broadwell and Ryzen are scoring.

That system has 2 cpu's. (24 cores 48 threads). What you see are real core performances, and not thread performances. So his calculations are correct.

inf64 · Dec 18, 2016

Dygaza said:
That system has 2 cpu's. (24 cores 48 threads). What you see are real core performances, and not thread performances. So his calculations are correct.

Are you sure about that? He seems to be talking about only one chip in his system instead of 2P configuration. His application is showing 24 cores in total so I guess you could be right.

Dresdenboy · Dec 18, 2016

bjt2 said:
Look at this...
https://forums.anandtech.com/thread...-run-150-samples.2494600/page-5#post-38632293

Carrizo gains +100% from Stilt's build. CMT gain drops a little, but for skylake drops very much...
This means the optimized build has nothing to do with 256 bit AVX and it saturates skylake, but not carrizo...

Why?

Let's start from this video: [...]

Here is calculated a 2 thread IPC for Zen of 2.22 instructions.

Since the official blender is 128 bit max and most of the instructions are 1 uop, we can safely assume under 2.5 uops/cycle for 2 threads (mean)...

If we assume 75% of these instructions are SSE/AVX 128 and that in the Stilt's build they are transformed in 256 bit and thus occupy 2 uops, we have that the mean IPC for 2 threads of this new build is a little more 4 uops cycle, with 3.8 uops/cycle of broken 256 bit instructions. But Zen FPU can do 4 uops/cycle. So theoretically even Zen can have the +100% increase of carrizo. I don't expect such an increase though, but at least +50% yes. And from the image posted above, skylake gains "only" about 30%...

I don't know, if these ops get converted that easily, as Blender seems to use vectors of 4 floats very often (128b). Trying to parallelize that might not be that easy. There are other (dedicated) optimizations for using 256b, like this one:
https://git.blender.org/gitweb/gitw.../cycles/kernel/geom/geom_triangle_intersect.h

This might also benefit CZ, if it does things more intelligently using 256b regs (so the whole architectural FPRF is capable to store twice as many numbers).

cytg111 said:
- Magic dust. I used to believe that a "hyper thread" was using core resources at convenience of a another threads stall, mispredict, pipeline flush, someone here informed me that it is much smarter than that and greater parralism then that can be extracted. I have no understanding of "how ht works" beyond that. I do know that I have never heard of a scenario, outside of cache trashing, where HT hurts single threaded performance. I am ready to be educated and shown wrong..

Your model is fine, just needs the addition, that without stall, instructions of each thread will get selected. I think, additionally to front end round-robin, there is also the schedulers selection of oldest ops, which will simply let the second thread get equal chances.

Thus with a second thread on a core, using the same code, both threads will perform roughly equally. If they are well optimized and SMT gain goes to 0%, both perform at ~50% their 1T speed.

Dygaza · Dec 18, 2016

inf64 said:
Are you sure about that? He seems to be talking about only one chip in his system instead of 2P configuration. His application is showing 24 cores in total so I guess you could be right.

Well he says it at the start of the video that the system is using 24c 48t. Also time pretty much correlates with it. double that time and you pretty much get what 12C/24T would get with 2.8GHz,.

Also look at the core frequencies. It's quite clear that they are showing real cores intead of threads. Clockspeeds when idle also can't be different between real core and "ht core", so they have to be real cores.

krumme · Dec 18, 2016

If zen+ is smt 1c 4t as Fottemberg hints with a grain of salt what does that say about the current implementation?

http://semiaccurate.com/forums/showpost.php?p=278118&postcount=4639

itsmydamnation · Dec 18, 2016

krumme said:
If zen+ is smt 1c 4t as Fottemberg hints with a grain of salt what does that say about the current implementation?

http://semiaccurate.com/forums/showpost.php?p=278118&postcount=4639

Doubt it, just look to power to see what SMT4/8 scaling looks like and look at the amount of cache they roll with. If they do that it would be about addressing specific markets ( Database) not general performance.

DrMrLordX · Dec 18, 2016

How long is the POWER8 pipeline? POWER9?

itsmydamnation · Dec 18, 2016

DrMrLordX said:
How long is the POWER8 pipeline? POWER9?

I dont know not that long ( 18 stages according to anadtech*) it has 4 execution domains ( amd has 2 int/fp and intel 1) to keep execution complexity in check. It also decodes and retires 8 instructions a clock and has 4 load store ports ( two dedicated load, two shared L/store) , that extra store port compared to intel/amd is vital for those high SMT modes.

* http://www.anandtech.com/show/10435/assessing-ibms-power8-part-1/3

AMD Zen “RYZEN” CPUs Detailed – 8 Cores, 3.4Ghz+ & Auto Overclocking With “XFR”

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Senior member

Senior member

Diamond Member

Lifer

Diamond Member

Member

Diamond Member

Senior member

Lifer

Platinum Member

Diamond Member

Member

Diamond Member

Golden Member

Member

Diamond Member

Diamond Member

Lifer

Diamond Member