AMD Zen “RYZEN” CPUs Detailed – 8 Cores, 3.4Ghz+ & Auto Overclocking With “XFR”

Ajay · Dec 17, 2016

unseenmorbidity said:
It seems a lot more civil now that the anti-AMD trolls are less active. Where did they go?

The hype has settled down - most will likely wait till the next major release of data. Some were just tired of banging heads with some here who don't get somethings like AMD's margin requirements for success in business (or how margins drive R&D, Marketing and ultimate market share - it's not the wild west of the 90s, the CPU biz is much more mature now).

cytg111 · Dec 17, 2016

TimCh said:
It is quite clear that you have no idea how HT works, there are no HT threads and non-HT threads. The first thread is in fact slowed down when the second thread is added on Intel CPUs.

Sendt fra min SM-G928F med Tapatalk

- What? Cache trashing, ok, I guess that is why i7 has a larger L2, but other than that I havent heard of a hyperthreading scenario where adding a second thread on core0 hurts the ST performance of the first (outside of ~1%)

TheELF · Dec 17, 2016

cytg111 said:
- What? Cache trashing, ok, I guess that is why i7 has a larger L2, but other than that I havent heard of a hyperthreading scenario where adding a second thread on core0 hurts the ST performance of the first (outside of ~1%)

It depends on if the threads are aware of each other on what priorities they run and what kind of control the user has over the OS.
The point of multi tasking is that you run two applications and both get the same progress(work done) over time and in general task manager will do just that,slow down one thread to run a second one at the same amount of "work per time" .

In console games where the threads are made for the jaguar core the individual threads are not fighting for resources, each thread fits neatly into half an haswell core so HTT scaling is pretty close to perfect.
For all intents and purposes for console games (any jaguar level of threads) 1 haswell cores=2 (jaguar) cores.

Benchmark threads like rendering or transcoding on the other hand are made to use as many instructions as the core has to give so HTT only gives you the benefit of the "amount of fail" (branch/cache misses and so on)
For all intents and purposes for benches 1 haswell core=1haswell core + fail

If zen has huge smt gains for bench threads that are made to use as many instructions as the core has to give,thus people using them for benches in the first place, it means it leaves a huge amount of the core unutilised so a thread from any other program will leave an even huger amount of the core unutilised.
For all intents and purposes 1 zen core = 2 zen cores for normal threads
and 1zen core= 1zen core + amount of unutilized for bench threads.

What does this mean for our discussion? zen could have a very low IPC for sequential code but double the IPC in throughput making it equal the broadwell.

ALL JUST SPECULATION ZEN COULD ALSO HAVE THE SAME IPC AS BROADWELL PLUS THE SAME LEVEL OF HTT.
But for a company with no money to speak off,where all their products work in the same way, the 1core=2cores is much more plausible.

Dresdenboy · Dec 17, 2016

cytg111 said:
- What? Cache trashing, ok, I guess that is why i7 has a larger L2, but other than that I havent heard of a hyperthreading scenario where adding a second thread on core0 hurts the ST performance of the first (outside of ~1%)

What is your mind model of SMT? The same as TheELF's? Do Windows and Linux schedulers treat logical cores in 2 classes - strong (1st thread) and weak (2nd thread)? What they do is put a single thread on each core first (if power mgmt setting permits I suppose), and fill them with further threads later.

Hyperthreading is a communistic variant of SMT, as I reported on the previous page.

sirmo · Dec 17, 2016

Based on logic you could only ever get 1.50 scaling with hyperthreading in a benchmark where you are matching or surpassing your competitor while using less power. To get more than 1.50 would mean that each thread spends more time idle than executing. Which would likely result in terrible overall performance. This is because SMT has diminishing returns.. only one thread can execute at the same time with shared resources. If a thread was executing 30% of the time and stalled 70% of time, another thread would do the same, because we're talking about threads doing identical kind of work. So overall performance would be down, there is no magic here.

Someone did the Blender test they used in the demo with HT on and HT off on Skylake and it was 1.40 scaling. I think even if worse case scenario is true for AMD. They will still have a solid chip on their hands.

KTE · Dec 17, 2016

sirmo said:
Someone did the Blender test they used in the demo with HT on and HT off on Skylake and it was 1.40 scaling. I think even if worse case scenario is true for AMD. They will still have a solid chip on their hands.

SMT scaling with that render is 1.38-1.40x SandyBridge onwards.

Sent from HTC 10
(opinions are own)

Dresdenboy · Dec 17, 2016

sirmo said:
Based on logic you could only ever get 1.50 scaling with hyperthreading in a benchmark where you are matching or surpassing your competitor while using less power. To get more than 1.50 would mean that each thread spends more time idle than executing. Which would likely result in terrible overall performance. This is because SMT has diminishing returns.. only one thread can execute at the same time with shared resources. If a thread was executing 30% of the time and stalled 70% of time, another thread would do the same, because we're talking about threads doing identical kind of work. So overall performance would be down, there is no magic here.

Someone did the Blender test they used in the demo with HT on and HT off on Skylake and it was 1.40 scaling. I think even if worse case scenario is true for AMD. They will still have a solid chip on their hands.

I think it wouldn't be that difficult to produce some code leading to even higher SMT yields than 50%. The highlighted part of your posting wonders me. Why would that lazy thread keep the other from using all those unused resources during a specific cycle? Actually both threads doing work just 30% of the time should see rather good scaling.

We might simply see the reciprocal of SMT scaling as the efficacy of compiler optimization + uarch ILP extraction.

On Reddit someone saw near 50% scaling:

6700k @ 4,7GHz (DDR4 3200) HT on 66 seconds / HT off 98 seconds

https://www.reddit.com/r/Amd/comments/5i6w0w/submit_your_blender_test_for_comparison/db5wkpi/

If he locked the clock frequency, he actually removed any power management effects, which might cause lower SMT yields due to hitting some higher temperature/power consumption levels, resulting in reduced clock frequencies.

DrMrLordX · Dec 17, 2016

Dresdenboy said:
I think it wouldn't be that difficult to produce some code leading to even higher SMT yields than 50%.

Lots of cache thrashing can lead to that. Dr. Cutress' 3D Particle Movement v1 features near-100% scaling with HT. v2 not so much.

sirmo · Dec 17, 2016

Dresdenboy said:
I think it wouldn't be that difficult to produce some code leading to even higher SMT yields than 50%. The highlighted part of your posting wonders me. Why would that lazy thread keep the other from using all those unused resources during a specific cycle? Actually both threads doing work just 30% of the time should see rather good scaling.

We might simply see the reciprocal of SMT scaling as the efficacy of compiler optimization + uarch ILP extraction.

On Reddit someone saw near 50% scaling:

https://www.reddit.com/r/Amd/comments/5i6w0w/submit_your_blender_test_for_comparison/db5wkpi/

If he locked the clock frequency, he actually removed any power management effects, which might cause lower SMT yields due to hitting some higher temperature/power consumption levels, resulting in reduced clock frequencies.

The way I see it code with SMT yields higher than 50% would be poorly optimized loads. Theoretically if you have a thread that's so well optimized which never stalls on a given core it would keep that core 100% busy. In other words another hyper-thread may never get a chance to run on it. We can examine some code which has 0% SMT scaling to learn what kind of load that is. Likely highly repetitive code with no branch prediction or cache misses.

"only one thread can execute at the same time with shared resources" What I mean by that is that while red boxes are for instance all competitively shared, given the nature of identical workloads in other threads they will all overlap on same instructions. Meaning they will be bottle necked in a same way. So either one thread can run or the other.. not two at the same time. They are competing for the time on the red boxes. So for both threads to get an equal share of execution there at least need to be 50% of time when one thread stalls, to give another thread an opportunity to run. And that would result in near 100% SMT scaling.

Of course you could have a thread that's stalled say 80% of the time which would give plenty of opportunity to get insane scaling (perhaps even more than 2x). But somehow I doubt that could be the case given the performance shown in Ryzen demo vs 6900K which we know has 40% scaling. In other words the overall performance would suffer in such a scenario, due to various overheads present in poorly fed execution pipelines.

Ultimately in regards to Ryzen's single thread performance, I see Ryzen perhaps having 50% scaling as the worst case scenario. And of course in the best case scenario it could even have less than 40% scaling which would mean at least in Blender type workloads it could have a better single thread performance than Broadwell-E.

UncleCrusty · Dec 17, 2016

sirmo said:
Theoretically if you have a thread that's so well optimized which never stalls on a given core it would keep that core 100% busy. In other words another hyper-thread may never get a chance to run on it.

A wide machine is very rarely going to be firing on all execution ports every cycle (or issuing the maximum number of ops per cycle etc). I don't think failure to fill an execution port would be referred to as a "stall", so you're pretty much always going to have room for some SMT scaling even if the code never stalls.

BTW, is anyone aware of Zen's maximum sustained issue rate when running out of the uop cache? 4 or 6 I would assume.

sirmo · Dec 17, 2016

UncleCrusty said:
A wide machine is very rarely going to be firing on all execution ports every cycle (or issuing the maximum number of ops per cycle etc). I don't think failure to fill an execution port would be referred to as a "stall", so you're pretty much always going to have room for some SMT scaling even if the code never stalls.

BTW, is anyone aware of Zen's maximum sustained issue rate when running out of the uop cache? 4 or 6 I would assume.

Well that's debatable because the core is also trying to extract every bit of instruction level parallelism possible, and because the competing thread is running the same kind of workload a conflict for shared resources is even greater.

We've certainly seen plenty of 0% scaling SMT workloads. And even negative scaling due to context switching and cache thrashing.

sirmo · Dec 17, 2016

Dresdenboy said:
On Reddit someone saw near 50% scaling:

https://www.reddit.com/r/Amd/comments/5i6w0w/submit_your_blender_test_for_comparison/db5wkpi/

If he locked the clock frequency, he actually removed any power management effects, which might cause lower SMT yields due to hitting some higher temperature/power consumption levels, resulting in reduced clock frequencies.

Higher clocks could increase the scaling. Because by OCing the CPU.. you're effectively increasing the mem fetch latency. Producing more opportunity for hyperthreading.

Dresdenboy · Dec 17, 2016

@TheELF & others regarding SMT:
Here is a good video from LinusTechTips, where Luke explains it better than me and also has some interesting benchmark results of HT threads and "non-HT threads":

At 2:25 he's referring to this forum.

bjt2 · Dec 17, 2016

sirmo said:
The way I see it code with SMT yields higher than 50% would be poorly optimized loads. Theoretically if you have a thread that's so well optimized which never stalls on a given core it would keep that core 100% busy. In other words another hyper-thread may never get a chance to run on it. We can examine some code which has 0% SMT scaling to learn what kind of load that is. Likely highly repetitive code with no branch prediction or cache misses.

Maximum IPC ever seen on a non power virus software is 2.5. INTEL with its 8 pipelines is very far to be saturated. Also zen with 10 pipelines...

witeken · Dec 17, 2016

unseenmorbidity said:
Where did they go?

I've never seen you in an Intel thread either, so what's the matter?

I suppose like simple likes like.

But spoiler alert: I have found AMD fans aren't always the most civil folks out there. Fans of the "underdog" can get very defensive about their (by definition) inferior brand. (I guess it's got something to do with cognitive dissonance.)

Insulting other members is not allowed
Markfw
Anandtech Moderator

unseenmorbidity · Dec 17, 2016

witeken said:
I've never seen you in an Intel thread either, so what's the matter?

I suppose like simple likes like.

But spoiler alert: I have found AMD fans aren't always the most civil folks out there. Fans of the "underdog" can get very defensive about their (by definition) inferior brand. (I guess it's got something to do with cognitive dissonance.)

I do go from time to time, but I don't go specifically bash intel. Besides, Zen is far more interesting than a relaunch of skylake.

"Sphh Kabylake, same ****, different package.... Flipping intel sitting on their monopoly ******* consumers, amirite!!?"

That was the kind of garbage that was in every AMD thread a few months ago.

witeken · Dec 17, 2016

unseenmorbidity said:
Besides, Zen is far more interesting than a relaunch of skylake.

Agreed, I'm just wondering how high KBL will clock. Who doesn't like a big comeback and the hard fought battle that will follow? Makes for some good drama on both sides of the deal -- Piednoel downplaying the product, AMD fans dreaming about how low the price(s) will drop.

"Sphh Kabylake, same ****, different package.... Flipping intel sitting on their monopoly ******* consumers, amirite!!?"

That was the kind of garbage that was in every AMD thread a few months ago.

Was the garbage posted by AMD or Intel fans?

unseenmorbidity · Dec 17, 2016

witeken said:
Agreed, I'm just wondering how high KBL will clock. Who doesn't like a big comeback and the hard fought battle that will follow? Makes for some good drama on both sides of the deal -- Piednoel downplaying the product, AMD fans dreaming about how low the price(s) will drop.

Was the garbage posted by AMD or Intel fans?

Anti-AMD trolls

I am not going to comment anymore, as this is terribly off topic.

sirmo · Dec 17, 2016

The threads as of recently have been pretty partisan free I must say.. Pretty refreshing to see. I think everyone realizes Ryzen being competitive is a good thing for both Intel and AMD. Intel engineers will get more resource or a mandate to do something more daring as well, hopefully if it really turns out Ryzen is competitive.

For all its failings, Bulldozer was also something different that was fun to see. It just sucked it set AMD back so far that there has been no competition for so many years.

Dresdenboy · Dec 17, 2016

sirmo said:
The way I see it code with SMT yields higher than 50% would be poorly optimized loads. Theoretically if you have a thread that's so well optimized which never stalls on a given core it would keep that core 100% busy. In other words another hyper-thread may never get a chance to run on it. We can examine some code which has 0% SMT scaling to learn what kind of load that is. Likely highly repetitive code with no branch prediction or cache misses.

The theoretical limit for HT/SMT-2 scaling is 100% as no more than 2 threads (200% of # of threads) can run on Hyperthreading machines. This might be a fun exercise to produce such code. ^^

The other extreme you cite is a power virus - or close to that: Prime95, which I used. It got slowed down by nearly 50% thanks to the less prioritized threads sitting in the same core and occupying available resources. That's the thing with missing prioritization of Hyperthreads/logical cores. ATM I don't know , where this equalization happens, maybe already during fetch (in alternating cycles?). This is also, what BD did in the shared front end units. But this part is not SMT, but fine grained MT. So if instructions of the second thread enter the OoO section, the first thread can do nothing against it. Even not occupying the whole core (impossible) or just the AGUs for example wouldn't help, because this thread wouldn't get enough supply of instructions to continue playing this game.

sirmo said:
"only one thread can execute at the same time with shared resources" What I mean by that is that while red boxes are for instance all competitively shared, given the nature of identical workloads in other threads they will all overlap on same instructions. Meaning they will be bottle necked in a same way. So either one thread can run or the other.. not two at the same time. They are competing for the time on the red boxes. So for both threads to get an equal share of execution there at least need to be 50% of time when one thread stalls, to give another thread an opportunity to run. And that would result in near 100% SMT scaling.

I see what you mean. But amongst the millions of instructions executed during one thread's timeslice, there are so many different types (simple ALU, IMUL, IDIV, AGU, loads, stores, FP SIMD, Int SIMD, FMUL, FADD, shuffle, etc.), that there always will be some room for other instructions. Especially with 8 or more execution units.

sirmo said:
Ultimately in regards to Ryzen's single thread performance, I see Ryzen perhaps having 50% scaling as the worst case scenario. And of course in the best case scenario it could even have less than 40% scaling which would mean at least in Blender type workloads it could have a better single thread performance than Broadwell-E.

Agreed. The realistic range for SMT scaling is more like that. As often said, there will be ST code (like Blender with 1T), which might have even higher IPC on Zen, and there will be other code with lower IPC. That's natural for the wide variety of instruction mixes. 256b AVX will be enough to show a difference.

sirmo said:
Higher clocks could increase the scaling. Because by OCing the CPU.. you're effectively increasing the mem fetch latency. Producing more opportunity for hyperthreading.

You are right. Increased mem fetch latency due to OC'ing is another factor.

sirmo · Dec 17, 2016

Dresdenboy said:
I see what you mean. But amongst the millions of instructions executed during one thread's timeslice, there are so many different types (simple ALU, IMUL, IDIV, AGU, loads, stores, FP SIMD, Int SIMD, FMUL, FADD, shuffle, etc.), that there always will be some room for other instructions. Especially with 8 or more execution units.

So that's a good question. Notice how their graph makes a distinction between competitively shared SMT and competitively shared *tagged* ones. Tagged makes sense to me, in other words they have a tag to tell instructions or data which belong to each thread. But does that also mean that they can't tell which instruction belongs to which thread in the red boxes (competitively shared ones). To me that implies whatever its in red boxes can only be utilized by one thread at the time. But I could be wrong. Like INT and FPU are decoupled so you could have one thread use each, but can two threads use a the FPU at the same time. If yes wouldn't they also be tagged? How else would they know when to rename registers for the appropriate thread context?

I think width in these pipelines are for instruction level parallelism and out of order execution on the single thread. I do not think both threads could be executing in these units at the same time.

At least that's how I see SMT, perhaps my view is misguided, but I thought the whole point of SMT is to save transistors and have a fallback when a thread stalls, basically to have another thread take its place. Cache misses in particular are super costly and they create huge gaps/stalls. I think the ability to out of order execute multiple threads at the same time would add too much logic and would complicate things quite a bit in terms of gates required to accomplish that. Would also require multiple register files, as you would have one for each context, I just think that would be a lot of state to carry around.

TheELF · Dec 17, 2016

Dresdenboy said:
@TheELF & others regarding SMT:
Here is a good video from LinusTechTips, where Luke explains it better than me and also has some interesting benchmark results of HT threads and "non-HT threads":

So what's your point with this video?
Running on only 4 cores,no matter which ones,gives you ~620 on CB, running it twice on all cores gives you 760 in total,same with the other benches.

Running jaguar threads on haswell cores still doubles your framerate.
If every thread you ever ran on a haswell where the width of a jaguar core the i3 would be a real quad core.

TheELF · Dec 17, 2016

Dresdenboy said:
The other extreme you cite is a power virus - or close to that: Prime95, which I used. It got slowed down by nearly 50% thanks to the less prioritized threads sitting in the same core and occupying available resources. That's the thing with missing prioritization of Hyperthreads/logical cores. ATM I don't know , where this equalization happens, maybe already during fetch (in alternating cycles?).

Default behavior for prime is starting up in normal priority while the threads that are doing the work are marked with idle priority.
It's a power virus like you stated so this is proper multi tasking behavior if you start up a task that could potentially lock up your PC until it's done ,only it's a stress test so it never stops,you make it so it doesn't freeze your desktop.
Right clicking on the thread itself within process hacker gives you the ability to put the thread to real-time,care to try your test again? (only do with affinity on one or few cores otherwise as stated you'll have a hard time doing anything)

zinfamous · Dec 17, 2016

witeken said:
I've never seen you in an Intel thread either, so what's the matter?

I suppose like simple likes like.

But spoiler alert: I have found AMD fans aren't always the most civil folks out there. Fans of the "underdog" can get very defensive about their (by definition) inferior brand. (I guess it's got something to do with cognitive dissonance.)

we get it, you're an irrational fan of a silicon manufacturing company for no reason whatsoever.

no one cares.

Insulting other members is not allowed
Markfw
Anandtech Moderator

itsmydamnation · Dec 17, 2016

zinfamous said:
we get it, you're an irrational fan of a silicon manufacturing company for no reason whatsoever.

no one cares.

If i had to guess he has stock right

.
But i find it really odd that the known few fell the need to actually actively attack a product, its a very strange complex to me.

Most of those haven't bothered because they haven't been able to actually justify there position with all the information AMD has given . All you have to do is ask them explain how they at a technical level came to that conclusion and they go quite

.

AMD Zen “RYZEN” CPUs Detailed – 8 Cores, 3.4Ghz+ & Auto Overclocking With “XFR”

Lifer

Lifer

Diamond Member

Golden Member

Golden Member

Senior member

Golden Member

Lifer

Golden Member

Junior Member

Golden Member

Golden Member

Golden Member

Senior member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

No Lifer

Diamond Member