New Zen microarchitecture details

Page 111 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Now we have established some kind of agreement that a 95w tdp ~180mm2 cpu on a low freq process is not 2% faster than a 140w tdp 240mm2 die on a high freq process costing 1100 usd.

It would be a bit more interesting - to say the least - if we got some info about the efficiency.
The writes about it is very slim imo in relation to its crucial importance.
A statistical approach: We might have seen a benchmark residing on the tail of the distribution. Let's wait for more samples (not ES).

Regarding die size:
AMD left out a lot of logic related to Intel's powerful 2x256b SIMD bandwidth (from caches to EX). And they do this 8 times per Zeppelin die.

Also some article mentioned, that they use a high density variant of the process (not sure, if they mean SoC density metal layers or HD libs). But this would both stand for small die size and power efficient (knee of the curve) clockability in the high core count range.
 
  • Like
Reactions: Phynaz

KTE

Senior member
May 26, 2016
478
130
76
I've pointed this one out to you before, possibly even in this thread. When making a clean design, a designer can just pick the amount of IPC they target. I could design you a CPU that has twice the IPC of Skylake. It wouldn't even be hard. It would be slow, of course, because IPC is a tradeoff against clock speed. Your entire argument is nonsensical. If AMD wanted to, they could have doubled the IPC over EXC, but obviously they decided that would have cost them too much clock speed and so targeted a lower IPC uplift.

Of course, being just a second faster at maybe even lower power means nothing. ;)

And who said "at least 40% IPC gain"? I'm sure, I could construct a scenario, where Zen performs worse than XV*1.4. Maybe it is this kind of misperception, which in the end results in disappointment.

And is your last point about something, that can't be possible because it doesn't happen every day? See it this way: XV is still a execution t'put crippled design with long latencies in the FPU, lots of resource collisions and inefficient caches. Lot's of roadblocks. Saying 40% gain over HSW would be a totally different story. While we are at it - how is POWER9 in this regard?

Another view of your logic, pre 2008: House pricing can't go down. Because last time that has actually happened in recent times is... when?
You're reading into it way too much. It was a very straightforward question.

So both of you believe average 40% over EXC is a given. There. Wasn't hard, was it?

And Tuna-Fish, no, IPC can't just be picked by a designer from a hat. That's quite ludicrous to believe at the bleeding edge. You can set a target and aim for changes to reach it. Ultimately you are fighting size and power for the x86 market BEFORE clocks. Aka POWER.

Sent from HTC 10
(Opinions are own)
 
  • Like
Reactions: Phynaz and zentan

inf64

Diamond Member
Mar 11, 2011
3,697
4,015
136
You're reading into it way too much. It was a very straightforward question.

So both of you believe average 40% over EXC is a given. There. Wasn't hard, was it?

And Tuna-Fish, no, IPC can't just be picked by a designer from a hat. That's quite ludicrous to believe at the bleeding edge. You can set a target and aim for changes to reach it. Ultimately you are fighting size and power for the x86 market BEFORE clocks. Aka POWER.

Sent from HTC 10
(Opinions are own)

Wait , you find it hard to believe AMD could have achieved average of 40% ST IPC improvement? Have you seen the amount of improvements Zen has Vs EX, both brute force(sheer number of units) and fine tuning (including novel things they incorporated)? I would be surprised if they don't achieve 50% improvement given how much they have added to the core vs EX in int,fp and memory management.
 

Glo.

Diamond Member
Apr 25, 2015
5,704
4,548
136
Summit Ridge's problem is that it's going to be a ~3GHz Broadwell IPC-at-best octa-core competing with a 4.2GHz base/4.5GHz turbo Core i7 7700K in the consumer/client market. Also, I have a sneaking suspicion that Broadwell-E will be a better overclocker than Summit Ridge.

AMD looks like it will have a better offering for those who prefer AMD than the Bulldozer-based family, but Zen based products look like they're still going to be a tough sell in just about every segment of the PC market.
And you are basing this on what? Your suspicion. ONLY!

We cannot draw any conclusions about how good Zen is based on Blender comparison with Broadwell-E CPUs, then most naysayers here jump already to conclusions that it will be worse than Broadwell-E.

It is too early to jump into conclusions in one way or another. It can be considered as spreading FUD.
 

TimCh

Member
Apr 7, 2012
54
47
91
So in short, you hope on cheap prices of Zen, due to the performance metrics lack of previous designs post Conroe and good faith without merit in history.

Single core K8 prices wasn't alone. AMD did the exact same with X2.

CPU Clock speed L2 cache size Price
Athlon 64 X2 4200+ 2.2GHz 512KB $537
Athlon 64 X2 4400+ 2.2GHz 1024KB $581
Athlon 64 X2 4600+ 2.4GHz 512KB $803
Athlon 64 X2 4800+ 2.4GHz 1024KB $1001

AMD isn't a charity company like some believe. Zen will be priced exactly where it fits into the performance metrics and not a penny cheaper.
The X2 4200+ was faster than the $1000 Pentium Extreme Edition 840, literally offering twice the performance per dollar.
 
  • Like
Reactions: Grazick

NTMBK

Lifer
Nov 14, 2011
10,232
5,012
136
Summit Ridge's problem is that it's going to be a ~3GHz Broadwell IPC-at-best octa-core competing with a 4.2GHz base/4.5GHz turbo Core i7 7700K in the consumer/client market. Also, I have a sneaking suspicion that Broadwell-E will be a better overclocker than Summit Ridge.

AMD looks like it will have a better offering for those who prefer AMD than the Bulldozer-based family, but Zen based products look like they're still going to be a tough sell in just about every segment of the PC market.

I wouldn't go quite that far. It will certainly struggle to compete in the HEDT niche if it can't clock at 4GHz, but the vast majority of Intel's CPUs don't clock anywhere near there. I am more interested in how Zen competes in servers and mobile, as those are two very profitable places to be. Two core Zen SoC with a Polaris derived GPU could be a compelling laptop part.
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
I wouldn't go quite that far. It will certainly struggle to compete in the HEDT niche if it can't clock at 4GHz, but the vast majority of Intel's CPUs don't clock anywhere near there. I am more interested in how Zen competes in servers and mobile, as those are two very profitable places to be. Two core Zen SoC with a Polaris derived GPU could be a compelling laptop part.

Yea, Servers first and Mobile secondly are the areas ZEN has to be good in order to generate cash for AMD. Third segment is entry to middle ($50 to $200) Desktop SKUs, this is also where ZEN APUs and 4-6Core iGPU-less ZEN SKUs will try to increase dekstop marketshare for AMD. That market doest care about the fastest IPC no matter what, an IvyBridge/Haswell IPC Dual + HT/Quad core with Polaris 11 graphics will be a Core i3/5 killer.

Edit: What i hope AMD will do, is not introduce ZEN SKUs at very high prices (due to low availability) and then reduce the prices. They should focus and introduce the ZEN models in such a way they will not need to lower its price before they will introduce a new SKU to replace the old. Much like Intel does the last 10 years or so. Unless we have a nice price war, then things change ;)
 

KTE

Senior member
May 26, 2016
478
130
76
Wait , you find it hard to believe AMD could have achieved average of 40% ST IPC improvement? Have you seen the amount of improvements Zen has Vs EX, both brute force(sheer number of units) and fine tuning (including novel things they incorporated)? I would be surprised if they don't achieve 50% improvement given how much they have added to the core vs EX in int,fp and memory management.
I don't find it hard to believe after Hot Chips.

Sent from HTC 10
(Opinions are own)
 
Mar 10, 2006
11,715
2,012
126
And you are basing this on what? Your suspicion. ONLY!

It's based on the limited information that has been put out there publicly + my years of experience of following these companies and knowing the typical PR stunt tricks (everyone does them).

We cannot draw any conclusions about how good Zen is based on Blender comparison with Broadwell-E CPUs, then most naysayers here jump already to conclusions that it will be worse than Broadwell-E.

This Zen demo was literally AMD's big reveal for this architecture, the best chance for them to grab headlines, generate buzz, get analysts talking, get investors excited, etc. If we assume that AMD's PR people are competent (and I think they are), then this demo was crafted to put Zen in the best possible light .

If, in a cherry picked demo, the best AMD can do is come within a couple of percentage points of Intel's 2014 CPU architecture (which has been replaced by one with both superior perf/clock and frequency capability), then I think it's reasonable to think that on balance Broadwell should still deliver better perf/clock.

Anyway, if Zen really had enough potential in the tank to hit Broadwell-like frequencies (let alone Skylake), then they wouldn't have had to knee-cap the Broadwell and bring it down to 3GHz -- is that extra 200MHz from the Zen chip that hard to wring out?
 
  • Like
Reactions: HiroThreading

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
You're reading into it way too much. It was a very straightforward question.

So both of you believe average 40% over EXC is a given. There. Wasn't hard, was it?

And Tuna-Fish, no, IPC can't just be picked by a designer from a hat. That's quite ludicrous to believe at the bleeding edge. You can set a target and aim for changes to reach it. Ultimately you are fighting size and power for the x86 market BEFORE clocks. Aka POWER.
This last point is, what Mike Clark stressed the whole time. They had to balance everything, and have thrown out features for later or ever, if they shift the balance in a wrong direction. The good thing as it seems is, that due to some front loaded big cuts (e.g. 256b SIMD, use of high performance libs), they didn't exactly have to struggle with size.
 
  • Like
Reactions: Phynaz

leoneazzurro

Senior member
Jul 26, 2016
919
1,450
136
It depends. An improvement of 40% of ST IPC related to Excavator is surely possible, as certainly Excavator is not the best example of single thread performance around here, and given the details of the core it certainly seems plausible. Which is good, as the high end of the AMD's offer for the desktop us still based on the Piledriver core, which is slower than the Steamroller, being the latter again slower than the Excavator core, given equal memory subsystems. So, in the end, IPC improvements of Zen against the current AMD's desktop top of the line could be possibly in the +60% range, not limited to +40%. Moreover, the FP power per core is at least doubled compared to to XV, which also bodes well for FP intensive code.
Of course, this means not so much without knowing the clock speeds but again, 3GHz for an ES is not terrible, even if low compared to Intel's top offerings. Normally ES are not the higest clocked parts that will appear, even if a 4GHz part at launch is very unlikely. Also, it would be interesting to know about the real power consumption, as if Zen could not scale well up, it could instead scale well down and thus be more competitive in the mobile market. We'll see. I hope competition will rise again , as customers (that's us) will benefit from it.
 

AMDisTheBEST

Senior member
Dec 17, 2015
682
90
61
Zen already proves itself to be faster than Broadwell hertz for hertz. The remaining question is just the frequency and its power consumption. If amd's 8 cores can match Intel's 4.5 gigahertz found on the kaby i7 while charging a price just slightly higher than an i5.... I can guaranteed you price will drop like a rock and value for the money will be at its best in 10 years
 

Abwx

Lifer
Apr 2, 2011
10,937
3,439
136
This Zen demo was literally AMD's big reveal for this architecture, the best chance for them to grab headlines, generate buzz, get analysts talking, get investors excited, etc. If we assume that AMD's PR people are competent (and I think they are), then this demo was crafted to put Zen in the best possible light .

Or rather in good light without disclosing the extent of the perf, because what would be the point to give your competitor some clues so he can better compete against you, and that s surely what they did, anything else would be bad strategy as for the time they have nothing to sell that could benefit from better numbers..

The remaining question is just the frequency and its power consumption.

https://forums.anandtech.com/threads/new-zen-microarchitecture-details.2465645/page-110
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
If i say ZEN could have higher Throughput per core, looking strictly at the execution units (Wider than SKL), will it be correct ??
Can ZEN Fetch, Decode, Execute and Retire one Integer and one FP Thread simultaneously (HT) per Cycle without having those threads fight for available resources ?? Dresdenboy ????

AMD ZEN

HC28.AMD.Mike%20Clark.final-page-007.jpg



Intel Skylake

16hrty8.jpg
 

Abwx

Lifer
Apr 2, 2011
10,937
3,439
136
Can ZEN Fetch, Decode, Execute and Retire one Integer and one FP Thread simultaneously (HT) per Cycle without having those threads fight for available resources

AMD ZEN

HC28.AMD.Mike%20Clark.final-page-007.jpg



Intel Skylake

16hrty8.jpg

It is written on their slides, 4 uops/cycle for the FP part, for Integer it s 4 uops/cycle from decoder + 2 uops/cycle from the trace cache, and both FP and INT scheduler can work simultaneously, AMD also said that the uops from the trace cache are equivalent to X86 instructions in most cases.
 

zentan

Member
Jan 23, 2015
177
5
36
Zen already proves itself to be faster than Broadwell hertz for hertz. The remaining question is just the frequency and its power consumption. If amd's 8 cores can match Intel's 4.5 gigahertz found on the kaby i7 while charging a price just slightly higher than an i5.... I can guaranteed you price will drop like a rock and value for the money will be at its best in 10 years
Where did you see the prove? Please share with us too.

Anyway, people should be cautious about taking controlled room demos as some kind of actual performance review. Not long ago did AMD show a system with a polaris consuming 54W less than a gtx 950 equipped system. From various reviews it become apparent that either rx460 SKUs consumed a bit more or about equal and in some cases it did consume 20-25W less but still nowhere close to what if controlled demo was taken as a real world reference. Also if AMD has actually a product which is better than Broadwell clock for clock, then they probably were/are being quite conservative with their "40%" ipc estimation.
 

HiroThreading

Member
Apr 25, 2016
173
29
91
I will gladly pay $2000 for a 16c Zen.. I am not hoping for anything.

AMD in a vacuum will charge as much as they can get like any company, but Zen is not being released into the vacuum.
http://www.anandtech.com/show/1745/5

From that same link:

One thing that we noticed in our first review of the Athlon 64 X2 processor was that AMD was surely getting their money's worth out of each X2 sale, especially compared to Intel. Dating back to the launch of the Pentium D, Intel's entry-level Pentium D 820 only came with an $80 premium over its identical single core counterpart. Back then, AMD's cheapest core, the X2 4200+ commanded a $265 premium for its second core.

With the introduction of the Manchester core in the Athlon 64 X2 3800+, AMD introduces a much more reasonably priced dual-core CPU, where the cost of the second core has finally dropped to $160. It's still not as low as Intel's lowest, but it is fairly competitive with Intel's closest priced dual core competitor - the 3.0GHz Pentium D 830.

AMD's Athlon 64 and X2 processors were f***ing expensive. They obviously had the best performance (especially for games). However, if all you needed was a dual core CPU for multitasking and some productivity, then the Pentium D offered really good value.

If anything, it was Intel with the Core 2 Duo that caused a major price war (especially with the E6300 and E6600), suppressing CPU prices to, really, this very day.
 
Last edited:

KTE

Senior member
May 26, 2016
478
130
76
This last point is, what Mike Clark stressed the whole time. They had to balance everything, and have thrown out features for later or ever, if they shift the balance in a wrong direction. The good thing as it seems is, that due to some front loaded big cuts (e.g. 256b SIMD, use of high performance libs), they didn't exactly have to struggle with size.
I'll talk to the POWER9 team very soon about this and try and get their [informal] insight about this balancing. Maybe even tomorrow.

Sent from HTC 10
(Opinions are own)
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
If i say ZEN could have higher Throughput per core, looking strictly at the execution units (Wider than SKL), will it be correct ??
Can ZEN Fetch, Decode, Execute and Retire one Integer and one FP Thread simultaneously (HT) per Cycle without having those threads fight for available resources ?? Dresdenboy ????
Hehe, I had the same topic in the P3D editors forum. ;)

Due to mem accesses (incl. L1), limited ILP, and FP instruction mix/latencies there won't be too many cycles, where either INT, FP, or both are maxed out in throughput. Due to that, this scenario (Int+FP threads) shouldn't be limited that much by the front end.

I think, similar to SKL, uOp cache and decode outputs are being muxed into the uOp queue, thus they are not providing any advantage by combination. Anyway Zen will have an advantage by avoiding crowded issue ports. But this advantage might get lost for example with AVX256.

How about that thread priorization thing, which I mentioned a while ago in this forum, when we all discussed Zen's SMT capabilities? ;)

@all:
BTW did anyone here get, that the stack engine + memfile reduces AGU pressure (remember the 3rd AGU discussions?). That memfile seems to be a small stack cache. A well known MPR editor will cover that topic soon. ^^
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I'll talk to the POWER9 team very soon about this and try and get their [informal] insight about this balancing. Maybe even tomorrow.
That's good. Like to hear, what they say.

We forgot some variables: complexity, time, minds (thinking up/knowing fancy stuff like hash perceptrons), IP costs, testability..
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Could you post the link, i cant seem to find it.

As for hyperthreading what I did find interesting was that using Linpack with quad core Haswell, HTT and AVX2 run on one thread per core at 3GHz resulted in a throughput of 166GFLOPS. If however a simple app was run on each of the other 4 threads using just GPR's and simple MOV instructions in a loop so no memory accesses or CPU cache usage, this would result in Linpack throughput dropping to 98GFLOPS, a 40% drop in performance.
That's a real problem. This even happens with those parallel process' threads having a lower priority as long as there is no SMT capable scheduler taking care of that.
(4/8/15)

As AtenRa wrote, they have some experience (which includes test and verification). I saw one patent filing, where they take care of thread priorities in their SMT implementation. That would be interesting, if they implemented it in Zen...
(10/7/15)

Of course the integral counts and it won't be that much during gaming (adapting to target stream resolution).

During gaming with well threaded game code, Quick Sync shouldn't eat too much away.

Edit: I did some measurements on my i7-5600U work notebook. This diagram shows the Prime95 FFT times when using 4 threads.
prime95_with_backgrouymupt.png


I have multiple taskmanager screenshots of the different situations, but don't want to spam the thread for now.

The background task is my genetic programming trading system software, which runs with 4 threads and usually a priority one level below "normal". It does lots of floating point calculations, pointer handling, memory accesses and branching. For the tests I set it to the lowest level.

Prime95 sets priority to "normal" for each subtest and does heavy AVX work.

What we're seeing here was my vague impression before based on uarch and is a solid fact now. There is no priority in HT. Skylake might have improved that, but I didn't hear about this.
(11/18/15)

I wanted to build up tension. ;)
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,762
3,131
136
BTW did anyone here get, that the stack engine + memfile reduces AGU pressure (remember the 3rd AGU discussions?). That memfile seems to be a small stack cache. A well known MPR editor will cover that topic soon. ^^
Im sure most people who have read/know about the stack cache papers saw it. What is interesting is it seems far more"internal" to the core then i expected, i assumed it would sit inline with L1's. I guess they will see even greater savings then in the paper because of lessened data movement? So is the L0 in other leaks the u-op cache or the stack?
 

hrga225

Member
Jan 15, 2016
81
6
11
Hehe, I had the same topic in the P3D editors forum. ;)

How about that thread priorization thing, which I mentioned a while ago in this forum, when we all discussed Zen's SMT capabilities? ;)

@all:
BTW did anyone here get, that the stack engine + memfile reduces AGU pressure (remember the 3rd AGU discussions?). That memfile seems to be a small stack cache. A well known MPR editor will cover that topic soon. ^^
I missed thread priorization discussion.So they are doing it in hardware,overiding OS?What mechanism do you think they are using?

Regarding stack cache,as I am aware(I read about it somewhere), it reduces power consumption by reducing AGU pressure.Now,that 3rd AGU,or lack of it to be precise,to me was never much of an issue in first place.Yes. it would be bottleneck in certain workloads.So,in the end having 2 AGUs was design choice instead of design tradeoff.