New Zen microarchitecture details

itsmydamnation · Jun 18, 2016

KTE said:
BTW... Looking at DT Excavator, I wouldn't want AMD to project anything based off it!

It's certainly not even 5% better than PD on average. It wins few but loses more with a landslide!

Absolute performance wise, it's ppp. It needs FAR higher clocks but it doesn't scale at all. It's a mobile chip, simply put, low frequency+graphics optimized.

It's no better than the old Regor chips were at the time. Kuma caned them, literally.

Sent from HTC 10
(Opinions are own)

Its not even close to Regor vs Kuma. For one in almost anything that is not a throughput test/workload per clock excavator is anywhere upto 15% faster per clock then kaveri.

An actual fair comparison would be something like an X6 vs llano. L3 vs no L3 , APU focued vs not. Regor vs Kuma was about cost reduction.

The only way trinity wins out vs excavator per clock is because of the extra L2 and thats completely disregarding TDP. As pointed out by Stilt the L2 is very power hungry and clock limited. if looking at integer IPC out side of the L2 difference there is nothing in excavator that sacrifices performance for lower power.

the funny thing about IPC is the C, so the point of your post was........

coercitiv · Jun 18, 2016

The Stilt said:
Since certain people claim that shrinking virtually any design from 28nm bulk to 14nm FinFet LPP will immediately increase the Fmax by 50 - 100%, I'm really keen to see how that will work out :sneaky:

It's not about automatic Fmax but rather about giving the chip some breathing room in power usage, hence potentially higher clocks.

However, I have yet to see people hope or claim 50-100% increase, when even Nvidia brought about 40% at best. How about we hope for 20-40% instead?

The Stilt · Jun 18, 2016

KTE said:
BTW... Looking at DT Excavator, I wouldn't want AMD to project anything based off it!

It's certainly not even 5% better than PD on average. It wins few but loses more with a landslide!

Absolute performance wise, it's ppp. It needs FAR higher clocks but it doesn't scale at all. It's a mobile chip, simply put, low frequency+graphics optimized.

It's no better than the old Regor chips were at the time. Kuma caned them, literally.

Sent from HTC 10
(Opinions are own)

The average figures quoted by AMD are pretty accurate based on my own testing. According to AMD Steamroller is 10% faster than Piledriver and Excavator is 5% faster than Steamroller. There certaily is extremities to both directions, but generally that's the actual average.

So on average Excavator should have 15.5% higher IPC than Piledriver. In most cases it has, unless it's primary weakness is exploited (insufficient L2).

itsmydamnation · Jun 18, 2016

i posted this on beyond3d in regard to discussion on Zeppelin PCI-E/GMI, figured i'ld share here see what people think:

Yes i would assume that GMI is PCI-E physicals/encoding. I've been looking at die shots trying to get an idea of the possible amount of of PCI-E in the zeppelin dieshot by roughly comparing the relative size of a PCI-E interface to the memory interface. the best comparison i have so far has been to use the good bulldozer die shots and use the HT links as a guess. Each HT 16bit (32bit bidir) link is 76pins. PCI-e 16x is around 160 pins. On Zambezi the HT links are spread thin on mangy-cours they are wider but shorter. it looks like you can get aprox 8 HT interfaces in the same space as the 128bit memory interface (608 HT pins).

On Zeppelin each gmi interface is about 2/3 of the width of the 64bit memory interface but it is about a 1/3 longer, the two GMI interfaces in total would be around the same size as 128bit DDR interface. So that is aprox 608pin which is aprox 64 lanes of PCI-E.

So now the question is how much interconnect between Zeppelin SOC's is enough ? is 16 lanes (25gbps) enough? if so two Zeppelin SOC's could provide 64 lanes to GPU and 32 lanes to motherboard based PCI-E. So long as shortest path is taken between CPU and GPU then the small inter zeppelin GMI link shouldn't be a problem.

On the 32core part a full mesh would leave a total of 64 lanes for PCI-E, if its a dual ring then 128 lanes.

So hows that for a mighty long bow! . :runaway:

KTE · Jun 18, 2016

The Stilt said:
The average figures quoted by AMD are pretty accurate based on my own testing. According to AMD Steamroller is 10% faster than Piledriver and Excavator is 5% faster than Steamroller. There certaily is extremities to both directions, but generally that's the actual average.

So on average Excavator should have 15.5% higher IPC than Piledriver. In most cases it has, unless it's primary weakness is exploited (insufficient L2).

In which DT apps? (PD vs SR) Are we talking real apps or synthetics here, and did you try CPU Queen or Fritz Chess by any chance?

K10 is still faster than even EX...

http://www.pcgameshardware.de/Athlon-X4-845-CPU-261962/Tests/Excavator-Benchmarks-Test-1191570/
http://www.overclockersclub.com/reviews/amd_athlon_x4_845_cpu/5.htm
http://excavator.looncraz.net/
http://www.ferra.ru/ru/system/review/amd-excavator-athlon-x4-845/#.V2UtSZ_TXqA
http://www.neoseeker.com/Articles/Hardware/Reviews/amd-athlon-x4-845/2.html

Sent from HTC 10
(Opinions are own)

Dresdenboy · Jun 18, 2016

The Stilt said:
The average figures quoted by AMD are pretty accurate based on my own testing. According to AMD Steamroller is 10% faster than Piledriver and Excavator is 5% faster than Steamroller. There certaily is extremities to both directions, but generally that's the actual average.

So on average Excavator should have 15.5% higher IPC than Piledriver. In most cases it has, unless it's primary weakness is exploited (insufficient L2).

Let's not forget the missing L3 and the higher IMC latencies, which might also make XV's results worse, as shown by KTE.

Dresdenboy · Jun 18, 2016

KTE said:
In which DT apps? (PD vs SR) Are we talking real apps or synthetics here, and did you try CPU Queen or Fritz Chess by any chance?

K10 is still faster than even EX...

http://www.pcgameshardware.de/Athlon-X4-845-CPU-261962/Tests/Excavator-Benchmarks-Test-1191570/
http://www.overclockersclub.com/reviews/amd_athlon_x4_845_cpu/5.htm
http://excavator.looncraz.net/
http://www.ferra.ru/ru/system/review/amd-excavator-athlon-x4-845/#.V2UtSZ_TXqA
http://www.neoseeker.com/Articles/Hardware/Reviews/amd-athlon-x4-845/2.html

Were TDPs take care of?

Here are some results with x4 845 @ 95W:
http://www.planet3dnow.de/cms/22697-erste-benchmarks-des-athlon-x4-845/

And regarding IPC (similar to Looncraz's work), there is also the P3DNow test:
http://www.planet3dnow.de/cms/18564...cavator-leistungsvergleich-der-architekturen/
(article index is below the headline)

KTE · Jun 18, 2016

OT:

It has always been pretty funny an exercise when comparing benchmarks online - 16 years and enthusiasts still haven't developed firm approaches. Quite a conundrum

Before Zen is out, I should say it. I have a problem with some common data. Let's face it,

Some benches are purely synthetics, true to their name with little correlation to reality.
Some synthetics show best case for an architecture and some show worst case results.
Then there are those which are popular among benchers.
There's even, synthetic benchmarks which contain a good instruction mix and which correlate very well with what you find in the public domain.
Then benches which are testing a future capability.
Also benches which scale well or poorly from 1C to nC.
Furthermore, there are those which are optimized for one architecture more than the other.

Then you have the real world apps, which can be divided to be:

All or any of the above...
RW apps most use and those most do not use.
Benches running tasks and or functions which most use/do not use.
Benches which run at sizes uncommon in RW.
Benches which show results in percentages which mean little for actual runtimes (relevancy threshold)
Then you have outliers, corner cases (best/worst) in the results, improper setups, test bed hampering, etc.

My gripe is usually with 3-5. It is so easy to get caught up in numbers. What does 5fps mean at 70fps? What does 7s mean to unzip anything, unless its sub 25s runtime? But even then? Does 20s vs 25s boot time really matter? Would consumers perceive it? Would it improve their survival age by a day?

We need to be able to look for a relevant threshold when advising and comparing, especially reviewers and this is done by taking the benchmark, its bench result in actual figures, its applicability to real world and its total runtime and sizes in consideration. % statistics are useful to determine which is faster, for analysis, but don't tell the complete tale for an end user.

KTE · Jun 18, 2016

Dresdenboy said:
Were TDPs take care of?

Here are some results with x4 845 @ 95W:
http://www.planet3dnow.de/cms/22697-erste-benchmarks-des-athlon-x4-845/

And regarding IPC (similar to Looncraz's work), there is also the P3DNow test:
http://www.planet3dnow.de/cms/18564...cavator-leistungsvergleich-der-architekturen/
(article index is below the headline)

Thanks (I haven't seen it)

Is MusicIsMyLife still around?

Dresdenboy · Jun 19, 2016

KTE said:
Thanks (I haven't seen it)

Is MusicIsMyLife still around?

He's still posting there in the forum, but less often, also not actively working on articles.

KTE · Jun 19, 2016

Dresdenboy said:
He's still posting there in the forum, but less often, also not actively working on articles.

It would be good to see him back now in order to review Zen... As he's very knowledgeable with AMD CPUs and had pioneering experience from the Agena days.

Sent from HTC 10
(Opinions are own)

happy medium · Jun 19, 2016

AMD Zen ~= Intel 3930k @ 130watts.

That's a good guess. and more than I expect.

moonbogg · Jun 19, 2016

happy medium said:
AMD Zen ~= Intel 3930k @ 130watts.

That's a good guess. and more than I expect.

If that's the case then Zen would have to be given away for nearly free. Most people would just buy a new intel quad and get better performance. If it performs like Sandy Bridge, then AMD is stuck being the budget brand forever. Also, people don't care about 8 cores and they don't need them. A few of us around here might care but no one else does and no one needs it. Sad truth is, a fast quad is going to be just fine for another decade.

Abwx · Jun 19, 2016

happy medium said:
AMD Zen ~= Intel 3930k @ 130watts.

That's a good guess. and more than I expect.

Lol, quite a good guess of course, they said twice the throughput of an FX8350 while a 3930K has barely 25% higher throughput than a 8350..

http://www.hardware.fr/articles/940-19/indices-performance-cpu.html

Azuma Hazuki · Jun 19, 2016

Well, according to Passmark's CPU list that puts it at about 12000 points, where the 6700K is just below 11000. If they're selling this for $400 or below, especially $350 or below, this is a no-brainer. Single thread is fast enough, multithread is exceptional, and with DX12 this should be a decent enough gamer CPU.

Personally, as a Gentoo fan who's stuck on Arch because compiling on a Core 2 Duo is made of pain, I am looking forward to this.

The Stilt · Jun 19, 2016

Abwx said:
they said twice the throughput of an FX8350

Could you post a link where AMD said that? AMD said "Orochi" and the "FX-8350" was nothing but a interpretation of WCCF.

If you look at the first (Orochi vs. Summit) and the second (Excavator vs. Zen) version of the slide, there is a pretty clear pattern. I think that's the very reason why the slides got pulled away :sneaky:

When viewed in the original size, the height of Summit's / Zen's column is 658 pixels in the first version of the slide (Orochi vs. Summit) and 667 pixels in the second version (Excavator vs. Zen). Meanwhile the height of Orochi's and Excavator's columns are 328 and 370 pixels.

658 / 328 = 100.60% higher (Orochi vs. Summit)
667 / 370 = 80.27% higher (Excavator vs. Zen)

In the very same section of the slide AMD states "significant performance leap expected - 40% IPC improvement".

Is the ~80% higher column for Zen (vs. Excavator) just a coincidence, or does the slide have 2:1 scale

50.3% would fit perfectly as the difference between Orochi (Piledriver) and Zen, considering the average performance difference between Orochi and Excavator in Cinebench.

Here are the originals (no resizing, lossless).

coercitiv · Jun 20, 2016

The Stilt said:
50.3% would fit perfectly as the difference between Orochi (Piledriver) and Zen, considering the average performance difference between Orochi and Excavator in Cinebench.

In other words, Zen 8C will have double the throughput only at same frequency, once a ~25% clock speed difference gets factored in, throughput advantage drops from +100% to +50%.

It all depends on final clocks.

Abwx · Jun 20, 2016

The Stilt said:
658 / 328 = 100.60% higher (Orochi vs. Summit)
667 / 370 = 80.27% higher (Excavator vs. Zen)

In the very same section of the slide AMD states "significant performance leap expected - 40% IPC improvement".

Is the ~80% higher column for Zen (vs. Excavator) just a coincidence, or does the slide have 2:1 scale
50.3% would fit perfectly as the difference between Orochi (Piledriver) and Zen, considering the average performance difference between Orochi and Excavator in Cinebench.

Why should there be a 2/1 scale..?..

The 80% obviously doesnt apply to the IPC, so it s the other metric mentioned, that is, a Zen core has 80% more perf than a XV core; so this obviously apply to throughput, as you know it 80% over XV is 100% better than Piledriver in say Cinebench 11.5.

You know that i do not use CB R15 because it doesnt mimick the results of CB 11.5 for SR and XV, so there s another factor at play, possibly cache size and this produce non significant results IPC wise for these APUs as the L3 cache equipped FX doesnt seems to suffer from this detail.

KTE · Jun 20, 2016

Abwx said:
Lol, quite a good guess of course, they said twice the throughput of an FX8350 while a 3930K has barely 25% higher throughput than a 8350..

http://www.hardware.fr/articles/940-19/indices-performance-cpu.html

Of course, that depends on which application AMD means (unknown) or how many of them they are considering.

Dresdenboy · Jun 20, 2016

KTE said:
It would be good to see him back now in order to review Zen... As he's very knowledgeable with AMD CPUs and had pioneering experience from the Agena days.

Yep, that'll be good. I also wrote a few articles there (and am still waiting for my BD sample

), and provided one part about a uarch last year. Maybe I'll be involved regarding Zen.

KTE · Jun 20, 2016

Dresdenboy said:
Yep, that'll be good. I also wrote a few articles there (and am still waiting for my BD sample ), and provided one part about a uarch last year. Maybe I'll be involved regarding Zen.

It'd be good if you wrote the architectural side to Zen at least. Esp. in the preface to mention the more important bandwidths and instruction latencies that have changed. You rarely if ever see that covered anywhere (expect, Aces/RWT formerly).

Also an idea for you to pass on for Zen (I haven't spoken to MusicIsMyLife since 2007

): Electrical testing of the various voltage supply lines (like ht4u.net used to do).

superstition · Jun 21, 2016

KTE said:
We need to be able to look for a relevant threshold when advising and comparing, especially reviewers and this is done by taking the benchmark, its bench result in actual figures, its applicability to real world and its total runtime and sizes in consideration. % statistics are useful to determine which is faster, for analysis, but don't tell the complete tale for an end user.

The best thing is to use benchmarks that measure the different aspects of the CPU in a concise and targeted manner, such as:

cache performance
integer performance
floating-point performance
AVX performance
AVX-2 performance
graphical performance (for integrated graphics)

performance per watt, maximum and minimum
performance per watt over time

A one-size-fits-all benchmark can have big drawbacks, like relying overly on a FP-heavy benchmark to characterize the difference between an 8370 and a 3770K. 8 integer cores and 4 floating point units is a design that is going to look particularly sub-par in a FP-heavy benchmark, unless those 4 floating point units are really powerful.

"Real-world" is tricky because applications can be coded in a manner to favor one architecture over another. Cinebench, for instance, could be updated to further favor Intel by leaning even more heavily on something like AVX-2. I assume Skylake and Kaby Lake are going to have stronger AVX-2 performance than Zen. So, all one needs to cook up a benchmark that proves how sad Zen is is something "real world" like Cinebench "12" that leans very heavily on a specific Intel advantage. The opposite is a benchmark that doesn't use AVX at all in a circumstance where it would provide additional performance.

If Intel were putting Broadwell C-style EDRAM in its chips a "real world" benchmark could also be designed to heavily favor a large victim cache. Since Zen is unlikely to have anything like that initially then that could be a big marketing edge. "Zen falls short of Kaby Lake by 60%!" (in benchmark that leans heavily on 128 MB of L4 cache). Of course, if Intel and the benchmark maker were to pursue this more obviously-exposed avenue it would have been wise to put the L4 on all Skylake as well as Broadwell E parts. Even more clever would have been to withhold the L4 until Skylake to claim that the big performance advantage is due to the newer Skylake cores.

KTE · Jun 21, 2016

superstition said:
The best thing is to use benchmarks that measure the different aspects of the CPU in a concise and targeted manner, such as:

cache performance
integer performance
floating-point performance
AVX performance
AVX-2 performance
graphical performance (for integrated graphics)

performance per watt, maximum and minimum
performance per watt over time

A one-size-fits-all benchmark can have big drawbacks, like relying overly on a FP-heavy benchmark to characterize the difference between an 8370 and a 3770K. 8 integer cores and 4 floating point units is a design that is going to look particularly sub-par in a FP-heavy benchmark, unless those 4 floating point units are really powerful.

"Real-world" is tricky because applications can be coded in a manner to favor one architecture over another. Cinebench, for instance, could be updated to further favor Intel by leaning even more heavily on something like AVX-2. I assume Skylake and Kaby Lake are going to have stronger AVX-2 performance than Zen. So, all one needs to cook up a benchmark that proves how sad Zen is is something "real world" like Cinebench "12" that leans very heavily on a specific Intel advantage. The opposite is a benchmark that doesn't use AVX at all in a circumstance where it would provide additional performance.

If Intel were putting Broadwell C-style EDRAM in its chips a "real world" benchmark could also be designed to heavily favor a large victim cache. Since Zen is unlikely to have anything like that initially then that could be a big marketing edge. "Zen falls short of Kaby Lake by 60%!" (in benchmark that leans heavily on 128 MB of L4 cache). Of course, if Intel and the benchmark maker were to pursue this more obviously-exposed avenue it would have been wise to put the L4 on all Skylake as well as Broadwell E parts. Even more clever would have been to withhold the L4 until Skylake to claim that the big performance advantage is due to the newer Skylake cores.

Yes, but here's the kicker:

i) there is no such application targeting all... Accurate for all.
ii) showing a CPU in a bench highly optimized/tuned to extract good performance will generally not be reminiscent of RW apps.

In the end, the point of benchmarks is to be able to advise people on what is worth purchasing for them for their usage patterns.

I'll dissect Excavator PR from actual results to show this as soon as I get some time.

Sent from HTC 10
(Opinions are own)

Ajay · Jun 21, 2016

The Stilt said:

Since neither axis is labelled for the Zen vs Excavator chart - what we have here is a pure marketing chart. There is no valid engineering data and, hence, no valid conclusion can be drawn from it. This is pure pixie dust

Burpo · Jun 21, 2016

Lol, that slide has been altered so many times.. This is the sloppy one where the word Zen doesn't even line up with the word Excavator.. Fonts & size are different.. The original slide had no names below the graph..

New Zen microarchitecture details

Diamond Member

Diamond Member

Golden Member

Diamond Member

Senior member

Golden Member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Lifer

Lifer

Lifer

Golden Member

Golden Member

Diamond Member

Lifer

Senior member

Golden Member

Senior member

Platinum Member

Senior member

Lifer

Diamond Member