[Techpowerup] AMD "Zen" CPU Prototypes Tested, "Meet all Expectations"

looncraz · Nov 12, 2015

Exophase said:
I am very skeptical of this claim. Unless the microcode was contributing to something very, very broken there's no way any change in it would result in a 2x performance difference. Even a CPU like Pentium 4 wouldn't be spending that much time in microcode.

I took it as an exaggeration when I heard it, but it's conceivable that the fact that the original 08088 code is still in place, and all of the errata over the years requires workarounds, we could well see a few places where it is a significant enough of a concern.

I would, however, like to state I didn't know the guy long and have little reason to take his word over anyone else with experience at Intel. We simply called him "drunk guy" to give you a clue. He did, however, absolutely work for Intel, and didn't volunteer the information until he realized who I was after a mutual friend made the connection.

Sweepr · Nov 12, 2015

Fjodor2001 said:
So increased Skylake-E count over Haswell-E and Broadwell-E is just a speculation.

Surprise, surprise. Here's your source:

I got something wrong though. They didn't wait Skylake-E, they are already doing it with Broadwell-E.

Fjodor2001 said:
Which was not what was being discussed. A recap for you: The discussion between us in this thread started by me mentioning that I think Skylake-E is not likely to have significantly higher IPC and performance than Broadwell-E and Haswell-E.

+25% cores/thread, slightly higher IPC and that's only Broadwell-E. Feeling pretty confident about Skylake-E right now.

looncraz · Nov 12, 2015

Phynaz said:
Copy Exactly is used in manufacturing, not design. It means every line in every fab is an exact clone of every other. If an optimization is developed at one location it is rolled out to other locations. This is in contrast to other fabs that use SPM on a per line basis. The purpose is to reduce cost.

I believe I read somewhere that Intel syncs their manufacturing processes every 90 days.

Intel page.

The name is certainly similar, but he was specifically referring to microcode (I'm a software developer, so the software aspect was the common thread between us). Turned out he had my software installed on his laptop and was a fan of mine, how neat is that? :thumbsup:

Dresdenboy · Nov 12, 2015

I think, Zen is about power and size, good legacy ST and good FP MT performance.

NostaSeronx · Nov 12, 2015

I think, Zen is about theft of IP. It worked for Apple, it will work for us.

AMD Twister/Cyclone via Zen Hype!

Lawsuit pending.

looncraz · Nov 12, 2015

DrMrLordX said:
I'm not sure that's accurate. The_Stilt was nice enough to post some XV numbers @ 3.4 GHz (no throttle/turbo) using his dev platform, and adding 40% to his Cinebench R10 numbers would put Zen ahead of Skylake per clock (assuming the same number of cores). 8c/16t Zen would be an R10 monster. Now when you take into account that R10 is mostly an fp SSE2 benchmark, and when you take into account that much of Zen's improvements in IPC over XV will come from:

1). Faster cache
2). Presumably shorter pipeline (less performance loss from stalls)
3). additional fp resources

you will probably see improvements from Zen on the high side in a benchmark like Cinebench R10. The open question is: how high will Zen clock? Maybe not all that high (at least as a base clock), but we'll find out soon enough.

I compared Excavator to Steamroller directly and estimated a 4~15% IPC increase, largely in floating point. The average was 9.85%. This tied it with the results pathway I could calculate for Penryn very closely, still losing in FPU, but winning in integer. The common benchmarks across these, as you might imagine, is limited, so I went with the average performance change between generations.

http://www.anandtech.com/bench/product/49?vs=1492

Here, remember, in single threaded, the A4 is at 3.9GHz, and Steamroller usually sticks to its max turbo speed in single threaded loads, so that's a pretty safe estimate.

Core 2 has far better FP IPC than Steamroller, and Excavator is about 10% ahead of Steamroller, FPU-wise. Of course, we don't really know as much about Excavator as I'd like, but my calculations are based off every benchmark I've been able to find from a reputable source with properly disclosed conditions.

AtenRa · Nov 12, 2015

Dresdenboy said:
I think, Zen is about power and size, good legacy ST and good FP MT performance.

As i have said before, im expecting a very high perf/watt for ZEN. They will not aim for the top absolute performance this time.

looncraz · Nov 12, 2015

ShintaiDK said:
Broadwell-E will be 3.3Ghz for 8 cores and 3Ghz for 10 cores. All at 140W.

Anyone in their right mind still believing in 4Ghz Haswell IPC 8C/16T at 95W with 14 LPP for Zen?

Yup.

A little math can help:
140 / 10 = 14W @ 3GHz

14*4 = 56
56 * 1.7, to equal 4GHz = 95W

This seems it would be right around the power curve we'd expect, actually. For a quad core :thumbsup:

The eight core, though, no way. Unless AMD has really done something crazy awesome. However, it's important to note just how power efficient Keller's other contemporary designs have been. His focus has been high performance (for their market), wide, nice clocking, power efficient, CPUs for quite some time.

If that 8 core comes clocked at 3GHz, it could be at 95W. Intel's ratings are also incompatible with AMD's, so this entire comparison is useless - but fun.

looncraz · Nov 12, 2015

NostaSeronx said:
1). Inclusive 8 MB L3 Cache per 4 cores. Kind of shoots down that faster cache notion.
2). BD -> XV has 15 pipeline stages for the Integer side. ZN/ZN+ has 17 pipeline stages for the Integer side.
3). The additional FP resources is negated by having MAC units rather than FMAC units. 2 128b AVX Adds + 2 128b AVX Muls or 1 128b/256b AVX Muladd.

At 14nm, an identical cache should be able to operate with lower latencies. If they improved the L3 by 15% and the L2 by 50%, and provided larger L1 caches with a 1cycle shorter latency, they could still easily support a 40% IPC improvement.

2)
I've never believed the 15 stage figure. None of the benchmarks or characteristics bear that out. I fully suspect that there are 19 or 20 stages in Bulldozer.

The biggest hint is the minimum misprediction penalty. Which is 20 for Bulldozer.

http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/2

Considering all other architectures a misprediction penalty is the exact same as the pipeline length (excluding Sandy Bridge's 3-cycle optimization), I'd say the safest bet is 20 stages. Unless there is more information available I have not seen.

3)
FMAC vs MAC: FMAC does only one rounding, MAC typically does two.
a += b*c

The impact on performance between the two is not usually drastic.

looncraz · Nov 12, 2015

NostaSeronx said:
I think, Zen is about theft of IP. It worked for Apple, it will work for us.

AMD Twister/Cyclone via Zen Hype!

Lawsuit pending.

High level design isn't IP, though it can sometimes be patented.

There's no denying the superficial similarities between Apple's Cyclone and Zen, though.

However, with Keller, IIRC, involved with both projects, it only makes sense. Keller would be the one in the hotseat, though, legally speaking. AMD can just rely on plausible deniability.

Dresdenboy · Nov 12, 2015

NostaSeronx said:
I think, Zen is about theft of IP. It worked for Apple, it will work for us.

AMD Twister/Cyclone via Zen Hype!

Lawsuit pending.

This will be interesting to watch.

NostaSeronx · Nov 12, 2015

looncraz said:
The biggest hint is the minimum misprediction penalty. Which is 20 for Bulldozer.

Unless, you look at the unconditional direct branches and returns which is 15 cycles.

It is explained very vaguely in documentation that there is a bubble for conditional and indirect branches. Which is why it is 20 cycles, not 15 cycles like the above. So, ya the complex branches that are difficult to calculate take a 5 cycle penalty when mispredicted. While, the simple branches take a 0 cycle penalty when mispredicted.

(Not documentated but the misprediction has also been suggested by some architects to be the same as L2 latency. Thus, BD has 21 cycle misprediction, PD has 20 cycle misprediction, SR has 19 cycle misprediction, and XV has 17 cycle misprediction)

looncraz · Nov 12, 2015

NostaSeronx said:
Unless, you look at the unconditional direct branches and returns which is 15 cycles.

It is explained very vaguely in documentation that there is a bubble for conditional and indirect branches. Which is why it is 20 cycles, not 15 cycles like the above. So, ya the complex branches that are difficult to calculate take a 5 cycle penalty when mispredicted. While, the simple branches take a 0 cycle penalty when mispredicted.

(Not documentated but the misprediction has also been suggested by some architects to be the same as L2 latency. Thus, BD has 21 cycle misprediction, PD has 20 cycle misprediction, SR has 19 cycle misprediction, and XV has 17 cycle misprediction)

That would actually make sense. Particularly the L2 latency relationship given my understanding of how the misprediction flush operates, which would [almost?] always result in an L2 read (and one reason why I've thought my understanding might be wrong :\). If that's the case, then a faster L2 will have a more impressive benefit to Zen than reducing the pipeline length. That could indicate Zen will have 15~20 stages like Excavator and could be expected to clock similarly aside from the assumed added complexity in the scheduler functionality... though there are ways around that. :thumbsup:

Dresdenboy · Nov 12, 2015

AtenRa said:
As i have said before, im expecting a very high perf/watt for ZEN. They will not aim for the top absolute performance this time.

In another forum I asked for FMA usage in applications/games. Now I'd also like to know, if these are ST or MT softwares. My point is:
It's more likely to see non-FMA code as being singlethreaded, thus needing more FMULs/FADDs. And if code uses FMA, it's also very likely multithreaded.

So what could AMD have done? Balance the design for these targets at high efficiency by avoiding any loosing of frequency and power headroom due to a too high FMA/AVX throughput. So Zen cores might have good throughput and an unshared L3 for legacy code at boosted clocks (lasting longer due to power efficiency) and high FMA/AVX MT throughput by providing more cores operating at the knee of the freq/power curve. Adding cores adds FPUs, AGUs, L1 and L2 caches and cache bandwidth.

looncraz said:
The eight core, though, no way. Unless AMD has really done something crazy awesome. However, it's important to note just how power efficient Keller's other contemporary designs have been. His focus has been high performance (for their market), wide, nice clocking, power efficient, CPUs for quite some time.

If that 8 core comes clocked at 3GHz, it could be at 95W. Intel's ratings are also incompatible with AMD's, so this entire comparison is useless - but fun.

Haswell/Skylake are big cores, which also need more power when doing full 256b FMACs with PRF and L1 at their limits. Intel chips have a specific AVX base clock and turbo headroom. Removing that burden in a design would allow to use more cores. GPUs also scale by adding compute cores. It's also much simpler than to work on the shader uarch.

I looked for Haswell vs. older uarch Linpack or DGEMM perf/W numbers, but couldn't find any after some time of searching. It would be a point to support

Regarding XV IPC:
Have you seen the Carrizo measurements on http://instlatx64.atw.hu/ ? There is also a short direct comparison between cores:
http://users.atw.hu/instlatx64/BDZvsPLDvsSTMvsCRZ.txt

Tuna-Fish · Nov 12, 2015

looncraz said:
The name is certainly similar, but he was specifically referring to microcode (I'm a software developer, so the software aspect was the common thread between us). Turned out he had my software installed on his laptop and was a fan of mine, how neat is that? :thumbsup:

While what you say is probably true of a lot of microcoded instructions, the big reason they are not being updated is that they are deprecated. You absolutely could not get significant speedup of any kind with better microcode simply because modern compilers almost never emit instructions that run microcode.

Microcode is for weird corner cases and legacy software.

Exophase · Nov 12, 2015

NostaSeronx said:
Unless, you look at the unconditional direct branches and returns which is 15 cycles.

It is explained very vaguely in documentation that there is a bubble for conditional and indirect branches. Which is why it is 20 cycles, not 15 cycles like the above. So, ya the complex branches that are difficult to calculate take a 5 cycle penalty when mispredicted. While, the simple branches take a 0 cycle penalty when mispredicted.

Lower misprediction penalty on unconditional direct branches is made possible by the correction not being dependent on a result writeback (flags or register). So it can be resolved earlier in the pipeline instead of at the end. Doesn't mean that the pipeline should be considered shorter.

A lower penalty for returns makes no sense, and indeed doesn't line up with what Agner Fog tested:

The misprediction penalty is specified as minimum 20 clock cycles for conditional and indirect branches and 15 clock cycles for unconditional jumps and returns. My measurements indicated up to 19 clock cycles for conditional branches and 22 clock cycles for returns.

Given the large number of other pretty obvious errors or omissions in AMD's documentation I'm more inclined to go with Agner on this.

NostaSeronx said:
(Not documentated but the misprediction has also been suggested by some architects to be the same as L2 latency. Thus, BD has 21 cycle misprediction, PD has 20 cycle misprediction, SR has 19 cycle misprediction, and XV has 17 cycle misprediction)

Who suggests this? I don't see a reason to believe they'd be tightly coupled like that.

Exophase · Nov 12, 2015

looncraz said:
I took it as an exaggeration when I heard it, but it's conceivable that the fact that the original 08088 code is still in place, and all of the errata over the years requires workarounds, we could well see a few places where it is a significant enough of a concern.

I would, however, like to state I didn't know the guy long and have little reason to take his word over anyone else with experience at Intel. We simply called him "drunk guy" to give you a clue. He did, however, absolutely work for Intel, and didn't volunteer the information until he realized who I was after a mutual friend made the connection.

There's no way 8086/8088 microcode would even work on something like Pentium 4. Microcode is very uarch-specific and we actually know a bit about the structure of Netburst microcode which is definitely very different from 8086 microcode.

It could be algorithmically similar, but in reality most of the ucode from 8086 wouldn't even be used anymore since the functions would be hardcoded into the decoders/execution units/etc.

The biggest place where I'd expect microcode to actually be used would likely be in implicit cases like handling various exceptions/faults, replays, etc. But that would also be very uarch specific.

It's possible the guy you knew was saying something accurate but it was just misinterpreted somehow.

DrMrLordX · Nov 12, 2015

looncraz said:
I compared Excavator to Steamroller directly and estimated a 4~15% IPC increase, largely in floating point. The average was 9.85%. This tied it with the results pathway I could calculate for Penryn very closely, still losing in FPU, but winning in integer.

On fp workloads, I saw more like an 11% increase in IPC from Steamroller to Excavator. The problem is that a lot of the XV numbers going around are coming from cTDP-strangled, throttle-prone Carrizo chips. Check out what happens when you have a cTDP-unlocked developer platform with Carrizo in it:

http://www.overclock.net/t/1560230/jagatreview-hands-on-amd-fx-8800p-carrizo/400_100#post_24310470

Connecting the dots from there isn't too hard. Carrizo @ 3.4 GHz managed an R10 score of 13146. In a MT scenario, that chip is getting ~966.6 CB per thread per GHz. Zen will (allegedly) get ~1353.3 per thread per GHz in the same scenario, assuming a 40% improvement.

Now if you look here:

You'll see the 6700k getting 40731 @ 4.8 GHz. That's ~1060.7 per thread per GHz. Here we see two interesting facts: First, XV is very close to the 6700k in this old SSE2 benchmark, losing by the margin it does thanks to AMD being unable to clock it very high and deploy more modules. Secondly, a 4c/8t Zen should require only a clockspeed of ~3.8 GHz to match the 4.8 GHz 6700k in R10.

Haswell-like performance? I think not.

Dresdenboy · Nov 12, 2015

Dresdenboy said:
Haswell/Skylake are big cores, which also need more power when doing full 256b FMACs with PRF and L1 at their limits. Intel chips have a specific AVX base clock and turbo headroom. Removing that burden in a design would allow to use more cores. GPUs also scale by adding compute cores. It's also much simpler than to work on the shader uarch.

I found a gem.

As the table shows, when IPC=2, HSW consumes 11.05 nJ per cycle. Of that cost, 8.31 nJ comes from a fixed overhead cost and 2.74 nJ comes from the variable cost of the instructions. As this data implies, the actual operation cost of an instruction (e.g., the floatingpoint arithmetic implied by a floating-point instruction) is only a small fraction of the total power of the CPU.

This is about Livermore loops.
Paper: http://kentcz.com/downloads/P149-ISCA14-Preprint.pdf
Slides: http://kentcz.com/downloads/ISCA149_slides_final.pdf

So if AMD creates smaller cores for HPC throughput, it could reduce the fixed energy cost and add cores instead. This would improve the chips total HPC power efficiency. That's it.

TechGod123 · Nov 12, 2015

DrMrLordX said:
On fp workloads, I saw more like an 11% increase in IPC from Steamroller to Excavator. The problem is that a lot of the XV numbers going around are coming from cTDP-strangled, throttle-prone Carrizo chips. Check out what happens when you have a cTDP-unlocked developer platform with Carrizo in it:

http://www.overclock.net/t/1560230/jagatreview-hands-on-amd-fx-8800p-carrizo/400_100#post_24310470

Connecting the dots from there isn't too hard. Carrizo @ 3.4 GHz managed an R10 score of 13146. In a MT scenario, that chip is getting ~966.6 CB per thread per GHz. Zen will (allegedly) get ~1353.3 per thread per GHz in the same scenario, assuming a 40% improvement.

Now if you look here:

You'll see the 6700k getting 40731 @ 4.8 GHz. That's ~1060.7 per thread per GHz. Here we see two interesting facts: First, XV is very close to the 6700k in this old SSE2 benchmark, losing by the margin it does thanks to AMD being unable to clock it very high and deploy more modules. Secondly, a 4c/8t Zen should require only a clockspeed of ~3.8 GHz to match the 4.8 GHz 6700k in R10.

Haswell-like performance? I think not.

This...is exciting.

Dresdenboy · Nov 12, 2015

DrMrLordX, SKL has a SMT penalty in the score/GHz/thread calculation. XV has only a small one (CMT). That (avg.) 40% number could also stand for single threaded integer code. We don't know it.

looncraz · Nov 12, 2015

Dresdenboy said:
Regarding XV IPC:
Have you seen the Carrizo measurements on http://instlatx64.atw.hu/ ? There is also a short direct comparison between cores:
http://users.atw.hu/instlatx64/BDZvsPLDvsSTMvsCRZ.txt

http://looncraz.net/research/cpu/ipc/amd_lat/

I took all of the available data in the link you provided that matched up with Bulldozer, Piledriver, Steamroller, and Excavator and made a few charts. :thumbsup:

This would suggest Excavator is a smaller improvement than we know it to be. Obviously AMD focused on the right instructions this time. The drop from Bulldozer to Piledriver is impressive.

looncraz · Nov 12, 2015

Tuna-Fish said:
While what you say is probably true of a lot of microcoded instructions, the big reason they are not being updated is that they are deprecated. You absolutely could not get significant speedup of any kind with better microcode simply because modern compilers almost never emit instructions that run microcode.

Microcode is for weird corner cases and legacy software.

Well, the whole "modern compiled" thing may very well put his whole statement in context. This was only months after the Pentium 4 release

Maybe it is one of those things that was true then, and not even remotely true now?

Dresdenboy · Nov 12, 2015

looncraz said:
http://looncraz.net/research/cpu/ipc/amd_lat/

I took all of the available data in the link you provided that matched up with Bulldozer, Piledriver, Steamroller, and Excavator and made a few charts.

This would suggest Excavator is a smaller improvement than we know it to be. Obviously AMD focused on the right instructions this time. The drop from Bulldozer to Piledriver is impressive.

Nice graphs! It looks like instruction execution changes settled somehow with XV. Then the biggest part of the IPC improvement comes from L1 cache size and other core improvements (as usual I admit

).

looncraz · Nov 12, 2015

Exophase said:
Who suggests this? I don't see a reason to believe they'd be tightly coupled like that.

I remember reading some details about Bulldozer a loong time ago that made me believe that a branch misprediction would result in a L1 flush and the L1 would need to be refilled by the L2 before execution resumes. The problem, as I recall, was how little the integer unit can do at once. This meant, at least to me, that the branch misprediction penalty would, in effect, by constrained by the L2 latency (since a misprediction will flush the pipelines at the same time it initiated an L1 refill, there would only be, maybe, a 1 cycle penalty that was not driven by the L2).

I think this came up in a conversation with JF-AMD, but I really have no idea how to search for that :'(

[Techpowerup] AMD "Zen" CPU Prototypes Tested, "Meet all Expectations"

Where do you think this will land performance wise

Intel i7 Haswell-E 8 CORE

Intel i7 Skylake

Intel i5 Skylake

Just another Bulldozer attempt

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Senior member

Lifer

Senior member

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Golden Member

Golden Member

Diamond Member

Diamond Member

Lifer

Golden Member

Member

Golden Member

Senior member

Senior member

Golden Member

Senior member