[Techpowerup] AMD "Zen" CPU Prototypes Tested, "Meet all Expectations"

looncraz · Nov 12, 2015

Exophase said:
It's possible the guy you knew was saying something accurate but it was just misinterpreted somehow.

Very possible. Not like any of us were sober

:whiste::|:hmm

_O

Exophase · Nov 12, 2015

looncraz said:
Well, the whole "modern compiled" thing may very well put his whole statement in context. This was only months after the Pentium 4 release

Maybe it is one of those things that was true then, and not even remotely true now?

No I think it was pretty true then. Most useful instructions on Pentium 4 were not microcoded (same thing can be said going back to at least 486, maybe 386). Among instructions that were microcoded, they were either niche, hard to use by a compiler, or already had a reputation for being slow in Intel processors going back several generations, like the loop instruction.

The biggest exceptions I can see to this were 16-bit multiplies, divisions, shift memory location, indirect calls, and the string instructions. But that collection of instructions isn't significant enough to allow a massive performance improvement with better microcode (assuming the microcode had a lot of room for improvement to begin with)

Exophase · Nov 12, 2015

looncraz said:
I remember reading some details about Bulldozer a loong time ago that made me believe that a branch misprediction would result in a L1 flush and the L1 would need to be refilled by the L2 before execution resumes. The problem, as I recall, was how little the integer unit can do at once. This meant, at least to me, that the branch misprediction penalty would, in effect, by constrained by the L2 latency (since a misprediction will flush the pipelines at the same time it initiated an L1 refill, there would only be, maybe, a 1 cycle penalty that was not driven by the L2).

What would you flush from L1 exactly? The pipeline could have contained no loads whatsoever yet will still be subject to the same misprediction penalty. The result could not be the entire L1 being flushed, that would have a far more dire effect.

It's possible that you could resolve L1 misses only at the end of the pipeline and force a replay, which would make the L1 miss penalty look similar to a branch misprediction penalty. But that'd be a very bad design for an OoO CPU. The L1 miss should only cause dependent instructions to stall, which may mean no stall at all if enough independent instructions can be scheduled ahead of the load. A branch misprediction, on the other hand, will almost always cause work to be thrown out because all of the instructions that come after the branch are dependent.

looncraz said:
I think this came up in a conversation with JF-AMD, but I really have no idea how to search for that :'(

I'd take it with some salt then because frankly JF-AMD said a lot of things about uarch for his company's processors that were just plain wrong. The info about K10 being 3 ALU or 3 AGU vs Bulldozer being 2 ALU + 2 AGU was particularly bad.

looncraz · Nov 12, 2015

DrMrLordX said:
On fp workloads, I saw more like an 11% increase in IPC from Steamroller to Excavator.
[snip]
Haswell-like performance? I think not.

11% will still not get you very far. But, yes, I saw some benchmarks with 15%, some with 4%. I just average them all together.

And, of course, I pretend 40% is a strictly precise improvement over the average performance of Excavator, which is largely estimated from about four benchmarks, whereas my previous estimations are often based on over two dozen benchmarks (some of them MT benchmarks run with one thread/core enabled).

So, if you consider the high-points of Excavator performance, we'll hit Skylake, if you consider the low points, we'll hit Sandy Bridge. The average falls smack on Haswell. In fact, I've fallen on Haswell IPC using at least three different methods now. Albeit, I do have a best-case, yet realistic, scenario which shows Zen kissing Skylake as it blows past it, but that would require AMD to have actually reached L2 and L3 performance parity with Intel.

Zen does look like it might have a very nice FPU, though. Its breakdown of instructions finally makes sense, after putting them all in a spreadsheet. There's something special about fp3 that allows it to gang-up with other floating point units more easily (all roads lead to fp3 kinda thing).

fop reg, mem
fmul reg, mem
ssemuladd

These each require FP3 + some other unit. Only one floating point multi-fp op, fcmp-dbl, doesn't require fp3.

Honestly, I do have a hard time constraining the performance, from what we know of Zen from the gcc patch, to 40%. It appears the most critical instructions are exactly the ones that have been optimized most (surprise, surprise) Some of these look like we can do twice as many as Bulldozer without even looking at any other possible/likely improvements (particularly simple ops, like add/fadd). And, yes, that would be Skylake territory.

looncraz · Nov 12, 2015

Dresdenboy said:
Nice graphs! It looks like instruction execution changes settled somehow with XV. Then the biggest part of the IPC improvement comes from L1 cache size and other core improvements (as usual I admit ).

I have no doubt the biggest improvement came from doubling the L1-D and specific FPU instruction optimizations being close behind.

The slower the L2, the more important the L1 storage happens to be - particularly for data, which can't always stream as easily as instructions.

At least that's the way I see it

Haserath · Nov 12, 2015

DrMrLordX said:
On fp workloads, I saw more like an 11% increase in IPC from Steamroller to Excavator. The problem is that a lot of the XV numbers going around are coming from cTDP-strangled, throttle-prone Carrizo chips. Check out what happens when you have a cTDP-unlocked developer platform with Carrizo in it:

http://www.overclock.net/t/1560230/jagatreview-hands-on-amd-fx-8800p-carrizo/400_100#post_24310470

Connecting the dots from there isn't too hard. Carrizo @ 3.4 GHz managed an R10 score of 13146. In a MT scenario, that chip is getting ~966.6 CB per thread per GHz. Zen will (allegedly) get ~1353.3 per thread per GHz in the same scenario, assuming a 40% improvement.

Now if you look here:

You'll see the 6700k getting 40731 @ 4.8 GHz. That's ~1060.7 per thread per GHz. Here we see two interesting facts: First, XV is very close to the 6700k in this old SSE2 benchmark, losing by the margin it does thanks to AMD being unable to clock it very high and deploy more modules. Secondly, a 4c/8t Zen should require only a clockspeed of ~3.8 GHz to match the 4.8 GHz 6700k in R10.

Haswell-like performance? I think not.

SMT performance is only about 30% better in cinebench. With 4 cores/4 threads that Skylake would be around 30k, which makes its score above 1500 per thread per Ghz.

looncraz · Nov 12, 2015

Haserath said:
SMT performance is only about 30% better in cinebench. With 4 cores/4 threads that Skylake would be around 30k, which makes its score above 1500 per thread per Ghz.

Indeed so.

40731 * 0.7 = 28511.7
28511.7 / 4 = 7127.925
7127.925 / 3.8 = 1875.77

So 1875/core/GHz.

The link also shows the Carizzo being locked to 15W TDP, though I'm not sure that was really enforced.

That 2:1 ratio, however, is pretty close to what we see with Piledriver vs Skylake:

http://www.anandtech.com/bench/product/1544?vs=700

ST: 8372 vs 4114
MT: 29116 vs 12857

The clock speeds will be all over the place, of course, but the Skylake will pretty much NOT run at its base clock, it will be stuck at its max turbo in single threaded, and be 50% of the way there for a fully loaded CPU that is properly cooled (at least in Cinebench from my testing w/ Haswell).

So MT will be 3.7GHz Skylake vs 3.8GHz Piledriver, and ST will be 3.9GHz Skylake vs 4.0GHz Piledriver.

We know Steamroller did effectively nothing for Cinebench scores, and we know that Excavator added a decent 9.85% or so boost in Cinebench.

So, Synthetic Excavator vs Skylake, equal clocks:

ST: 4635 vs 8372
MT: 13751 vs 29116

A theoretical 40% faster Zen:

ST: 6490 vs 8372
MT: 19250 vs 29116

So, where does that put Zen vs Intel?

How does the 4690k compare?
ST: 6490 vs 7619
MT: 27090 vs 19250

Better, but still not there. Of course, on average, 40% faster, evenly applied, will create Haswell-like performance. The FPU, however, will need to double. And the current design details suggest that is exactly what may be happening.

JDG1980 · Nov 13, 2015

ShintaiDK said:
Broadwell-E will be 3.3Ghz for 8 cores and 3Ghz for 10 cores. All at 140W.

Do we have any confirmation for the 140W figure? Just because that's what was assigned to Haswell-E doesn't mean the same will be true of Broadwell-E, which uses a different process. (And the 140W TDP of Haswell-E is a rough estimate at best. Tom's Hardware measured the i7-5960X as having a maximum load of 122W, including VRM losses. Average under 100% load was only 106W.)

ShintaiDK said:
Anyone in their right mind still believing in 4Ghz Haswell IPC 8C/16T at 95W with 14 LPP for Zen?

I'm assuming the flagship 8C/16T Zen will probably come in at 3.0 GHz base clock (turbo clocks under lightly threaded loads will be higher), and have IPC averaging somewhere above Sandy Bridge and below Haswell. The target TDP of 95W should be achievable on the Samsung-derived FinFET process with these or similar parameters. Enthusiast boards will probably have better power stages to allow additional overclocking, which would of course invalidate the stock TDP rating.

I don't expect AMD to blow Intel away. That's unrealistic given their limited resources. I do expect them to come up with a product that is reasonably competitive, given the relative stagnation in the high-end desktop market since Sandy Bridge.

TechGod123 · Nov 13, 2015

looncraz said:
I remember reading some details about Bulldozer a loong time ago that made me believe that a branch misprediction would result in a L1 flush and the L1 would need to be refilled by the L2 before execution resumes. The problem, as I recall, was how little the integer unit can do at once. This meant, at least to me, that the branch misprediction penalty would, in effect, by constrained by the L2 latency (since a misprediction will flush the pipelines at the same time it initiated an L1 refill, there would only be, maybe, a 1 cycle penalty that was not driven by the L2).

I think this came up in a conversation with JF-AMD, but I really have no idea how to search for that :'(

Because I'm not fully aware, would the L1 cache you're referring to be the instruction cache or data cache that would get flushed?

ShintaiDK · Nov 13, 2015

JDG1980 said:
Do we have any confirmation for the 140W figure?

No, but it is the platform target.

lamedude · Nov 13, 2015

Dresdenboy said:
In another forum I asked for FMA usage in applications/games.

FWIW VC 2013's C library uses them for a few things. We found out right away since MS didn't bother checking if AVX was enabled before using them (this didn't get fixed until VC 2015

). You can do "bcdedit xsavedisable 1" and see if anything crashes.

NostaSeronx · Nov 13, 2015

-32nm PDSOI - 12 Track (High Performance Lib)- vs

14nm LPP - 10.5 Track (High Performance Lib)
-20% frequency
+50% lower power
+85% lower leakage
+55% area shrink

22nm FDSOI - 8 Track + ABB (Fast High Density Lib // FBB Focus)
+30% frequency(est.)
+45% lower power(est.)
+50% lower leakage(est.)
+65% area shrink(est. the shrink is bigger for Mixed-signal)

22nm FDSOI - 8 Track + ABB (Fast High Density Lib // RBB Focus)
+10% frequency(est.)
+65% lower power(est.)
+70% lower leakage(est.)
+65% area shrink(est. the shrink is bigger for Mixed-signal)

DrMrLordX · Nov 13, 2015

Dresdenboy said:
DrMrLordX, SKL has a SMT penalty in the score/GHz/thread calculation. XV has only a small one (CMT). That (avg.) 40% number could also stand for single threaded integer code. We don't know it.

Skylake's SMT penalty is irrelevant. Running at its highest performance level, the chip in question (6700k) has the capability to handle 8 threads. A quad-core Zen would do exactly the same thing, just as a 2M XV can handle 4 threads. If Zen is really going to have 40% higher IPC than XV, then that's going to be after we take SMT penalties into account (which is something Zen will face, since it will feature SMT). If not, then AMD is simply not telling the truth about Zen.

I doubt it'll be that much of an improvement in int-heavy code. There will still be some improvement thanks to faster cache and (probably) shorter pipeline. Most of the gains will be in floating-point code.

Haserath said:
SMT performance is only about 30% better in cinebench. With 4 cores/4 threads that Skylake would be around 30k, which makes its score above 1500 per thread per Ghz.

Now we're just playing with numbers. The only way we can objectively compare Construction core modules and SMT-capable cores is to push them to their maximum thread counts and see where we lie. Yes, Intel does sell CPUs that simply don't utilize SMT for anything, but as you clearly stated, they're slower without SMT in R10 and many other benchmarks.

Sure, you can try to eliminate SMT from the equation to make Skylake look like it's getting better performance per thread per GHz, but since you have eliminated SMT, the maximum thread count for Skylake drops by half, hurting overall performance. Why do you think Carrizo is a performance loser against Skylake today in a benchmark like R10? Simple: thread counts and clockspeeds.

looncraz said:
The link also shows the Carizzo being locked to 15W TDP, though I'm not sure that was really enforced.

It is, in all the 15W cTDP Carrizo platforms, which is why they're not necessarily the best source of data concerning XV performance. For example, in 15W mode, Carrizo is forced into using DDR3-1600 no matter what DIMMS are in use or whatever other factors prevail.

If you really want hard data on what XV can do, you need a cTDP-unlocked system. Or you're just going to have to wait for Bristol Ridge and see what it can do.

Haserath · Nov 13, 2015

DrMrLordX said:
Skylake's SMT penalty is irrelevant. Running at its highest performance level, the chip in question (6700k) has the capability to handle 8 threads. A quad-core Zen would do exactly the same thing, just as a 2M XV can handle 4 threads. If Zen is really going to have 40% higher IPC than XV, then that's going to be after we take SMT penalties into account (which is something Zen will face, since it will feature SMT). If not, then AMD is simply not telling the truth about Zen.

I doubt it'll be that much of an improvement in int-heavy code. There will still be some improvement thanks to faster cache and (probably) shorter pipeline. Most of the gains will be in floating-point code.

Now we're just playing with numbers. The only way we can objectively compare Construction core modules and SMT-capable cores is to push them to their maximum thread counts and see where we lie. Yes, Intel does sell CPUs that simply don't utilize SMT for anything, but as you clearly stated, they're slower without SMT in R10 and many other benchmarks.

Sure, you can try to eliminate SMT from the equation to make Skylake look like it's getting better performance per thread per GHz, but since you have eliminated SMT, the maximum thread count for Skylake drops by half, hurting overall performance. Why do you think Carrizo is a performance loser against Skylake today in a benchmark like R10? Simple: thread counts and clockspeeds.

Except you calculated a 3.8Ghz Zen will equal a 4.8Ghz Skylake because Skylake's SMT penalty is much bigger than Carrizo's CMT penalty...

If you scale up Carrizo's score: 13146+40% IPC+30% SMT+26.3% clock= 30218 and even if the CMT penalty isn't part of that 40%: 30218+10%= 33240, Zen would be 20% behind.

IEC · Nov 13, 2015

Haserath said:
Except you calculated a 3.8Ghz Zen will equal a 4.8Ghz Skylake because Skylake's SMT penalty is much bigger than Carrizo's CMT penalty...

If you scale up Carrizo's score: 13146+40% IPC+30% SMT+26.3% clock= 30218 and even if the CMT penalty isn't part of that 40%: 30218+10%= 33240, Zen would be 20% behind.

That seems... unlikely.

But I would love to have an incentive to replace my Skylake desktop

Dresdenboy · Nov 13, 2015

It will also be interesting, which clock frequencies with how many cores Zen is able to sustain.

DrMrLordX · Nov 13, 2015

Haserath said:
Except you calculated a 3.8Ghz Zen will equal a 4.8Ghz Skylake because Skylake's SMT penalty is much bigger than Carrizo's CMT penalty...

No! You are missing the point.

Skylake gets X performance at Y GHz, period. It gets THE MOST performance using SMT. The fact that it only gains 30% per physical core from SMT is a foible of the design, NOT a penalty against XV or Zen. Zen will probably have the same problem. It will probably get ~70% of its maximum performance without using SMT at all, unless their design is radically different than Intel's.

If Zen is actually going to have 40% more IPC than XV, it is going to have that higher IPC irrespective of CMT penalties or SMT penalties. If it can not manage this, then AMD is not telling the truth about their uarch. If we are going to take AMD at their word about the IPC improvement, then Zen is going to have an R10 score 40% higher than what XV can produce when fully-loaded with the maximum thread count. Attempting to eliminate SMT or CMT penalties from the equation is not only absurd, but also counter-productive since the overall IPC of a given modern uarch is heavily influenced by all of the underlying technologies involved, including CMT and SMT. CMT and SMT are different enough that comparing uarches is very difficult; competing ISA extensions (AVX2 vs xOP vs NEON) can make things even worse. At least R10 uses an older ISA extension that is supported by Skylake, XV, and Zen (and apparently it has no problem using that ISA extension on AMD CPUs). That helps to reduce some of the confusion.

By the same token, if I take a program and load up all cores (4+ threads) on a 2m/4t Bulldozer and then compare it to XV and record the difference, I don't question the difference in IPC just on the basis that BD has a module penalty when all modules are loaded. It's a problem with that particular uarch that was more-or-less fixed with Steamroller (and it certainly didn't get worse with XV).

Summary/tl;dr: 1 Zen core + SMT is going to have to manage +40% performance at the same clockspeed vs. 1 XV module in order to fulfill AMD's claim of +40% IPC. They made good (and then some) on their IPC claim moving from SR to XV. Can they do it with Zen? We'll find out next year.

And here is a sobering thought: a 4m/8t XV CPU running @ 4.8 GHz would have an R10 score of 37117, assuming the memory interface scaled upwards enough to maintain performance (it would need to be something faster than DDR3-2133 CL9/10). That is very close to what Skylake 4c/8t produces today in a workload that heavily favors Intel's design over AMD's. AMD can't (or simply refuses to) produce such a processor, so Intel has no competition. But the design chops are there.

MrTeal · Nov 13, 2015

DrMrLordX said:
Summary/tl;dr: 1 Zen core + SMT is going to have to manage +40% performance at the same clockspeed vs. 1 XV module in order to fulfill AMD's claim of +40% IPC. They made good (and then some) on their IPC claim moving from SR to XV. Can they do it with Zen? We'll find out next year.

Do you have a source for the claim that AMD is promising one Zen core will produce 40% higher IPC than one XV module? That seems highly suspect, as it would imply that an 8 core Zen chip at the same clocks would get a score 5.6X higher in MT benchmarks than the two module Carrizo chips.

DrMrLordX · Nov 13, 2015

MrTeal said:
Do you have a source for the claim that AMD is promising one Zen core will produce 40% higher IPC than one XV module? That seems highly suspect, as it would imply that an 8 core Zen chip at the same clocks would get a score 5.6X higher in MT benchmarks than the two module Carrizo chips.

No, all AMD has said is that Zen will have 40% higher IPC. The whole point of my previous post was to explain why it must be that way if AMD's claim is to have any merit whatsoever. There is no other viable way to compare the microarchitectures.

We can't just give AMD the benefit of the doubt and say, "well what they meant was, take a Zen core, turn off SMT, and then compare it to how fast an XV module is with one "core" disabled in the OS, and Zen is 40% faster" because XV is going to scale downward differently losing that one core than Zen (or Skylake, or any other SMT core) will scale downward when it loses SMT. The comparison would be meaningless. All we'd be doing is arbitrarily throwing out some of the computational resources of every CPU in the comparison to try to bring them into line with a dated view of what is (or should be) a CPU core.

If you have a core or module that requires two different processing threads to make full use of all available resources, then the only viable way to measure its IPC is to load up every core/module with as many threads as it can handle, and record the total performance. If you are attempting to ascertain how poorly-threaded or lightly-threaded code will operate on these processors, then you are going to have to test on an application-by-application basis to see how that software will function on a given CPU. As it stands, software that fails to utilize the full thread count of a processor will favor SMT-based solutions over CMT-based solutions. AMD has encountered that problem often enough that they are ready to get away from CMT.

itsmydamnation · Nov 13, 2015

DrMrLordX said:
I doubt it'll be that much of an improvement in int-heavy code. There will still be some improvement thanks to faster cache and (probably) shorter pipeline. Most of the gains will be in floating-point code.

Why its doubled its integer execution resources. AMD moved instructions in piledriver onward that from 7k to bulldozer where on the ALU's to the AGU because they where so ALU limited, for Zen they are back on the ALU. We are going to see better cache, better predictors, better load store pipeline. All of these will help integer performance just as much as FP performance.

So what's the exact reasoning you think we wont see 40% more perf per clock for int but we will for FP?

SMT "penalty" is to hard to calculate at this point, how big is Zen scheduler, how big is the PRF, the load store queues etc. The variability of the bottleneck of specific workloads will also change "the penalty".

NostaSeronx · Nov 13, 2015

MrTeal said:
Do you have a source for the claim that AMD is promising one Zen core will produce 40% higher IPC than one XV module?

The actual claim is a single Zen core{1C|2T} has 40% higher IPC over a single XV core{1C|1T}.

cytg111 · Nov 13, 2015

Haserath said:
Except you calculated a 3.8Ghz Zen will equal a 4.8Ghz Skylake because Skylake's SMT penalty is much bigger than Carrizo's CMT penalty...

If you scale up Carrizo's score: 13146+40% IPC+30% SMT+26.3% clock= 30218 and even if the CMT penalty isn't part of that 40%: 30218+10%= 33240, Zen would be 20% behind.

These numbers .. based on next to no information, this thread baffles me.
Anyway, application performance doesnt scale with with IPC.

Exophase · Nov 13, 2015

cytg111 said:
Anyway, application performance doesnt scale with with IPC.

How not? IPC is supposed to refer to some average measurement of real world application's work/MHz.

cytg111 · Nov 13, 2015

Exophase said:
How not? IPC is supposed to refer to some average measurement of real world application's work/MHz.

Cause the CPU is just one part of the whole? If you were to scale up the rest of the gang too, busses, memory etc, that would be another story.

MrTeal · Nov 13, 2015

cytg111 said:
Cause the CPU is just one part of the whole? If you were to scale up the rest of the gang too, busses, memory etc, that would be another story.

Generally when you see IPC talked about here, it's measured and compared to other designs by measuring performance in a variety of different applications rather than something based on the number of pipelines in a core or similar. In that case, those external factors are already included in "IPC increase".

[Techpowerup] AMD "Zen" CPU Prototypes Tested, "Meet all Expectations"

Where do you think this will land performance wise

Intel i7 Haswell-E 8 CORE

Intel i7 Skylake

Intel i5 Skylake

Just another Bulldozer attempt

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Golden Member

Member

Lifer

Golden Member

Diamond Member

Lifer

Senior member

Elite Member

Golden Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member