New Zen microarchitecture details

The Stilt · May 2, 2016

naukkis said:
You compare 4-core Intel to 2-module AMD cpu's. 4-core Zen will have about 100 percent more fpu-simd resources and L2-bandwith versus two module versions.

Every module has 2-thread SIMD fpu, every Zen core will have about same resources for fpu(2-threads, L2). Only L1-bandwith will be lower but as write throughput is L2-cache limited module cores can't get full benewith from it. So 4-core Zen should be about twice as fast as 2-module const core at heavy SIMD workloads. So very close to Haswell.....

The test was ST.

naukkis · May 2, 2016

The Stilt said:
The test was ST.

Yeah but you should compare dual module Ex to 2-core 4 thread Intel cores. 4-core Zen will have twice as much SIMD-resources and L2-bandwith and if in that situation manages to be only 40% faster than dual modules AMD has screwed something badly.

If AMD reused Excavator FPU 2-core zen have exactly same FPU as 2-module Ex(we have some speculation that AMD even expanded some resources), only L1-bandwith won't match. So SMT 2-core Zen should perform about same as 2-module Ex in heavy SIMD-workload. ST performance is unknown but I doubt that Ex have anywhere near 100% CMT scaling in heavy SIMD-optimized workloads as this.

The Stilt · May 2, 2016

naukkis said:
Yeah but you should compare dual module Ex to 2-core 4 thread Intel cores. 4-core Zen will have twice as much SIMD-resources and L2-bandwith and if in that situation manages to be only 40% faster than dual modules AMD has screwed something badly.

If AMD reused Excavator FPU 2-core zen have exactly same FPU as 2-module Ex(we have some speculation that AMD even expanded some resources), only L1-bandwith won't match. So SMT 2-core Zen should perform about same as 2-module Ex in heavy SIMD-workload. ST performance is unknown but I doubt that Ex have anywhere near 100% CMT scaling in heavy SIMD-optimized workloads as this.

So you expect a single Zen CCX (4 cores) to have twice as much SIMD resources as two Excavator CUs? I'm pretty sure that's not the case, unfortunately.

I also expect Zen to show larger gains in L3 bandwidth than in L2.

SAAA · May 2, 2016

Well 100% difference sounds like a really extreme case... care to test something else than X265?

AMD stated over 40%, but that's an average for sure so it might be 0% in some cases while 80% in others for example...

If Zen is close to being a distant relative of A9X, but born for servers instead, it shouldn't do too bad: my personal expectations put it around Sandy Bridge level IPC on most code and better when using some recent instructions (like AVX2).

naukkis · May 2, 2016

The Stilt said:
So you expect a single Zen CCX (4 cores) to have twice as much SIMD resources as two Excavator CUs? I'm pretty sure that's not the case, unfortunately.

I also expect Zen to show larger gains in L3 bandwidth than in L2.

Is there have been any indications that Zen core would have less FPU-resources than Ex module? And if they redesign FPU to simpler(why, both Zen and Ex have 2-way SMT FPU) what they could cut?

And Zen will have L2 per core, bandwith wise equal to Ex module but with lower latency. So 4-core Zen should have at least twice the L2-bandwith vs 2-module ex.

Dresdenboy · May 2, 2016

The Stilt, thanks for these results!

ShintaiDK said:
Yep, no need to keep waiting for "magic". Its going to be the same case as always.

There is no need to wait. The 128b FPU thing is known since October. IPC increase is an average, so the upside is on fixed point/int as it seems. FP throughput won't be much higher than in XV. Latency, legacy code, and power seem to have been the main optimization targets.

Exophase · May 2, 2016

The Stilt said:
The test was ST.

Probably benefits more from a full-width AVX2 solution than most software though. Would be curious to see SB and IB results too (AFAIK it doesn't use AVX1, nothing really FP based)

AtenRa · May 2, 2016

Are you comparing Mobile 15-35W TDP vs Desktop 85-95W TDP skus ???

The Stilt · May 2, 2016

AtenRa said:
Are you comparing Mobile 15-35W TDP vs Desktop 85-95W TDP skus ???

Did you actually read what the post was about?

ShintaiDK · May 2, 2016

The Stilt said:
Did you actually read what the post was about?

He went straight to denial before reading it

Personal attacks are not allowed here.
Markfw900

DrMrLordX · May 2, 2016

Dresdenboy said:
FP throughput won't be much higher than in XV.

That would surprise me. I expect XV and Zen to be close(r) in int code, but the difference in fp throughput should be dramatic.

AtenRa · May 2, 2016

The Stilt said:
Did you actually read what the post was about?

Are the Excavator performance bellow at 15-35W TDP on a mobile platform or what ??

The Stilt said:
So:

Excavator (AVX2) = 1.25fps
Steamroller = 1.26fps (+0.8%)
Excavator (No-AVX2) = 1.30fps (+4%)
Haswell = 2.36fps (+88.8%)
Skylake = 2.55fps (+104.0%)

The Stilt · May 2, 2016

AtenRa said:
Are the Excavator performance bellow at 15-35W TDP on a mobile platform or what ??

Please do elaborate what does the TDP have to do with IPC? These are fixed frequency runs, single threaded.

AtenRa · May 2, 2016

The Stilt said:
Please do elaborate what does the TDP have to do with IPC? These are fixed frequency runs, single threaded.

Sorry but the performance of a single Core/Thread at 15W TDP is not the same as the Core/Thread at 95W TDP.

Example, A8-7600 at 45-55-65W TDP.

Same CPU, same Turbo, same platform. Single Thread performance is not the same simple because the Core will throttle down faster at 45W TDP than on 65W or 95W TDP etc.

A8-7600 Turbo = 3800MHz
A10-7700K (95W TDP) Turbo = 3800MHz

Have a look at ST from 45W to 95W TDP

and

The Stilt · May 2, 2016

Who said I had the TDP set to 15W on Carrizo? It is a AMD reference system, which is configured to 35W/42W by default :sneaky:

Unless of course you recon even the 35/42W TDP limit will be exhausted by stressing a single core?

AtenRa · May 2, 2016

The Stilt said:
Who said I had the TDP set to 15W on Carrizo? It is a AMD reference system, which is configured to 35W/42W by default :sneaky:

Unless of course you recon even the 35/42W TDP limit will be exhausted by stressing a single core?

Even at 45W TDP, Single Thread performance IS NOT the same as at 95W TDP. I have just posted Cinebench above, have a look.

Dresdenboy · May 2, 2016

naukkis said:
Is there have been any indications that Zen core would have less FPU-resources than Ex module? And if they redesign FPU to simpler(why, both Zen and Ex have 2-way SMT FPU) what they could cut?

And Zen will have L2 per core, bandwith wise equal to Ex module but with lower latency. So 4-core Zen should have at least twice the L2-bandwith vs 2-module ex.

Have a look at this table:
http://users.atw.hu/instlatx64/CPU_chart_v10.png
It sums up a lot of what is known so far. Check the XV and Zeppelin columns for the Vxxx instructions. The throughput is given as inverse throughput (1/instructions per cycle). For most FP SIMD ops, throughput is the same, and some latencies are lower (good for dependent instruction chains).
Also the FPU is 4 issue instead of 3 (for 128b ops), and can execute for example 2 FMUL and 2 FADD (64b or 128b) in one cycle vs. 2 FMUL OR 2 FADD (or one of each) with XV (3rd unit is FMISC). It depends on the code as usual.

These are theoretical considerations as anything else from fetch to retire affects the resulting performance.

In the end, SR vs. PD is an 8 FPU chip vs. a 4 FPU chip. This leads to another answer:

The Stilt said:
So you expect a single Zen CCX (4 cores) to have twice as much SIMD resources as two Excavator CUs? I'm pretty sure that's not the case, unfortunately.

1 CCX: 4x 2x (FMA or FMUL+FADD)
2 XV CUs: 2x 2x (one of FMA/FMUL/FADD)
(all 128b)
Int SIMD/AVX2 is a bit different.

el etro · May 2, 2016

The stlit locked the clocks on the test and test is single thread performance, this is not the question. Question is that this that CON core was not built for this kind of load(according to guys that really understand CPU design! Not me), so is normal that it performs so poorly in this test, and Zen probably will not change it.

I still think about best case Excavator IPC(Integer) + 40% for Zen. May 7-zip or Geekbench integer tests represent better what kind of IPC Zen may bring.

Looncraz tests shows that Intel big core IPC advantage is not too big in most kinds of loads, but CON having loses in FP, having into FP tests some outliers where Intel IPC advantage goes up to 300%

In the end the difference may be not to reach, at least in most types of loads.

Arachnotronic · May 2, 2016

el etro said:
The stlit locked the clocks on the test and test is single thread performance, this is not the question. Question is that this that CON core was not built for this kind of load(according to guys that really understand CPU design! Not me), so is normal that it performs so poorly in this test, and Zen probably will not change it.

I still think about best case Excavator IPC(Integer) + 40% for Zen. May 7-zip or Geekbench integer tests represent better what kind of IPC Zen may bring.

Looncraz tests shows that Intel big core IPC advantage is not too big in most kinds of loads, but CON having loses in FP, having into FP tests some outliers where Intel IPC advantage goes up to 300%

In the end the difference may be not to reach, at least in most types of loads.

That's a Sandy Bridge you're looking at there. Skylake is probably ~34% faster per clock than Sandy Bridge.

JoeRambo · May 2, 2016

The Stilt said:
Unless of course you recon even the 35/42W TDP limit will be exhausted by stressing a single core?

LOL, application of so much salt should be restricted near wounds. :wub:

Still, AVX2 (and vector code overall) is bad scenario for current AMD stuff. Intel simply has beastly execution resources for such code with massive bandwidth from caches to sustain performance. But is it very relevant for real world? I don't think so. I do hope Zen will have decent integer performance and SSE2/AVX vector implementation that has no glass jaws or hidden gotchas. I don't really care if they execute 256 bit vectors in 2x the time or at half the throughput. (it' not like Intel jumped to full-on 256bits with Sandy Bridge either, it was slowish and epic leap was made with Haswell).

Current AMD chips are so disastrously bad that it is easy to forget what really matters

The Stilt · May 2, 2016

AtenRa said:
Even at 45W TDP, Single Thread performance IS NOT the same as at 95W TDP. I have just posted Cinebench above, have a look.

A8-7600 (65W) scores exactly the same in ST as A10-7700K as long as the clocks are not throttled down by the TDP restriction.

I've told million times that the power management on Kaveri is completely broken and that the chip doesn't know it's power consumption.

Static frequency = static performance regardless of the TDP.

el etro · May 2, 2016

Arachnotronic said:
That's a Sandy Bridge you're looking at there. Skylake is probably ~34% faster per clock than Sandy Bridge.

It all depends on the load. In some of these SKL is over than 60% faster than SNB, not to mention encryption loads...

I think that on overall SKL is 20% to 25% faster than SNB per-clock, is around what Anand concludes it is.

The Stilt · May 2, 2016

Yeah just like AtenRa said, my clocks are definitely all over the place and the IPC is completely TDP limited! Oh wait... :sneaky:

@ 2.9GHz Expected: ~1.133 (1.25/3200 * 2900)
@ 2.9GHz Measured: 1.14fps

@ 3.2GHz Measured: 1.25fps (reference)

@ 3.4GHz Expected: ~1.328 (1.25/3200 * 3400)
@ 3.4GHz Measured: 1.32fps

AtenRa · May 2, 2016

The Stilt,

Just lower TDP to 15W and run single thread again and see if it will perform the same as 42W TDP. Simple as that.

edit : question

Does the Desktop Excavator (BristolRidge) only has 700MHz NB frequency ??

The Stilt · May 2, 2016

AtenRa said:
The Stilt,

Just lower TDP to 15W and run single thread again and see if it will perform the same as 42W TDP. Simple as that.

edit : question

Does the Desktop Excavator (BristolRidge) only has 700MHz NB frequency ??

I probably need to disable STAPM too in order to make it to throttle. Otherwise it will allow boost to 25W for 200 seconds, which is almost 2/3 of the duration of the run.

All Carrizos have NBDPM which automatically adjusts the NB frequency. In load it jumps to 1300MHz and in idle it drops to 600-800MHz. Bristol Ridge will have the same NCLK ceiling, yes.

New Zen microarchitecture details

Golden Member

Golden Member

Golden Member

Senior member

Golden Member

Golden Member

Diamond Member

Lifer

Golden Member

Lifer

Lifer

Lifer

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Golden Member

Lifer

Golden Member

Golden Member

Golden Member

Golden Member

Lifer

Golden Member