New Zen microarchitecture details

Page 41 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
You compare 4-core Intel to 2-module AMD cpu's. 4-core Zen will have about 100 percent more fpu-simd resources and L2-bandwith versus two module versions.

Every module has 2-thread SIMD fpu, every Zen core will have about same resources for fpu(2-threads, L2). Only L1-bandwith will be lower but as write throughput is L2-cache limited module cores can't get full benewith from it. So 4-core Zen should be about twice as fast as 2-module const core at heavy SIMD workloads. So very close to Haswell.....

The test was ST.
 

naukkis

Senior member
Jun 5, 2002
705
576
136
The test was ST.

Yeah but you should compare dual module Ex to 2-core 4 thread Intel cores. 4-core Zen will have twice as much SIMD-resources and L2-bandwith and if in that situation manages to be only 40% faster than dual modules AMD has screwed something badly.

If AMD reused Excavator FPU 2-core zen have exactly same FPU as 2-module Ex(we have some speculation that AMD even expanded some resources), only L1-bandwith won't match. So SMT 2-core Zen should perform about same as 2-module Ex in heavy SIMD-workload. ST performance is unknown but I doubt that Ex have anywhere near 100% CMT scaling in heavy SIMD-optimized workloads as this.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Yeah but you should compare dual module Ex to 2-core 4 thread Intel cores. 4-core Zen will have twice as much SIMD-resources and L2-bandwith and if in that situation manages to be only 40% faster than dual modules AMD has screwed something badly.

If AMD reused Excavator FPU 2-core zen have exactly same FPU as 2-module Ex(we have some speculation that AMD even expanded some resources), only L1-bandwith won't match. So SMT 2-core Zen should perform about same as 2-module Ex in heavy SIMD-workload. ST performance is unknown but I doubt that Ex have anywhere near 100% CMT scaling in heavy SIMD-optimized workloads as this.

So you expect a single Zen CCX (4 cores) to have twice as much SIMD resources as two Excavator CUs? I'm pretty sure that's not the case, unfortunately.

I also expect Zen to show larger gains in L3 bandwidth than in L2.
 

SAAA

Senior member
May 14, 2014
541
126
116
Well 100% difference sounds like a really extreme case... care to test something else than X265?

AMD stated over 40%, but that's an average for sure so it might be 0% in some cases while 80% in others for example...

If Zen is close to being a distant relative of A9X, but born for servers instead, it shouldn't do too bad: my personal expectations put it around Sandy Bridge level IPC on most code and better when using some recent instructions (like AVX2).
 

naukkis

Senior member
Jun 5, 2002
705
576
136
So you expect a single Zen CCX (4 cores) to have twice as much SIMD resources as two Excavator CUs? I'm pretty sure that's not the case, unfortunately.

I also expect Zen to show larger gains in L3 bandwidth than in L2.

Is there have been any indications that Zen core would have less FPU-resources than Ex module? And if they redesign FPU to simpler(why, both Zen and Ex have 2-way SMT FPU) what they could cut?

And Zen will have L2 per core, bandwith wise equal to Ex module but with lower latency. So 4-core Zen should have at least twice the L2-bandwith vs 2-module ex.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
The Stilt, thanks for these results!

Yep, no need to keep waiting for "magic". Its going to be the same case as always.
There is no need to wait. The 128b FPU thing is known since October. IPC increase is an average, so the upside is on fixed point/int as it seems. FP throughput won't be much higher than in XV. Latency, legacy code, and power seem to have been the main optimization targets.
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
Please do elaborate what does the TDP have to do with IPC? These are fixed frequency runs, single threaded.

Sorry but the performance of a single Core/Thread at 15W TDP is not the same as the Core/Thread at 95W TDP.

Example, A8-7600 at 45-55-65W TDP.

Same CPU, same Turbo, same platform. Single Thread performance is not the same simple because the Core will throttle down faster at 45W TDP than on 65W or 95W TDP etc.

A8-7600 Turbo = 3800MHz
A10-7700K (95W TDP) Turbo = 3800MHz

Have a look at ST from 45W to 95W TDP

nmyrgj.jpg


and

2427evq.jpg
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Who said I had the TDP set to 15W on Carrizo? It is a AMD reference system, which is configured to 35W/42W by default :sneaky:

Unless of course you recon even the 35/42W TDP limit will be exhausted by stressing a single core?
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
Who said I had the TDP set to 15W on Carrizo? It is a AMD reference system, which is configured to 35W/42W by default :sneaky:

Unless of course you recon even the 35/42W TDP limit will be exhausted by stressing a single core?

Even at 45W TDP, Single Thread performance IS NOT the same as at 95W TDP. I have just posted Cinebench above, have a look.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Is there have been any indications that Zen core would have less FPU-resources than Ex module? And if they redesign FPU to simpler(why, both Zen and Ex have 2-way SMT FPU) what they could cut?

And Zen will have L2 per core, bandwith wise equal to Ex module but with lower latency. So 4-core Zen should have at least twice the L2-bandwith vs 2-module ex.
Have a look at this table:
http://users.atw.hu/instlatx64/CPU_chart_v10.png
It sums up a lot of what is known so far. Check the XV and Zeppelin columns for the Vxxx instructions. The throughput is given as inverse throughput (1/instructions per cycle). For most FP SIMD ops, throughput is the same, and some latencies are lower (good for dependent instruction chains).
Also the FPU is 4 issue instead of 3 (for 128b ops), and can execute for example 2 FMUL and 2 FADD (64b or 128b) in one cycle vs. 2 FMUL OR 2 FADD (or one of each) with XV (3rd unit is FMISC). It depends on the code as usual.

These are theoretical considerations as anything else from fetch to retire affects the resulting performance.

In the end, SR vs. PD is an 8 FPU chip vs. a 4 FPU chip. This leads to another answer:
So you expect a single Zen CCX (4 cores) to have twice as much SIMD resources as two Excavator CUs? I'm pretty sure that's not the case, unfortunately.
1 CCX: 4x 2x (FMA or FMUL+FADD)
2 XV CUs: 2x 2x (one of FMA/FMUL/FADD)
(all 128b)
Int SIMD/AVX2 is a bit different.
 
Last edited:

el etro

Golden Member
Jul 21, 2013
1,581
14
81
The stlit locked the clocks on the test and test is single thread performance, this is not the question. Question is that this that CON core was not built for this kind of load(according to guys that really understand CPU design! Not me), so is normal that it performs so poorly in this test, and Zen probably will not change it.

I still think about best case Excavator IPC(Integer) + 40% for Zen. May 7-zip or Geekbench integer tests represent better what kind of IPC Zen may bring.

single-threaded-3ghz.png


Looncraz tests shows that Intel big core IPC advantage is not too big in most kinds of loads, but CON having loses in FP, having into FP tests some outliers where Intel IPC advantage goes up to 300%

In the end the difference may be not to reach, at least in most types of loads.
 
Mar 10, 2006
11,715
2,012
126
The stlit locked the clocks on the test and test is single thread performance, this is not the question. Question is that this that CON core was not built for this kind of load(according to guys that really understand CPU design! Not me), so is normal that it performs so poorly in this test, and Zen probably will not change it.

I still think about best case Excavator IPC(Integer) + 40% for Zen. May 7-zip or Geekbench integer tests represent better what kind of IPC Zen may bring.

single-threaded-3ghz.png


Looncraz tests shows that Intel big core IPC advantage is not too big in most kinds of loads, but CON having loses in FP, having into FP tests some outliers where Intel IPC advantage goes up to 300%

In the end the difference may be not to reach, at least in most types of loads.

That's a Sandy Bridge you're looking at there. Skylake is probably ~34% faster per clock than Sandy Bridge.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Unless of course you recon even the 35/42W TDP limit will be exhausted by stressing a single core?

LOL, application of so much salt should be restricted near wounds. :wub:

Still, AVX2 (and vector code overall) is bad scenario for current AMD stuff. Intel simply has beastly execution resources for such code with massive bandwidth from caches to sustain performance. But is it very relevant for real world? I don't think so. I do hope Zen will have decent integer performance and SSE2/AVX vector implementation that has no glass jaws or hidden gotchas. I don't really care if they execute 256 bit vectors in 2x the time or at half the throughput. (it' not like Intel jumped to full-on 256bits with Sandy Bridge either, it was slowish and epic leap was made with Haswell).

Current AMD chips are so disastrously bad that it is easy to forget what really matters :)
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Even at 45W TDP, Single Thread performance IS NOT the same as at 95W TDP. I have just posted Cinebench above, have a look.

:rolleyes:

A8-7600 (65W) scores exactly the same in ST as A10-7700K as long as the clocks are not throttled down by the TDP restriction.

I've told million times that the power management on Kaveri is completely broken and that the chip doesn't know it's power consumption.

Static frequency = static performance regardless of the TDP.
 

el etro

Golden Member
Jul 21, 2013
1,581
14
81
That's a Sandy Bridge you're looking at there. Skylake is probably ~34% faster per clock than Sandy Bridge.

It all depends on the load. In some of these SKL is over than 60% faster than SNB, not to mention encryption loads...

I think that on overall SKL is 20% to 25% faster than SNB per-clock, is around what Anand concludes it is.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Yeah just like AtenRa said, my clocks are definitely all over the place and the IPC is completely TDP limited! Oh wait... :sneaky:

@ 2.9GHz Expected: ~1.133 (1.25/3200 * 2900)
@ 2.9GHz Measured: 1.14fps

@ 3.2GHz Measured: 1.25fps (reference)

@ 3.4GHz Expected: ~1.328 (1.25/3200 * 3400)
@ 3.4GHz Measured: 1.32fps


RafWf8C.png


kHwGdJB.png


BxVyf8v.png
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
The Stilt,

Just lower TDP to 15W and run single thread again and see if it will perform the same as 42W TDP. Simple as that. ;)

edit : question

Does the Desktop Excavator (BristolRidge) only has 700MHz NB frequency ??
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
The Stilt,

Just lower TDP to 15W and run single thread again and see if it will perform the same as 42W TDP. Simple as that. ;)

edit : question

Does the Desktop Excavator (BristolRidge) only has 700MHz NB frequency ??

I probably need to disable STAPM too in order to make it to throttle. Otherwise it will allow boost to 25W for 200 seconds, which is almost 2/3 of the duration of the run.

All Carrizos have NBDPM which automatically adjusts the NB frequency. In load it jumps to 1300MHz and in idle it drops to 600-800MHz. Bristol Ridge will have the same NCLK ceiling, yes.