New Zen microarchitecture details

Page 42 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
The Stilt,

Just lower TDP to 15W and run single thread again and see if it will perform the same as 42W TDP. Simple as that. ;)

edit : question

Does the Desktop Excavator (BristolRidge) only has 700MHz NB frequency ??

This is the last derail from my side.

True 10W TDP limit (STAPM Off)

Average frequency (DAR) 1843MHz.

xv6O5gu.png


JIiRY5c.png


True 15W TDP Limit (STAPM Off)

Average frequency (DAR) 2856MHz.

SN5CvcI.png


EZYsDTF.png


~21.5W TDP is needed to keep the frequency at static 3400MHz in a single threaded X265 workload.
 

majord

Senior member
Jul 26, 2015
433
523
136
The x265 results are interesting. I will have to get full version numbers etc when I get home for comparison, but below resualts are all default benchmark script - AVX2 where supported.

As you can see, I get bang on 60% faster for Skylake in x265, which is also right on the average across all the benchmarks.

-edit , zoomed in and saw the answer - disregard


3Ghz Piledriver Vs Steamroller Vs Excavator Vs Skylake 3M:


heGWi7H.png
 
Last edited:

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
I probably need to disable STAPM too in order to make it to throttle. Otherwise it will allow boost to 25W for 200 seconds, which is almost 2/3 of the duration of the run.

All Carrizos have NBDPM which automatically adjusts the NB frequency. In load it jumps to 1300MHz and in idle it drops to 600-800MHz. Bristol Ridge will have the same NCLK ceiling, yes.

what does the nb clock do for the amd platform?
 

deasd

Senior member
Dec 31, 2013
513
724
136
The stlit locked the clocks on the test and test is single thread performance, this is not the question. Question is that this that CON core was not built for this kind of load(according to guys that really understand CPU design! Not me), so is normal that it performs so poorly in this test, and Zen probably will not change it.

I still think about best case Excavator IPC(Integer) + 40% for Zen. May 7-zip or Geekbench integer tests represent better what kind of IPC Zen may bring.

single-threaded-3ghz.png


Looncraz tests shows that Intel big core IPC advantage is not too big in most kinds of loads, but CON having loses in FP, having into FP tests some outliers where Intel IPC advantage goes up to 300%

In the end the difference may be not to reach, at least in most types of loads.

Anything more than 150% should not counted as difference, especially in SIMD workload like CB, you can simply have linear performance uplift if you enlarge SIMD scaling and L/S bandwidth. But for scalar performance you can't do this to ALUs to achieve linear uplift because of ILP limitation. Anyway, it would have much higher power consumption if you add more SIMD unit and wider L/S and execute longer instruction set.
I always think the '40% IPC' is for scalar performance, even if test method it's just superpi......
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
~21.5W TDP is needed to keep the frequency at static 3400MHz in a single threaded X265 workload.
Thanks for the results! Do you have an estimate of how much of that power is for the CU? 3.4GHz is also a bit beyond the efficiency crossover point in comparison to SR. So a SR could probably use less power at the higher frequencies.

BTW I wouldn't count it as derail as this is valuable information about the base of the +40% IPC claim.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
The x265 results are interesting. I will have to get full version numbers etc when I get home for comparison, but below resualts are all default benchmark script - AVX2 where supported.

As you can see, I get bang on 60% faster for Skylake in x265, which is also right on the average across all the benchmarks.

-edit , zoomed in and saw the answer - disregard


3Ghz Piledriver Vs Steamroller Vs Excavator Vs Skylake 3M:


heGWi7H.png

What X265 build were you using in your tests? The difference is so large that it almost seems that the recent AVX2 porting and other optimizations are completely missing.

I used build 1.9+141 (18/04/2016 commit). The same exact build, test video and the used settings can be found here: http://forums.anandtech.com/showpost.php?p=38191805&postcount=6900

In X265 setting affinity doesn't seem to cause any issues. The performance is identical, considering that there is additional overhead from the OS running on the same core. Also on AMD disabling the secondary core has no effect what so ever on the performance of the remaining core, in X265.

37vUD4r.png
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Ok, so 23W limit (vs. 21.5W I previously estimated) was required to keep the average CU0 frequency at ~3393MHz during ST X265 workload.

I won't tell the power for each and every domain separately (for obvious reasons), but the difference between the "CU0" & "Total" power draws consists of: CU1 (PG), GPU (the whole GNB, including everything, mostly PG), CNB, IO, PCI-E PHYs and the ring oscillator.

The averages (excluding the idle parts in the beginning and the end) were:

CU0 Power: 13.221W
"Other" Power: 7.645W
Total Power: 20.867W

r9K4MEc.png
 

majord

Senior member
Jul 26, 2015
433
523
136
What X265 build were you using in your tests? The difference is so large that it almost seems that the recent AVX2 porting and other optimizations are completely missing.

I used build 1.9+141 (18/04/2016 commit). The same exact build, test video and the used settings can be found here: http://forums.anandtech.com/showpost.php?p=38191805&postcount=6900

In X265 setting affinity doesn't seem to cause any issues. The performance is identical, considering that there is additional overhead from the OS running on the same core. Also on AMD disabling the secondary core has no effect what so ever on the performance of the remaining core, in X265.

37vUD4r.png

that does explain it, looks like the Benchmark version I've been using is back on 1.5.

Will download your set and update charts just for Exv vs SL , which should bring the average up to 1.6x (60%) neat. (I'll do your 3.2 run on the i3 also to see how much it differs from the i5/i7 results with the extra L3)

Anyway, all AVX2 heavy benchmarks tend to show around that 2x IPC vs Excavator. Far more than anything else.

Pretty clear it will be a weak point for Zen, core for core, but given the types of applications AVX2 are applicable to tend to scale to a high number of threads, I think the power and area savings may at least go some way towards negating the performance penality when you're talking about 8 SMT cores.
 

naukkis

Senior member
Jun 5, 2002
702
570
136
I totally missed that those Stilt's results were totally single-threaded. With module cores you will lose half of your fpu-L1 bandwith when operating only single thread as fpu shares integer cores AGUs and caches. So with highly optimized SIMD-code module cores became L1-bandwith starved - you really need compare with 2-threads to get real FPU performance estimation.

With Zen(and Intel) FPU load-store cababilities aren't split between threads.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
I totally missed that those Stilt's results were totally single-threaded. With module cores you will lose half of your fpu-L1 bandwith when operating only single thread as fpu shares integer cores AGUs and caches. So with highly optimized SIMD-code module cores became L1-bandwith starved - you really need compare with 2-threads to get real FPU performance estimation.

With Zen(and Intel) FPU load-store cababilities aren't split between threads.

In this context the performance of a whole Excavator CU is pretty irrelevant, since the projected 40% IPC improvement on Zen will apply on a core not on a CU (which is two cores, according to AMD).
 

naukkis

Senior member
Jun 5, 2002
702
570
136
In this context the performance of a whole Excavator CU is pretty irrelevant, since the projected 40% IPC improvement on Zen will apply on a core not on a CU (which is two cores, according to AMD).

It's pretty relevant if you want to compare with Intel(this was mainly answer to your speculation that 8C16T Zen will lose to 4C8T Skylake in x265 encoding. I doubt it based on results in this thread). As module cores have split L1 between cores FPU can only stretch it's legs with two threads active.

http://www.ilsistemista.net/index.p...n-whats-wrong-with-amd-bulldozer.html?start=6

With highly tuned SIMD code FPU bandwith is critical, Haswell doubled that against ivy bridge and can still be starved with one active thread in extreme cases. AMD const cores have only about ivy-bridge equal L1-bandwith with two active threads, and just half of that with one active thread.

Yes AMD did say that with one active thread full fpu is dedicated to one thread but that's not exactly the case as only execution resources are shared.
 
Last edited:

guskline

Diamond Member
Apr 17, 2006
5,338
476
126
A question to Dresdenboy and Stilt since both of you appear to have a lot more knowledge in this area than others.

At this stage in Zen's development, with rumored release at the very end of the year, what is the AMD development team doing with the Zen silicon? Can I assume it is "out in the field" being used by trusted testers?

What tweeks are likely to be made?

I thought it would be fun to talk about what is likely to happen with it this summer.
 

Madpacket

Platinum Member
Nov 15, 2005
2,068
326
126
I'm sure a lot of engineering went into these specialized AVX instructions but how much use do they get on average? Outside of a few corner cases (x265 video encoding, 7zip) I don't see the big deal here for the regular consumer.

There is no benefit in being fastest in something when 95% (stat pulled out of thin air but you get the idea) of your consumers will never use it.

Maybe I'm being obtuse here but wouldn't these transistors be better spent on more cores, more cache, etc?
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
95% - More like 99.99% for the normal consumer.
Its absolutely a good move from a business perspective. Secondly they probably had to reuse much of the fpu tech they had for ressource reasons anyway. Lol.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
I'm sure a lot of engineering went into these specialized AVX instructions but how much use do they get on average? Outside of a few corner cases (x265 video encoding, 7zip) I don't see the big deal here for the regular consumer.

There is no benefit in being fastest in something when 95% (stat pulled out of thin air but you get the idea) of your consumers will never use it.

Maybe I'm being obtuse here but wouldn't these transistors be better spent on more cores, more cache, etc?

Build it and they will come. In other words you need to wait sometimes years for the software to catch up with the hardware.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Server CPUs are more likely to use specialized instruction sets than CPUs sold to the general public.
 

hojnikb

Senior member
Sep 18, 2014
562
45
91
It makes sense. They will make a 8 core die and sell 8 core and cutdown 6 core chips. Everything else will be bristol ridge with 2 and 4 "cores".
 

tential

Diamond Member
May 13, 2008
7,355
642
121
So 12 threads and 16 threads are what Zen will initially launch as.

So either a hexacore will be mainstream to compete with Zen, or Zen will have a core count advantage like I thought but lower performance per core.

This is exactly what I suspected Zen would launch as so I won't be too surprised if that's what we see.
 

deasd

Senior member
Dec 31, 2013
513
724
136
It makes sense. They will make a 8 core die and sell 8 core and cutdown 6 core chips. Everything else will be bristol ridge with 2 and 4 "cores".

Excavator 'core' is not a core, 2 'core' equal to 1 Zen core+SMT.

So either a hexacore will be mainstream to compete with Zen, or Zen will have a core count advantage like I thought but lower performance per core.

I can't see any relationship between your speculation and 'no quad&dual' news.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Made few more Excavator vs. Haswell tests.

The performance difference in X264 (which supposed to be one of the few favorable workloads for 15h) was surprisingly large in favor of Haswell.

X264 0.148.x (20/4/2016)
Compiled: GCC 5.3 x86-64 + YASM 1.30
Settings: Slow, ME = UMH, RC = CRF 16.0, Threads = 1
Input: YUV 420P, 1920x1080, 30fps

Carrizo = 1.82fps
Haswell = 2.71fps (+48.9%)

VPEdDoH.png


4gQvUKZ.png



VP9 1.5.x (Master), 3/5/2016
Compiled: GCC 5.3 x86-64 + YASM 1.30
Settings: Good, Webm, CPU-Used = 3*, End-Usage=CQ, CQ-Level=12, Target-Bitrate=9000, Threads = 1
Input: YUV 420P, 1920x1080, 30fps

* A quality control parameter, has nothing to do with the number of utilized threads or cores. Range 0-8, smaller settings are slower but produce better quality and compression.

Carrizo = 1.03fps
Haswell = 1.64fps (+59.2%)

8HLc13O.png


uwPtRG0.png


"Prime"
Uses GMP 6.1 library to calculate Mersenne Prime numbers.
GMP was compiled as a "fat" binary, meaning it will use all available / supported instructions on all CPUs (effectively a CPU RTD).

Compiled: GCC 5.3 x86-64, generic binary (no architecture specific optimizations).

Carrizo = 327.366 seconds
Haswell = 192.817 seconds (+69.8%)

fxkzwro.png


Ki1qPKT.png
 

nismotigerwvu

Golden Member
May 13, 2004
1,568
33
91
Correct me if I'm wrong, but I thought the reason x264 encoding was seen as a strong point for the con cores was that it was a parallel integer task, which means you could throw a bunch of threads at the chips and they would chug right along through them. Sort of like a best case scenario for the "moar cores" approach. Running it single threaded is negating this benefit and the end result is exactly as expected, the chip with much better single threaded performance well, showed greater single threaded performance. At launch, the FX 8350 was right up there with the Ivy Bridge i7 chips (http://www.anandtech.com/show/6396/the-vishera-review-amd-fx8350-fx8320-fx6300-and-fx4300-tested/3) on x264 encoding, but it was all about tlp not ipc.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
IIRC X264 "over allocates" each thread by 1.5x. I can check what happens without any manual thread configuration, when all cores are used.