New Zen microarchitecture details

The Stilt · May 2, 2016

AtenRa said:
The Stilt,

Just lower TDP to 15W and run single thread again and see if it will perform the same as 42W TDP. Simple as that.

edit : question

Does the Desktop Excavator (BristolRidge) only has 700MHz NB frequency ??

This is the last derail from my side.

True 10W TDP limit (STAPM Off)

Average frequency (DAR) 1843MHz.

True 15W TDP Limit (STAPM Off)

Average frequency (DAR) 2856MHz.

~21.5W TDP is needed to keep the frequency at static 3400MHz in a single threaded X265 workload.

majord · May 2, 2016

The x265 results are interesting. I will have to get full version numbers etc when I get home for comparison, but below resualts are all default benchmark script - AVX2 where supported.

As you can see, I get bang on 60% faster for Skylake in x265, which is also right on the average across all the benchmarks.

-edit , zoomed in and saw the answer - disregard

3Ghz Piledriver Vs Steamroller Vs Excavator Vs Skylake 3M:

monstercameron · May 2, 2016

The Stilt said:
I probably need to disable STAPM too in order to make it to throttle. Otherwise it will allow boost to 25W for 200 seconds, which is almost 2/3 of the duration of the run.

All Carrizos have NBDPM which automatically adjusts the NB frequency. In load it jumps to 1300MHz and in idle it drops to 600-800MHz. Bristol Ridge will have the same NCLK ceiling, yes.

what does the nb clock do for the amd platform?

DrMrLordX · May 2, 2016

NB reduces memory latency and sometimes increases effective memory bandwidth.

deasd · May 2, 2016

el etro said:
The stlit locked the clocks on the test and test is single thread performance, this is not the question. Question is that this that CON core was not built for this kind of load(according to guys that really understand CPU design! Not me), so is normal that it performs so poorly in this test, and Zen probably will not change it.

I still think about best case Excavator IPC(Integer) + 40% for Zen. May 7-zip or Geekbench integer tests represent better what kind of IPC Zen may bring.

Looncraz tests shows that Intel big core IPC advantage is not too big in most kinds of loads, but CON having loses in FP, having into FP tests some outliers where Intel IPC advantage goes up to 300%

In the end the difference may be not to reach, at least in most types of loads.

Anything more than 150% should not counted as difference, especially in SIMD workload like CB, you can simply have linear performance uplift if you enlarge SIMD scaling and L/S bandwidth. But for scalar performance you can't do this to ALUs to achieve linear uplift because of ILP limitation. Anyway, it would have much higher power consumption if you add more SIMD unit and wider L/S and execute longer instruction set.
I always think the '40% IPC' is for scalar performance, even if test method it's just superpi......

Dresdenboy · May 3, 2016

The Stilt said:
~21.5W TDP is needed to keep the frequency at static 3400MHz in a single threaded X265 workload.

Thanks for the results! Do you have an estimate of how much of that power is for the CU? 3.4GHz is also a bit beyond the efficiency crossover point in comparison to SR. So a SR could probably use less power at the higher frequencies.

BTW I wouldn't count it as derail as this is valuable information about the base of the +40% IPC claim.

The Stilt · May 3, 2016

majord said:
The x265 results are interesting. I will have to get full version numbers etc when I get home for comparison, but below resualts are all default benchmark script - AVX2 where supported.

As you can see, I get bang on 60% faster for Skylake in x265, which is also right on the average across all the benchmarks.

-edit , zoomed in and saw the answer - disregard

3Ghz Piledriver Vs Steamroller Vs Excavator Vs Skylake 3M:

What X265 build were you using in your tests? The difference is so large that it almost seems that the recent AVX2 porting and other optimizations are completely missing.

I used build 1.9+141 (18/04/2016 commit). The same exact build, test video and the used settings can be found here: http://forums.anandtech.com/showpost.php?p=38191805&postcount=6900

In X265 setting affinity doesn't seem to cause any issues. The performance is identical, considering that there is additional overhead from the OS running on the same core. Also on AMD disabling the secondary core has no effect what so ever on the performance of the remaining core, in X265.

The Stilt · May 3, 2016

Dresdenboy said:
Do you have an estimate of how much of that power is for the CU?

You mean how much of that is allocated by the other parts of the core? I can make an additional DAR to get the exact figures.

The Stilt · May 3, 2016

Ok, so 23W limit (vs. 21.5W I previously estimated) was required to keep the average CU0 frequency at ~3393MHz during ST X265 workload.

I won't tell the power for each and every domain separately (for obvious reasons), but the difference between the "CU0" & "Total" power draws consists of: CU1 (PG), GPU (the whole GNB, including everything, mostly PG), CNB, IO, PCI-E PHYs and the ring oscillator.

The averages (excluding the idle parts in the beginning and the end) were:

CU0 Power: 13.221W
"Other" Power: 7.645W
Total Power: 20.867W

majord · May 3, 2016

The Stilt said:
What X265 build were you using in your tests? The difference is so large that it almost seems that the recent AVX2 porting and other optimizations are completely missing.

I used build 1.9+141 (18/04/2016 commit). The same exact build, test video and the used settings can be found here: http://forums.anandtech.com/showpost.php?p=38191805&postcount=6900

In X265 setting affinity doesn't seem to cause any issues. The performance is identical, considering that there is additional overhead from the OS running on the same core. Also on AMD disabling the secondary core has no effect what so ever on the performance of the remaining core, in X265.

that does explain it, looks like the Benchmark version I've been using is back on 1.5.

Will download your set and update charts just for Exv vs SL , which should bring the average up to 1.6x (60%) neat. (I'll do your 3.2 run on the i3 also to see how much it differs from the i5/i7 results with the extra L3)

Anyway, all AVX2 heavy benchmarks tend to show around that 2x IPC vs Excavator. Far more than anything else.

Pretty clear it will be a weak point for Zen, core for core, but given the types of applications AVX2 are applicable to tend to scale to a high number of threads, I think the power and area savings may at least go some way towards negating the performance penality when you're talking about 8 SMT cores.

naukkis · May 3, 2016

I totally missed that those Stilt's results were totally single-threaded. With module cores you will lose half of your fpu-L1 bandwith when operating only single thread as fpu shares integer cores AGUs and caches. So with highly optimized SIMD-code module cores became L1-bandwith starved - you really need compare with 2-threads to get real FPU performance estimation.

With Zen(and Intel) FPU load-store cababilities aren't split between threads.

The Stilt · May 3, 2016

naukkis said:
I totally missed that those Stilt's results were totally single-threaded. With module cores you will lose half of your fpu-L1 bandwith when operating only single thread as fpu shares integer cores AGUs and caches. So with highly optimized SIMD-code module cores became L1-bandwith starved - you really need compare with 2-threads to get real FPU performance estimation.

With Zen(and Intel) FPU load-store cababilities aren't split between threads.

In this context the performance of a whole Excavator CU is pretty irrelevant, since the projected 40% IPC improvement on Zen will apply on a core not on a CU (which is two cores, according to AMD).

naukkis · May 3, 2016

The Stilt said:
In this context the performance of a whole Excavator CU is pretty irrelevant, since the projected 40% IPC improvement on Zen will apply on a core not on a CU (which is two cores, according to AMD).

It's pretty relevant if you want to compare with Intel(this was mainly answer to your speculation that 8C16T Zen will lose to 4C8T Skylake in x265 encoding. I doubt it based on results in this thread). As module cores have split L1 between cores FPU can only stretch it's legs with two threads active.

http://www.ilsistemista.net/index.p...n-whats-wrong-with-amd-bulldozer.html?start=6

With highly tuned SIMD code FPU bandwith is critical, Haswell doubled that against ivy bridge and can still be starved with one active thread in extreme cases. AMD const cores have only about ivy-bridge equal L1-bandwith with two active threads, and just half of that with one active thread.

Yes AMD did say that with one active thread full fpu is dedicated to one thread but that's not exactly the case as only execution resources are shared.

guskline · May 3, 2016

A question to Dresdenboy and Stilt since both of you appear to have a lot more knowledge in this area than others.

At this stage in Zen's development, with rumored release at the very end of the year, what is the AMD development team doing with the Zen silicon? Can I assume it is "out in the field" being used by trusted testers?

What tweeks are likely to be made?

I thought it would be fun to talk about what is likely to happen with it this summer.

Madpacket · May 3, 2016

I'm sure a lot of engineering went into these specialized AVX instructions but how much use do they get on average? Outside of a few corner cases (x265 video encoding, 7zip) I don't see the big deal here for the regular consumer.

There is no benefit in being fastest in something when 95% (stat pulled out of thin air but you get the idea) of your consumers will never use it.

Maybe I'm being obtuse here but wouldn't these transistors be better spent on more cores, more cache, etc?

krumme · May 3, 2016

95% - More like 99.99% for the normal consumer.
Its absolutely a good move from a business perspective. Secondly they probably had to reuse much of the fpu tech they had for ressource reasons anyway. Lol.

Phynaz · May 3, 2016

Madpacket said:
I'm sure a lot of engineering went into these specialized AVX instructions but how much use do they get on average? Outside of a few corner cases (x265 video encoding, 7zip) I don't see the big deal here for the regular consumer.

There is no benefit in being fastest in something when 95% (stat pulled out of thin air but you get the idea) of your consumers will never use it.

Maybe I'm being obtuse here but wouldn't these transistors be better spent on more cores, more cache, etc?

Build it and they will come. In other words you need to wait sometimes years for the software to catch up with the hardware.

DrMrLordX · May 3, 2016

Server CPUs are more likely to use specialized instruction sets than CPUs sold to the general public.

deasd · May 4, 2016

No Dual and Four Cores Zen CPUs, at least initially

http://www.bitsandchips.it/english/...al-and-four-cores-zen-cpus-at-least-initially

EDIT: I feel something like 'it would be quite expensive' or 'too expensive to make quad core on this die'

hojnikb · May 4, 2016

It makes sense. They will make a 8 core die and sell 8 core and cutdown 6 core chips. Everything else will be bristol ridge with 2 and 4 "cores".

tential · May 4, 2016

So 12 threads and 16 threads are what Zen will initially launch as.

So either a hexacore will be mainstream to compete with Zen, or Zen will have a core count advantage like I thought but lower performance per core.

This is exactly what I suspected Zen would launch as so I won't be too surprised if that's what we see.

deasd · May 4, 2016

hojnikb said:
It makes sense. They will make a 8 core die and sell 8 core and cutdown 6 core chips. Everything else will be bristol ridge with 2 and 4 "cores".

Excavator 'core' is not a core, 2 'core' equal to 1 Zen core+SMT.

tential said:
So either a hexacore will be mainstream to compete with Zen, or Zen will have a core count advantage like I thought but lower performance per core.

I can't see any relationship between your speculation and 'no quad&dual' news.

The Stilt · May 4, 2016

Made few more Excavator vs. Haswell tests.

The performance difference in X264 (which supposed to be one of the few favorable workloads for 15h) was surprisingly large in favor of Haswell.

X264 0.148.x (20/4/2016)
Compiled: GCC 5.3 x86-64 + YASM 1.30
Settings: Slow, ME = UMH, RC = CRF 16.0, Threads = 1
Input: YUV 420P, 1920x1080, 30fps

Carrizo = 1.82fps
Haswell = 2.71fps (+48.9%)

VP9 1.5.x (Master), 3/5/2016
Compiled: GCC 5.3 x86-64 + YASM 1.30
Settings: Good, Webm, CPU-Used = 3*, End-Usage=CQ, CQ-Level=12, Target-Bitrate=9000, Threads = 1
Input: YUV 420P, 1920x1080, 30fps

* A quality control parameter, has nothing to do with the number of utilized threads or cores. Range 0-8, smaller settings are slower but produce better quality and compression.

Carrizo = 1.03fps
Haswell = 1.64fps (+59.2%)

"Prime"
Uses GMP 6.1 library to calculate Mersenne Prime numbers.
GMP was compiled as a "fat" binary, meaning it will use all available / supported instructions on all CPUs (effectively a CPU RTD).

Compiled: GCC 5.3 x86-64, generic binary (no architecture specific optimizations).

Carrizo = 327.366 seconds
Haswell = 192.817 seconds (+69.8%)

nismotigerwvu · May 4, 2016

Correct me if I'm wrong, but I thought the reason x264 encoding was seen as a strong point for the con cores was that it was a parallel integer task, which means you could throw a bunch of threads at the chips and they would chug right along through them. Sort of like a best case scenario for the "moar cores" approach. Running it single threaded is negating this benefit and the end result is exactly as expected, the chip with much better single threaded performance well, showed greater single threaded performance. At launch, the FX 8350 was right up there with the Ivy Bridge i7 chips (http://www.anandtech.com/show/6396/the-vishera-review-amd-fx8350-fx8320-fx6300-and-fx4300-tested/3) on x264 encoding, but it was all about tlp not ipc.

The Stilt · May 4, 2016

IIRC X264 "over allocates" each thread by 1.5x. I can check what happens without any manual thread configuration, when all cores are used.

New Zen microarchitecture details

Golden Member

Senior member

Diamond Member

Lifer

Senior member

Golden Member

Golden Member

Golden Member

Golden Member

Senior member

Golden Member

Golden Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Lifer

Senior member

Senior member

Diamond Member

Senior member

Golden Member

Golden Member

Golden Member