New Zen microarchitecture details

KTE · Sep 5, 2016

Tuna-Fish said:
The most fine-grained example I know of is that Intel power-gates the upper half of AVX machinery when AVX isn't used a lot.

Power Gates, how? There are various methods and implementations possible.

Low Vt fine-grained also has a huge area penalty and create immense timing issues between variated voltage clusters.

Sent from HTC 10
(Opinions are own)

Abwx · Sep 5, 2016

Tuna-Fish said:
The most fine-grained example I know of is that Intel power-gates the upper half of AVX machinery when AVX isn't used a lot.

Fine grained i dont know, the delay is 14us, so at 4GHz this amount to 56 000 cyles latency, i m not a programer but i would imagine that s it s generaly more efficient to use generic SSE2 and derivatives..

Dresdenboy · Sep 5, 2016

Abwx said:
Not at all, you can stop clocking any digital circuitry for any time, the result is that (statisticaly) half of the transistors will be switched on while the other half will be switched off, the circuit will consume nothing other than the leakage residual current, as it s when the circuit switch that power is mainly consumed.

On the other hand power gating will reduce the leakage by at most 2 but this cant be applied on short periods,
so it s usefull only for caches and such parts wich are not systematicaly used fully, you couldnt power gate a part of a pipeline for instance while you can clock gate it.

That's one possible reason I see behind The Stilt's SMT switch comment: clock gating would be too fast (and needs pipeline draining anyway), while power gating would fit to count as "slow" or "long delay" (i.e. several µs, see SKL).

AMD did such power gating already with Bobcat as it seems, or at least some later cat core:
https://www.google.com/patents/US20130009693
https://www.google.com/patents/US20130009697

KTE said:
Power gating is already implemented quite heavily with XV.

Leakage current isn't an 'only' BTW. Adds up very fast. Depends on the base process characteristics how negligible it is.

Of course, power gating is not new. But it seems to be driven further into the designs. That "only" was related to the remaining guesses for the reason of a slow SMT mode switch.

BTW, I found this presentation citing examples of power gating parts of an adder, etc. by AMD:
https://www.cse.buffalo.edu/~rsridhar/cse691/Present15/Fritz_CSE691_FinalPresentation.pptx

Abwx said:
Fine grained i dont know, the delay is 14us, so at 4GHz this amount to 56 000 cyles latency, i m not a programer but i would imagine that s it s generaly more efficient to use generic SSE2 and derivatives..

These are "one time charges" for a thread, still roughly 1/1000th of a time slice. Powering off the high AVX path might allow for higher avg. frequencies, similar to AMD's FPU throttling results I mentioned. Intel might even power down other units as suggested here: http://myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5#sessionID=130

Instead of SSE2, AVX-128 would be more efficient (3 operands).

CentroX · Sep 11, 2016

Wow zen is a beast.

Going to build zen + vega rig in Q1 2017. Finally i can throw out my 4820k rig.

superstition · Sep 11, 2016

If a uOP cache offers so much efficiency gain why didn't AMD use one for Piledriver or any of the construction cores? Is it a case of it seeming obvious only after the release of Sandy Bridge? Maybe it's confirmation bias but it seems like it would have been a rather obvious method of gain.

(I've read that Bulldozer has a bit of uOP caching but not to the extent of Sandy (and its derivatives) and, reportedly, Zen.)

If Piledriver were to have had faster L2 and L3 caches and a uOP cache what would the estimated improvement by for IPC?

Sonikku · Sep 11, 2016

I would totally be all over AMD if WoW wasn't so single threaded and the game did not account for the majority of my desktop games.

Dresdenboy · Sep 12, 2016

superstition said:
If a uOP cache offers so much efficiency gain why didn't AMD use one for Piledriver or any of the construction cores? Is it a case of it seeming obvious only after the release of Sandy Bridge? Maybe it's confirmation bias but it seems like it would have been a rather obvious method of gain.

(I've read that Bulldozer has a bit of uOP caching but not to the extent of Sandy (and its derivatives) and, reportedly, Zen.)

If Piledriver were to have had faster L2 and L3 caches and a uOP cache what would the estimated improvement by for IPC?

There are small loop buffers (40 uops per core) since SR. I think, there are many possible reasons, why they didn't implement a uOp cache in the BD series and later:

There was no ready and proven concept available in the design stage of BD (around 2007).
It was not part of Andy Glew's M*-architecture and he didn't recommend it, while Chuck Moore didn't know such concepts from Power either.
If a concept was available, area/power constraints omitted its application.
As it wasn't in BD, adding such a cache might have caused too much uArch rework for later generations - especially if construction cores were about to phase out with XV (design phase roughly in 2012 when SB came out).
With BD's anticipated clocks and expected leakages, a SRAM holding µOps all the time (while that code wouldn't necessarily always be hit again) might just have been too costly leakage-wise and thus not that efficient in the energy efficiency and area related metrics.

I think, a PD with those changes could indeed have a ~10-20% higher IPC with most of this caused by the L2, with the uOP$ being second. L3 could only shine if L2 wouldn't be that slow. The 2MB L2$ hit rate is high, so the L3/DDR CTRL + their prefetchers can only help with the L2$ misses.

itsmydamnation · Sep 12, 2016

Would uop cache really help bulldozer? bulldozer throughput is fine, its single thread IPC is to low, its int execution resources very low. is being able to issue more peak Mop's a cycle and a few cycles reduction in miss predict penality really give that much more performance in something that already appears to have bottlenecks all over the place?

Im really interested to see what exactly the "differential checkpoint" dot point is, are we talking about rollback on branch misses etc within the execution block? That should both save power and increase single thread IPC.

cytg111 · Sep 12, 2016

Dresdenboy · Sep 12, 2016

itsmydamnation said:
Would uop cache really help bulldozer? bulldozer throughput is fine, its single thread IPC is to low, its int execution resources very low. is being able to issue more peak Mop's a cycle and a few cycles reduction in miss predict penality really give that much more performance in something that already appears to have bottlenecks all over the place?

Im really interested to see what exactly the "differential checkpoint" dot point is, are we talking about rollback on branch misses etc within the execution block? That should both save power and increase single thread IPC.

BD with its long pipeline would benefit from shorter branch misprediction penalties in cached cases (this should be visible even with that 1C throughput) and a bit from front end congestion if each core would have it's own µOp$ (as SR has shown, separate decoders help, so this would too).

About checkpointing:
https://www.google.com/patents/US20150026437

bjt2 · Sep 12, 2016

Dresdenboy said:
BD with its long pipeline would benefit from shorter branch misprediction penalties in cached cases (this should be visible even with that 1C throughput) and a bit from front end congestion if each core would have it's own µOp$ (as SR has shown, separate decoders help, so this would too).

About checkpointing:
https://www.google.com/patents/US20150026437

Unified uop cache would help in 2 thread mode if tagged also with physical address: in this case shared code, like system code, kernel code, shared library code, which code is not duplicated in memory, aren't duplicated even in uop cache... Indeet uop cache in Zen is competitively shared if I remember well... Or at least should be to avoid waste of space...

sm625 · Sep 12, 2016

CentroX said:
Wow zen is a beast.

I have tried to read every post on this thread. But I may have missed some. Has there been any real benchmarks released or some other evidence presented to make you call it a beast? I am still waiting for a google octane v2 score.

bjt2 · Sep 12, 2016

sm625 said:
I have tried to read every post on this thread. But I may have missed some. Has there been any real benchmarks released or some other evidence presented to make you call it a beast? I am still waiting for a google octane v2 score.

Same performance in blender as BW-E 8c/16t clock to clock suffice? The 19 integer pipeline stages makes we hope for a clock greater that 3GHz...

EDIT: I have read on semi that BD pipeline length is 15, vs 19 for Zen... So we can hope at least same clock... Probabily greater since 14nm should be better than 28nm and 19 stages pipeline should clock faster...

Borealis7 · Sep 13, 2016

TPU has a headline that the first ZEN CPU would be named "AND ZEN X370".
what an unfortunate choice of model naming scheme...ZEN X...sounds like an anxiety medication Xanax.

f2bnp · Sep 13, 2016

You are incorrect, X370 is supposed to be the name for the chipset, not a CPU.

sm625 · Sep 14, 2016

bjt2 said:
Same performance in blender as BW-E 8c/16t clock to clock suffice? .

No. It kind of reminds me of this benchmark:

Oh look at that, Bulldozer was competitive with sandy bridge! Until you load up a game, then you get something like this:

Which was obviously a total disaster. A simple 10 second javascript benchmark is all it will take to tell us whether this disaster is going to repeat.

bjt2 · Sep 14, 2016

sm625 said:
No. It kind of reminds me of this benchmark:

Oh look at that, Bulldozer was competitive with sandy bridge! Until you load up a game, then you get something like this:

Which was obviously a total disaster. A simple 10 second javascript benchmark is all it will take to tell us whether this disaster is going to repeat.

These "disasters" could be due to the substantial differences between core and bulldozer... Starting from the caches (L1 write through.... Brrrrrrrrr...)...
Now Zeh has same things that core, except few (that I will highlight later):
- fast writeback L1
- little (even if is bigger that INTEL's) and dedicated L2 cache, and INCLUSIVE and finally with a 2x256 bit bus...
- Big L3, with the same clock as the cores, victim cache and with big 2x256 bit bus
- uop cache
- SMT
- big buffers and mainly shared resources for SMT

Better than intel there are:
- separate int and FP scheduler: 4+4 vs 4 total for intel
- 2 fmul +2 fadd at 128 bit versus 2 256 bit pipelines: with legacy code this is and advantage
- l0 cache for stack operations
- separate, decoupled and speculative branch prediction
- bigger L2 and L3 with centralized switches versus ring bus: 8c Zen has average hops of 1.5 hops, while 8 core INTEL has at least 2 hops of average hops... If you see the MCC INTEL cores, with 1.5 rings the mean hops increase further. 10 core has at least 2.5 hops. But consumer 10 core are implemented with 1.5 rings and 15 core dices, which have greater mean hops...

Better than AMD:
- 2.5MB/core L3 cache, vs 2mb/core
- 2x256 bit FMAC vs 1x256 fmac. (but par with non fmac code, and lose with 128 bit non fmac code)
- quadchannel for the higher core parts
- 4x256 bit memory ports vs 3x128 bit memory ports

AMD should have better SMT especially in mixed int/fp threads (4+4 vs 4 total int+fp), but INTEL should be better in memory intensive tasks...

BD sucked in games for the poor L2 and L3 cache performance... Now L2 and L3 are on par or better than INTEL's...

LTC8K6 · Sep 14, 2016

So why not bench Zen properly then? If you can bench Zen, why bench it in an ambiguous way? If it's impressive, why not impress us with the leak?

If you impress us with the leak, we might hold off on buying a new CPU.

This seems like the same road we have traveled before. Both with CPU and GPU.

Hopefully it's not.

The Stilt · Sep 14, 2016

Holding back the benchmarks really is no indicator to either direction. They might hold them back because the performance isn't as impressive as many people have expected, or because it is impressive and releasing the information of the product now would cause people to stop purchasing the current inventory. Zeppelin is at least 4 and half months away and AMD has tons of 32nm and 28nm inventory. In fact, both 32nm and 28nm parts are still produced (at least some SKUs). You don't want to turn your multi-million dollar inventory into a pumpkin over-night, by releasing the benchmarks of a better product.

LTC8K6 · Sep 14, 2016

The Stilt said:
Holding back the benchmarks really is no indicator to either direction. They might hold them back because the performance isn't as impressive as many people have expected, or because it is impressive and releasing the information of the product now would cause people to stop purchasing the current inventory. Zeppelin is at least 4 and half months away and AMD has tons of 32nm and 28nm inventory. In fact, both 32nm and 28nm parts are still produced (at least some SKUs). You don't want to turn your multi-million dollar inventory into a pumpkin over-night, by releasing the benchmarks of a better product.

Releasing a poor benchmark could just cause people to go ahead and buy the competition's product, reducing sales anyway. Is Zen really going to interfere with Vishera sales at this point? It's not the same socket, so we are talking about people who are waiting to upgrade their whole system. It seems unlikely that a good Zen bench would affect current CPU sales.

Arachnotronic · Sep 14, 2016

The Stilt said:
Holding back the benchmarks really is no indicator to either direction. They might hold them back because the performance isn't as impressive as many people have expected, or because it is impressive and releasing the information of the product now would cause people to stop purchasing the current inventory. Zeppelin is at least 4 and half months away and AMD has tons of 32nm and 28nm inventory. In fact, both 32nm and 28nm parts are still produced (at least some SKUs). You don't want to turn your multi-million dollar inventory into a pumpkin over-night, by releasing the benchmarks of a better product.

I would say any such "Osbourning" is already done, there is a lot of hype around Zen, much of it coming from AMD itself.

Abwx · Sep 14, 2016

LTC8K6 said:
So why not bench Zen properly then? If you can bench Zen, why bench it in an ambiguous way? If it's impressive, why not impress us with the leak?

.

It was benched properly, Blender is no worse than Cinebench and certainly less biaised, or should AMD use some heavily ICC compiled bench preferably, that s what you woud call a proper "benchmark"..?..

I m sure that if they had used PovRay they would had displayed even better performance, but for sure that we would have heard that s it s not a proper bench as well..

Arachnotronic · Sep 14, 2016

A proper Geekbench 4 run of Zeppelin would tell us pretty much everything we need to know about Zen. Hope one leaks soon (I don't know how much faith I have in the Naples GB4 run that we saw, score looks too low).

bjt2 · Sep 14, 2016

Arachnotronic said:
I would say any such "Osbourning" is already done, there is a lot of hype around Zen, much of it coming from AMD itself.

There is also a lot of FUD and counter-hype... 3.2GHz max even on the 4 core version, mid 2017 availability, high power drain, blender as best case and terrible on games... Who will buy a 8 core 3.2GHz that goes well in blender and poor in games? Do you remember the "leak" of the game benchmarck (Aots or something, i don't know much games)? It was terrible... Even for an A0 ES...

Abwx · Sep 14, 2016

bjt2 said:
It was terrible... Even for an A0 ES...

It wasnt, it s just that some viral marketers took a non significant subscore and presented it as a CPU bench, the actual perfs displayed in Aots point to the same thing as AMD s Blender demo but with what looks to be a restricted RAM bandwith plateform.

http://wccftech.com/amd-zen-es-benchmark/

With half the core count but 15% higher frequency than the ES Zen would match the i7 in this graph.

New Zen microarchitecture details

Senior member

Lifer

Golden Member

Senior member

Platinum Member

Lifer

Golden Member

Diamond Member

Lifer

Golden Member

Senior member

Diamond Member

Senior member

Platinum Member

Member

Diamond Member

Senior member

Lifer

Golden Member

Lifer

Lifer

Lifer

Lifer

Senior member

Lifer