New Zen microarchitecture details

Page 105 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

KTE

Senior member
May 26, 2016
478
130
76
AMD seems to have a lot of sound designs but they are usually held back by their poor cache and memory performance. If AMD could improve that, I think it would bring them much closer to where they need to be.

I'm also worried about their chipset. Rumors say Asmedia is designing it but rumors also say there have been delays/issues. Not only does AMD have to execute their CPU well but they also have to execute their chipset well. This is an area I think AMD will be able to provide an advantage over Intel for the same price point at least in the mainstream market.

There's a lot riding on GlobalFoundries' 14nm process and so far it doesn't seem to be that great but hopefully from now till Zen launch it'll mature a bit. I can only wonder what Zen on IBM's (now GF's) 22nm FDSOI would look like...
Ignoring decode, predecode, execution, scheduling and retirement...

Fetch / Predictors -> Cache -> Memory

In my view, this is where it is make or break for the AMD Zen uarch. I don't doubt the raw execution power... But how effective the local trace caches are at every stage relieving bottlenecks.

L0/1/2 crucial for DT/Mobile.
As well as L3/Mem crucial for Server/HPC.

As for process, I expect that is why Lisa has delayed mass availability to Q1 17 -- which I believe is a good move for competitiveness.

Forbes: "She’s expecting even bigger gains when the company’s newest line of high-end computer chips, dubbed “Zen,” goes on sale next year. “It’s a nice way for us to really increase our reach,” says the CEO, who is more fond of understatement than bold pronouncements."

Sent from HTC 10
(Opinions are own)
 

KTE

Senior member
May 26, 2016
478
130
76
You make the project, I'll run it on three different systems (Excavator, Sandy Bridge, Deneb).
Just run any common bench. Something like SPEC Int is a good example. I have a huge shortage of time to even reply in details so mostly end up keeping this generalised in overview.

As soon as I can, I will install and set this up. I have a Skylake, Ivybridge and Bulldozer (FX-4300) for checking.

Excavator's branch prediction rates seem like they should be better than Sandy Bridge, judging by Agner's comments.

http://www.agner.org/optimize/microarchitecture.pdf

Pages 28 & 33
Thanks. I will have a look, although I have skimmed quite a bit of that before.

If Excavator had even Sandy Bridge level caches, things would be a lot different.

BDvsSandy-Caches.png

Oh, and that's 5Ghz Bulldozer vs 4.5Ghz Sandy Bridge.
When I talk about caches, I mean L0/1/2 and other trace caches, like the predecode. That's where I think Intel ends up getting a major chunk of victories since Nehalem.

I don't agree with BD being a good or worthy competition design due to BDs inherent speed demon mispredict penalty tho, and also the lack of future proofing with such designs (if that is what you are implying).

Also, In Q1 2017 we're looking for a minimum Broadwell competitor rather than Sandybridge. Bare in mind, currently, Intel is keeping performance limited due to the lack of any challenge. AMD Zen needs to account for Intels challenges for 2017-2019, not 2012-2015 :)

Sent from HTC 10
(Opinions are own)
 
Last edited:

KTE

Senior member
May 26, 2016
478
130
76
IF i was a betting man i would bet that Zen has just as long a pipeline as CON (my guess is front end and L/S are CON core evolution), so it will be interesting to see how/if they can reduce failed branches and associated penalty . will we see the much patented about retirement queue cache/trace cache, check pointing etc, i expect they will have done something to alleviate the 20+ cycle branch miss penalty of a pipeline of that length.

I hope Zen core does take from the Con design except bottlenecks like any FP L/S using Int pipes.

Agena, Deneb and Thuban had NB power of 5-15W depending on the load and Mem config. I never measured (nor did Michael of LostCircuits) lower idle figures.

Sent from HTC 10
(Opinions are own)
 

KTE

Senior member
May 26, 2016
478
130
76
If it was a drop and optimize port it would not be on 32nm PDSOI but on 22nm FDSOI.

Optimized Orochi die on FDSOI with OD+FBB/SRAM+RBB we are looking at ~4.5 GHz as nominal clock. With 30-40% yield aiming towards 5 GHz.

BD was a fail for competitiveness right from the start, even on paper. 7% lower IPC than Deneb but 99mm^2 bigger than SnB is no way to start a battle.

Sent from HTC 10
(Opinions are own)
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
When I talk about caches, I mean L0/1/2 and other trace caches, like the predecode. That's where I think Intel ends up getting a major chunk of victories since Nehalem.
Seeing the AIDA numbers again - Vishera does 8 128b L1 reads per clock -> strong limitation, while there are 8 L1 caches with 2 read ports each (16 transfers). This is due to the SIMD code in AIDA plus the FPU's load buffer limitations. But on a side note: If the L1D would only have two 64b read ports - would we've noticed?
 

naukkis

Senior member
Jun 5, 2002
695
564
136
Seeing the AIDA numbers again - Vishera does 8 128b L1 reads per clock -> strong limitation, while there are 8 L1 caches with 2 read ports each (16 transfers). This is due to the SIMD code in AIDA plus the FPU's load buffer limitations. But on a side note: If the L1D would only have two 64b read ports - would we've noticed?

It has only two 64bit read ports in each L1D, so it can only read 128bit per thread to FPU per clock. That limitation was discussed well before BD launch.
 

KTE

Senior member
May 26, 2016
478
130
76
Interesting. Hmm, did I miss that somehow? This will be another Zen advantage.
The biggest front-end limitation is probably the ability to fetch/decode only 16 instructions per 8 BD cores vs 32 for Intel. Even Thuban can fetch/decode more (18). With SMTs hunger to keep more instructions in-flight, this will need to be rectified.



Sent from HTC 10
(Opinions are own)
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
The biggest front-end limitation is probably the ability to fetch/decode only 16 instructions per 8 BD cores vs 32 for Intel. Even Thuban can fetch/decode more (18). With SMTs hunger to keep more instructions in-flight, this will need to be rectified.

Not quite how it works.

Bulldozer fetch and pick are 32Byte/cycle.. which is anywhere from 2 to 8 instructions per module per cycle.

By comparison, Sandy Bridge, Haswell, or even Skylake can do 16Bytes/cycle - or up to 4 instruction/cycle (ignoring macroop fusion, which can add one more).

So, Bulldozer can do twice the fetching as Intel... per module. Or the same per core.

Decode in Bulldozer is a possible bottleneck when both cores in a module are active.

Bulldozer's front-end, per module, can decode 4 instructions, same as Intel cores. And, similar to Intel, AMD's decoders can spit out more than one upo per x86 instruction... there's some more nuance to all of this, but each Bulldozer module has the same approximate decode capabilities as an Intel Skylake core.

Zen will have this capability for every core, similar Intel's Skylake, so the front-end should not be an issue. A possible difference may come from how each handles SMT decoding, but we'll just have to wait and see.
 

DrMrLordX

Lifer
Apr 27, 2000
21,570
10,762
136
Power9 will be using 14nm HP FinFet developed by IBM, and now owned by GlobalFoundries. Now AMD could grow some balls and tell that WSA won't be fullfilled unless they get access to the process. I would assume that oblications to fullfill any made contracts expire at the moment when a bankruptcy is declared anyway, or? So in that aspect GlobalFoundries has nothing to lose.

I'm wondering if/when AMD plans to avail themselves of that process. It would seem more-suited to a desktop/workstation/server CPU than 14nm LPP anyway. Maybe it's a cost thing?
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
It seems AMD picked TSMC for Polaris 11. Since the chip says made in Taiwan. If that's the case, its all about the WSA as usual. Not that the process is going to same a poor uarch anyway tho.
 

KTE

Senior member
May 26, 2016
478
130
76
It seems AMD picked TSMC for Polaris 11. Since the chip says made in Taiwan. If that's the case, its all about the WSA as usual. Not that the process is going to same a poor uarch anyway tho.
True.

How do you know it's a poor uarch?


Bulldozer's front-end, per module, can decode 4 instructions, same as Intel cores. And, similar to Intel, AMD's decoders can spit out more than one upo per x86 instruction... there's some more nuance to all of this, but each Bulldozer module has the same approximate decode capabilities as an Intel Skylake core.

I think you're mistaken somewhat here. The 4 IPC is a best case, 'up-to', with low IPC workloads.

Fetch/Decode are both shared between 2 cores in a module. L1I line is 64B so a single fetch takes 2 cycles because it's 32B a time into IBB. An IBB per core but that dispatch window of 16B to the decode is shared. With multi-core execution, more often than not, you'll find only 1 of the fetches being decoded per module.

BD isn't a true 4-wide design due to the heavy sharing. Hence, increase the threads executing and decode bandwidth drops. It's not 4 per thread anymore but 4 per 2 cores.

I remember reading that somewhere years ago and it's stuck.... Lemme search and edit here.

Anand Lal Shimpi: http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2

David Kanter said:
The decode phase for Bulldozer, shown in Figure 3, has been improved, but the changes are far less dramatic than for fetching. The decoding begins by inspecting the first two of the 16B windows in the IBB for a single core. In many circumstances, instructions can be taken from both windows, but there are restrictions based upon alignment, number of loads and stores, branches, and other factors which can restrict decoding to a single 16B window.
http://www.realworldtech.com/bulldozer/5/

Sent from HTC 10
(Opinions are own)
 
Last edited:

laamanaator

Member
Jul 15, 2015
66
10
41
It seems AMD picked TSMC for Polaris 11. Since the chip says made in Taiwan. If that's the case, its all about the WSA as usual. Not that the process is going to same a poor uarch anyway tho.

No they did not. The GPU die it self is made in GF fabs, but the assembly is done in Taiwan. RX480 also has "Made in Taiwan" marked on its die guard (the metal thing), and it's made by GF.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
True.

How do you know it's a poor uarch?

Sent from HTC 10
(Opinions are own)

Well poor is a somewhat floating term. But compared to competition for example. While 14LPP is subpair, you cant blame it for everything. There isn't a 70-80% difference between 14LPP and 16FF+ for example.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
No they did not. The GPU die it self is made in GF fabs, but the assembly is done in Taiwan. RX480 also has "Made in Taiwan" marked on its die guard (the metal thing), and it's made by GF.

Thanks for the clarification. I was unsure if the 470/480 said the same.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
If the Polaris dies had more room on them, they would say "Diffused in USA, Made in Taiwan". Packaged at Hu Kuo site by Amkor.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
I think you're mistaken somewhat here. The 4 IPC is a best case, 'up-to', with low IPC workloads.

Of course it is an 'up-to' - the same applies to Intel's Skylake, except it can, at times, do up-to five instructions thanks to macro-op fusion. However, Intel sometimes creates four uops from one x86 instruction, so that should effectively level the playing field in that regard.

Fetch/Decode are both shared between 2 cores in a module.

Correct, which only matters when both cores are active. One active core can use all of the front-end resources. Skylake can have two threads running on the same front-end resources as well... the fact that it is just one core means only very little.

The instruction fetch unit handles the task of interpreting a couple of idle and power state features and loads the pick buffer accordingly. If an instruction exists for both threads/cores, they will end up in different lines (adjacent, alternating) of the pick buffer which results in the decoders working on a different thread's instructions every other cycle when both cores are fully loaded with non-idle instructions.

L1I line is 64B so a single fetch takes 2 cycles because it's 32B a time into IBB.

Still 32B/cycle... also, the L1 code cache (L1I) comes before the instruction fetch, so irrelevant (unless it was unable to sustain the 32B/cycle the decoders can chew through).

An IBB per core but that dispatch window of 16B to the decode is shared.

There is a 32B bus to the decoders, not 16B. There are many issues with alignment, though, and Bulldozer actually has what - I can only assume - is an implementation bug that results in linear code on one thread maxing out at 21B/clock. That was fixed in Piledriver... and you see the impact that made :rolleyes:

With multi-core execution, more often than not, you'll find only 1 of the fetches being decoded per module.

No, not at all. The 32B pick buffer lines/entries are filled with alternating instructions when both cores are active.

The effect is that each core has all four decoders every other cycle, and averages to 16B/cycle when both cores are fully loaded.

Pick Buffer lines
[0] CORE 0 - 32B
[1] CORE 1 - 32B
[2] CORE 0 - 32B
[3] CORE 1 - 32B

However, this is a common scenario:

[0] CORE 0 - 32B
[1] CORE 0 - 32B
[2] CORE 1 - 32B (idle set, power state set)
[3] CORE 0 - 32B

And any combination thereof...

BD isn't a true 4-wide design due to the heavy sharing. Hence, increase the threads executing and decode bandwidth drops. It's not 4 per thread anymore but 4 per 2 cores.

Each core is 4-wide, and the pathways are all 4-wide. You just share the front-end between two cores every other cycle. The impact of this is well studied - about 15% performance cost per core when both cores are fully loaded.

And, Zen will not have any of these issues. It has the full capabilities, and more, for every core.

And we know that Bulldozer's front-end can do better with Sandy Bridge's even with those issues, it just requires both cores to be running as Bulldozer is ALU starved.

If you take a good close look at how Bulldozer was designed, it appears that AMD intended to be able to issue integer instructions from one thread onto both cores. This would have been a nice boost to single threaded performance, and CMT would bring a 50% gain, instead of an ~85% gain. It seems they abandoned that effort at some point during the design.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
If you take a good close look at how Bulldozer was designed, it appears that AMD intended to be able to issue integer instructions from one thread onto both cores. This would have been a nice boost to single threaded performance, and CMT would bring a 50% gain, instead of an ~85% gain. It seems they abandoned that effort at some point during the design.
Nope, CMT does not equal SIMT.

The front-end of Bulldozer -> Excavator was meant for something like Alpha's 21264 or AMD's K9(3 AGLUs(K8/10h/12h backwards compatibility) + 1 ALU).

Alpha 21264 vs AMD Bulldozer/Excavator Int core;
4 Adds (L01 + U01) vs 2 Adds(EX01)

As Bulldozer-to-Excavator is the Integer core only needs 2 decode pipes. 4 Macro-ops = 4 Computational Ops + 4 Load/Store Ops

EX01 are the computational pipes and there are two.
AG01 are the load/store pipes and there are two.
Two macro-ops are needed at best. Worst case is four when doing double micro-ops. Which is two four-operand macro-ops on two-operand EX/AGLU pipes. L01 in 21264 were three-operand AGLU pipes so LEAs were done single unit in a single cycle.

Bulldozer ideally performs about Alpha 21164. When it has the phyiscal structure of Alpha 21264.
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Nope, CMT does not equal SIMT.

The fetch and decode don't care about the internal arrangement of resources - they care about thread count. 2 == 2. AMD proportions those resources equally, but I'm not entirely certain how Intel does it with HT, but they still have to keep a separation for the second thread.

The front-end of Bulldozer -> Excavator was meant for something like Alpha's 21264 or AMD's K9(3 AGLUs(K8/10h/12h backwards compatibility) + 1 ALU).

Alpha 21264 vs AMD Bulldozer/Excavator Int core;
4 Adds (L01 + U01) vs 2 Adds(EX01)

As Bulldozer-to-Excavator is the Integer core only needs 2 decode pipes. 4 Macro-ops = 4 Computational Ops + 4 Load/Store Ops

EX01 are the computational pipes and there are two.
AG01 are the load/store pipes and there are two.
Two macro-ops are needed at best. Worst case is four when doing double micro-ops. Which is two four-operand macro-ops on two-operand EX/AGLU pipes. L01 in 21264 were three-operand AGLU pipes so LEAs were done single unit in a single cycle.

Bulldozer ideally performs about Alpha 21164. When it has the phyiscal structure of Alpha 21264.

Yes, but none of this is related to the front-end, which is the only part particularly relevant to Zen...

The fact remains that Bulldozer's front-end, even with its flaws, can stream and decode instructions fast enough to permit Intel levels of performance when those resources are thrown at a single core wide enough to execute those instructions and when not hindered by poorly performing cache system.

Steamroller has dedicated decoders per core - and is only 6.7% faster per core over Piledriver in my testing. Steamroller also has the uop cache which Zen probably inherited, and some loop optimizations which will also be found in Zen.
 

KTE

Senior member
May 26, 2016
478
130
76
Of course it is an 'up-to' - the same applies to Intel's Skylake, except it can, at times, do up-to five instructions thanks to macro-op fusion. However, Intel sometimes creates four uops from one x86 instruction, so that should effectively level the playing field in that regard.



Correct, which only matters when both cores are active. One active core can use all of the front-end resources. Skylake can have two threads running on the same front-end resources as well... the fact that it is just one core means only very little.

The instruction fetch unit handles the task of interpreting a couple of idle and power state features and loads the pick buffer accordingly. If an instruction exists for both threads/cores, they will end up in different lines (adjacent, alternating) of the pick buffer which results in the decoders working on a different thread's instructions every other cycle when both cores are fully loaded with non-idle instructions.



Still 32B/cycle... also, the L1 code cache (L1I) comes before the instruction fetch, so irrelevant (unless it was unable to sustain the 32B/cycle the decoders can chew through).



There is a 32B bus to the decoders, not 16B. There are many issues with alignment, though, and Bulldozer actually has what - I can only assume - is an implementation bug that results in linear code on one thread maxing out at 21B/clock. That was fixed in Piledriver... and you see the impact that made :rolleyes:



No, not at all. The 32B pick buffer lines/entries are filled with alternating instructions when both cores are active.

The effect is that each core has all four decoders every other cycle, and averages to 16B/cycle when both cores are fully loaded.

Pick Buffer lines
[0] CORE 0 - 32B
[1] CORE 1 - 32B
[2] CORE 0 - 32B
[3] CORE 1 - 32B

However, this is a common scenario:

[0] CORE 0 - 32B
[1] CORE 0 - 32B
[2] CORE 1 - 32B (idle set, power state set)
[3] CORE 0 - 32B

And any combination thereof...



Each core is 4-wide, and the pathways are all 4-wide. You just share the front-end between two cores every other cycle. The impact of this is well studied - about 15% performance cost per core when both cores are fully loaded.

And, Zen will not have any of these issues. It has the full capabilities, and more, for every core.

And we know that Bulldozer's front-end can do better with Sandy Bridge's even with those issues, it just requires both cores to be running as Bulldozer is ALU starved.

If you take a good close look at how Bulldozer was designed, it appears that AMD intended to be able to issue integer instructions from one thread onto both cores. This would have been a nice boost to single threaded performance, and CMT would bring a 50% gain, instead of an ~85% gain. It seems they abandoned that effort at some point during the design.
Thanks for your reply.

I understand how the core parts function, and we agree on fetch. But the point about decoders not losing max theoretical bandwidth as more threads are fired up seems unintuitive and contentious to me in respect to BD, and against what Anand/D.Kanter understand. I will explain in more detail as soon as I can find the time (probably during work)... :)

Sent from HTC 10
(Opinions are own)
 

.vodka

Golden Member
Dec 5, 2014
1,203
1,537
136
High performance caches as a highlight? Have they finally solved one of their major bottlenecks throughout the years?
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
High performance caches as a highlight? Have they finally solved one of their major bottlenecks throughout the years?
Yeaa wtf?. Noticed it as well. To me the ppt just look like a mess with some random techincal nonsense tacked on. They could either give more precise performance information or more consistent arch description or more apecific like eg latency numbers for cache. Imo its just not good enough. Who is the audience for this meaningless crap?