New Zen microarchitecture details

Discussion in 'CPUs and Overclocking' started by Dresdenboy, Mar 1, 2016.

  1. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    Ignoring decode, predecode, execution, scheduling and retirement...

    Fetch / Predictors -> Cache -> Memory

    In my view, this is where it is make or break for the AMD Zen uarch. I don't doubt the raw execution power... But how effective the local trace caches are at every stage relieving bottlenecks.

    L0/1/2 crucial for DT/Mobile.
    As well as L3/Mem crucial for Server/HPC.

    As for process, I expect that is why Lisa has delayed mass availability to Q1 17 -- which I believe is a good move for competitiveness.

    Forbes: "She’s expecting even bigger gains when the company’s newest line of high-end computer chips, dubbed “Zen,” goes on sale next year. “It’s a nice way for us to really increase our reach,” says the CEO, who is more fond of understatement than bold pronouncements."

    Sent from HTC 10
    (Opinions are own)
     
  2. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    Just run any common bench. Something like SPEC Int is a good example. I have a huge shortage of time to even reply in details so mostly end up keeping this generalised in overview.

    As soon as I can, I will install and set this up. I have a Skylake, Ivybridge and Bulldozer (FX-4300) for checking.

    Thanks. I will have a look, although I have skimmed quite a bit of that before.

    When I talk about caches, I mean L0/1/2 and other trace caches, like the predecode. That's where I think Intel ends up getting a major chunk of victories since Nehalem.

    I don't agree with BD being a good or worthy competition design due to BDs inherent speed demon mispredict penalty tho, and also the lack of future proofing with such designs (if that is what you are implying).

    Also, In Q1 2017 we're looking for a minimum Broadwell competitor rather than Sandybridge. Bare in mind, currently, Intel is keeping performance limited due to the lack of any challenge. AMD Zen needs to account for Intels challenges for 2017-2019, not 2012-2015 :)

    Sent from HTC 10
    (Opinions are own)
     
    #2602 KTE, Jul 30, 2016
    Last edited: Jul 30, 2016
  3. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    I hope Zen core does take from the Con design except bottlenecks like any FP L/S using Int pipes.

    Agena, Deneb and Thuban had NB power of 5-15W depending on the load and Mem config. I never measured (nor did Michael of LostCircuits) lower idle figures.

    Sent from HTC 10
    (Opinions are own)
     
  4. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    BD was a fail for competitiveness right from the start, even on paper. 7% lower IPC than Deneb but 99mm^2 bigger than SnB is no way to start a battle.

    Sent from HTC 10
    (Opinions are own)
     
  5. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,730
    Likes Received:
    554
    Seeing the AIDA numbers again - Vishera does 8 128b L1 reads per clock -> strong limitation, while there are 8 L1 caches with 2 read ports each (16 transfers). This is due to the SIMD code in AIDA plus the FPU's load buffer limitations. But on a side note: If the L1D would only have two 64b read ports - would we've noticed?
     
  6. naukkis

    naukkis Member

    Joined:
    Jun 5, 2002
    Messages:
    140
    Likes Received:
    37
    It has only two 64bit read ports in each L1D, so it can only read 128bit per thread to FPU per clock. That limitation was discussed well before BD launch.
     
  7. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,730
    Likes Received:
    554
    Interesting. Hmm, did I miss that somehow? This will be another Zen advantage.
     
  8. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    The biggest front-end limitation is probably the ability to fetch/decode only 16 instructions per 8 BD cores vs 32 for Intel. Even Thuban can fetch/decode more (18). With SMTs hunger to keep more instructions in-flight, this will need to be rectified.



    Sent from HTC 10
    (Opinions are own)
     
  9. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,634
    Not quite how it works.

    Bulldozer fetch and pick are 32Byte/cycle.. which is anywhere from 2 to 8 instructions per module per cycle.

    By comparison, Sandy Bridge, Haswell, or even Skylake can do 16Bytes/cycle - or up to 4 instruction/cycle (ignoring macroop fusion, which can add one more).

    So, Bulldozer can do twice the fetching as Intel... per module. Or the same per core.

    Decode in Bulldozer is a possible bottleneck when both cores in a module are active.

    Bulldozer's front-end, per module, can decode 4 instructions, same as Intel cores. And, similar to Intel, AMD's decoders can spit out more than one upo per x86 instruction... there's some more nuance to all of this, but each Bulldozer module has the same approximate decode capabilities as an Intel Skylake core.

    Zen will have this capability for every core, similar Intel's Skylake, so the front-end should not be an issue. A possible difference may come from how each handles SMT decoding, but we'll just have to wait and see.
     
  10. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    9,857
    Likes Received:
    1,264
    I'm wondering if/when AMD plans to avail themselves of that process. It would seem more-suited to a desktop/workstation/server CPU than 14nm LPP anyway. Maybe it's a cost thing?
     
  11. ShintaiDK

    ShintaiDK Lifer

    Joined:
    Apr 22, 2012
    Messages:
    20,395
    Likes Received:
    128
    It seems AMD picked TSMC for Polaris 11. Since the chip says made in Taiwan. If that's the case, its all about the WSA as usual. Not that the process is going to same a poor uarch anyway tho.
     
  12. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    True.

    How do you know it's a poor uarch?


    I think you're mistaken somewhat here. The 4 IPC is a best case, 'up-to', with low IPC workloads.

    Fetch/Decode are both shared between 2 cores in a module. L1I line is 64B so a single fetch takes 2 cycles because it's 32B a time into IBB. An IBB per core but that dispatch window of 16B to the decode is shared. With multi-core execution, more often than not, you'll find only 1 of the fetches being decoded per module.

    BD isn't a true 4-wide design due to the heavy sharing. Hence, increase the threads executing and decode bandwidth drops. It's not 4 per thread anymore but 4 per 2 cores.

    I remember reading that somewhere years ago and it's stuck.... Lemme search and edit here.

    Anand Lal Shimpi: http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2

    http://www.realworldtech.com/bulldozer/5/

    Sent from HTC 10
    (Opinions are own)
     
    #2612 KTE, Jul 31, 2016
    Last edited: Jul 31, 2016
  13. laamanaator

    laamanaator Member

    Joined:
    Jul 15, 2015
    Messages:
    66
    Likes Received:
    10
    No they did not. The GPU die it self is made in GF fabs, but the assembly is done in Taiwan. RX480 also has "Made in Taiwan" marked on its die guard (the metal thing), and it's made by GF.
     
  14. ShintaiDK

    ShintaiDK Lifer

    Joined:
    Apr 22, 2012
    Messages:
    20,395
    Likes Received:
    128
    Well poor is a somewhat floating term. But compared to competition for example. While 14LPP is subpair, you cant blame it for everything. There isn't a 70-80% difference between 14LPP and 16FF+ for example.
     
  15. ShintaiDK

    ShintaiDK Lifer

    Joined:
    Apr 22, 2012
    Messages:
    20,395
    Likes Received:
    128
    Thanks for the clarification. I was unsure if the 470/480 said the same.
     
  16. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,686
    Likes Received:
    2,921
    If the Polaris dies had more room on them, they would say "Diffused in USA, Made in Taiwan". Packaged at Hu Kuo site by Amkor.
     
  17. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,634
    Of course it is an 'up-to' - the same applies to Intel's Skylake, except it can, at times, do up-to five instructions thanks to macro-op fusion. However, Intel sometimes creates four uops from one x86 instruction, so that should effectively level the playing field in that regard.

    Correct, which only matters when both cores are active. One active core can use all of the front-end resources. Skylake can have two threads running on the same front-end resources as well... the fact that it is just one core means only very little.

    The instruction fetch unit handles the task of interpreting a couple of idle and power state features and loads the pick buffer accordingly. If an instruction exists for both threads/cores, they will end up in different lines (adjacent, alternating) of the pick buffer which results in the decoders working on a different thread's instructions every other cycle when both cores are fully loaded with non-idle instructions.

    Still 32B/cycle... also, the L1 code cache (L1I) comes before the instruction fetch, so irrelevant (unless it was unable to sustain the 32B/cycle the decoders can chew through).

    There is a 32B bus to the decoders, not 16B. There are many issues with alignment, though, and Bulldozer actually has what - I can only assume - is an implementation bug that results in linear code on one thread maxing out at 21B/clock. That was fixed in Piledriver... and you see the impact that made :rolleyes:

    No, not at all. The 32B pick buffer lines/entries are filled with alternating instructions when both cores are active.

    The effect is that each core has all four decoders every other cycle, and averages to 16B/cycle when both cores are fully loaded.

    Pick Buffer lines
    [0] CORE 0 - 32B
    [1] CORE 1 - 32B
    [2] CORE 0 - 32B
    [3] CORE 1 - 32B

    However, this is a common scenario:

    [0] CORE 0 - 32B
    [1] CORE 0 - 32B
    [2] CORE 1 - 32B (idle set, power state set)
    [3] CORE 0 - 32B

    And any combination thereof...

    Each core is 4-wide, and the pathways are all 4-wide. You just share the front-end between two cores every other cycle. The impact of this is well studied - about 15% performance cost per core when both cores are fully loaded.

    And, Zen will not have any of these issues. It has the full capabilities, and more, for every core.

    And we know that Bulldozer's front-end can do better with Sandy Bridge's even with those issues, it just requires both cores to be running as Bulldozer is ALU starved.

    If you take a good close look at how Bulldozer was designed, it appears that AMD intended to be able to issue integer instructions from one thread onto both cores. This would have been a nice boost to single threaded performance, and CMT would bring a 50% gain, instead of an ~85% gain. It seems they abandoned that effort at some point during the design.
     
  18. NostaSeronx

    NostaSeronx Platinum Member

    Joined:
    Sep 18, 2011
    Messages:
    2,153
    Likes Received:
    194
    Nope, CMT does not equal SIMT.

    The front-end of Bulldozer -> Excavator was meant for something like Alpha's 21264 or AMD's K9(3 AGLUs(K8/10h/12h backwards compatibility) + 1 ALU).

    Alpha 21264 vs AMD Bulldozer/Excavator Int core;
    4 Adds (L01 + U01) vs 2 Adds(EX01)

    As Bulldozer-to-Excavator is the Integer core only needs 2 decode pipes. 4 Macro-ops = 4 Computational Ops + 4 Load/Store Ops

    EX01 are the computational pipes and there are two.
    AG01 are the load/store pipes and there are two.
    Two macro-ops are needed at best. Worst case is four when doing double micro-ops. Which is two four-operand macro-ops on two-operand EX/AGLU pipes. L01 in 21264 were three-operand AGLU pipes so LEAs were done single unit in a single cycle.

    Bulldozer ideally performs about Alpha 21164. When it has the phyiscal structure of Alpha 21264.
     
    #2618 NostaSeronx, Jul 31, 2016
    Last edited: Jul 31, 2016
  19. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,634
    The fetch and decode don't care about the internal arrangement of resources - they care about thread count. 2 == 2. AMD proportions those resources equally, but I'm not entirely certain how Intel does it with HT, but they still have to keep a separation for the second thread.

    Yes, but none of this is related to the front-end, which is the only part particularly relevant to Zen...

    The fact remains that Bulldozer's front-end, even with its flaws, can stream and decode instructions fast enough to permit Intel levels of performance when those resources are thrown at a single core wide enough to execute those instructions and when not hindered by poorly performing cache system.

    Steamroller has dedicated decoders per core - and is only 6.7% faster per core over Piledriver in my testing. Steamroller also has the uop cache which Zen probably inherited, and some loop optimizations which will also be found in Zen.
     
  20. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    Thanks for your reply.

    I understand how the core parts function, and we agree on fetch. But the point about decoders not losing max theoretical bandwidth as more threads are fired up seems unintuitive and contentious to me in respect to BD, and against what Anand/D.Kanter understand. I will explain in more detail as soon as I can find the time (probably during work)... :)

    Sent from HTC 10
    (Opinions are own)
     
  21. dooon

    dooon Member

    Joined:
    Jul 3, 2015
    Messages:
    87
    Likes Received:
    52
    #2621 dooon, Aug 11, 2016
    Last edited: Aug 13, 2016
    FlanK3r likes this.
  22. .vodka

    .vodka Golden Member

    Joined:
    Dec 5, 2014
    Messages:
    1,006
    Likes Received:
    1,028
    High performance caches as a highlight? Have they finally solved one of their major bottlenecks throughout the years?
     
  23. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    5,565
    Likes Received:
    1,232
    It looks to me on the last ppt that zen is like 2.8x excavator?
    Jolly good :)
    /sarcasm
     
    #2623 krumme, Aug 11, 2016
    Last edited: Aug 11, 2016
  24. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    5,565
    Likes Received:
    1,232
    Yeaa wtf?. Noticed it as well. To me the ppt just look like a mess with some random techincal nonsense tacked on. They could either give more precise performance information or more consistent arch description or more apecific like eg latency numbers for cache. Imo its just not good enough. Who is the audience for this meaningless crap?
     
  25. inf64

    inf64 Platinum Member

    Joined:
    Mar 11, 2011
    Messages:
    2,797
    Likes Received:
    1,095
    Well the slide states 40%, dunno where you see 2.8x?