New Zen microarchitecture details

Discussion in 'CPUs and Overclocking' started by Dresdenboy, Mar 1, 2016.

  1. Abwx

    Abwx Diamond Member

    Joined:
    Apr 2, 2011
    Messages:
    8,246
    Likes Received:
    154
    HW fourth ALU is in a cluster that has few exe ressources, Intel clustered exe units with each one its dedicated ALU cant really be compared to AMD wider design when it comes to mixed code execution.

    But still, we have some people thinking that Zen s 4 ALUs will be barely as capable as Intel s previous gen 3 clustered ALUs.

    Unless you can point wich flavor of 14nm LPP is used for the 480 and wich one is used for Zen those datas are irrelevant, FTR there s HVT, RVT, LVT, slVT among others...

    Even knowing the one used for Polaris would be useless if it s not the same as Zen, and all point to the transistors being different, Polaris likely use HVT-RVT due to lower leakage while Zen will use either LVT or sLVT.
     
    #2551 Abwx, Jul 26, 2016
    Last edited: Jul 26, 2016
  2. AtenRa

    AtenRa Lifer

    Joined:
    Feb 2, 2009
    Messages:
    12,228
    Likes Received:
    1,022
    I will also add that im expecting ZEN to use M1 layers that will increase density and energy characteristics over Polaris 10/11 but will also increase the cost of the Wafer due to double patterning.
    But since ZEN will be most probably sub 200mm2 and will also be targeting high-end CPUs thus high ASPs, it should be ok.
     
  3. NostaSeronx

    NostaSeronx Golden Member

    Joined:
    Sep 18, 2011
    Messages:
    1,607
    Likes Received:
    42
    22/20/14 nanometer Bulldozer improvements lost while Zen was implemented.

    FP256; 256-bit [Single Macro-op] Decode, 256-bit Load/Store Width, 256-bit FP/INT Vector Unit Width.

    Clustered Multithreading 2.0;
    Power -> Enhanced PWR_MGMT via Integrated Buck Regulators that manage an enhanced Resonant Clock Mesh. Think about the inductors in the RCM and the per-core AVFS modules in Excavator. (Module(Front-end, L2, FPU) would use Processor VR, cores(Core/LSU) would use their own Buck Regulators.)
    Performance -> Changes were to reduce latency, unspecified.

    Full AGLU; Inclusion of Arithmetic Data Paths in AGLU pipe.
     
    #2553 NostaSeronx, Jul 26, 2016
    Last edited: Jul 26, 2016
  4. Pilum

    Pilum Member

    Joined:
    Aug 27, 2012
    Messages:
    181
    Likes Received:
    0
    The discussion about the number of ALUs misses an important point: ALUs don't matter if you can't keep them fed. For that you need a sufficient number of Load/Store Units. Zen will retain the two LSUs from the Bulldozer family. This will probably pose a pretty hard limit on the performance of various integer algorithms. Just as reference: Intel moved to three LSUs with Haswell in 2013. While RISC systems (e.g. POWER8, Apple A9) often have only half as many LSUs as their sustained instruction throughput, on x86 you want a higher ratio due to the fused load-execute ops.

    I haven't found traces of current x64 software, but for x86 traces from the 90s, load/store ops often amounted for more than 50% of the dynamic instruction count. For SPEC2000 int-gzip, it was nearly 77%; for int-gcc 82%. These are the most extreme cases, but then the lowest percentage is about 40%.

    If current x64 code still has similar ratios, we won't see a general 40% integer IPC uplift vs. EXV, as the LSUs will pose a bottleneck for some types of integer code. For FP this is a different story, as FP stores seem to use only one FP pipeline without using a LSU, so Zen will act as having three LSUs when running FP code.

    While the design doesn't seem competitive for a 2017 high-performance x86 product, the design choice makes sense if the execution backend was primarily intended for K12, as it would have the 1:2 ratio of LSUs to issue width which seems normal for RISC architectures.

    BTW, any news on K12? I haven't seen any news on it for a long time. Is it still mentioned by AMD, or does it appear to be canceled?
     
  5. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    712
    Likes Received:
    1,605
    That hasn't been lost on many of us here, I don't believe :biggrin:

    I think you mean AGU (address generation unit). Zen contains a number of improvements that reduce the stress on the AGUs. And you can usually generate addresses much faster than you can read the data you need from those addresses, which is why you don't see massive increases from Intel adding more AGUs and even dedicate data store units.

    For many reasons, the actual hardware utilization of the AGUs does not match the memory op calls in the code. Frequently that code will reference the same address over and over, so the CPU contains buffers/caches which are checked prior to scheduling an AGU calculation.

    Further, some of those memory ops can be done - as they are in Zen - by ALUs. LEA (Load effective address) is one such optimized address calculation.

    Zen has some rather meaningful improvements in this area that can be quite enlightening.

    You can see one of the discussions here:

    http://forums.anandtech.com/showthread.php?p=37905139#post37905139
     
  6. superstition

    superstition Platinum Member

    Joined:
    Feb 2, 2008
    Messages:
    2,219
    Likes Received:
    215
    I've read conflicting opinions about the value of CMT. This article, for instance, states that it doesn't even exist or "doesn't work", while another one states that there is a third type which I can't recall and treats CMT as if it is potentially just as advantageous as SMT, depending upon the workload targeted. That latter article seems to be significantly more credible, in terms of its tone but that could be misleading (tone policing fallacy). However, given the shrillness of that person's argumentation it seems suspect.

    I found this, which is interesting. It says the third type is FMT (fine-grained multithreading).

    [​IMG]

    So, is this a case of terminology changing? "Coarse-grained" versus "Cluster-based"? Is AMD's CMT something different?

    Is it possible to design a CPU that takes advantage of both CMT and SMT or do they step on each others' toes too much? I've read that Bulldozer does have a bit of SMT happening. This suggests a marriage in "CSMT".

    This is what I am still unclear about. If CMT theoretically works why doesn't it work in practice? Is it just because AMD's design was flawed? If so, why go to SMT now? I can't find the article at the moment but it listed pros and cons for both CMT and SMT — making the case that SMT isn't necessarily the superior option for all cases. The Stilt said AMD doesn't do large caches well and people have discussed the slowness of Bulldozer's L3 cache in particular. So, has anyone come up with a model/schematic that would show a CMT design that is quite efficient and effective?
     
    #2556 superstition, Jul 27, 2016
    Last edited: Jul 27, 2016
  7. NostaSeronx

    NostaSeronx Golden Member

    Joined:
    Sep 18, 2011
    Messages:
    1,607
    Likes Received:
    42
    Cluster-based Multithreading, Clustered Multithreading, Multiclustered Multithreading, and Chip Multithreading are all based on Simultaneous Multithreading.

    The difference is that Simultaneous Multithreading at nominal definition has threads sharing critical resources. Critical resources are the Instruction Bus(Retirement, Schedular, etc), the Data Bus(Load/Store, L1d[First-Level Cache]), the Datapaths(ALUs, AGUs, PRFs), and the Control Unit(Branches, Context, etc). The above forks of simultaneous multithreading replicate these critical resources. This replication of critical resources increases performance that it can operate as if it were using Chip Multiprocessing.

    CMT in the context of AMD is;
    Step 1. Make Simultaneous Multithreading architecture.
    Step 2. Replicate Integer/Memory pipelines and separate thread context across those pipelines.
    Step 3. Make it work.
    Step 4. Done.

    It is now a simultaneous multithreading architecture with the nominal/average performance of a standard dual-core processor. It is clustered multithreading now.

    One issue into implementation is that Bulldozer isn't a fully faithful interpretation of CMT. One of the areas where it really ruined the design is the Load-Store pipeline. The design on paper had what is currently the L1d being shared between the cores. The write-coalescing would occur at the L0d -> L1 level with lower latency and faster bandwidth. This would allow the L1d to not be at a premium. Instead, it would follow the L1i cache in being placed right next to the L2 Interface.

    So, if AMD came out with an optimized Bulldozer architecture that was faithful to the original; [Using Excavator as basis and following the 8x - 1÷8 rule of caches]
    1 MB L2$ => 1x128 KB L1i$ & 1x128 KB L1d$ => 2x8 KB L0i$ & 2x8 KB L0d$ per module [This cache array implies a nice mobile/low power processor sort of like Banias/Dothan][[Possible wonder why AMD made Stoney Ridge so late in the game.]]

    There is also evolutions in Simultaneous Multithreading that can be used in AMD's CMT to further increase performance.
    Scalable Simultaneous Multithreading => Which would provide Front-end orientated performance increases. [Follows the trend with duplicated decoders in Steamroller]
    Clustered Simultaneous Multithreading => Which would provide FPU orientated performance increases.
     
    #2557 NostaSeronx, Jul 27, 2016
    Last edited: Jul 27, 2016
  8. superstition

    superstition Platinum Member

    Joined:
    Feb 2, 2008
    Messages:
    2,219
    Likes Received:
    215
    Thanks for the explanation. Do you think AMD made the wrong choice by not using those variations of CMT for Zen and going exclusively to SMT? Is the choice related to reducing the chance of compilers to favor Intel designs? Also, did AMD change the Load-Store pipeline in order to maximize clocks at the expense of efficiency?
     
  9. NostaSeronx

    NostaSeronx Golden Member

    Joined:
    Sep 18, 2011
    Messages:
    1,607
    Likes Received:
    42
    Yes, I do think AMD has made a bad choice. Zen is literally Bulldozer 32/28 Redux on 14nm. It has none of the proposed modifications in 22/20/14 Bulldozer.

    *Zen FPU vs 22/20/14 Hypothetical Bulldozer-derived architecture;
    -> Zen => 2 Muls, 2 Adds, or 2 FMAs // Only FP128
    -> Bulldozer => 4 Muls, 4 Adds, or 4 FMAs or any mix like 3 adds, 1 mul or 3 muls, 1 add. // FP128 mode ,, 2 Muls, 2 Adds, or 2 FMAs or 1 add/mul and 1 FMA // FP256 mode [Marketing would probably call it FlexFPU 2.0; link]

    *Zen Memory vs 22/20/14 Hypothetical Bulldozer-derived architecture;
    -> Zen => 2 threads / 2 AGUs / LSU for 2 threads [2x16B_L + 1x16B_S]
    -> Bulldozer => 2 threads / 4 AGUs / LSU0 for 1 thread [2x32B_L + 1x32B_S] + LSU1 for 1 thread [2x32B_L + 1x32B_S]

    *Zen core vs 22/20/14 Hypothetical Bulldozer-derived architecture core;
    -> Zen => 4 ALUs [+ 2 AGUs] with 6 units attached to PRF.
    -> Bulldozer => 4 ALUs (2 ALUs intermixed with AGUs) with 4 units attached to PRF
    -> The ramifications of above imply the Bulldozer design would clock higher or use lower energy.
    The only issue with compilers is with Instruction Set Enhancements. AMD will still have issues if the compiler gives them 128-bit SSE2 2-operand while Intel gets 256-bit/512-bit AVX 3-operand.
    I think the L0d->WCC->L1d was mostly lost in design proposed to design acceptance. It shows with the WCC write-through policy not only going to the L2 cache, but the L3 and memory as well. This is primarily caused by the WCC location being in the L2 interface. Which means it has access to write into the System Request Interface. The WCC was meant to only write-through to a single cache that is inclusive to cores data caches within the module. This meant the next policy onwards would of course be write-back. Hindsight? Foresight? Nope, just negligence.
     
    #2559 NostaSeronx, Jul 27, 2016
    Last edited: Jul 27, 2016
  10. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    477
    Likes Received:
    130
    Which translates into how much performance increase exactly? (once again :))
     
  11. superstition

    superstition Platinum Member

    Joined:
    Feb 2, 2008
    Messages:
    2,219
    Likes Received:
    215
    It's hard to imagine that bright engineers would make such serious errors.

    I'm also puzzled when someone with the reputation of Jim Keller would allegedly make bad decisions for such a crucial product. Does he have anything in his track record that suggests a capacity for that? The only things I can think of:

    A) A desire to get positive press via "AMD dumped Bulldozer/CMT, hooray" sentiment

    B) The Zen design team being more familiar with SMT, so their choices are due to a limited vision

    C) We're missing information that will show that Zen is quite good.

    D) (Paranoia alert) Intel people work at AMD and are sabotaging its products.

    It's definitely interesting to see someone say something positive about CMT. All the media coverage I've seen about Zen that mentions the decision to go to SMT exclusively has been favorable.
     
  12. riggnix

    riggnix Junior Member

    Joined:
    Jul 27, 2016
    Messages:
    23
    Likes Received:
    3
    Maybe thats the whole point. Could it be related to the LP process? Maybe higher clocks aren't possible because of it, so it wouldn't help anyway?
     
  13. Erenhardt

    Erenhardt Diamond Member

    Joined:
    Dec 1, 2012
    Messages:
    3,248
    Likes Received:
    101
    Zen+
    *mindblown
     
  14. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    712
    Likes Received:
    1,605
    Zen's FPU can do 3 concurrent adds (SSEi, MMX), using FPU0, FPU!, & FPU3 pipelines. It can do 2 SSE adds. And, maximally, 4 adds when performing MMX/SSEi and SSE or (presumably) x87 at the same time.

    It can also do 2 divisions, shuffles, comparisons, etc... vector performance should be notably better relative to other comparisons. x87 should improve as well, but that depends on AMD's microcode, decode, and scheduler setup and optimizations.

    It is definitely optimized for SMT, geared towards reducing the number holes in execution with two threads simultaneously executing on the FPU. It should do even better than Intel's port-based solution here.

    Each thread would still only have 2AGUs - and they'd be sometimes doing ALU ops - and they'd still be contending with another core for cache access.

    Zen compares modestly favorably here - particularly as the second thread is only expected to deliver 20~30% more total performance than just using the first thread.

    Well, and that single threaded IPC should be notably higher on Zen...

    Clock speed ramifications would depend on the PRF design - 14nm LPP should provide an assistance there.

    Zen's focus is on single threaded performance, Bulldozer is on module throughput. When a Bulldozer module can only just get the throughput of a single Sandy Bridge SMT core, but loses horribly in single threaded work-loads, you have a recipe for market-place disaster.

    Continuing with that design philosophy would have only put AMD deeper into its deficit against Intel. Zen refocuses to put priority on the single thread running on a single core - while making a second thread running on that same core a first class citizen, sharing ALL available execution resources.
     
  15. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    712
    Likes Received:
    1,605
    That is likely part of the reasoning - though the initial design predates 14nm LPP.

    AMD knows that getting 5Ghz shipping clocks will never happen with their CMT designs... the cache is their biggest problem - if they had Intel-like caches, their CMT designs could really shine.

    So, with Zen, what Jim Keller did was look at where AMD was strong, and where they were weak... and then sought to create a design that exploited AMD's strengths.

    In a construction-core module you have four ALUs, but only two can be used on one thread - if all four could be used on one thread you'd have better performance for that one thread. Allowing the other thread to also execute on those four ALUs makes total sense - you don't have to completely recreate everything, AMD already had all the IP... you could get the design done in just a couple of years that way.

    Reuse reuse reuse!

    (Some of this is assumption)

    So Zen uses the original Bulldozer's front-end, slightly modified. It uses the threading logic, decode logic, just about everything really. Then the front-end was updated with the newer prediction logic and an instruction cache for loop optimization and reduced mispredict penalties (Intel does something similar to great effect). The three pathways to the schedulers were repurposed - one for integer, one for memory, and one for the floating point.

    All of the ALU's from the module were moved to the single integer scheduler and the execution resources were more evenly distributed according to the needs of two threads, so DIV and MUL were split between two different pipelines, and only one DIV and one MUL unit included - no need to keep two of each... that just wastes power. He kept both branch units - they're even still spaced far apart.

    The AGUs were made more pure, but they kept a few abilities that enables their direct use for indirect branching, 'leave', shifts, and for streaming vector math, and other purposes. Using two fully-capable, identical, AGUs made the memory scheduling easier. Adding a third may have been beneficial, but it seems there were concerns about power usage/die size/utilization and who knows what else... or perhaps it was found that the third one was not needed due to other optimizations (improved page walks, caches, etc.) so they excluded the underutilized AGU (or perhaps made those improvements to make up for it...).

    The FPU was given the same type of quick special treatment, then they started working on all the nitty gritty of making it all work together.

    And that is how the Zen was made.

    (or so I believe)
     
  16. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,709
    Likes Received:
    504
  17. frozentundra123456

    frozentundra123456 Diamond Member

    Joined:
    Aug 11, 2008
    Messages:
    9,516
    Likes Received:
    262
    Off topic in this forum, but that is very strange spacing for the 3 models. 470 is very close to the 480 in SP, while there is a huge gap down to the 460. I was hoping for something in the range of 1280 SP with a sub hundred watt TDP, or maybe even able to run from the PCIe slot alone.

    Looking at compute performance and SP count vs TDP, it doesnt appear markedly more efficient than the 480. Maybe 5 to slightly over 10%. Maybe in the real world it will fare better.
     
  18. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    712
    Likes Received:
    1,605
    It is strange to me as well. Less than half the SPs of the RX 470... However, the die holds exactly half (1024), which would make sense.

    The 470 being so close the 480 is a surprise as well. If I was in charge RX 480 would be 2816 SPs like Hawaii, with only 1Ghz clock speeds (despite its energy efficiency). It would perform better, that's for sure.

    The RX 470 would be 2048SPs.

    And the RX 460 would be 1280, again with lower clocks.

    AMD should have learned by now to keep their clocks in the peak efficiency range... which is made easier by using a larger-than-needed GPU.

    The ROPs are all fine, though.
     
  19. dark zero

    dark zero Platinum Member

    Joined:
    Jun 2, 2015
    Messages:
    2,074
    Likes Received:
    32
    OFF Topis again: Maybe because the 460 is a cut down chip.... the full die is 1280 and the one who is about to sell has less than that.

    ON Topic: Ok, Zen is supposed to hit this year, but, I wonder when Zen is about to hit the budger market?
    And I read that there is Zen Lite, what is supposed to be?
     
  20. superstition

    superstition Platinum Member

    Joined:
    Feb 2, 2008
    Messages:
    2,219
    Likes Received:
    215
    link

    Zen Lite is a pretty awful mixed metaphor.
     
  21. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,202
    Likes Received:
    1,363
    I'd say right at the launch, unless AMD manages to increase the clocks on Zeppelin significantly (by 15-20%) within ~ a month*. The yields most likely are far from great, so there should be a good supply of harvested dies with two cores per CCX and 4MB of L3 fully functional.

    Therefore it would be wise to release harvested SKUs also at the same time, just to gain momentum and to lock people on AM4 / DDR4 infra. I'm aware that AMD has previously stated that initially there wouldn't be 4C/8T models available, but I expect that they made the statement at a time when the behavior of the manufacturing process wasn't known in the full extent yet.

    *(if expecting any availability in 2016 and considering the lead time of ~ 3 months from PR status to shipping).

    Unless the clock frequencies on all SKUs improve significantly from the current (alleged) figures then I would expect:

    - 4C/8T top shelf SKU < 199$
    - 8C/16T top shelf SKU < 359$

    IMO AMD cannot ask higher than that for any Zeppelin consumer SKU. i7-5820K sells for > 369$ and it should be significantly faster in ST workloads, tie in legacy MT workloads and annihilate in those few rare ST/MT 256-bit workloads. Also it will overclock better than Zeppelin does.

    /Speculation
     
  22. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    712
    Likes Received:
    1,605
    I suspect Zen lite is just Zen without the L3 (or a minimally sized L3 if the intra-module communication requires it), and fewer SoC-loike resources (PCI-e lanes and the like).

    The use of fewer memory chips is an issue - it means less bandwidth. Really makes me think that 512MB "System RAM" might be HBM. A singe stack of HBM2 would provide 256GB/s of bandwidth... which looks oddly familiar to another product we know which has 2304SPs...
     
  23. superstition

    superstition Platinum Member

    Joined:
    Feb 2, 2008
    Messages:
    2,219
    Likes Received:
    215
    Maybe Zen Lite will be 8C with 8 threads (à la i5).
     
  24. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    477
    Likes Received:
    130
    So these are two completely separate departments.
    A design aims for certain process level characteristics... Many years later, when that design is ready, it then needs to be re-evaluated for that process in time and realistic targets set. It's a management decision whether to go ahead with what they can currently achieve or delay the design for more tweaking on a better process. Even an excellent uarch can fail with a poor process. But the public only sees them as one and the same.

    We don't really know if caches are the only major bottleneck here, but I would wager there is A LOT more holding these CMT designs back. Poor branch misprediction rates being one.

    And significantly more cache contention/thrashing and power draw :)

    Sent from HTC 10
    (Opinions are own)
     
    #2574 KTE, Jul 28, 2016
    Last edited: Jul 28, 2016
  25. superstition

    superstition Platinum Member

    Joined:
    Feb 2, 2008
    Messages:
    2,219
    Likes Received:
    215
    I remember reading an analysis that said one reason why Sandy beat Bulldozer so much is because Bulldozer lacked that.

    So, this is probably the analysis I recall reading (my bold added).
    Interesting statement in bold. Maybe AMD just has lacked the resources needed to make a competitive CMT-based design.