New Zen microarchitecture details

Discussion in 'CPUs and Overclocking' started by Dresdenboy, Mar 1, 2016.

  1. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,634
    You make the project, I'll run it on three different systems (Excavator, Sandy Bridge, Deneb).

    Excavator's branch prediction rates seem like they should be better than Sandy Bridge, judging by Agner's comments.

    http://www.agner.org/optimize/microarchitecture.pdf

    Pages 28 & 33

    Not what I'm saying at all, I'm just comparing Zen to the Construction cores. AMD does have areas where it has been stronger than Intel - and Zen looks to attempt to exploit that. Zen+ looks to double down on that strategy (if the 15% boost is to be believed).

    I wouldn't call it an oversimplification - all the things you said make a cache slow. Slow != bandwidth alone. Slow comes in terms of latency, throughput, and any combination thereof... and nearly everything is tied to the performance of the caches one way or another.

    If Excavator had even Sandy Bridge level caches, things would be a lot different.

    [​IMG]
    Oh, and that's 5Ghz Bulldozer vs 4.5Ghz Sandy Bridge.
     
    #2576 looncraz, Jul 28, 2016
    Last edited: Jul 28, 2016
  2. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,702
    Likes Received:
    601
    The L3 is rather irrelevant as in bulldozer its just an eviction cache ( anything faster the main memory is a benefit), its the horrible L2. David kanter blames a lot of bulldozers issues around the L1D L1I L2 arrangement. I think a 10-12 cycle L2 for Zen is a reasonable expectation for 512k L2.

    According to the leaks around Cern L1D and L1i is 64k each which when coupled to the much faster L2 and probably much faster L3 which will be a massive improvement for single thread perf, probably not much of a difference for throughput. Even if just 32k each that should be sufficient when backed by a fast good sized L2 (on the assumption the L1i is better then the current 3 way).

    IF i was a betting man i would bet that Zen has just as long a pipeline as CON (my guess is front end and L/S are CON core evolution), so it will be interesting to see how/if they can reduce failed branches and associated penalty . will we see the much patented about retirement queue cache/trace cache, check pointing etc, i expect they will have done something to alleviate the 20+ cycle branch miss penalty of a pipeline of that length.
     
  3. coffeemonster

    coffeemonster Senior member

    Joined:
    Apr 18, 2015
    Messages:
    214
    Likes Received:
    83
    Hypothetical thought: After Kaveri, Carrizo/Excavator was designed for efficiency foremost no? to become the CAT core's replacement. So what would have happened if instead they decided to make the Kaveri arch improvement on the mature 32 SOI and continued with the FX-8400, and later the Excavator arch improvements(perhaps without using High Density Libraries?) FX-8500. I Just wonder what sort of a CPU would have come out if the FX line continued with all the CON core improvements on 32 SOI since that seems to be the node best suited to the uarch.
     
  4. NostaSeronx

    NostaSeronx Platinum Member

    Joined:
    Sep 18, 2011
    Messages:
    2,129
    Likes Received:
    187
    If it was a drop and optimize port it would not be on 32nm PDSOI but on 22nm FDSOI.

    Optimized Orochi die on FDSOI with OD+FBB/SRAM+RBB we are looking at ~4.5 GHz as nominal clock. With 30-40% yield aiming towards 5 GHz.

    So, FX-9600(117W/8-core Top SKU)/FX-9400(88W/8-core Nominal SKU)/FX-7400(88W/6-core Salvaged SKU)/FX-5400(88W/4-core Salvage SKU). 170-to-156.42-to-140 mm² range. (Would require internal Northbridge/Southbridge to support AM4)

    Optimized Gecko(SR28)/Basilisk(XV20) we would be looking at a nominal all core of ~4 GHz @ 140 watts w/ a 16-core SKU. 360-to-344.124-to-320 mm² range. (Would just need a southbridge to support AM4)

    22FDSOI has ~55% lower PVT variation than 14LPP, currently. Which means at this moment 0.3_PDK_22 is faster than 1.2_PDK_14.
     
    #2579 NostaSeronx, Jul 28, 2016
    Last edited: Jul 28, 2016
  5. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,664
    Likes Received:
    2,846
    Few weeks ago I actually tested Vishera with L3 caches disabled. The only workload I noticed any performance difference in was WinRAR and 7-Zip. In WinRAR the performance was ~2% higher with L3 enabled and in 7-Zip 3.7% higher. This was at 1866MHz MEMCLK so at higher memory speeds the difference would have been ever lower.
     
  6. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    9,683
    Likes Received:
    1,180
    How about Cinebench R15?
     
  7. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,664
    Likes Received:
    2,846
    Within margin of error ±2 points.
     
  8. NostaSeronx

    NostaSeronx Platinum Member

    Joined:
    Sep 18, 2011
    Messages:
    2,129
    Likes Received:
    187
    How about power consumption?

    L3 on vs L3 off.
     
  9. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,664
    Likes Received:
    2,846
    Cannot be measured easily. The total NB (NB + 4x 2MB L3) worst case power consumption on Vishera is ~15W so it is nearly irrelevant. For a Piledriver based NPU (without GPU) the NB power consumption is ~8W.
     
  10. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,702
    Likes Received:
    601
    Yeah things with bigger more often accessed data sets will see more benefit from the extra cache, did you test any games? I would expect a bigger difference there.
     
  11. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,664
    Likes Received:
    2,846
    Nope, just some of the most common benchmarks. I would be shocked if any other workload would show greater gains than 7-Zip.
     
  12. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,702
    Likes Received:
    601
    Compres-decompress are only a small subset, heavy branching logic code will stress prefetch predict and cache far more. Back in the phenom 2 days the 6mb l3 was good for around 10-15% in games.
     
    Doom2pro likes this.
  13. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,730
    Likes Received:
    554
    These numbers are helpful. Is it safe to assume, that a Zeppelin die might have ~6W NB power (lower for internal logic, roughly similar for off-die I/O)?

    Edit:
    @all: From here:
    OK, there is no ARM based core in this list...

    This could be a result of component sharing between both designs. The question is: are there any upcoming 16nm FinFET CPUs on the internal AMD roadmap?
     
    #2588 Dresdenboy, Jul 29, 2016
    Last edited: Jul 29, 2016
  14. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    478
    Likes Received:
    130
    10-12 cycle L2 would be a very good move for Zen.
     
  15. BigDaveX

    BigDaveX Senior member

    Joined:
    Jun 12, 2014
    Messages:
    295
    Likes Received:
    50
    There seems to be something about these "speed demon" designs whereby they hardly gain anything from cache size increases. Back in the day, the Pentium 4 got some pretty nice performance increases when the cache was doubled in Northwood, and then when the Extreme Edition slapped on that huge (for the time) L3 cache. Then Prescott made the pipeline crazy deep, and all of a sudden we had a situation where the IPC difference between a Celeron D with 256KB of L2 cache and a Pentium 4 600-series with 2MB of L2 cache was fairly negligible.
     
  16. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    5,536
    Likes Received:
    1,204
    Ehh wasnt that a cache size decrease example? :)
     
  17. Phynaz

    Phynaz Lifer

    Joined:
    Mar 13, 2006
    Messages:
    10,047
    Likes Received:
    740
    Isn't there an ARM core inside Zen acting as a system agent?
     
  18. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,664
    Likes Received:
    2,846
    The SMU should still be RISC based (LM32), however I'm not 100% sure.
     
  19. Phynaz

    Phynaz Lifer

    Joined:
    Mar 13, 2006
    Messages:
    10,047
    Likes Received:
    740
    ^ Wikipedia

    It's the security processor I was thinking of - AMD Trust Zone.

    http://www.amd.com/en-us/press-releases/Pages/amd-strengthens-security-2012jun13.aspx
     
    #2594 Phynaz, Jul 29, 2016
    Last edited: Jul 29, 2016
  20. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,730
    Likes Received:
    554
    The PSP is using an ARM Cortex-A5. David Kaplan, the guy who held the CPU design talk at CCC, is mainly involved in the PSP stuff. But Hon Hin Wong worked on CPU schedulers and the like, at a much lower level the PSP. I think, even if the K12 would've been cancelled, many people already worked on it for years, and even more on reused components for both uarchs.
     
  21. Hi-Fi Man

    Hi-Fi Man Senior member

    Joined:
    Oct 19, 2013
    Messages:
    524
    Likes Received:
    89
    AMD seems to have a lot of sound designs but they are usually held back by their poor cache and memory performance. If AMD could improve that, I think it would bring them much closer to where they need to be.

    I'm also worried about their chipset. Rumors say Asmedia is designing it but rumors also say there have been delays/issues. Not only does AMD have to execute their CPU well but they also have to execute their chipset well. This is an area I think AMD will be able to provide an advantage over Intel for the same price point at least in the mainstream market.

    There's a lot riding on GlobalFoundries' 14nm process and so far it doesn't seem to be that great but hopefully from now till Zen launch it'll mature a bit. I can only wonder what Zen on IBM's (now GF's) 22nm FDSOI would look like...
     
  22. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    9,683
    Likes Received:
    1,180
    A lot of us are wondering about that. Back in 2014 it gave us POWER8 CPUs in the 4.7 GHz territory, albeit at massive power draw. Those were big honkin chips though. Considering how much improving GF has done with their 32nm and 28nm processes, you'd think 22nm FDSOI would have seen some improvements since fall of 2014.
     
  23. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,664
    Likes Received:
    2,846
    Power9 will be using 14nm HP FinFet developed by IBM, and now owned by GlobalFoundries. Now AMD could grow some balls and tell that WSA won't be fullfilled unless they get access to the process. I would assume that oblications to fullfill any made contracts expire at the moment when a bankruptcy is declared anyway, or? So in that aspect GlobalFoundries has nothing to lose.
     
  24. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    715
    Likes Received:
    1,634
    The "issue" that was leaked was just the standard affair about USB 3.1 - the same applies to Intel chipsets because you can only drive the signal so strong from the chip. Basically, USB 3.1 degrades over relatively short distances (and with relatively minor interference), so the traces from the chipset to the header/ports need to be short and clean - which is a serious packaging challenge.

    The solution is to either use a ~$2 driver chip to boost and filter the signal after a few inches or to do what HP did - and move the chipset right behind the USB 3.1 ports - and to keep the headers closer:
    [​IMG]


    I disagree with assertions that 14nm LPP isn't looking good - I think it is delivering handsomely. It took the poor-clocking and power hungry GCN architecture and gave it 20% better clocks and cut power usage of the GPU immensely.

    Yields are probably something of an issue at this time, but that should be resolved by the time Zen makes it to production.
     
  25. IntelUser2000

    IntelUser2000 Elite Member

    Joined:
    Oct 14, 2003
    Messages:
    5,416
    Likes Received:
    512
    Back with Pentium 4 Northwood, a shrink made the CPU drop its power use from ~90W to 54W, a big 40% drop in power use. And they did not have to use fancy power management techniques that's not only a requirement nowadays but even with it don't bring the huge improvements that were brought with a "simple" shrink over a decade ago with Northwood.

    Back then, in a very general sense we could call it a simple/straight shrink, because compared to uArch changes the efforts put in were negligible. Of course, there's changes that would have been needed that we didn't know.

    But nowadays, process improvements and even architectural overhauls bring much smaller changes. And at the same time its much difficult. If it was 1x effort and 1y gain, now its 4x effort and 0.25y gain. There isn't such a thing called simple/straight shrinks anymore. Not from an engineers point of view. Remember how AMD promised massive, top-to-bottom changes for GCN4 aka GCN1.4 aka Polaris? Well, 10 years ago that kind of effort might have really brought massive improvements, but now they are a requirement to merely benefit from a process. It's clear that its a fundamental limitation, like what humans are capable of(or not capable of).

    So to blame/attribute to a process change alone is incorrect. Process is merely a recipe for success, not the whole picture.
     
    #2600 IntelUser2000, Jul 30, 2016
    Last edited: Jul 30, 2016