New Zen microarchitecture details

Discussion in 'CPUs and Overclocking' started by Dresdenboy, Mar 1, 2016.

  1. leoneazzurro

    leoneazzurro Member

    Joined:
    Jul 26, 2016
    Messages:
    113
    Likes Received:
    39
    It is entirely OT here, but anyway.. RX470 is probably set at a lower voltage, also the boost clock is not so different to RX480's one (and when gaming, this will matter more). What AMD needs is how to fed efficiently its engine - theoretical peak rates on Ellesmere hint for a lot of untapped potential - and how to set up efficiently the memory bandwidth usage/fill rate: it is amusing to see how GTX1060 fares well enough at high resolution despite the bandwidth disadvantage.
     
  2. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    713
    Likes Received:
    1,623
    Yes, riggnix was responding to NostaSeronx about clock-speed and power usage ramifications of Zen vs Bulldozer designs favoring Bulldozer - and riggnix stated "Maybe that's the whole point."

    And it probably was - AMD didn't expect to be able to clock higher, so the design-enabled higher clocks of the Construction cores are going to waste.

    With AMD aware that every move they make to a new process with the construction cores has reduced clockspeed they knew upcoming processes would likely follow the same pattern, so they'd need to design for IPC and forego frequency.

    The construction cores actually have decent branch predictors, it's just the penalty that causes problems - and the penalty is largely derived from the slow caches.

    Not so much, really. You already had many of these resources shared in a module, three schedulers acting on them, and a net 4+3+4 pipeline design. In a way, Zen will have less issues with the 4+2+4 design, improved buffers, and presumably improved schedulers and prediction capabilities.

    Con cores have four pipelines per core, and three for the FPU, IIRC, which is 11 pipelines being fed by shared, and inferior, resources to what will be feeding Zen's 10 pipelines.
     
  3. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    477
    Likes Received:
    130
    Decent, but not good enough compared to Intels. Run some branchy bench you know of and use CodeAnalyst to check the misprediction rates. I'll run the same with Intel Skylake.

    I disagree that it's just cache and the rest was equally matched to Intel.

    Slow cache is also a major oversimplification - it's ways, it's how many accesses per line can be made simultaneously, it's the latency and bandwidth with cache contention.

    Sent from HTC 10
    (Opinions are own)
     
  4. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    713
    Likes Received:
    1,623
    You make the project, I'll run it on three different systems (Excavator, Sandy Bridge, Deneb).

    Excavator's branch prediction rates seem like they should be better than Sandy Bridge, judging by Agner's comments.

    http://www.agner.org/optimize/microarchitecture.pdf

    Pages 28 & 33

    Not what I'm saying at all, I'm just comparing Zen to the Construction cores. AMD does have areas where it has been stronger than Intel - and Zen looks to attempt to exploit that. Zen+ looks to double down on that strategy (if the 15% boost is to be believed).

    I wouldn't call it an oversimplification - all the things you said make a cache slow. Slow != bandwidth alone. Slow comes in terms of latency, throughput, and any combination thereof... and nearly everything is tied to the performance of the caches one way or another.

    If Excavator had even Sandy Bridge level caches, things would be a lot different.

    [​IMG]
    Oh, and that's 5Ghz Bulldozer vs 4.5Ghz Sandy Bridge.
     
    #2579 looncraz, Jul 28, 2016
    Last edited: Jul 28, 2016
  5. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,467
    Likes Received:
    415
    The L3 is rather irrelevant as in bulldozer its just an eviction cache ( anything faster the main memory is a benefit), its the horrible L2. David kanter blames a lot of bulldozers issues around the L1D L1I L2 arrangement. I think a 10-12 cycle L2 for Zen is a reasonable expectation for 512k L2.

    According to the leaks around Cern L1D and L1i is 64k each which when coupled to the much faster L2 and probably much faster L3 which will be a massive improvement for single thread perf, probably not much of a difference for throughput. Even if just 32k each that should be sufficient when backed by a fast good sized L2 (on the assumption the L1i is better then the current 3 way).

    IF i was a betting man i would bet that Zen has just as long a pipeline as CON (my guess is front end and L/S are CON core evolution), so it will be interesting to see how/if they can reduce failed branches and associated penalty . will we see the much patented about retirement queue cache/trace cache, check pointing etc, i expect they will have done something to alleviate the 20+ cycle branch miss penalty of a pipeline of that length.
     
  6. coffeemonster

    coffeemonster Member

    Joined:
    Apr 18, 2015
    Messages:
    180
    Likes Received:
    66
    Hypothetical thought: After Kaveri, Carrizo/Excavator was designed for efficiency foremost no? to become the CAT core's replacement. So what would have happened if instead they decided to make the Kaveri arch improvement on the mature 32 SOI and continued with the FX-8400, and later the Excavator arch improvements(perhaps without using High Density Libraries?) FX-8500. I Just wonder what sort of a CPU would have come out if the FX line continued with all the CON core improvements on 32 SOI since that seems to be the node best suited to the uarch.
     
  7. NostaSeronx

    NostaSeronx Golden Member

    Joined:
    Sep 18, 2011
    Messages:
    1,629
    Likes Received:
    45
    If it was a drop and optimize port it would not be on 32nm PDSOI but on 22nm FDSOI.

    Optimized Orochi die on FDSOI with OD+FBB/SRAM+RBB we are looking at ~4.5 GHz as nominal clock. With 30-40% yield aiming towards 5 GHz.

    So, FX-9600(117W/8-core Top SKU)/FX-9400(88W/8-core Nominal SKU)/FX-7400(88W/6-core Salvaged SKU)/FX-5400(88W/4-core Salvage SKU). 170-to-156.42-to-140 mm² range. (Would require internal Northbridge/Southbridge to support AM4)

    Optimized Gecko(SR28)/Basilisk(XV20) we would be looking at a nominal all core of ~4 GHz @ 140 watts w/ a 16-core SKU. 360-to-344.124-to-320 mm² range. (Would just need a southbridge to support AM4)

    22FDSOI has ~55% lower PVT variation than 14LPP, currently. Which means at this moment 0.3_PDK_22 is faster than 1.2_PDK_14.
     
    #2582 NostaSeronx, Jul 28, 2016
    Last edited: Jul 28, 2016
  8. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Few weeks ago I actually tested Vishera with L3 caches disabled. The only workload I noticed any performance difference in was WinRAR and 7-Zip. In WinRAR the performance was ~2% higher with L3 enabled and in 7-Zip 3.7% higher. This was at 1866MHz MEMCLK so at higher memory speeds the difference would have been ever lower.
     
  9. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,173
    Likes Received:
    411
    How about Cinebench R15?
     
  10. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Within margin of error ±2 points.
     
  11. NostaSeronx

    NostaSeronx Golden Member

    Joined:
    Sep 18, 2011
    Messages:
    1,629
    Likes Received:
    45
    How about power consumption?

    L3 on vs L3 off.
     
  12. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Cannot be measured easily. The total NB (NB + 4x 2MB L3) worst case power consumption on Vishera is ~15W so it is nearly irrelevant. For a Piledriver based NPU (without GPU) the NB power consumption is ~8W.
     
  13. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,467
    Likes Received:
    415
    Yeah things with bigger more often accessed data sets will see more benefit from the extra cache, did you test any games? I would expect a bigger difference there.
     
  14. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Nope, just some of the most common benchmarks. I would be shocked if any other workload would show greater gains than 7-Zip.
     
  15. itsmydamnation

    itsmydamnation Golden Member

    Joined:
    Feb 6, 2011
    Messages:
    1,467
    Likes Received:
    415
    Compres-decompress are only a small subset, heavy branching logic code will stress prefetch predict and cache far more. Back in the phenom 2 days the 6mb l3 was good for around 10-15% in games.
     
    Doom2pro likes this.
  16. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,713
    Likes Received:
    515
    These numbers are helpful. Is it safe to assume, that a Zeppelin die might have ~6W NB power (lower for internal logic, roughly similar for off-die I/O)?

    Edit:
    @all: From here:
    OK, there is no ARM based core in this list...

    This could be a result of component sharing between both designs. The question is: are there any upcoming 16nm FinFET CPUs on the internal AMD roadmap?
     
    #2591 Dresdenboy, Jul 29, 2016
    Last edited: Jul 29, 2016
  17. KTE

    KTE Senior member

    Joined:
    May 26, 2016
    Messages:
    477
    Likes Received:
    130
    10-12 cycle L2 would be a very good move for Zen.
     
  18. BigDaveX

    BigDaveX Senior member

    Joined:
    Jun 12, 2014
    Messages:
    235
    Likes Received:
    12
    There seems to be something about these "speed demon" designs whereby they hardly gain anything from cache size increases. Back in the day, the Pentium 4 got some pretty nice performance increases when the cache was doubled in Northwood, and then when the Extreme Edition slapped on that huge (for the time) L3 cache. Then Prescott made the pipeline crazy deep, and all of a sudden we had a situation where the IPC difference between a Celeron D with 256KB of L2 cache and a Pentium 4 600-series with 2MB of L2 cache was fairly negligible.
     
  19. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    4,537
    Likes Received:
    568
    Ehh wasnt that a cache size decrease example? :)
     
  20. Phynaz

    Phynaz Diamond Member

    Joined:
    Mar 13, 2006
    Messages:
    9,384
    Likes Received:
    261
    Isn't there an ARM core inside Zen acting as a system agent?
     
  21. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    The SMU should still be RISC based (LM32), however I'm not 100% sure.
     
  22. Phynaz

    Phynaz Diamond Member

    Joined:
    Mar 13, 2006
    Messages:
    9,384
    Likes Received:
    261
    ^ Wikipedia

    It's the security processor I was thinking of - AMD Trust Zone.

    http://www.amd.com/en-us/press-releases/Pages/amd-strengthens-security-2012jun13.aspx
     
    #2597 Phynaz, Jul 29, 2016
    Last edited: Jul 29, 2016
  23. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,713
    Likes Received:
    515
    The PSP is using an ARM Cortex-A5. David Kaplan, the guy who held the CPU design talk at CCC, is mainly involved in the PSP stuff. But Hon Hin Wong worked on CPU schedulers and the like, at a much lower level the PSP. I think, even if the K12 would've been cancelled, many people already worked on it for years, and even more on reused components for both uarchs.
     
  24. Hi-Fi Man

    Hi-Fi Man Senior member

    Joined:
    Oct 19, 2013
    Messages:
    467
    Likes Received:
    31
    AMD seems to have a lot of sound designs but they are usually held back by their poor cache and memory performance. If AMD could improve that, I think it would bring them much closer to where they need to be.

    I'm also worried about their chipset. Rumors say Asmedia is designing it but rumors also say there have been delays/issues. Not only does AMD have to execute their CPU well but they also have to execute their chipset well. This is an area I think AMD will be able to provide an advantage over Intel for the same price point at least in the mainstream market.

    There's a lot riding on GlobalFoundries' 14nm process and so far it doesn't seem to be that great but hopefully from now till Zen launch it'll mature a bit. I can only wonder what Zen on IBM's (now GF's) 22nm FDSOI would look like...
     
  25. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,173
    Likes Received:
    411
    A lot of us are wondering about that. Back in 2014 it gave us POWER8 CPUs in the 4.7 GHz territory, albeit at massive power draw. Those were big honkin chips though. Considering how much improving GF has done with their 32nm and 28nm processes, you'd think 22nm FDSOI would have seen some improvements since fall of 2014.