New Zen microarchitecture details

Discussion in 'CPUs and Overclocking' started by Dresdenboy, Mar 1, 2016.

  1. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    4,463
    Likes Received:
    535
    My one and only guess was 2.5 3.2 :)
    http://forums.anandtech.com/showthread.php?p=38267634
    No surprises here imo.
    And i expect relatively weak smt too.
    Where i do disagree is the st ipc assessment you do and build perf estimates on.
    Based on some ppt and weak pr talk?
    I think when they are gunning for broadwell ipc on integer - and why shouldnt they especially that wide - thats more or less what they will get. Hsw integer st ipc in cb. Fp more like ivy.
    Efficiency will be the key. Then total system cost.
     
    #2251 krumme, Jul 17, 2016
    Last edited: Jul 17, 2016
  2. mrmt

    mrmt Diamond Member

    Joined:
    Aug 18, 2012
    Messages:
    3,976
    Likes Received:
    0
    It is not a ring, it is a bridge, but since you said that Intel solution is suboptimal, what would be the optimal solution for you? and how AMD solution is better than Intel suboptimal solution?
     
  3. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,202
    Likes Received:
    1,363
    On all designs since Trinity, EVT parts have existed at final clocks. Orochi B (Bulldozer) reached it's final clocks with it's second stepping from the launch stepping (B0 vs. B2 release). A1 stepping already reached 3.6GHz.

    If AMD plans to launch Zeppelin in 2016, they must have the final silicon available by now. Therefore most likely A0 (which was displayed by Lisa Su) is the final stepping for Zeppelin.
     
  4. FieryUP

    FieryUP Junior Member

    Joined:
    Jun 4, 2002
    Messages:
    17
    Likes Received:
    0
    Yes, it's a ring. It looks like a ring and it works like a ring. You can call it whatever you want though, it's your choice. It's suboptimal because it's got too many hops. It works best with slightly less cores, 24 cores are stretching the concept.

    I'm not saying AMD's solution is better, I'm only saying the whole GMI+SDP architecture was designed from the ground up to support 4 dies and a total of 32 cores. While the ring architecture that Intel uses wasn't designed for 32 cores at all, not even for 24 cores.

    To make Intel's ring architecture better, you would need to lessen the number of hops it takes to reach cores. You could for example connect each core to not only their neighboring 2 cores, but 2 more cores from the other side of the ring for example. A 2D mesh and especially a 3D mesh would be even better. AMD's solution is completely different, since it has 4 cores in a core complex, 2 core complexes on a die, and then dies are connected to each other inside the CPU package. But those dies are connected quite well (IMHO AMD did a great job on that), so what we'll need to see is what the worst latency would turn out to be.
     
  5. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    4,463
    Likes Received:
    535
    #2255 krumme, Jul 17, 2016
    Last edited: Jul 17, 2016
  6. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,202
    Likes Received:
    1,363
    Isn't GMI just ganged PCI-E 3.0 lanes?
     
    #2256 The Stilt, Jul 17, 2016
    Last edited: Jul 17, 2016
  7. Arachnotronic

    Joined:
    Mar 10, 2006
    Messages:
    10,439
    Likes Received:
    837
    SKL-EP will use a mesh interconnect, rather than a ring bus. If it's like KNL, then it'll technically be a "mesh of rings."

    Anyway those architects at Intel know what they're doing.
     
  8. mrmt

    mrmt Diamond Member

    Joined:
    Aug 18, 2012
    Messages:
    3,976
    Likes Received:
    0
    How are the dies connected quite well? Did AMD provide amenities to the cores inside the package? Connected well is not quite technical, it seems that you are just assuming AMD solution is better because it is AMD.
     
  9. FieryUP

    FieryUP Junior Member

    Joined:
    Jun 4, 2002
    Messages:
    17
    Likes Received:
    0
    Maybe I just cannot be any more specific than that, because ... you know :) I think you can figure out the reason.

    Anyway, I'm not saying AMD's solution is better. All I'm saying is that Intel's current (and supposedly their best) solution is simply not suitable to scale up to 32 cores, while AMD's solution was designed to scale up to 32 cores. That's a big, albeit short term advantage on AMD's part. But it's also the least thing you can expect from a brand new, designed from the ground up server CPU.

    So based on that, I simply assume that AMD's solution will work well with 32 cores -- maybe I'm stupid :) I'm not saying it will perform great, all I'm saying is that the way AMD connected the dies together inside the CPU package looks great on paper. Latencies, especially the worst case scenario (when data must cross "die borders") is crucial, but I don't have any ways to test or benchmark that at this time.

    And I'm also aware of the fact that AMD showed us many-many great things on paper (slides) in the past 10 or so years, but then did "great" on underdelivery, delays, and canned many projects (Krishna, Wichita, Komodo, Kaveri 1.0 with GDDR5, the original Richland, Sepang, Macau, Terramar, Dublin, etc). It (ie. underdelivery and delays) could happen this time too, but at least on paper the 32-core Naples CPU looks very promising.
     
    #2259 FieryUP, Jul 17, 2016
    Last edited: Jul 17, 2016
  10. mrmt

    mrmt Diamond Member

    Joined:
    Aug 18, 2012
    Messages:
    3,976
    Likes Received:
    0
    I think you touched the crux of the issue here, AMD product performance usually degrades a lot when going from paper to actual products.

    I also don't get why you are comparing the ring architecture from Intel's current chips with a future AMD product. Intel already said that they will change the current ring architecture, so whatever Zen will have (which is unimpressive IMO) will go against whatever Intel will field in the future. For example, AMD will not face 22 core chips as of today, but 28-core chips, so unless you expect AMD performance to fall 10-15% from Skylake, they won't get even close of the performance crown.
     
  11. Dresdenboy

    Dresdenboy Golden Member

    Joined:
    Jul 28, 2003
    Messages:
    1,709
    Likes Received:
    504
    Did you consider any solution, which is different than Intel's rings?

    Some speculation: With Zeppelin dies, AMD just has to connect core complexes. Everything below that granularity could be handled locally by the CCX. With that said, a mesh spanning individual cores is out. A concentrated mesh connecting the CCX' would work. Other options might be a folded torus (X, X+Y), double butterfly, butter donut, etc. Why would all these be worse than bridged ring busses?

    Projected, but missed performance, and promised, but cancelled products, are different things. But you're right, there were multiple "performance degradation" cases in the past, like with Barcelona's clock frequencies.
     
    #2261 Dresdenboy, Jul 17, 2016
    Last edited: Jul 17, 2016
  12. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,072
    Likes Received:
    366
    Those clockspeeds seem low for the 4c and 8c parts. I was thinking/hoping for 3.3 GHz/4.0 GHz boost.

    32m parts at that clockspeed would do okay. I would have to know more about AVX2 market penetration before I could judge whether or not AMD can bag it on modern ISA extensions. But from all the technical data that's been released thus far, it does not look like Zen will have good 256-bit SIMD performance.

    Feel free to surprise me AMD.
     
  13. FieryUP

    FieryUP Junior Member

    Joined:
    Jun 4, 2002
    Messages:
    17
    Likes Received:
    0
    2.8/3.2 GHz are for the 4c/65W part. I too expect those clocks to go considerably higher when TDP is loosened up to 95W, _and_ AMD gets to another spin on the stepping (A1). The big question is how better A1 would be, and when can we get to that stepping.
     
  14. NostaSeronx

    NostaSeronx Golden Member

    Joined:
    Sep 18, 2011
    Messages:
    1,607
    Likes Received:
    42
    Nope, it's Redwood 4.0 from Rambus. :whiste:
     
  15. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    712
    Likes Received:
    1,605

    This is what the most simple 16-core ring bus would look like when using quad core modules with their own L3 (like Zen, but not based on any thing I know [or think I know] about Zen...).

    [​IMG]

    If the purple ring is bidirectional, then you only need, max, two hops to reach the proper target module. If the L3 is inclusive, then the logical arrangement shown makes the most sense to me. Cache latencies dominate the performance aspect in this scenario, with it only ever taking just a few hops for the ring buses to do their thing.

    The green area with the yellow star is the shared L3 module cache, which contains a data bus which is pulling double-duty for inter-core data transfers. Each core in the module is connected to a simple ring bus to assert or receive communications on the ring bus.

    The easiest way to address a core in this system would be by MODULE:CORE addressing. Assuming internal ring bus communication begin at Core 1, Core A1 talking to core D4 would be the longer trip - with 10 hops. Nominal would be just four hops, however. If each hop took two cycles and the L3 cache was as slow as that on Bulldozer, worst-case inter-core latency would be ~85 cycles, with nominal being closer to 75 cycles. Best case would be 69 cycles.

    That's fairly consistent for a 16-core behemoth... and with just a very simple system. Zen will be better than this, no doubt, but the module organization has some pretty obvious advantages.
     
    #2265 looncraz, Jul 18, 2016
    Last edited: Jul 18, 2016
  16. looncraz

    looncraz Senior member

    Joined:
    Sep 12, 2011
    Messages:
    712
    Likes Received:
    1,605
    You would still think that a 14nm LPP Zen CPU could have turbo clocks closer to 4Ghz.

    If Zen consumer APUs don't have nearer to 4Ghz turbo clocks while only having Haswell or lower IPC, with 65W TDP, then Zen is basically a failure.

    I don't care about base clocks as low as 2.8 or 3Ghz - that's certainly acceptable for an 8-core 95W CPU. Turbo clocks should be much closer to 4Ghz, though, IMHO, to even be worth buying.

    Unless we are talking about $150 8-core CPUs still...
     
  17. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,202
    Likes Received:
    1,363
    I don't think there will be A) a 4C/8T 95W part B) A1 stepping.

    If Zeppelin was to launch in December, AMD would need to have the A1 stepping validated by September at latest. Since we are beyond the middle of July already, chances for that are extremely slim as the current stepping appears to be A0.

    btw. Tamas?
     
    #2267 The Stilt, Jul 18, 2016
    Last edited: Jul 18, 2016
  18. FieryUP

    FieryUP Junior Member

    Joined:
    Jun 4, 2002
    Messages:
    17
    Likes Received:
    0
    It looks nice, but there's no ring in Naples ;)
     
  19. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    4,463
    Likes Received:
    535
    The technical specs or whatever for some 2.5 torus is way over my head but from a business perspective it gives sense for them to choose interconnect that:

    Is easily scalable from 8 to 32
    Is easily scalable from 32 upwards (think not from hardware perspective) - perhaps this is a key to understand zen solution on the cloud market?
    Is scalable on hardware as well as from software side.
    Is cheap and fast to develop. Meaning leverage some standard and eg. extend it. Stilt ganges pci lanes makes sense here.
    Is known and tried in server environment. Seamicro ip makes sense here.
    Is fast to market and dont drain internal ressources. Buying tech like seamicro is an obvious solution.
     
    #2269 krumme, Jul 18, 2016
    Last edited: Jul 18, 2016
  20. FieryUP

    FieryUP Junior Member

    Joined:
    Jun 4, 2002
    Messages:
    17
    Likes Received:
    0
    Why not? AMD already has 4-core Athlon X4 CPUs based on Kaveri and Carrizo. I'm not saying it's what the masses demand all day long, but AMD is famous for riding on niche markets, "slicing the salami" as we say around here. If they take 8-core Summit Ridge dies where there's an issue in one of the core complexes, and they disable that core complex, they can re-purpose such parts that otherwise would go to the trashcan. And such 4c/8t 95W parts could go head-to-head against i7-6700K. Of course it wouldn't have an iGPU, but most enthusiasts would have a dGPU to begin with...

    I would personally worry more about capturing the performance crown. AMD would need something to go against i7-6950X. Not that it would bring a lot of $$, just solely on the purpose of peeling off the current sticker (stigma) of "affordable, but not that high performance". AMD's golden age was when they owned the performance crown with Athlon 64 and then Athlon 64 X2.

    I agree, but I don't think a December market launch would work out anyway. I expect a December press launch (paper launch) and a January market availability. That would give more time for A1 to be validated. If AMD can push the clocks even only 200 MHz higher by going from A0 to A1, then I'd say it's worth one or two months of delays. Summit Ridge needs to work awesome for AMD to get back in the game.

    Yep :)
     
    #2270 FieryUP, Jul 18, 2016
    Last edited: Jul 18, 2016
  21. krumme

    krumme Diamond Member

    Joined:
    Oct 9, 2009
    Messages:
    4,463
    Likes Received:
    535
    The process is developed for having profit at the all important smarphone market.
    An a1 metal layer spin is not gone a make a difference for freq. It will take more like a year to significantly alter process. At that time we are at zen plus.
    I wouldnt worry so much. Ofcource its bad for desktop users if you want something that can compete with even a regular i5 for st. But for server its efficiency that matters. And they stand no chance here if they go just a bit outside optimal power.
    High freq will come. It will just take a year or two.
    Its far better with solid ipc from day one. Its a mess to alter that later on just look at bd.
     
  22. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,202
    Likes Received:
    1,363
    To me the 95W 4C/8T appears to be pretty unlikely, since the Fmax appears to be fully limited by the manufacturing process itself and not by the power limit (based on the alleged extremely tiny delta between the base clock and max turbo). So increasing the TDP from 65W to 95W would basically make no difference in the clocks or the performance.

    Based on Polaris 10 silicon characteristic I'd expect that AMD would gain significantly more from a more mature manufacturing process than from a new silicon revision.

    Anyway, good to see you here Tamas ;)
     
  23. FieryUP

    FieryUP Junior Member

    Joined:
    Jun 4, 2002
    Messages:
    17
    Likes Received:
    0
    Thank you :) I'm not sure what "extremely tiny delta" do you mean... 2.8/3.2 is the part in question, so 400 MHz delta is that small? True, parts like i7-6700 have 600 MHz delta, but 400 MHz is not that smaller than 600 MHz, it's not tiny compared to it :) With 6700K it's only 200 MHz, and 6800K is similarly 400 MHz.
     
  24. richaron

    richaron Senior member

    Joined:
    Mar 27, 2012
    Messages:
    841
    Likes Received:
    61
    It's awesome to see actual knowledgeable dudes on here, and I hope you're not put off by the regular people-whom-appear-to-know-what-they're-talking-about-but-actually-just-remember-stuff-to further-their-own-agenda-and-post-more-to-win-arguments types; like I have been. Cheers for your input.

    On topic: considering AMD's priorities with semi-custom/consoles and big partners such as Apple, on top of an obviously immature GloFo 14nm; wouldn't it be accurate to assume we've seen the worst of the silicone (and pushed beyond comfort zone) with the RX 480 reference? I'd assume top grade chips and 6 months fab' maturing would be at least a little different..
     
    #2274 richaron, Jul 18, 2016
    Last edited: Jul 18, 2016
  25. Arachnotronic

    Joined:
    Mar 10, 2006
    Messages:
    10,439
    Likes Received:
    837
    The problem here is that Intel is also focused on capturing the performance crown; AMD doesn't operate in a vacuum.