New Zen microarchitecture details

FieryUP · Jul 17, 2016

mrmt said:
It is not a ring, it is a bridge, but since you said that Intel solution is suboptimal, what would be the optimal solution for you? and how AMD solution is better than Intel suboptimal solution?

Yes, it's a ring. It looks like a ring and it works like a ring. You can call it whatever you want though, it's your choice. It's suboptimal because it's got too many hops. It works best with slightly less cores, 24 cores are stretching the concept.

I'm not saying AMD's solution is better, I'm only saying the whole GMI+SDP architecture was designed from the ground up to support 4 dies and a total of 32 cores. While the ring architecture that Intel uses wasn't designed for 32 cores at all, not even for 24 cores.

To make Intel's ring architecture better, you would need to lessen the number of hops it takes to reach cores. You could for example connect each core to not only their neighboring 2 cores, but 2 more cores from the other side of the ring for example. A 2D mesh and especially a 3D mesh would be even better. AMD's solution is completely different, since it has 4 cores in a core complex, 2 core complexes on a die, and then dies are connected to each other inside the CPU package. But those dies are connected quite well (IMHO AMD did a great job on that), so what we'll need to see is what the worst latency would turn out to be.

krumme · Jul 17, 2016

What do we actually know about the gmi link besides Hans pics of die?
http://www.chip-architect.com/news/Zen_Summit_Ridge_First.jpg

Dresdenboy says "ofcource" Seamicro IP
http://forums.anandtech.com/showthread.php?p=38065492

What does that imply? And what to expect?

The Stilt · Jul 17, 2016

Isn't GMI just ganged PCI-E 3.0 lanes?

Arachnotronic · Jul 17, 2016

FieryUP said:
Yes, it's a ring. It looks like a ring and it works like a ring. You can call it whatever you want though, it's your choice. It's suboptimal because it's got too many hops. It works best with slightly less cores, 24 cores are stretching the concept.

I'm not saying AMD's solution is better, I'm only saying the whole GMI+SDP architecture was designed from the ground up to support 4 dies and a total of 32 cores. While the ring architecture that Intel uses wasn't designed for 32 cores at all, not even for 24 cores.

To make Intel's ring architecture better, you would need to lessen the number of hops it takes to reach cores. You could for example connect each core to not only their neighboring 2 cores, but 2 more cores from the other side of the ring for example. A 2D mesh and especially a 3D mesh would be even better. AMD's solution is completely different, since it has 4 cores in a core complex, 2 core complexes on a die, and then dies are connected to each other inside the CPU package. But those dies are connected quite well (IMHO AMD did a great job on that), so what we'll need to see is what the worst latency would turn out to be.

SKL-EP will use a mesh interconnect, rather than a ring bus. If it's like KNL, then it'll technically be a "mesh of rings."

Anyway those architects at Intel know what they're doing.

mrmt · Jul 17, 2016

FieryUP said:
But those dies are connected quite well (IMHO AMD did a great job on that), so what we'll need to see is what the worst latency would turn out to be.

How are the dies connected quite well? Did AMD provide amenities to the cores inside the package? Connected well is not quite technical, it seems that you are just assuming AMD solution is better because it is AMD.

FieryUP · Jul 17, 2016

mrmt said:
How are the dies connected quite well? Did AMD provide amenities to the cores inside the package? Connected well is not quite technical, it seems that you are just assuming AMD solution is better because it is AMD.

Maybe I just cannot be any more specific than that, because ... you know 🙂 I think you can figure out the reason.

Anyway, I'm not saying AMD's solution is better. All I'm saying is that Intel's current (and supposedly their best) solution is simply not suitable to scale up to 32 cores, while AMD's solution was designed to scale up to 32 cores. That's a big, albeit short term advantage on AMD's part. But it's also the least thing you can expect from a brand new, designed from the ground up server CPU.

So based on that, I simply assume that AMD's solution will work well with 32 cores -- maybe I'm stupid 🙂 I'm not saying it will perform great, all I'm saying is that the way AMD connected the dies together inside the CPU package looks great on paper. Latencies, especially the worst case scenario (when data must cross "die borders") is crucial, but I don't have any ways to test or benchmark that at this time.

And I'm also aware of the fact that AMD showed us many-many great things on paper (slides) in the past 10 or so years, but then did "great" on underdelivery, delays, and canned many projects (Krishna, Wichita, Komodo, Kaveri 1.0 with GDDR5, the original Richland, Sepang, Macau, Terramar, Dublin, etc). It (ie. underdelivery and delays) could happen this time too, but at least on paper the 32-core Naples CPU looks very promising.

mrmt · Jul 17, 2016

FieryUP said:
And I'm also aware of the fact that AMD showed us many-many great things on paper (slides) in the past 10 or so years, but then did "great" on underdelivery, delays, and canned many projects (Krishna, Wichita, Komodo, Kaveri 1.0 with GDDR5, the original Richland, Sepang, Macau, Terramar, Dublin, etc). It (ie. underdelivery and delays) could happen this time too, but at least on paper the 32-core Naples CPU looks very promising.

I think you touched the crux of the issue here, AMD product performance usually degrades a lot when going from paper to actual products.

I also don't get why you are comparing the ring architecture from Intel's current chips with a future AMD product. Intel already said that they will change the current ring architecture, so whatever Zen will have (which is unimpressive IMO) will go against whatever Intel will field in the future. For example, AMD will not face 22 core chips as of today, but 28-core chips, so unless you expect AMD performance to fall 10-15% from Skylake, they won't get even close of the performance crown.

Dresdenboy · Jul 17, 2016

mrmt said:
I'm not interested in magic, I'm interested on how they are technically better than Intel ring solution and it seems you do not have an answer, you are expecting a miracle. I'm expecting either huge latencies or very poor efficiency from AMD solution, it is a cheapskate solution for a market that has very little tolerance to it.

Did you consider any solution, which is different than Intel's rings?

Some speculation: With Zeppelin dies, AMD just has to connect core complexes. Everything below that granularity could be handled locally by the CCX. With that said, a mesh spanning individual cores is out. A concentrated mesh connecting the CCX' would work. Other options might be a folded torus (X, X+Y), double butterfly, butter donut, etc. Why would all these be worse than bridged ring busses?

mrmt said:
I think you touched the crux of the issue here, AMD product performance usually degrades a lot when going from paper to actual products.

Projected, but missed performance, and promised, but cancelled products, are different things. But you're right, there were multiple "performance degradation" cases in the past, like with Barcelona's clock frequencies.

DrMrLordX · Jul 17, 2016

Those clockspeeds seem low for the 4c and 8c parts. I was thinking/hoping for 3.3 GHz/4.0 GHz boost.

32m parts at that clockspeed would do okay. I would have to know more about AVX2 market penetration before I could judge whether or not AMD can bag it on modern ISA extensions. But from all the technical data that's been released thus far, it does not look like Zen will have good 256-bit SIMD performance.

Feel free to surprise me AMD.

FieryUP · Jul 18, 2016

DrMrLordX said:
Those clockspeeds seem low for the 4c and 8c parts. I was thinking/hoping for 3.3 GHz/4.0 GHz boost.

32m parts at that clockspeed would do okay. I would have to know more about AVX2 market penetration before I could judge whether or not AMD can bag it on modern ISA extensions. But from all the technical data that's been released thus far, it does not look like Zen will have good 256-bit SIMD performance.

Feel free to surprise me AMD.

2.8/3.2 GHz are for the 4c/65W part. I too expect those clocks to go considerably higher when TDP is loosened up to 95W, _and_ AMD gets to another spin on the stepping (A1). The big question is how better A1 would be, and when can we get to that stepping.

NostaSeronx · Jul 18, 2016

The Stilt said:
Isn't GMI just ganged PCI-E 3.0 lanes?

Nope, it's Redwood 4.0 from Rambus. :whiste:

looncraz · Jul 18, 2016

mrmt said:
I'm not interested in magic, I'm interested on how they are technically better than Intel ring solution and it seems you do not have an answer, you are expecting a miracle. I'm expecting either huge latencies or very poor efficiency from AMD solution, it is a cheapskate solution for a market that has very little tolerance to it.

This is what the most simple 16-core ring bus would look like when using quad core modules with their own L3 (like Zen, but not based on any thing I know [or think I know] about Zen...).

If the purple ring is bidirectional, then you only need, max, two hops to reach the proper target module. If the L3 is inclusive, then the logical arrangement shown makes the most sense to me. Cache latencies dominate the performance aspect in this scenario, with it only ever taking just a few hops for the ring buses to do their thing.

The green area with the yellow star is the shared L3 module cache, which contains a data bus which is pulling double-duty for inter-core data transfers. Each core in the module is connected to a simple ring bus to assert or receive communications on the ring bus.

The easiest way to address a core in this system would be by MODULE:CORE addressing. Assuming internal ring bus communication begin at Core 1, Core A1 talking to core D4 would be the longer trip - with 10 hops. Nominal would be just four hops, however. If each hop took two cycles and the L3 cache was as slow as that on Bulldozer, worst-case inter-core latency would be ~85 cycles, with nominal being closer to 75 cycles. Best case would be 69 cycles.

That's fairly consistent for a 16-core behemoth... and with just a very simple system. Zen will be better than this, no doubt, but the module organization has some pretty obvious advantages.

looncraz · Jul 18, 2016

FieryUP said:
2.8/3.2 GHz are for the 4c/65W part. I too expect those clocks to go considerably higher when TDP is loosened up to 95W, _and_ AMD gets to another spin on the stepping (A1). The big question is how better A1 would be, and when can we get to that stepping.

You would still think that a 14nm LPP Zen CPU could have turbo clocks closer to 4Ghz.

If Zen consumer APUs don't have nearer to 4Ghz turbo clocks while only having Haswell or lower IPC, with 65W TDP, then Zen is basically a failure.

I don't care about base clocks as low as 2.8 or 3Ghz - that's certainly acceptable for an 8-core 95W CPU. Turbo clocks should be much closer to 4Ghz, though, IMHO, to even be worth buying.

Unless we are talking about $150 8-core CPUs still...

The Stilt · Jul 18, 2016

FieryUP said:
2.8/3.2 GHz are for the 4c/65W part. I too expect those clocks to go considerably higher when TDP is loosened up to 95W, _and_ AMD gets to another spin on the stepping (A1). The big question is how better A1 would be, and when can we get to that stepping.

I don't think there will be A) a 4C/8T 95W part B) A1 stepping.

If Zeppelin was to launch in December, AMD would need to have the A1 stepping validated by September at latest. Since we are beyond the middle of July already, chances for that are extremely slim as the current stepping appears to be A0.

btw. Tamas?

FieryUP · Jul 18, 2016

looncraz said:
This is what the most simple 32-core ring bus would look like when using quad core modules with their own L3 (like Zen, but not based on any thing I know [or think I know] about Zen...).

It looks nice, but there's no ring in Naples 😉

krumme · Jul 18, 2016

The technical specs or whatever for some 2.5 torus is way over my head but from a business perspective it gives sense for them to choose interconnect that:

Is easily scalable from 8 to 32
Is easily scalable from 32 upwards (think not from hardware perspective) - perhaps this is a key to understand zen solution on the cloud market?
Is scalable on hardware as well as from software side.
Is cheap and fast to develop. Meaning leverage some standard and eg. extend it. Stilt ganges pci lanes makes sense here.
Is known and tried in server environment. Seamicro ip makes sense here.
Is fast to market and dont drain internal ressources. Buying tech like seamicro is an obvious solution.

FieryUP · Jul 18, 2016

The Stilt said:
I don't think there will be A) a 4C/8T 95W part

Why not? AMD already has 4-core Athlon X4 CPUs based on Kaveri and Carrizo. I'm not saying it's what the masses demand all day long, but AMD is famous for riding on niche markets, "slicing the salami" as we say around here. If they take 8-core Summit Ridge dies where there's an issue in one of the core complexes, and they disable that core complex, they can re-purpose such parts that otherwise would go to the trashcan. And such 4c/8t 95W parts could go head-to-head against i7-6700K. Of course it wouldn't have an iGPU, but most enthusiasts would have a dGPU to begin with...

I would personally worry more about capturing the performance crown. AMD would need something to go against i7-6950X. Not that it would bring a lot of $$, just solely on the purpose of peeling off the current sticker (stigma) of "affordable, but not that high performance". AMD's golden age was when they owned the performance crown with Athlon 64 and then Athlon 64 X2.

If Zeppelin was to launch in December, AMD would need to have the A1 stepping validated by September at latest. Since we are beyond the middle of July already, changes for that are extremely slim as the current stepping appears to be A0.

I agree, but I don't think a December market launch would work out anyway. I expect a December press launch (paper launch) and a January market availability. That would give more time for A1 to be validated. If AMD can push the clocks even only 200 MHz higher by going from A0 to A1, then I'd say it's worth one or two months of delays. Summit Ridge needs to work awesome for AMD to get back in the game.

btw. Tamas?

Yep 🙂

krumme · Jul 18, 2016

The process is developed for having profit at the all important smarphone market.
An a1 metal layer spin is not gone a make a difference for freq. It will take more like a year to significantly alter process. At that time we are at zen plus.
I wouldnt worry so much. Ofcource its bad for desktop users if you want something that can compete with even a regular i5 for st. But for server its efficiency that matters. And they stand no chance here if they go just a bit outside optimal power.
High freq will come. It will just take a year or two.
Its far better with solid ipc from day one. Its a mess to alter that later on just look at bd.

The Stilt · Jul 18, 2016

FieryUP said:
Why not? AMD already has 4-core Athlon X4 CPUs based on Kaveri and Carrizo. I'm not saying it's what the masses demand all day long, but AMD is famous for riding on niche markets, "slicing the salami" as we say around here. If they take 8-core Summit Ridge dies where there's an issue in one of the core complexes, and they disable that core complex, they can re-purpose such parts that otherwise would go to the trashcan. And such 4c/8t 95W parts could go head-to-head against i7-6700K. Of course it wouldn't have an iGPU, but most enthusiasts would have a dGPU to begin with...

I would personally worry more about capturing the performance crown. AMD would need something to go against i7-6950X. Not that it would bring a lot of $$, just solely on the purpose of peeling off the current sticker (stigma) of "affordable, but not that high performance". AMD's golden age was when they owned the performance crown with Athlon 64 and then Athlon 64 X2.

I agree, but I don't think a December market launch would work out anyway. I expect a December press launch (paper launch) and a January market availability. That would give more time for A1 to be validated. If AMD can push the clocks even only 200 MHz higher by going from A0 to A1, then I'd say it's worth one or two months of delays. Summit Ridge needs to work awesome for AMD to get back in the game.

Yep 🙂

To me the 95W 4C/8T appears to be pretty unlikely, since the Fmax appears to be fully limited by the manufacturing process itself and not by the power limit (based on the alleged extremely tiny delta between the base clock and max turbo). So increasing the TDP from 65W to 95W would basically make no difference in the clocks or the performance.

Based on Polaris 10 silicon characteristic I'd expect that AMD would gain significantly more from a more mature manufacturing process than from a new silicon revision.

Anyway, good to see you here Tamas 😉

FieryUP · Jul 18, 2016

The Stilt said:
To me the 95W 4C/8T appears to be pretty unlikely, since the Fmax appears to be fully limited by the manufacturing process itself and not by the power limit (based on the alleged extremely tiny delta between the base clock and max turbo). So increasing the TDP from 65W to 95W would basically make no difference in the clocks or the performance.

Based on Polaris 10 silicon characteristic I'd expect that AMD would gain significantly more from a more mature manufacturing process than from a new silicon revision.

Anyway, good to see you here Tamas 😉

Thank you 🙂 I'm not sure what "extremely tiny delta" do you mean... 2.8/3.2 is the part in question, so 400 MHz delta is that small? True, parts like i7-6700 have 600 MHz delta, but 400 MHz is not that smaller than 600 MHz, it's not tiny compared to it 🙂 With 6700K it's only 200 MHz, and 6800K is similarly 400 MHz.

richaron · Jul 18, 2016

It's awesome to see actual knowledgeable dudes on here, and I hope you're not put off by the regular people-whom-appear-to-know-what-they're-talking-about-but-actually-just-remember-stuff-to further-their-own-agenda-and-post-more-to-win-arguments types; like I have been. Cheers for your input.

On topic: considering AMD's priorities with semi-custom/consoles and big partners such as Apple, on top of an obviously immature GloFo 14nm; wouldn't it be accurate to assume we've seen the worst of the silicone (and pushed beyond comfort zone) with the RX 480 reference? I'd assume top grade chips and 6 months fab' maturing would be at least a little different..

Arachnotronic · Jul 18, 2016

FieryUP said:
Why not? AMD already has 4-core Athlon X4 CPUs based on Kaveri and Carrizo. I'm not saying it's what the masses demand all day long, but AMD is famous for riding on niche markets, "slicing the salami" as we say around here. If they take 8-core Summit Ridge dies where there's an issue in one of the core complexes, and they disable that core complex, they can re-purpose such parts that otherwise would go to the trashcan. And such 4c/8t 95W parts could go head-to-head against i7-6700K. Of course it wouldn't have an iGPU, but most enthusiasts would have a dGPU to begin with...

I would personally worry more about capturing the performance crown. AMD would need something to go against i7-6950X. Not that it would bring a lot of $$, just solely on the purpose of peeling off the current sticker (stigma) of "affordable, but not that high performance". AMD's golden age was when they owned the performance crown with Athlon 64 and then Athlon 64 X2.

The problem here is that Intel is also focused on capturing the performance crown; AMD doesn't operate in a vacuum.

KTE · Jul 18, 2016

It'll be A1... No chance A0 will be production silicon.

+200-400MHz with this respin is possible only if the initial silicon is problematic/clocking poorly according to the base process metrics.

Not saying that's what will happen for the clocks. Don't see anything higher than 2.8-3.0GHz 8core 95W happening.

Sent from HTC 10
(Opinions are own)

DrMrLordX · Jul 18, 2016

FieryUP said:
2.8/3.2 GHz are for the 4c/65W part. I too expect those clocks to go considerably higher when TDP is loosened up to 95W, _and_ AMD gets to another spin on the stepping (A1). The big question is how better A1 would be, and when can we get to that stepping.

I was more referring to the 8c part @ 95W when I mentioned 3.3 GHz/4.0 GHz. But basically I'm with Looncraz, at least the turbo should get up there . . .

If they can manage A1 within an acceptable window then, hey, great. Get that superior voltage scaling to either bring up the clockspeed or bring down power at the existing clocks. 3 GHz is no problem for a 32c part after all. For an 8c part? ehhhh

KTE said:
Not saying that's what will happen for the clocks. Don't see anything higher than 2.8-3.0GHz 8core 95W happening.

That would be pretty weak though. Market penetration on AM4 would be non-existent.

We'll find out in a few months so it's no big deal either way. Just gotta be patient . . . or wait for them ES leaks.

The Stilt · Jul 18, 2016

KTE said:
It'll be A1... No chance A0 will be production silicon.

Why there is no chance for A0 being the final stepping? The most recent design from AMD launched with A0 revision too (Stoney Ridge).

New Zen microarchitecture details

Junior Member

Diamond Member

Golden Member

Lifer

Diamond Member

Junior Member

Diamond Member

Golden Member

Lifer

Junior Member

Diamond Member

Senior member

Senior member

Golden Member

Junior Member

Diamond Member

Junior Member

Diamond Member

Golden Member

Junior Member

Golden Member

Lifer

Senior member

Lifer

Golden Member