AMD Zeppelin codename confirmed and perhaps 32 cores per socket

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NTMBK

Lifer
Nov 14, 2011
10,232
5,013
136
32-cores to adress 80% of the server market, interesting.



They probably don't expect it to beat the fastest Skylake-EP/EX (up to 28-cores).

Or alternatively, they expect to lose to it in 20% of use-cases. Say, ones that make heavy use of FP vectors.
 

svenge

Senior member
Jan 21, 2006
204
1
71
Given AMD's current nose-down trajectory I'm surprised that they didn't go one further and just name it Hindenburg, as it consists primarily of hot air and will likely go down in flames.


Threadcrapping is not allowed here
Markfw900
 
Last edited by a moderator:

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Given AMD's current nose-down trajectory I'm surprised that they didn't go one further and just name it Hindenburg, as it consists primarily of hot air and will likely go down in flames.

This. Plus Zeppelins are big and slow, exactly the traits you should expect of your brand new processor.
 

PPB

Golden Member
Jul 5, 2013
1,118
168
106
This. Plus Zeppelins are big and slow, exactly the traits you should expect of your brand new processor.

From attacking the OP to now speculating the product's performance based on it's codename.

The only nose-diving I am seeing here is in the forum technical discussion's level.

OT: So the L3 isn't actually a LLC talking die-wise. Are they probably relying on HBM for non-consumer products to do that job?
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
It is only single socket.

Zeppelin = 8 cores per die * 4 dies on interposer * 1 socket => 32 cores / 64 threads.

Dual Socket = 8 cores per die * 4 dies on interposer * 2 sockets => 64 cores / 128 threads.

Not like Zep will ever release. 22FDX is pretty much being shoved down AMD's throats by GlobalFoundries in the closed off meetings. (They didn't do STM's 28nm FDSOI, and they wont do Samsung's FinFETs.)
I assume, the cores/die granularity is 16 with 2 mem channels. This shouldn't be too big (~160 - 200 mm²), as one CU with 8 MB L3 (assumed) is ~30 mm².

If that was the case, AMD wouldn't keep it a secret and AMDs stock would explode.

Hypothetical scenarios stays hypothetical ;)
Ask your preferred server OEMs. They might know more. AMD might still look at the Shmoo plots.

32-cores to adress 80% of the server market, interesting.

They probably don't expect it to beat the fastest Skylake-EP/EX (up to 28-cores).
Surely not at FP SIMD heavy stuff. Without that, cores become smaller, use less power, need lower voltage margins.
 

Azuma Hazuki

Golden Member
Jun 18, 2012
1,532
866
131
It is only single socket.

Zeppelin = 8 cores per die * 4 dies on interposer * 1 socket => 32 cores / 64 threads.

Dual Socket = 8 cores per die * 4 dies on interposer * 2 sockets => 64 cores / 128 threads.

Not like Zep will ever release. 22FDX is pretty much being shoved down AMD's throats by GlobalFoundries in the closed off meetings. (They didn't do STM's 28nm FDSOI, and they wont do Samsung's FinFETs.)

Speaking of, what happened to those Harvester and Crane cores you mentioned a few months ago?
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
Are they probably relying on HBM for non-consumer products to do that job?

I hope not. Sure, it would give them better performance, but it would come at the cost of both SKU prices and availability. Not only does AMD need HBM for their higher performing GPUs, they also need it for their Zen APUs. Also, don't forget that nVidia has plans of using HBM on their Pascal GPUs, as well.

Call me crazy (everyone else does!), but I'd rather see widely available Zens that are slightly lower-performing at a lower price that makes AMD plenty of profit, than to see higher-performing Zens that are scarce, higher-priced, and making AMD less profit.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
OT: So the L3 isn't actually a LLC talking die-wise. Are they probably relying on HBM for non-consumer products to do that job?
I need to check Loh's works regarding this nomenclature. The second patch talks about LLC (on die), while HBM + NVM/DDR4 might also be seen as external two level memory architecture. So instead of HBM = L4, it would simply hide the off-package mem's low bandwidth.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Or alternatively, they expect to lose to it in 20% of use-cases. Say, ones that make heavy use of FP vectors.
Good point. One can add that keller and team didnt have the ressources to develop the new fancy fp part anyway. They did have to rely on a lot of prior design blocks. As long as zen pulls a future console and desktop the gpu can take care of the rest. Still -adress- 80% of future server market. Makes good sense imo to focus unlike the bd that was supposed to be good at everything and failed at everything.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Good point. One can add that keller and team didnt have the ressources to develop the new fancy fp part anyway. They did have to rely on a lot of prior design blocks. As long as zen pulls a future console and desktop the gpu can take care of the rest. Still -adress- 80% of future server market. Makes good sense imo to focus unlike the bd that was supposed to be good at everything and failed at everything.

Bulldozer wasn't supposed to be good at everything, it was supposed to excel at INT workloads while being reasonable on FP workloads. It also relied on more cores in order to compensate for the lower IPC.

Not that I think AMD will release a product so bad as Bulldozer, but the strategy behind it isn't anything new.
 

PPB

Golden Member
Jul 5, 2013
1,118
168
106
I hope not. Sure, it would give them better performance, but it would come at the cost of both SKU prices and availability. Not only does AMD need HBM for their higher performing GPUs, they also need it for their Zen APUs. Also, don't forget that nVidia has plans of using HBM on their Pascal GPUs, as well.

Call me crazy (everyone else does!), but I'd rather see widely available Zens that are slightly lower-performing at a lower price that makes AMD plenty of profit, than to see higher-performing Zens that are scarce, higher-priced, and making AMD less profit.
I dont think LLC is something to make big generalizations. We have examples of products having great all around performance without a die-wise LLC (conroe to me is a good example), and products that suffer little performance for not having LLC compared to the very same die javing that LLC enabled (because it was slow to begin with, think about K10/10.5 and the whole BD architecture).

AMD could very well target good enough performance for the consumer and just keep this L3 cache shared between those 4 core blocks (making a smaller but faster L3 cache compared to BD) , and have HBM as a proper, big LLC in their server or HPC products.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
Bulldozer wasn't supposed to be good at everything, it was supposed to excel at INT workloads while being reasonable on FP workloads. It also relied on more cores in order to compensate for the lower IPC.

Not that I think AMD will release a product so bad as Bulldozer, but the strategy behind it isn't anything new.

Cache latency was one of the big IPC killers on Bulldozer and its progeny. AMD's presentation slides specifically indicated improved cache performance for Zen, so at least there's an internal awareness of this issue and efforts taken to fix it.

Also, it's important to distinguish between trading off FPU performance in general (which Bulldozer did) and trading off AVX performance in particular. The FPU in Bulldozer is used for a lot of stuff, much of which routinely shows up in mainstream apps; not only actual floating point, but also MMX and SSE2 integer vector operations, go through it. This is where the corner cutting really hurts. On the other hand, AVX performance is a much more acceptable sacrifice. AVX, in practice, often gives only marginal improvements over SSE2, and is used in a lot fewer applications.

It's worth pointing out that even Sandy Bridge made some compromises with AVX to save die space:

Sandy Bridge allows 256-bit AVX instructions to borrow 128-bits of the integer SIMD datapath. This minimizes the impact of AVX on the execution die area while enabling twice the FP throughput, you get two 256-bit AVX operations per clock (+ one 256-bit AVX load).

I think we can expect something similar from AMD this time around.

Also, remember that Intel didn't consider AVX-512 important enough to justify its die-space expense on the desktop and mobile SKUs. Once you get above 128 bits, you run into diminishing returns with vector operations - unless your whole code path is massively parallel, in which case you should just be running it on a GPU instead.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
32-cores to adress 80% of the server market, interesting.

They probably don't expect it to beat the fastest Skylake-EP/EX (up to 28-cores).

If this product does make it to market, I think we'll see 32 Zen cores with IPC between Sandy Bridge and Haswell levels depending on workload, running at a clock speed probably between 2.0 and 2.5 GHz. This would fall behind 28-core Skylake, due to the latter's superior IPC, but not by a huge amount. (Of course, single-threaded and lightly-threaded apps would see the worst hit, but this is less of an issue on servers than on desktops.) On a 14nm FinFET process, Zeppelin could probably fit within a typical server CPU power envelope (130W-150W). This would require a dedicated socket, since AM4 is said to only be rated for up to 95W.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
If the former leaked slides are real, then such a CPU with 4xDDR4 channels, HBM, Greenland GPU surely isn't aimed at HEDT. AM4 simply wouldn't support it.

Yup, nothing quad channel will work on AM4. I'm interested in knowing if a dual-channel version could be made to work on AM4, though.

Not like Zep will ever release. 22FDX is pretty much being shoved down AMD's throats by GlobalFoundries in the closed off meetings. (They didn't do STM's 28nm FDSOI, and they wont do Samsung's FinFETs.)

. . . seriously?

Speaking of, what happened to those Harvester and Crane cores you mentioned a few months ago?

He's just keeping up with his stellar track record.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Speaking of, what happened to those Harvester and Crane cores you mentioned a few months ago?
Excavator - 28nm Advanced(Ex:SLPP//Super Low Power Plus)
Excavator+ - 20nm Advanced Next(Ex:LPMP//Low Power Mobility Plus)
Harvester(ARM) + Crane(x86) - 14nm eXtreme Mobility

then,
Excavator - 28nm Advanced
Harvester(ARM) + Crane(x86) - 28nm FDSOI

then,
Excavator - 28nm Advanced
Zen - 14nm LPP
Tunnelborer(ARM/x86) - 22FDX
---
TB => Quad-core CMT cluster with Dual-core CSMT. Sometime around boot-time determines if its x86 or ARM via UEFI/OS integration.
 
Last edited:

Azuma Hazuki

Golden Member
Jun 18, 2012
1,532
866
131
Excavator - 28nm Advanced(Ex:SLPP//Super Low Power Plus)
Excavator+ - 20nm Advanced Next(Ex:LPMP//Low Power Mobility Plus)
Harvester(ARM) + Crane(x86) - 14nm eXtreme Mobility

then,
Excavator - 28nm Advanced
Harvester(ARM) + Crane(x86) - 28nm FDSOI

then,
Excavator - 28nm Advanced
Zen - 14nm LPP
Tunnelborer(ARM/x86) - 22FDX
---
TB => Quad-core CMT cluster with Dual-core CSMT. Sometime around boot-time determines if its x86 or ARM via UEFI/OS integration.


Sorry, I'm still not following this. "Excavator+" at 20nm doesn't look like anything AMD has on its roadmaps, and it *sounds* like Stoney Ridge, except THAT'S supposed to be another 28nm Carizzo-based SOC isn't it?

I haven't heard of 28nm anything, let alone full-depletion silicon on insulator, on their roadmap either, as far as CPUs go.

Is Harvester supposed to be their answer to Apple's Cyclone or something? And is Crane some kind of 14nm respin of Excavator to follow Stoney Ridge? Also, HOW would Tunnelborer even work?

This stuff is all over the place. Where are you getting this?
 
Mar 10, 2006
11,715
2,012
126
Sorry, I'm still not following this. "Excavator+" at 20nm doesn't look like anything AMD has on its roadmaps, and it *sounds* like Stoney Ridge, except THAT'S supposed to be another 28nm Carizzo-based SOC isn't it?

I haven't heard of 28nm anything, let alone full-depletion silicon on insulator, on their roadmap either, as far as CPUs go.

Is Harvester supposed to be their answer to Apple's Cyclone or something? And is Crane some kind of 14nm respin of Excavator to follow Stoney Ridge? Also, HOW would Tunnelborer even work?

This stuff is all over the place. Where are you getting this?

Haven't you figured it out yet? This is all made-up garbage.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Bulldozer wasn't supposed to be good at everything, it was supposed to excel at INT workloads while being reasonable on FP workloads. It also relied on more cores in order to compensate for the lower IPC.

Not that I think AMD will release a product so bad as Bulldozer, but the strategy behind it isn't anything new.
I followed the BD arch (not the marketing slides) starting 7 years ago. And IMO before hitting the 32nm realities, they tried to put more hardware per thread into a given area, applying a sharing concept for typically underutilized blocks, while making it a speed racer design to compensate for cores certainly not designed to excel at INT workloads, especially at low thread counts. But this is one part. As it emerged, some serious bottlenecks sat in the data writing paths to L2 (WCC quickly overflowing, low L2 write B/W) and further down the memory hierarchy.
 

Azuma Hazuki

Golden Member
Jun 18, 2012
1,532
866
131
I often wonder what the architecture would have been like with double the L2 cache with more associativity, speed, and lower latency. It looks to me like a hungry and chronically underfed animal.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
I often wonder what the architecture would have been like with double the L2 cache with more associativity, speed, and lower latency. It looks to me like a hungry and chronically underfed animal.

But thats the thing according the david kanter the actual L2 arrays themselves are fast, so if they are fast yet latency to L2 is so high whats wrong and how do you fix it.


one thing that will be very interesting to see with Zen is what will the L1 latency be, i wouldn't be surprised to see 3 cycle but i expect 4.