AMD Zen 2 Based ‘Starship’ CPU to Bring 48 Cores, 96 Threads in 2018

ericlp

Diamond Member
Dec 24, 2000
6,137
225
106
It looks like AMD is just warming up its seat following the release of Ryzen processors based on Zen architecture. With powerful processors now flooding the market, AMD is set to begin another journey towards the second half of 2017 with a massive lineup of processors that would bring 7nm fabrication process and wait - 16-, 32- and 48-core CPUs!

The leak was reported by VideoCardz which suggested that the information was first presented back in February 2016. While we were led to believe that most of the information about AMD Zen was already revealed, the latest revelation confirmed that AMD plans a Zen-based solution led by a 7nm fabrication process called the Starship which will feature 48 cores.

In addition, there are two more processor families: Snowy Owl and Naples, which was already revealed early this year. As summarized by WCCFTech, here is the table of comparison of the upcoming AMD CPUs.

Enterprise Proc AMD Snowy Owl AMD Naples AMD Starship
CPU Architecture Zen 1 Zen 1 Zen 2
Process Node 14nm FinFET 14nm FinFET 7nm FinFET
Maximum Cores 16 Cores 32 Cores 48 Cores
Maximum Threads 32 Threads 64 Threads 96 Threads
Availability Q2 2017 2018 2018


Probably the most powerful among the Zen architecture-based CPUs (unless AMD still has something up their sleeves), this processor family codenamed Starship will utilize the latest 7nm Zen core architecture. AMD is reportedly planning on launching this processor family next year.

The Starship CPU was reportedly designed from the Zen 2 cores which will be available early next year. This is the same CPU believed to power high-end desktop (HEDT) based on AMD's Pinnacle Ridge.

With the latest 7nm FinFET process which can readily deliver increased efficiency and denser designs, this AMD chip can run a total of 48 cores which is massive, the first of its kind indeed. But that's not all, how about getting 96 threads. The Starship CPUs are believed to run in various configurations and TDPs between 35W up to 180W. Finally, the new Starship CPUs will be branded as the new Opteron processors that.
 

jpiniero

Lifer
Oct 1, 2010
16,553
7,060
136
Actually there are rumors that AMD increased it to 64 cores (which would presumably be 4x4CCX). Although because of yields the 16 core dies would probably be in short supply so there might be only 48 core models available at launch.
 

Charlie22911

Senior member
Mar 19, 2005
614
231
116
So what configuration will this have?

4x dies with two 6 core CCX per die?

4x dies with two 8 core CCX per die, Cut down for yields+future 64 core SKU?

I don’t see them simply cramming more dies into a package, due to latency and having dies with disconnected IMC needing to go through other dies to hit main memory.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,371
2,992
136
I'm still not sold on AMD going with 2 X 8 core CCXs. It's MUCH easier to just modify the uncore to add more CCX interconnects and just connect a couple more CCX units to the mesh on the die. 7nm is expected to ROUGHLY be a halving of the measure of each dimension of the existing die (so .25 of the existing area, again, roughly). For the same size die, with that density, they could double wafer yields (die per wafer, not number of living die at the end of the processor manufacturing process) while halving the total die size. With roughly twice the floor space, they can easily add two more CCX units that are just like the existing ones (obviously with core tweaks, etc) and with an updated, faster uncore, help deal with the additional latency inherent in having more CCX units without doing a really nasty CCX tear up and replan. The current CCX has 6 internal interconnects. An 8 core CCX would have 28 total internal interconnects (inside a CCX, each core can directly interact with each other core). To deal with that many interconnects, each CCX would grow significantly to accommodate the additional data and control lines required. Just adding two additional CCX units and wiring them into the uncore would be a MUCH simpler task.

One thing that won't shock me though, is a reduction in the L3 cache per CCX from the existing 8MB in Ryzen 1XXX to the 4MB in the Raven Ridge cores. It appears to have not been a major reduction in performance for it and in some areas has made it faster. However, this may have to do with the fact that that die was designed with only a single CCX and the need to carry a copy of some data from a remote CCX is not applicable. The other possibility is an expansion of the L3 cache per CCX from 8mb to 16mb. This would allow each CCX to have its local 4mb of cache, and have a local copy of each other CCX's local 4mb as well. It would give an effective 16mb of L3 cache to the outside world, but, internally, it would have 64mb of L3. Perhaps, instead of doing a big L3 for each CCX, they could go with the RR 4MB of L3 per CCX, and have the Memory Management Unit in the Uncore have a 32MB L4 cache. It could be partitioned where half is a copy of each L3 in each CCX, and half is a victim cache for everything destined for Main Memory. Assuming that the SRAM cell size stays the same relative to process node and CCX size, taking 4 CCX units and reducing their L3 cache down to 4mb, then moving the 16MB of cache to an uncore location would be a wash in terms of die area used, and it would allow more flexibility in placement of the now smaller CCX units. Given that everything in the uncore is expected to shrink at roughly the same ratio, and assuming that they'd want to be in at roughly half the area of the current die, that would give more space to a larger L4 cache, or, even allow a small iGPU section to be added to those cores. No matter what, having twice as many cores and threads is going to put a major strain on the memory bus to keep scaling relatively constant. There will need to be something done to keep that beast fed with data. A large L4 would seem to be a necessary evil.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
I appologise for lowering the tone, but I think it's awesome that AMD will be introducing Starship Enterprise CPUs for the name alone.
Now we know why AMD changed the name for the GPU architectures, into Stars names ;).
 
  • Like
Reactions: lightmanek

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
I'm still not sold on AMD going with 2 X 8 core CCXs. It's MUCH easier to just modify the uncore to add more CCX interconnects and just connect a couple more CCX units to the mesh on the die. 7nm is expected to ROUGHLY be a halving of the measure of each dimension of the existing die (so .25 of the existing area, again, roughly). For the same size die, with that density, they could double wafer yields (die per wafer, not number of living die at the end of the processor manufacturing process) while halving the total die size. With roughly twice the floor space, they can easily add two more CCX units that are just like the existing ones (obviously with core tweaks, etc) and with an updated, faster uncore, help deal with the additional latency inherent in having more CCX units without doing a really nasty CCX tear up and replan. The current CCX has 6 internal interconnects. An 8 core CCX would have 28 total internal interconnects (inside a CCX, each core can directly interact with each other core). To deal with that many interconnects, each CCX would grow significantly to accommodate the additional data and control lines required. Just adding two additional CCX units and wiring them into the uncore would be a MUCH simpler task.

One thing that won't shock me though, is a reduction in the L3 cache per CCX from the existing 8MB in Ryzen 1XXX to the 4MB in the Raven Ridge cores. It appears to have not been a major reduction in performance for it and in some areas has made it faster. However, this may have to do with the fact that that die was designed with only a single CCX and the need to carry a copy of some data from a remote CCX is not applicable. The other possibility is an expansion of the L3 cache per CCX from 8mb to 16mb. This would allow each CCX to have its local 4mb of cache, and have a local copy of each other CCX's local 4mb as well. It would give an effective 16mb of L3 cache to the outside world, but, internally, it would have 64mb of L3. Perhaps, instead of doing a big L3 for each CCX, they could go with the RR 4MB of L3 per CCX, and have the Memory Management Unit in the Uncore have a 32MB L4 cache. It could be partitioned where half is a copy of each L3 in each CCX, and half is a victim cache for everything destined for Main Memory. Assuming that the SRAM cell size stays the same relative to process node and CCX size, taking 4 CCX units and reducing their L3 cache down to 4mb, then moving the 16MB of cache to an uncore location would be a wash in terms of die area used, and it would allow more flexibility in placement of the now smaller CCX units. Given that everything in the uncore is expected to shrink at roughly the same ratio, and assuming that they'd want to be in at roughly half the area of the current die, that would give more space to a larger L4 cache, or, even allow a small iGPU section to be added to those cores. No matter what, having twice as many cores and threads is going to put a major strain on the memory bus to keep scaling relatively constant. There will need to be something done to keep that beast fed with data. A large L4 would seem to be a necessary evil.
If the starship CPU has 64 cores and 128 PCi lanes and 256 MB L3 cache, per Canard PC Hardware Twitt - it means single CCX will have 8 cores, and 32 MB's of L3 Cache. Ryzen 3000 series to have 64 MB's of L3 cache.

Absolute madness, if true.
 

Topweasel

Diamond Member
Oct 19, 2000
5,437
1,659
136
If the starship CPU has 64 cores and 128 PCi lanes and 256 MB L3 cache, per Canard PC Hardware Twitt - it means single CCX will have 8 cores, and 32 MB's of L3 Cache. Ryzen 3000 series to have 64 MB's of L3 cache.

Absolute madness, if true.

Could be 4x4 CCX per die.
 
  • Like
Reactions: Schmide

Charlie22911

Senior member
Mar 19, 2005
614
231
116
This is just an assumption, but I think decoupling Infinity Fabric from RAM would help with inter-core latency a good deal; I don't know if this is possible without a new platform though. Each die has its own IMC though, so as long as all channels are populated it shouldn't be any harder to feed a TR4/SP3 part than a AM4 part when paired with decent RAM.
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
21,057
3,549
126
Seriously the meme of Yoda going "The Core Wars have begun" is so true.

Pretty soon its not going to be oh how fast is your processor? its going to be how many cores you running under that heat sink?

Or how big is your heat sink to be keeping that massive monolith die in check?

:T
 

jpiniero

Lifer
Oct 1, 2010
16,553
7,060
136
This right here. 7nm yields are gonna suck for just about everyone initially.

GloFo's 7 yield is rumored to be even worse than Intel's 10. They have plenty of options in how things get chopped with Matisse though. It'd make things a lot easier I imagine if they could fix the problems related to having CCX's with different enabled cores.

Have to see what happens with TSMC and Samsung, but TSMC's 7FF does look like yield is passable there.
 

nathanddrews

Graphics Cards, CPU Moderator
Aug 9, 2016
965
534
136
www.youtube.com
Seriously the meme of Yoda going "The Core Wars have begun" is so true.
I'll gladly take the credit on that one. ;)

I'm sure that as more games (even one or two AAA games) come out that can use more cores, we will most certainly start blinging (yes, blinging) our core counts. Personally, I'll take all the cores I can get to transcode UHD content on my Plex server!
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
GloFo's 7 yield is rumored to be even worse than Intel's 10.
No redacted dude, it's barely entered risk production, if even.
I imagine if they could fix the problems related to having CCX's with different enabled cores.
That's impossible.
TSMC's 7FF does look like yield is passable there.
It doesn't, unless you fab peanuts-sized SoCs (and AMD is not making one so...).
Vega20 is a pipecleaner for a reason.

You have been previously warned many times,
but there is no profanity allowed in the tech forums.

AT Mod Usandthem
 
Last edited by a moderator:

ericlp

Diamond Member
Dec 24, 2000
6,137
225
106
I wonder what the transistor count will be with 96 threads??? :)

I'm gonna go with 40 billion.
 

DrMrLordX

Lifer
Apr 27, 2000
22,741
12,732
136
Seriously the meme of Yoda going "The Core Wars have begun" is so true.

Pretty soon its not going to be oh how fast is your processor? its going to be how many cores you running under that heat sink?

Or how big is your heat sink to be keeping that massive monolith die in check?

:T

Makes perfect sense for the enterprise world. It's amusing that an old 42U server rack filled with quad Opteron systems from the K8 days would be able to handle about as many threads as a 2P Starship system alone (actually less, but who's counting?). Also, if AMD can demonstrate the Starship is immune to Spectre too . . .
 
Last edited:

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
That's impossible..
Possible, though complexity involved would be fairly great if you want to retain some semblance of performance consistency. It's a question of "why bother" than "can we?".
 

thecoolnessrune

Diamond Member
Jun 8, 2005
9,673
583
126
Seriously the meme of Yoda going "The Core Wars have begun" is so true.

Pretty soon its not going to be oh how fast is your processor? its going to be how many cores you running under that heat sink?

Or how big is your heat sink to be keeping that massive monolith die in check?

:T

I for one am happy to see it again. After years of 65 Watt CPUs and 90 Watt high end CPUs with 130 Watt Server CPUs popping up occasionally, I'm happy to see 170 / 180 Watt CPUs come up. We're still far away from the 220 Watt AMD FX CPU, and a moon launch away from the 500 watt configurable TDP of IBM's 24 core Power 9 CPUs, so I welcome it.
 

Charlie22911

Senior member
Mar 19, 2005
614
231
116
Something like a 4-way XCC MCM would get us to that 500W number pretty quick!

With an overclock we could also get there. I wonder if AMD will do a halo part, running it way outside its normal spec kind of like they did with the FX 9590. That'd be cool.
 
  • Like
Reactions: ericlp

Vattila

Senior member
Oct 22, 2004
820
1,456
136
I've tried to make sense of the few roadmap and code name rumours we have so far. "Starship" was rumoured as a 48-core server chip, but by applying some common sense, I think it is more likely the name of the die, i.e. the successor to "Zeppelin" (airship). The next server chips have code names "Rome" and "Milan", as per the latest public AMD roadmap, i.e. the successors to "Naples" (Italian city). Based on this, I guess the following for the code names:

"Starship": 7LP multi-purpose die ("Zeppelin" successor), 3 CCXs (3 x 4 = 12 "Zen 2" cores)
"Rome": 7LP EPYC CPU ("Naples" successor), 4 "Starship" dies (48 "Zen 2" cores)
"Threadripper Next?": 7LP Threadripper CPU, 2 "Starship" dies (24 "Zen 2" cores)
"Matisse": 7LP Ryzen CPU ("Pinnacle Ridge" successor), 1 "Starship" die (12 "Zen 2" cores)
"Picasso": 7LP Ryzen APU ("Raven Ridge" successor), 1 "Picasso" die (?)

I have speculated that AMD's 7nm multi-purpose die will include graphics (a "GCX"; see this thread). However, there is also the rumour (from Canard PC Hardware) that next-gen EPYC will have 64 cores, requiring a building block with 16 cores (4 CCXs; see this thread). In any case, I presume AMD will increase the core count on their 7nm multi-purpose die. So, assuming "Starship" is that die, it will have 12 cores (3 CCXs) plus maybe a GCX (in my fantasy land), or 16 cores (4 CCXs) and likely no GCX. If "Starship" were to include a GCX, it would make it usable even for high-end mobile and desktop APUs, leaving "Picasso" as a cost-saving refresh of the low-end (4 cores and below).

In any case, 2018 is the year when the fog will slowly lift and the details about the 7nm era will emerge. It is going to be exciting!