Speculation: Ryzen 4000 series/Zen 3

Ajay · Oct 9, 2019

So, one last time on SMT-4. Here is a simulation based chart based on the Alpha Architecture (doesn't say which alpha was used as a baseline, did in another doc I read in the late 90's but I can't find it). Sadly, the x&y axis are label in light grey. This was clearly a high yield design with what looks like an SMT-2 yield of 67% for integer workloads! FP workloads aren't nearly so impressive, but mixed loading is still good. The rapid falloff with increase hardware thread count is one item that argues against going above SMT-2. In a low yield processor like Zen2 (~30% yield), the technical benefits of moving from SMT-2 to SMT-4 diminish beyond merit (given duplication, and enlargement of cpu resources required to get these marginal gains). Better, for now, to take advantage of shrinks, 3D stacking and other features to increase core count for maximum per socket performance.

Link to paper (1999)

soresu · Oct 9, 2019

Ajay said:
So, one last time on SMT-4. Here is a simulation based chart based on the Alpha Architecture (doesn't say which alpha was used as a baseline, did in another doc I read in the late 90's but I can't find it). Sadly, the x&y axis are label in light grey. This was clearly a high yield design with what looks like an SMT-2 yield of 67% for integer workloads! FP workloads aren't nearly so impressive, but mixed loading is still good. The rapid falloff with increase hardware thread count is one item that argues against going above SMT-2. In a low yield processor like Zen2 (~30% yield), the technical benefits of moving from SMT-2 to SMT-4 diminish beyond merit (given duplication, and enlargement of cpu resources required to get these marginal gains). Better, for now, to take advantage of shrinks, 3D stacking and other features to increase core count for maximum per socket performance.

View attachment 11765

Link to paper (1999)

As much as a person can admire what Alpha accomplised in its time, that was 20 years ago now - during which time we have had at least 4-5 major uArch changes from AMD and Intel.

Is there not a paper somewhat more recent that covers higher SMT scaling from Sun/Oracle or IBM?

soresu · Oct 9, 2019

I might add to the Apple Ax argument, they do not design for server - hence any consideration of what power consumption/TDP a theoretical Ax Storm core based server CPU draws must account for a NoC/fabric, and the oodles of system IO provided by AMD's huge Epyc IO die.

Thunder 57 · Oct 9, 2019

Richie Rich said:
Don't be influenced by desktop CPUs. Server 64c Epyc 7742 has a base frequency 2.25 GHz (may boost to 2.5 GHz within TDP). Apple A13 runs at 2.66 GHz too.... so for servers is freqency identical however performance is around +50% higher for fruit machine A13. Power consumption for A13 is around 4W, subtract consumption of GPU and idling/sleeping 5 more cores, it can be 3.5W x 64c = 224 W (Epyc has TDP 225W). Pretty comparable consumption with massive performance gain +50%.

6xALUs is killing feature. That is loud alarm for Intel and AMD and they should lose a sleep. So far they are lucky that this 6xALU beast is bounded in iPhone only thanks to Apple management. Steve Jobs was very challenging person and IMHO he would had a courage to change server business by server version of their 6xALU beast (Apple needs for their cloud service thousands servers too). And cloud service allows to keep HW in Apple's hands by selling service instead of HW.

Zen 3 with 6xALUs will be already 3 years behind Apple in CPU technology (A11 appeared in 2017). If Zen 3 won't be wide core, then it is tragedy for x86 and ARM with Cortex A78 will take server and laptop markets. Don't forget how ended up superior archs like IBM PowerPC, Itanium, Motorola 68000, DEC Alpha - all these were smashed by cheap, mass produced and thus faster evolving black horse called x86. Today the history repeats, just this time the black horse is ARM.

If it were that simple, why hasn't it been done already? I'm tired of hearing how ARM and Apple in particular are going to take over the world. This fight happened already in the 90's, we know what happened. ARM has it's place but apparently it is not in the high end. Also, Itanium was not superior, it was a joke. "Let's depend on an ingenious compiler". We know how that ended, too.

Ajay · Oct 9, 2019

soresu said:
As much as a person can admire what Alpha accomplised in its time, that was 20 years ago now - during which time we have had at least 4-5 major uArch changes from AMD and Intel.

Is there not a paper somewhat more recent that covers higher SMT scaling from Sun/Oracle or IBM?

Power 8 or 9 would be interesting, I'll look around if I can remember

Newer x86 CPU extract a high ILP than CPUs of 20 years ago, hence the lower SMT yield.

soresu · Oct 9, 2019

Thunder 57 said:
If it were that simple, why hasn't it been done already? I'm tired of hearing how ARM and Apple in particular are going to take over the world. This fight happened already in the 90's, we know what happened. ARM has it's place but apparently it is not in the high end. Also, Itanium was not superior, it was a joke. "Let's depend on an ingenious compiler". We know how that ended, too.

Despite continuous IPC improvements per gen, ARM seems to be quite focused on improving ML performance within the actual CPU cores.

The incoming Matterhorn core (likely Hercules/A78 successor from their slides yesterday) seems to have a heavy focus on ML with a doubling of GEMM performance using MatMul (possibly a pun based origin for the Matterhorn name too).

DrMrLordX · Oct 9, 2019

Richie Rich said:
Don't be influenced by desktop CPUs. Server 64c Epyc 7742 has a base frequency 2.25 GHz (may boost to 2.5 GHz within TDP). Apple A13 runs at 2.66 GHz too....

But now you're comparing a mobile chip with a server CPU that has an intricate (and brilliant) system of interconnects to make 64c work together. ARM DOES have an interconnect that can scale upward, and as Huawei has demonstrated, you CAN get 64c ARM CPUs onto the market. But those CPUs don't compete with Rome. Nobody has licensed Apple's Axx designs to produce a server CPU, either. Maybe they will, someday. Until then, the points you're making about technological progress aren't terribly valid.

Power consumption for A13 is around 4W, subtract consumption of GPU and idling/sleeping 5 more cores, it can be 3.5W x 64c = 224 W (Epyc has TDP 225W). Pretty comparable consumption with massive performance gain +50%.

Interconnect isn't free. Better jack up those power numbers.

soresu · Oct 9, 2019

DrMrLordX said:
Nobody has licensed Apple's Axx designs to produce a server CPU, either. Maybe they will, someday.

On that day a squadron of flying pigs will salute to Lucifer's frozen backside in the seventh circle of hell.

Apple are not, and will never be the sharing type - the closest thing to sharing they ever did was giving away OpenCL, which they did not exactly chase up either.

DrMrLordX · Oct 9, 2019

soresu said:
On that day a squadron of flying pigs will salute to Lucifer's frozen backside in the seventh circle of hell.

Apple are not, and will never be the sharing type - the closest thing to sharing they ever did was giving away OpenCL, which they did not exactly chase up either.

Nah I wouldn't suggest Apple would do it for free. Anyone who wanted to make AppleServerCPU (TM) would have to pay hefty licensing fees. Apple pulled out of that sector so it's unlikely they'd see it as competition . . . at least at first anyway. If it gained traction, they could just pull the license (eventually) and interfere with their own product.

yuri69 · Oct 9, 2019

Ajay said:
Power 8 or 9 would be interesting, I'll look around if I can remember
Newer x86 CPU extract a high ILP than CPUs of 20 years ago, hence the lower SMT yield.

IBM's promo materials state over 30% gains when going from SMT4 to SMT8. So yeah...

soresu · Oct 9, 2019

DrMrLordX said:
Nah I wouldn't suggest Apple would do it for free. Anyone who wanted to make AppleServerCPU (TM) would have to pay hefty licensing fees. Apple pulled out of that sector so it's unlikely they'd see it as competition . . . at least at first anyway. If it gained traction, they could just pull the license (eventually) and interfere with their own product.

It's unlikely regardless, by sharing I meant basically any Apple born tech not leaving the company.

I'm not even sure it's actually possible, given that Apple is already using an Architecture license from ARM themselves, I've never heard of any other ARM oriented company doing this.

Their stubborn clinging to Metal rather than accepting Vulkan is the reverse of this, they REALLY don't like using outside tech they don't control - I think the only reason they allow MoltenVK to prevail is because many of their 3rd party developers told them to shove Metal where the sun don't shine (my speculation), and MVK was the only way to save face.

soresu · Oct 9, 2019

yuri69 said:
IBM's promo materials state over 30% gains when going from SMT4 to SMT8. So yeah...

Say wut now?!

My guess would be that is specific to throughput in specific use cases - ie database queries, file serving etc.. standard datacenter/server work, light on single threaded IPC but heavy on raw throughput needs during peak times.

amd6502 · Oct 9, 2019

I think x86 and acorn kind of complement each other, and they both have their happy niches right now. I can see Apple adding much to the SoC portion (IO/uncore) and extending their core to work on PC and server. In fact, weren't they planning on ditching x86? Going homegrown would make sense in the next few years since they seem to have a decent core. (At 2.5ghz it would be fine for notebook and SFF desktop, and they may reach 3ghz by the time they launch a desktop CPU.) I know almost nothing about this A12, but seems to me that Apple is addressing the efficiency issue by having the small efficient cores do much of the computation.

I think the mobile oriented monolithic Picasso successor might skip Zen2 altogether and go for Zen3, which is supposed to focus on power efficiency improvements over its predecessor. With the 8c big CCX in Zen3 server chiplets it seems somewhat likely they might do more than 4 cores. I guess best guess is 4 or 6 cores, because 8c for mobile oriented mainstream is kind of crazy.

I think Ajay's DEC SMT graphs show really worthwhile gain from SMT2 to SMT3 for mixed code (which should be the primary objective). This is kind of why I think two nonspeculative mostly in-order threads would get good returns and be well matched (for a 4-6ALU wide core); I'm guessing a pair of these threads might be equvalent to very roughly a half to 2/3-rds of an SMT2 thread.

soresu · Oct 9, 2019

amd6502 said:
I think x86 and acorn kind of complement each other

You are making me feel old mentioning Acorn!

I remember my school had an Acorn PC way back in the 90s.

soresu · Oct 9, 2019

amd6502 said:
I think Ajay's DEC SMT graphs show really worthwhile gain from SMT2 to SMT3 for mixed code (which should be the primary objective).

Something to also bear in mind, compute workloads have changed since then as well as uArch's that run them.

CG rendering used to be pretty specialised, and digital video encoding with MPEG1 and 2 was still a relatively nascent (non mainstream) field in the 90s.

darkswordsman17 · Oct 10, 2019

Ajay said:
So, one last time on SMT-4. Here is a simulation based chart based on the Alpha Architecture (doesn't say which alpha was used as a baseline, did in another doc I read in the late 90's but I can't find it). Sadly, the x&y axis are label in light grey. This was clearly a high yield design with what looks like an SMT-2 yield of 67% for integer workloads! FP workloads aren't nearly so impressive, but mixed loading is still good. The rapid falloff with increase hardware thread count is one item that argues against going above SMT-2. In a low yield processor like Zen2 (~30% yield), the technical benefits of moving from SMT-2 to SMT-4 diminish beyond merit (given duplication, and enlargement of cpu resources required to get these marginal gains). Better, for now, to take advantage of shrinks, 3D stacking and other features to increase core count for maximum per socket performance.

View attachment 11765

Link to paper (1999)

The interesting thing there is that going from 2 to 3 threads seems to show good gains, its going from 3 to 4 where it really diminishes to the point of not being worth it. The rumors (at least that I've been talking about this whole time) were that it was pushing 3 threads instead of just 2. I don't think the rumors even talked about SMT4 (that was speculation added in discussion that came about later).

Which I'm not claiming AMD will or won't as I have no idea (and I've been more wrong than right about AMD over the past year, I didn't think they'd make Radeon VII, I thought they'd have Zen 2 and even Navi ready to launch 1H of this year, I didn't expect the prices and board issues). Heck, I don't even know much about the specifics of this rumor, just that there supposedly were (are?) Xbox dev kits that had Zen chips in them that could do up to 3 threads per core.

I don't know enough about this stuff to be able to have much in depth discussion myself, but I think its been kinda fun seeing discussions about various SMT (and CMT, and all the other stuff).

DrMrLordX said:
Nah I wouldn't suggest Apple would do it for free. Anyone who wanted to make AppleServerCPU (TM) would have to pay hefty licensing fees. Apple pulled out of that sector so it's unlikely they'd see it as competition . . . at least at first anyway. If it gained traction, they could just pull the license (eventually) and interfere with their own product.

I think Apple is actually planning on pushing versions of the new Mac Pro as racks (to be used as servers). Now that might just be since they're pushing this for A/V production stuff where they likely already have racks for various equipment, or so that well to do companies can rack mount it all up and then have employees connect in via terminal like setups. But it was an interesting aspect to the new Pro that I felt got overlooked. I think they said those would start shipping several months later.

NostaSeronx · Oct 10, 2019

Just want to post the SMT4 after investigating POWER7/POWER8/z13.

CMT in this isn't the same as 15h's but more closer to POWER7's SMT2/SMT4 and POWER9's execution-slicing. However, all Zen3 cores have the same amount of slices. There is also room for a lighter core with only a single slice that can be even further condensed with the removal of SMT-logic.

The actual amount of execution units is still up in the air, but it should allow for a 2x increase in performance. Comparative to previous cores with legacy workloads. The purpose of the change to the sliced architecture is to increase predictability/efficiency of SMT. The idea is to increase IPC within 50% area/80% performance; 1x8 ALUs is expensive, but 2x4 ALUs isn't. Which makes 2x4 ALUs on-par or greater than 1x6 ALUs, with more efficent execution happening on 2x4 ALUs.

DUV/HPC-CPP/7.5T to EUV/Mobile-CPP/6T(Custom ver) can have a pretty big shrink. On-par with a normal node shrink, someone can can calculate all areas getting shrunk. With the additional of a second integer portion w/ L0ds. It comes out to 2.4 mm2 to 2.6 mm2 for the range for me. Also, the L0ds SLAQ portion must also interconnect with each portion of hi-low FPU. For double 128-bit bandwidth and same 256-bit bandwidth, plus added raw memory bandwidth with low capacity/low latency.

NTMBK · Oct 10, 2019

Yotsugi said:
Nowhere near ubiquitous enough.

What do you mean? AVX-512 has been in mainstream CPUs since Cannonlake! *snicker*

Richie Rich · Oct 10, 2019

soresu said:
On that day a squadron of flying pigs will salute to Lucifer's frozen backside in the seventh circle of hell.

Apple are not, and will never be the sharing type - the closest thing to sharing they ever did was giving away OpenCL, which they did not exactly chase up either.

Apple doesn't need to share anything in server market - Cloud and VPS can be sold as a service you connect via network. That goes in hand with Apple paranoia so it's not impossible. Steve Jobs had courage to do it however Tim Cook not. First area they will likely expand is laptops (iBook). I'm just wonder how long it will take to react x86 world to evolve for 6xALU core. IMHO till next year when Zen 3 core delive 6xALU and SMT4.

Yotsugi · Oct 10, 2019

Richie Rich said:
IMHO till next year when Zen 3 core delive 6xALU and SMT4.

Zen3 is neither, hammer that into your head already.

Ajay · Oct 10, 2019

amd6502 said:
I think Ajay's DEC SMT graphs show really worthwhile gain from SMT2 to SMT3 for mixed code (which should be the primary objective). This is kind of why I think two nonspeculative mostly in-order threads would get good returns and be well matched (for a 4-6ALU wide core); I'm guessing a pair of these threads might be equvalent to very roughly a half to 2/3-rds of an SMT2 thread.

Thanks. But I think there are two issue with what you just said. One, compare to DEC Alpha simulations, Zen SMT-2 yield is low (55% lower). So, you need to divide the SMT-2 to SMT-3 yield difference by 2, which means it is less significant than it appears.

The other issue is what happens when you run 4 threads on a core, even with 6 ALUs. Since much of the front end is shared, you wind up with CMT like performance (like Bulldozer) - that is pointless. The Alpha EV8 was designed with 8 ALUs, IIRC, but that was because ILP (instruction level parallelism) was much lower back then than it is today.

Overall, it just doesn't seem to be worth it. There is still much to do in increasing performance for AMD's Zen based architecture, and many changes coming down the line in process development (3D chips, for example). And, these improvement will also bring high ILP and IPC, resulting is lower yields for SMT, in the long run. AMD has plenty of ideas to run with rather than falling back on SMT-4. So, really, lets please put this SMT-4 topic to bed - it's a non-starter (except for a silly unsubstantiated rumor on AdoredTV).

DrMrLordX · Oct 10, 2019

Richie Rich said:
IMHO till next year when Zen 3 core delive 6xALU and SMT4.

There's already been a leaked video showing no SMT4 for Zen3.

Vattila · Oct 10, 2019

NTMBK said:
Of course, an active interposer could move some logic off the compute die and into the interposer, opening up all sorts of options for topology.

AMD has done a lot of research in that area (ref. research papers by Gabriel Loh), but even if they stick with the current 9-die chiplet design and topology — as now seems likely for Zen 3 — and simply put the chiplets on top of a silicon interposer, there seems to be substantial power-saving to be had. The recent interposer-based chiplet design by TSMC and ARM indicate 3 to 4 times the power-efficiency of the chiplet interconnect:

"The inter-die interconnect consists of a Chip-on-Wafer-on-Substrate (CoWoS) interposer. More specifically, it uses TSMC’s upcoming LIPINCON interconnect architecture, which stands for Low-voltage-In-Package-INterCONnect. It is TSMC’s alternative to Intel’s AIB and upcoming MDIO chiplet interconnect, which Intel uses with its EMIB packaging technology. In that sense, LIPINCON is to CoWoS what AIB or MDIO is to EMIB. LIPINCON operates at 0.3 V and has a bandwidth of 8 Gb/s per pin and 320 GB/s total bandwidth. Bandwidth density is claimed at 1.6 Tb/s/mm2. It has an energy efficiency of 0.56 pJ/bit. For reference, AMD’s non-interposer Infinity Fabric consumes ~2 pJ/bit, while Intel has claimed as low as 0.3 pJ/bit for EMIB, and 0.5 pJ/bit for MDIO."

TSMC and Arm Show First 7nm Interposer-Based Chiplet System for HPC

TSMC and Arm have announced the industry's first 7nm chiplet system with a CoWoS interposer for HPC. It has four 4GHz Cortex-A72 cores per chiplet and uses the new LIPINCON interconnect.

www.tomshardware.com

TSMC Demonstrates A 7nm Arm-Based Chiplet Design for HPC

A look at a high-performance 7nm Arm-based chiplet architecture which was recently presented by TSMC at the 2019 VLSI Symposium.

fuse.wikichip.org

itsmydamnation · Oct 10, 2019

Richie Rich said:
Apple doesn't need to share anything in server market - Cloud and VPS can be sold as a service you connect via network. That goes in hand with Apple paranoia so it's not impossible. Steve Jobs had courage to do it however Tim Cook not. First area they will likely expand is laptops (iBook). I'm just wonder how long it will take to react x86 world to evolve for 6xALU core. IMHO till next year when Zen 3 core delive 6xALU and SMT4.

Something like 50% of x86/64 non SIMD operations on average contain a load or a store. The idea that there is all the free ILP just lying about all you need is just some more ALU's is quite frankly stupid. You need to be loading and storing more which means you need better front end to predict and prefetch and better cache to have the data closer when your front end misses. What you need to do to get more ILP is having the data in the core sooner, so go have a look at A12 vs Zen per core:

L1 32KB vs 128KB
L2 8MB vs 512KB

because the apple mobile SOC's are just that, they can have a small tight interconnect with larger per core caches , they don't have to worry how they scale coherency/interconnect/size to 64 cores. just look at the random latency between Zen2 and A12

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

you can see how much longer a Big A12 core can hold lower latency.

if you look at instruction throughput you will also see while A12 is 6 wide its instruction throughput on a per instruction basis isn't much higher and on many common instructions Zen2 has slightly lower instruction latency (fpmul/mul/Imul 4v3). If amd wanted to hit the same instruction throughput as A12 they could just as easily add a little bit more functionality to the existing ALU's, No need to go 6 wide. One area AMD can obviously improve is DIV but thats a latency problem not number of issuing ports.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

https://www.agner.org/optimize/instruction_tables.pdf

Apple is winning at IPC not by its width , but by its frontend and its big low latency caches.

Apples internal Width could just as easily be about power/clock gating and only powering up the ALU's with the more expensive complex logic when needed.

moinmoin · Oct 11, 2019

itsmydamnation said:
Something like 50% of x86/64 non SIMD operations on average contain a load or a store. The idea that there is all the free ILP just lying about all you need is just some more ALU's is quite frankly stupid. You need to be loading and storing more which means you need better front end to predict and prefetch and better cache to have the data closer when your front end misses. What you need to do to get more ILP is having the data in the core sooner, so go have a look at A12 vs Zen per core:

L1 32KB vs 128KB
L2 8MB vs 512KB

because the apple mobile SOC's are just that, they can have a small tight interconnect with larger per core caches , they don't have to worry how they scale coherency/interconnect/size to 64 cores. just look at the random latency between Zen2 and A12

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

you can see how much longer a Big A12 core can hold lower latency.

if you look at instruction throughput you will also see while A12 is 6 wide its instruction throughput on a per instruction basis isn't much higher and on many common instructions Zen2 has slightly lower instruction latency (fpmul/mul/Imul 4v3). If amd wanted to hit the same instruction throughput as A12 they could just as easily add a little bit more functionality to the existing ALU's, No need to go 6 wide. One area AMD can obviously improve is DIV but thats a latency problem not number of issuing ports.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

https://www.agner.org/optimize/instruction_tables.pdf

Apple is winning at IPC not by its width , but by its frontend and its big low latency caches.

Apples internal Width could just as easily be about power/clock gating and only powering up the ALU's with the more expensive complex logic when needed.

Thanks, this is an excellent argument for Zen 3 being an optimization round. Some form of advanced TAGE predictor is assumed to be standard in CPUs by Intel and Apple already, and lagging AMD only introduced it in Zen 2 (and even there it was mentioned as one of the parts originally intended for Zen 3). The predictor needs to improve to be able to use the prefetcher more efficiently. A lot of the patents DisEnchantment previously mentioned in this thread are based around cache handling improvements that only make sense in conjunction with significant improvements in the predictor and prefetcher logic. This also meshes well with the leaked announcement that the two CCXes' L3$ on each CCD will be "unified" in Zen 3, which IMO in the light of the above information is not necessarily a statement about the topology but could be more about significant changes in the cache handling.

Speculation: Ryzen 4000 series/Zen 3

Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Senior member

Golden Member

Lifer

Lifer

Senior member

Diamond Member

Diamond Member