Speculation: Ryzen 4000 series/Zen 3

Page 20 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Thunder 57

Platinum Member
Aug 19, 2007
2,640
3,697
136
The one was to know if something is actually true is to be the exact opposite of what Nosta says it is .

Where is my Tunnelboerer!!!



Hahaha that's great. Have you seen another recent one? This guy has a fetish for Bulldozer and FDSOI. Possible updates to Bulldozer include:

Micro-op queue, 2 BTB branch prediction(branch calculated per core queue), CX/AX; complex execution(iMUL/iDIV/CRC/Branch+ALUs) and address execution(AGUs+ALUs), more flexible FPU, L0 caches, etc. However, most of that stuff is more ideal on 12FDX which is to 14nm FinFET as 22FDX is to 28nm.
 
  • Like
Reactions: CHADBOGA

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
Where is my Tunnelboerer!!!
Went to Intel on September 2016: "Next-Generation Processors. Floorplan/Microarchitecture co-design, Full-chip Integration and Design Planning. Scalar/Vector Execution Unit Design Lead"

Relative to everything else it started on the 28nm FDSOI node on June 2015. "28FDS entered production in 2015" The basis of the new CMT plan came from the "500MHz 28nm prototype in 2014." Main reason to follow this development line is that it can appear in other cores. For example, who has AMD64 and an ARM micro-architectural license. It will be nice to have a single core that can do ARM64 and AMD64 natively at the same time. Which can be agnostic to any of the cores, less than 2.8 mm2 with only smaller nodes(7nm&6nm EUV/5nm&4nm EUV/3nm EUV). ARM has the largest market share, x86(AMD64) in second place and PowerPC being third. *cough* https://patents.google.com/patent/US7124286B2 (Software Embodiments & two names) *cough* https://patents.google.com/patent/US20060015707A1 (First lines) */cough*
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
less than 2.8 mm2
At this point SRAM scaling is so bad that quoting core size is not incredibly meaningful on a full blown desktop processor, the SRAM ends up taking way more space, part of the reason I'm excited to see a shift to some form of MRAM for L2 and L3 caches (beyond the obvious persistence advantages to power consumption).
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
At this point SRAM scaling is so bad that quoting core size is not incredibly meaningful on a full blown desktop processor, the SRAM ends up taking way more space, part of the reason I'm excited to see a shift to some form of MRAM for L2 and L3 caches (beyond the obvious persistence advantages to power consumption).
AMD uses custom SRAM tiles. Which might or might not scale better than standard 6T SRAM tiles.
Samsung EUV 6T Std SRAM => 0.262 um
TSMC DUV 6T Std SRAM => 0.27 um
EUV should have an average area density increase of 1.1x going from 7.5T DUV logic tiles to 7.5T EUV logic tiles. However, TSMC says 1.2X logic density at same performance. With per-customer performance optimization for N7+.

If AMD uses 6T EUV logic tiles then logic will be even more dense. A from-scratch approach with EUV can give a smaller core than Zen2, and higher frequencies and lower power.

Going from HPC 7.5T DUV to Mobile 6T DUV is a 13% loss in performance. Moving to 6T EUV, EUV per-customer opt is expected to be 3% to 5% increase and EUV by itself is 10% increase at same power. So performance lost can be gained in a shrink architecture. Since, it is Family 19h and not Family 17h, there is a chance for a more large performance(Frequency) increase. It can also use the extra area for better and newer units.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
The same thing will happen again when they switch to the Nanosheet/GAA/MBCFET device type at 3nm for Zen and RDNA
Really interested in the device characteristics for '3'nm MBCFET. Hope we get a deep dive on that.
 

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
Really interested in the device characteristics for '3'nm MBCFET. Hope we get a deep dive on that.
From what I saw from Samsung, it has somethng like a 20% advantage over an ISO pitch finFET in power, but a disadvantage in device density.

Samsung 3nm MBCFET could be similar density to TSMC 5nm finFET, albeit with superior power consumption at the same voltage.

A future 'forksheet' device evolution from the initial nanosheet MBCFET will supposedly deliver superior density to finFET with even better power than nanosheet, but that could likely be more than one generation of Nanosheet down the road.
 
  • Like
Reactions: DarthKyrie

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
SMT has been on desktop for over a decade now, and Windows scheduler still have trouble making sense of it. Oh, and people now know that SMT is a security nightmare, due to inherent nature of resource-sharing design. If you want to share your bathroom with your guests, you'd better be sure you have no skeletons in the medicine chest.

I'd say SMT's time has passed now that core counts are actually increasing. Back in the day when the CPUs had one or two cores SMT made sort of sense, but today AMD and Intel's time will be better spent figuring out how many actual cores they can fit in a limited space without sacrificing performance/power.

AMD already tried to make a convoluted resource-sharing scheme work and failed miserably. It's called Bulldozer.
Don't mistake SMT for Bulldozer's CMT implementation.

It's not resource sharing so much as resource optimisation - allowing for potentially idle resources to be utilised in order increase MT perf/watt/area.

Basically it's about putting a bit more silicon into the core to more fully utilise it's resources in situations not ideal for full core saturation with a single thread - not even close to an expert there and I'm pretty sure there's something about cache or branch prediction misses in there too.

The question is whether the SMT specific area investment is less than the extra area that would be required to add more cores to make up the MT shortfall if you remove it - my bet would be yes it is less.

I honestly think even CMT isn't as bad as made out by Bulldozer's failure, though AMD made the mistake of not pushing single threaded IPC in an era where application parallel threading was still far from optimal (if not non existent in many areas like web browsing) - and their final design was clearly suboptimal to whatever initial projections they had made in the concept phase.

It leaves me to wonder what a larger Excavator based chip (8+ modules) would have performed like in highly threaded situations given the perf/watt improvement from Bulldozer with relatively minor process improvements, it's hard to get a good read on hardcore MT improvements with Steamroller and Excavator given that they abandoned any attempt at server chips (or enthusiast DT chips based on them).

Either way I don't expect to see AMD make any further attempt to create a CMT based uArch due to PR blowback alone.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
From what I saw from Samsung, it has somethng like a 20% advantage over an ISO pitch finFET in power, but a disadvantage in device density.
Samsung's 3nm GAAE process is 35% increase in performance, 50% decrease in power, 45% smaller area compared to 7nm. It is 15% higher performance compared to 4LPP.

Same design rules: 7LPP -> 5LPE = UHD option(contact op + 1-fin) + power optimization, 5LPE -> 4LPE = Improved UHD option(contact op #2 + Mx shrink), 4LPP = Improved performance

stackednanosheet.png


7LPP is "54nm CPP" and "36nm Mx"
4LPE is "54nm CPP" and "28nm M1/32nm M2"
3GAAE could be the 44nm NFET CPP and 48nm PFET CPP "5nm Nanosheet" which I got above from. Or, it can aim for an even lower CPP like 36nm for NFET and 40nm for PFET. It isn't design rule locked with the 7nm node.

Gate length scaling is back, no more wimpy gates or weird scaling for I/O FinFETs. There is also poly-bi...*cough*continuous nano-sheet width scaling in track height, rather than 4-fin/3-fin/2-fin/1-fin etc.

The only benchmarks Samsung have displayed are with same CPP:
3nm.png

With 3GAAP fixing PMOS performance with intrinsic in-plane stressors like FDSOI nodes.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,571
10,764
136
Either way I don't expect to see AMD make any further attempt to create a CMT based uArch due to PR blowback alone.

From an end-user's perspective, CMT had two major drawbacks:

1). FP performance relied too heavily on either Fusion/HSA (which never materialized) or SIMD (XOP, basically). If you had an XOP-optimized application, then it was "okay", but in comparison, an Intel SMT-based CPU with AVX/AVX2 could still smoke it.
2). Modules really needed to be fully-loaded with threads to perform near their peak. You needed 2 threads per module, or you were leaving performance on the table. The possible exception being XOP-optimized fp applications. But otherwise, AMD's CMT CPUs were almost entirely based on throughput.

I can't see AMD going that route again unless they could figure out how to overcome those drawbacks.
 

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
Modules really needed to be fully-loaded with threads to perform near their peak. You needed 2 threads per module, or you were leaving performance on the table. The possible exception being XOP-optimized fp applications. But otherwise, AMD's CMT CPUs were almost entirely based on throughput
There in lay AMD's gamble on heavily threaded apps as I mentioned earlier.

Unfortunately AMD's hope and software engineer enthusiasm didn't really mesh at the time, or perhaps certain fields just needed more time to create viable solutions (like Mozilla's SharedArrayBuffer for web multithreading, and various web worker types).

It might also be noted that the main problem with BD's throughput design was when you put it in desktops, laptops and consumer workloads - typically many server applications are extremely throughput orientated from what I have seen.

Certainly at this point many fields are much more threaded than they were 8 years ago when Bulldozer launched.

Hopefully going forward AMD has adopted the age old maxim "Hope for the best, plan for the worst".
 

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
FP performance relied too heavily on either Fusion/HSA (which never materialized) or SIMD (XOP, basically)
I was under the impression that scalar FP/x87 is pretty redundant these days compared to SIMD.

There was some talk initially of X265 using HSA, but I suspect what little money AMD could spare to throw at them for R&D was drowned in an avalanche of green that fell from Mount Blue, they never mentioned it again even once to my great disappointment.

Sadly I suspect if you want to go down the more unique route in computing you need to throw a lot of money at developers to get the ball rolling, that and basically do a large amount of API/framework development for them (nVidia CUDA/Optix case in point), software devs do enjoy those free lunches it seems.

Also Bulldozer did support AVX too, XOP just added instructions that had no equivalent in AVX at the time BD was designed (including FMA4 of course, which Intel clearly sold them down the river with by doing a 180 and going to FMA3).

I think AVX2 must have addressed most of the remaining instructions if not all of it, considering XOP is deprecated now - at least that is the impression I got of why some apps used XOP at the time.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,571
10,764
136
typically many server applications are extremely throughput orientated from what I have seen.

Some are, some aren't. If VM response times and overprovisioning of hardware are on the table, then CMT gets back to its intrinsic disadvantages that you encounter in consumer workloads (more or less).

Certainly at this point many fields are much more threaded than they were 8 years ago when Bulldozer launched.

It is, but with the core wars you basically have the same problem. Yes, it's more likely that a 4M CMT chip will see all its resources used today . . . but AMD has been producing 8c SMT chips for over two years. Would an 8M chip on the same process from the same design team ever be as flexible in terms of resource allocation as Zen2 is today or Zen3 will be tomorrow? Likely not. When you run what is essentially an 8t workload with light SIMD on 8c/16t Zen2, you are leaving maybe 25-30% of your execution resources idle. A hypothetical 8M/16t XV in the same scenario is leaving somewhere around 45% of its execution resources idle. I remember testing that on my old Steamroller. Going to 1 thread per module resulted in ridiculous losses of performance.

If we get to the point where software developers start pushing out tons of software that demands more thread-level parallelism than CPUs can realistically provide, then maybe CMT will make more sense. That isn't happening right now.

Also Bulldozer did support AVX too, XOP just added instructions that had no equivalent in AVX at the time BD was designed (including FMA4 of course, which Intel clearly sold them down the river with by doing a 180 and going to FMA3).

XOP was just faster than AVX128 overall, I think. When properly implemented.

Take a look here:


Specifically, observe the 4.2GHz 3930k time (233.251s) vs a 4.21 GHz FX-8350 (267.329s). Both are 128-bit SIMD implementations. Those 4 PD modules are very close to 6 Sandy cores. A hypothetical 6M PD @ ~4.2GHz would have turned in a time of around 178s, which interestingly enough, is really darn close to some of the 4930k numbers on that list (which supported AVX256, but not AVX2). AMD's 2013 chip was dangerously close to Intel's 2011 chip in a very narrow set of circumstances. The potential was there. AMD was limited by process and software support.

In a hypothetical alternate universe where GF iterated quickly and successfully on advanced SOI nodes and software developers supported XOP, XOP2, etc., AMD would have done much better with their con cores.

In our universe, nobody much supported FMA4 + XOP and GF wrecked AMD with dated process nodes. 2013's Haswell with AVX2 left XOP in the dust. The one narrow circumstance where XOP enabled AMD to sort-of reach parity with Intel in fp performance swung wildly in Intel's favor. The rest is history.

I think AVX2 must have addressed most of the remaining instructions if not all of it, considering XOP is deprecated now - at least that is the impression I got of why some apps used XOP at the time.

AVX2, as implemented in Haswell, annihilated anything that ever supported XOP. The difference is just night and day. If AMD ever expects to go back to CMT and tries to use SIMD to shore up an otherwise-weak fp unit, they've got to have something better than XOP under their belt. It would be interesting to see (in one of those alternate universes) what a CMT CPU that supported SVE2 would be like.
 
  • Like
Reactions: lightmanek

maddie

Diamond Member
Jul 18, 2010
4,717
4,615
136
If the SoftMachines work has any benefits, and Intel seems to think so, having bought the company, then the whole CMT/SMT argument might be obsoleted soon.
 

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
If the SoftMachines work has any benefits, and Intel seems to think so, having bought the company, then the whole CMT/SMT argument might be obsoleted soon.
Soft Machines model seems work similar under the hood to the reconfigurable architecture that Microsoft is working on with E2 (EDGE ISA), but that is going off topic a bit (as is CMT but I do get carried away there).
 

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
they've got to have something better than XOP under their belt
My reading is that XOP was not intended as a direct competitor to AVX.

AMD initially announced a larger instruction set called SSE5 before AVX was announced by Intel.

At that point the instructions in common with AVX were set aside, and the remaining ones became the XOP instruction set, so complementary if you will.

I suspect that AMD announcing SSE5 forced Intel's hand with AVX before it was fully finished cooking, and obviously they did not want AMD setting the tone with a new SSE version.

As XOP included separate instructions from AVX it could accelerate more workloads than AVX alone could, so it is unsurprising that XOP optimised software was initially performing well for AMD.

I suspect the superior scores of AVX2 on Haswell had more to do with Haswell uArch itself than AVX2 - though I think AVX2 did add other things lacking in XOP too.
 

moinmoin

Diamond Member
Jun 1, 2017
4,926
7,609
136
AMD's implementation of SMT in Zen completely replaced CMT, I see no way how they'd ever want to go back even on a purely technical level.

Look at it this way, CMT can be thought of as a very crude version of Zen's SMT where everything was statically partitioned to either one of the two supported threads except the FP unit (scheduler and 2x FMACs) that's dynamically shared. Zen's SMT now competitively shares all the core's resources, with only the uop, retire and store queues remaining as statically partitioned resources.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
AMD's implementation of SMT in Zen completely replaced CMT, I see no way how they'd ever want to go back even on a purely technical level.
This might help...
csmtbypass.png

I'll let people figure out what it means for themselves. Just note it is 32nm/28nm that is Four ALUs and Four AGUs, with a 256-entry retire queue, with a 80 to 96-entry scheduler queue, and etc. They were always going the path of physical/virtual from the company that shall not be named.
 

maddie

Diamond Member
Jul 18, 2010
4,717
4,615
136
Soft Machines model seems work similar under the hood to the reconfigurable architecture that Microsoft is working on with E2 (EDGE ISA), but that is going off topic a bit (as is CMT but I do get carried away there).
Not really off topic as a "reconfigurable architecture" using virtual cores allows the best use of hardware, as you don't need to make upfront design decisions as to thread hardware resources needed for the future. AMD was an investor in SoftMachines for years before Intel acquired them. I assume some IP was obtained.
 

soresu

Platinum Member
Dec 19, 2014
2,582
1,778
136
Not really off topic as a "reconfigurable architecture" using virtual cores allows the best use of hardware, as you don't need to make upfront design decisions as to thread hardware resources needed for the future. AMD was an investor in SoftMachines for years before Intel acquired them. I assume some IP was obtained.
As was Samsung I think.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
AMD's implementation of SMT in Zen completely replaced CMT, I see no way how they'd ever want to go back even on a purely technical level.

Look at it this way, CMT can be thought of as a very crude version of Zen's SMT where everything was statically partitioned to either one of the two supported threads except the FP unit (scheduler and 2x FMACs) that's dynamically shared. Zen's SMT now competitively shares all the core's resources, with only the uop, retire and store queues remaining as statically partitioned resources.
What about if AMD will create shared front-end for whole CCX? This would bring some advantages out of CMT while still using SMT for back-end.
1) This could save some transistors and increase throughput.
2) It allows HW control over threads within CCX. It can eliminate crazy windows scheduler shuffle.
 
  • Like
Reactions: DarthKyrie

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
1) This could save some transistors and increase throughput.
Less xtors maybe, but worse throughput - and nightmarish hardware thread scheduling. As it is the Windows thread scheduler in non-deterministic. Adding a hardware load balancing algorithm to the dispatcher would give server app developers a real tough time tuning per thread performance. Much less of a problem for client systems were average load levels are much lower.

Well, I guess the benefit of real time load (power, performance?) balancing would be incredible responsiveness to varying demand. Damn, now I need to think about this - I’m sure there’s been a lot of research done on this. Seems like a runtime thread scheduler could be getting pretty big, and hot, doing real-time instruction stream analysis (statistical) to slot the threads correctly for maximum ILP or minimum power. Nuts!
 

DrMrLordX

Lifer
Apr 27, 2000
21,571
10,764
136
AMD's implementation of SMT in Zen completely replaced CMT, I see no way how they'd ever want to go back even on a purely technical level.

Generally agreed. It's just so much easier to keep a core busy when there's SMT under the hood.

My reading is that XOP was not intended as a direct competitor to AVX.

All of that is possible, and someone academic at this point unless some version of XOP emerges in the future. Personally I'd like to see AMD throw out AVX altogether in favor of SVE2 but . . . that's unlikely to happen. Instead we're probably going to see AVX512 support in Zen3 which is not thrilling. But Intel has moved the market in that direction, so I guess AMD needs to follow.

If we're lucky, AMD will double the number of FMACs and support AVX512 through op fusion ala Zen/Zen+.