AMD Bristol/Stoney Ridge Thread

Page 62 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
You are asking for the return of the CON's one-year core update cycles. With such CON cycles AMD would dig itself back into big troubles.

Honestly, at this point, I would be happy with 15 months. Or, you know, a consumer roadmap at least? Instead of silence?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
I don't think they want to spend 7 nm wafers on something like this. Hence the Zen 3 on GloFo 12 talk.
Monet gets the same cost by redesigning for 12LP+.
Zen3 from 7nm to 12LP+ => $100M+ redesign
RDNA2 from 7nm to 12LP+ => $100M+ redesign
Any 7nm additions to 12LP+ => $100M+ redesign

Mendocino gets a lower cost by reduction of masks from 7nm:
Zen2 from 7nm to 6nm => Lower mask, same node gen, lower cost
RDNA2 from 7nm to 6nm => Lower mask, same node gen, lower cost
Any 7nm additions to 6nm => Same node, same rules, lower mask count
For the same reason 22FDX is selected, TSMC's FF-RF outclasses GlobalFoundries' FF-RF; https://www.tsmc.com/english/news-events/blog-article-20210603

Redesign an older microarchitecture to better fit value BGA and essential PGA/LGA:
CPU 28nm to 22FDX = ~$40M refactored
GPU 28nm to 22FDX = ~$40M refactored
New systems IP from 28nm to 22FDX = ~$40M refactored

Most of the money for HVM, >100,000 unit chip quantity parts are for manufacturing not design costs.
22FDX processed wafer is on-par with 28SLP processed wafer.
Which is ~1.6x lower than 20LPM/SHP and ~3.2x lower than 14LPP/12LP/12LP+ and ~6.4x lower than 7FF/6FF.

So inserting on 22FDX means overall the manufacturing side which dominates costs is more feasible as a $25 APU, which can be paired with Mobo+LPDDR adding $50.

Lower cost on design, lower cost on manufactoring, as well as lower cost on sort and packaging(supply-chain stuff).

12LP+ goes from Malta to Dresden for dice-sort then to the more advanced packaging facility at Asia.
22FDX would be fabbed and sorted at Dresden and uses the more mature packaging facility at Asia.

---
There is also design implications going to 22FDX that favor planar Bulk to FDSOI rather than 7nm FinFET-Gen to 14nm FinFET-Gen.

bobcat40nm.png
Bobcat has high RVT/HVT focus, explained off by high Vdd/low leakage-power for 1.7 GHz being 1.1V(BC) to 1.4V(WC)

bulldozer32nmandsteamroller28nmc.jpeg
Bulldozer-Excavator has high RVT focus, explained off by needing low leakage-low power to high frequency-low power for HPC/Server/Client.
Centurion == 220W 4.4-5 GHz
Opteron HE 16-core = 2 Die-85W 1.8-2.3 GHz
Opteron HE 8-core = 1 Die-40W 2-2.3 GHz

jaguar28nm.png
Jaguar-Puma has increased LVT focus, explained off by high frequency at low power and freq-cap at 25W.

zen14nm.jpeg
Zen is the same as Jaguar-Puma and freq-cap at high-TDP.

A 22FDX design can have the same Vt mix as Puma/Zen, where 28nm Puma hit a frequency wall at ~2.5 GHz and Zen at ~4.1-14nm and ~4.35 GHz-12nm. There is no such frequency-voltage wall on 22FDX since LVT(flipped-well)=RVT(conventional-well). Which means LVT and even sLVT can increase voltage without bumping up leakage as Puma/Zen does. Essentially having an enhanced volt/freq scaling line relative to PDSOI, negates bulk transition issues(PDSOI -> Bulk).

22FDX has reduced gate-effects and wire-effects. Thus, allowing for increased Fmax at LV and HV scenarios. Where in 14nm/12nm both gate-wire negative effects increased thus reducing Fmax.

Case one, HV/High-TDP/Essential Desktop:
3000G 35W => 3.5 GHz Freq
A9-9430 25W => 3.2-3.5 GHz Freq
A9-9425 15W => 3.1-3.7 GHz Freq
Athlon 5370 25W = 2.2 GHz Freq
A8-7410 15W = 2.2-2.5 GHz Freq
A 22FDX design even with a small-core in that range would be able to push = >4 GHz Freq, guaranteed with CMT.

Case two, LV/Low-TDP/Value-Fanless Client:
3015e 6W => 1.2-2.3 GHz Freq
A6-9220C/A9-9420e 6W => 1.8-2.7 GHz Freq
A10 Micro-6700T 4.5W => 1.2-2.2 GHz Freq
A 22FDX design even with a small-core in that range would be able to push = >3 GHz Freq, same as above.

In case one and two, there is high probability of accessing Ti-States(hot transistors = faster transistors that consume less power). Via, fanless from 6W and high TDP from 15W/25W.

Puma = 1c at full thrash gets for dependent exec 3-cycle Mul+3-cycle Add.
Excavator = 1c at full thrash gets for dependent exec 5-cycle FMA/Mul/Add.
22FDX CMT = 1c at full thrash gets for dependent exec 5/4/3 FMA-Mul+Add at worst and 4/3/3 FMA-Mul+Add at best. In the case of low-thrash 1C can horde 2C's worth of SIMD resources, eff = Zen/Zen+ FPU throughput.
Jaguar w/o Bridge = 0.39 mm2
Jaguar w/ FMA Bridge = 0.546 mm2 (65nm SOI Bridge FMA paper)
Jaguar w/ Bridge + Clustered Exec(one for each core in CMT) = 0.819 mm2 (CMT's 1.5x applies to the FPU as well)
22FDX shrink (FPU/FMA have biggest shrink from lib swaps/node shrinks) = 0.3276 mm2~0.4095 mm2 (8T-104CPP peak shrink is 0.4x and averages 0.5x, as well FDSOI area op(length, width, back gate))
It also pairs well with Zen3's FPU design. So, rather Zen/Zen+ 1x4x128-bit, it is more closely related to Zen3's 2x3x256-bit with 2x2x128-bit.

2c Zen + 2x512KB + 4MB = ~25 mm2
4c Jaguar + 2MB = ~27 mm2
2c Excavator + 1MB = ~20.2 mm2 // 4c + 2MB = ~40.4 mm2
5.5 NewCMT0 + 5.5 NewCMT1 + 12.6 2MB L2 = ~23.6 mm2
There is a bunch of L2 techniques to improve that cache, as well as interface points going from four(core BU:4*2 L1 caches) to two(module BU:2*3 L1 caches), lowering wiring complexity. As well as the alternative being the fitting of L3 14LPP tech in the L2 of 22FDX, ex: 2x512 KB 17c/12c & 2x2MB 39c/34c latency -> 2x 1 MB 17-cycle of local module, 19-cycle latency for nearby module. In turn improving on Excavator's L2 design and getting it some level parity with Zen. Where inter-module is not >190-cycle like Jaguar's L2 on inter-cluster on consoles. Using the L3 fabric on L2 fabric also allows for a spinout of a CPU design; ~23.6*4 = ~94.4 mm2. With it being 1.2W~25W per two modules and with eight modules being ~4.8W to ~100W, CPU-side. Also, negates the need of a cost-prohibitive L3 cache since low-power generally had high latency L2 to begin with; 8x1 MB farthest 27-cycle and 8x2MB farthest 29-cycle. Doing as the Cavium's Thunder, IBM's Telum, Intel's Centerton, Avoton, Denverton and rolling without an L3 cache.

Last time no L3 cache on big CPU design was the Bloodhound die:bloodhound.jpeg

This also gets around Memory being faster B/W and lower latency than L3 anyway. There is Low-latency HBM2 at 12ns/16ns versus HBM2 at 45ns. There is also low latency DRAM-IV at 13.5ns versus LPDDR4 at 42ns. If L3 is needed it is better to use DRAM instead, with hetero-memory controllers, same PHY; LL-DRAM(as well as NRAM) @ 512 MB~2 GB @ ~1x-ns/4266 MHz - 2 GHz, and LP-DRAM @ 8 GB-64 GB @ ~4x-ns/4-6.4+ GHz. This improves on Carrizo's memory controller where long read/stores go to one 64-bit bus and short read/stores go to the other.


-----
rpi4diesizelow.jpg
RPi4 is 120 mm2
~1.26 cm * ~1 cm (Stoney-AMD) vs 1.2 cm * 1 cm (Hudson-BCM)

So, getting below that is vital for AMD getting further into the pervasive/embedded/low-cost market; $170B+ vs AMD's Zen market focus of; $75B+. Also can be used to update from Atom S1200/Opteron X2000 for Supercluster computing against modern RPi4/Elkhart designs.

~25$ min (Socket/BGA APU) + ~29$ min (Mobo) + ~29$ min (8GB DDRx RAM) = ~83$ min whereas the norm is ~150$(Intel/ARM/RISC-V).

"Low-cost 3000G" outside the US is 455 USD to 699 USD on avg. Which is absurd.
 
Last edited:

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,691
136
Any new public roadmap beyond the obvious would be great indeed...

Just in on the rumour mill:


I do wonder what they're going to cook up on a 4nm class process. Could be a real budgetbeast...
 

moinmoin

Diamond Member
Jun 1, 2017
4,949
7,659
136
Just in on the rumour mill:


I do wonder what they're going to cook up on a 4nm class process. Could be a real budgetbeast...
Hm, Samsung over the long run replacing GloFo as the foundry used for budget parts might make sense, though I'm not sure Samsung would really like that kind of division of work with TSMC.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
I do wonder what they're going to cook up on a 4nm class process. Could be a real budgetbeast...
5LPA=4LPE
The CPP between 4LPE and N7/N6 is probably the same. However, the M1/M2/M3 pitches differ. Which means a redesign or custom-order(high cost to have custom Std. Cells/Libs) from Samsung.

Samsung's volume for EUV is less than TSMC's volume for EUV.
Two 7nm-5nm-4nm phases at 90,000 wpm and TSMC with three~six 6nm-gen phases at 200,000-300,000 wpm.

Switching away from TSMC is suicide for a chip of Van Gogh's/Mendocino's size. Where higher wafer per month equals lower manufacturing end cost.

The more likely candidate is ARM(Samsung)+RDNA2(AMD) Exynos chip being pushed to Chromebooks. Since, it is already at Samsung anyway. It is also not a low-value SKU being starting at $999 USD and higher for better SKUs.

We can actually use Snapdragon 8cx Genx/SQx => ~1500 USD avg. machine cost.
RDNA2 basically having better scaling than Adreno means 15W ARM+RDNA2 can better fit that premium role.

Samsung Exynos S5E9840 => Google Tensor S5E9845
Samsung Exynos w/ RDNA2 for Smartphones => AMD Etc. for Chromebooks (& WoA? w/ 64-bit AMD64 emulation support)
vs PowerVR CXT big solution 7.2 Gigarays vs RDNA2 6 CU @ 1.25+ = >7.5-giga triangle-style or >30-giga box-style intersection.

This is not however the budget chip we are looking for...

---
Process:
28SLP -> 28HPP -> 28SHP(6xMx)/28A(8xMx) -> 28HPA
Low ~10% cost increase steps

28SHP -> 20SHP -> 14LPP -> 12LP+
High ~30% cost increase steps

28SHP-Bhavani & 28HPA-Stoney -> 22FDX
Lower cost estimated ~20%+ lower, with higher performance at higher yield rate(less process steps, lower variation, etc).

Access of forward I/O:
Bhavani AM1 = DDR3-1866 (2014) -- one year too early for DDR4 support
Stoney AM4e(single channel AM4 motherboard concept; X-type, B-type, A-type, E-type(single-channel)*) = DDR4-2133 (2016) -- at least three years too early for DDR5 support or DDR4-3200 support
22FDX = DDR4-3200 (first two years) to DDR5-6400 (second two years);; single mask set
12FDX(re-opt-shrink) or 12FDX(new-design) = DDR5-8400 (first two years) to DDR6 (second two years);; single mask set

*Type-A:
A12-9800
A10-9700
A8-9600
A6-9500/A6-9400
^-- Dual Channel, 8x PCIe GFX, 3 DP/HDMI
Unreleased Type-E:
E4-9400 = (A9-9410)
E2-9200 = (A6-9210)
E2-9000 = (E2-9010)
^-- Single Channel, 4x PCIe GFX, 2 DP/HDMI, e-class Promontory E310 or no chipset for e-class SFF E300
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,237
5,018
136
Samsung's volume for EUV is less than TSMC's volume for EUV.
Two 7nm-5nm-4nm phases at 90,000 wpm and TSMC with three~six 6nm-gen phases at 200,000-300,000 wpm.

Switching away from TSMC is suicide for a chip of Van Gogh's/Mendocino's size. Where higher wafer per month equals lower manufacturing end cost.

It's kind of irrelevant how many wafers TSMC are running, if they have already sold them all to Apple. AMD has a certain allocation of TSMC wafers and they aren't getting more. Starting up a new product line on Samsung gives them more product to sell overall.

Same idea as your FD-SOI concepts, except with a process that is closer to competitive.
 
  • Like
Reactions: Tlh97

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
It's kind of irrelevant how many wafers TSMC are running, if they have already sold them all to Apple. AMD has a certain allocation of TSMC wafers and they aren't getting more. Starting up a new product line on Samsung gives them more product to sell overall.

Same idea as your FD-SOI concepts, except with a process that is closer to competitive.
Apple isn't actually using all wafers from TSMC. AMD's certain allocation of 7nm/5nm TSMC wafers is more than they had since 2000-2017 from AMD Foundry to GlobalFoundries for any given node.

Van Gogh/Mendocino from TSMC to Samsung = Bad, adds unneeded supply-chain complexity
Samsung Exynos w/ RDNA2 already on 4LPE to AMD Chromebook on 4LPE = Good, re-uses Samsung's supply-chain but for an AMD Semi-custom.

The usage of FDSOI is ~1500 USD wafers versus FinFET ~9000 USD wafers.
9000/386 150mm2 = ~23 USD
1500/577 100mm2 = ~2.6 USD

50 - 23 = $27 Margin
25 - 2.6 = $22.4 Margin or 30 - 2.6 = $27.4
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,262
7,890
136
Apple isn't actually using all wafers from TSMC. AMD's certain allocation of 7nm/5nm TSMC wafers is more than they had since 2000-2017 from AMD Foundry to GlobalFoundries for any given node.

Van Gogh/Mendocino from TSMC to Samsung = Bad, adds unneeded supply-chain complexity
Samsung Exynos w/ RDNA2 already on 4LPE to AMD Chromebook on 4LPE = Good, re-uses Samsung's supply-chain but for an AMD Semi-custom.

The usage of FDSOI is ~1500 USD wafers versus FinFET ~9000 USD wafers.
9000/386 150mm2 = ~23 USD
1500/577 100mm2 = ~2.6 USD

50 - 23 = $27 Margin
25 - 2.6 = $22.4 Margin or 30 - 2.6 = $27.4

If an Chromebook uses the Samsung Exynos with RDNA2, then it's a Samsung powered Chromebook, not an AMD one. Samsung is licensing the GPU IP from AMD, it's not an SOC developed by AMD. The only thing that might make sense for AMD to use a GF process for at this point are super cheap, 3rd world country type, APUs they can sell for dirt cheap and still make a buck, or to put the Zen IOD on 12FD-SOI, but that process still isn't ready for production.
 
  • Like
Reactions: Tlh97

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
If an Chromebook uses the Samsung Exynos with RDNA2, then it's a Samsung powered Chromebook, not an AMD one. Samsung is licensing the GPU IP from AMD, it's not an SOC developed by AMD.
Say that to the Google Tensor...
The only thing that might make sense for AMD to use a GF process for at this point are super cheap, 3rd world country type, APUs they can sell for dirt cheap and still make a buck, or to put the Zen IOD on 12FD-SOI, but that process still isn't ready for production.
Even in worst-case scenarios the 22FDX APUs would still be faster than either 28nm Bhavani or 28nm Stoney cases. The purpose of switching is Fully-depleted transistors like FinFETs, while unlike FinFETs actually drop in wafer price per node. AMD can't do aggressive low-cost with Zen because of this. The prices with Zen can only go up.

There is very much a market for low-cost and low-power 22FDX CPU and 22FDX GPU solutions as well.

Microserver/Dense Server => Cost-prohibitive to consumers in 2012~2014. While in 2020+ it isn't cost-prohibitive to consumers anymore.

Cluster board/rack => ~$200-300
4-5 APUs => ~$120-250 (30 best case, 50 worst case)
Included 70-120 watt PSU => 5 at sub ~6w = <30w APU power.

This can be more aggressively scaled for PPD/W rather than expensive CPU+GPU with PPD metric only.

The other case is inexpensive expansive 2D/2.5D/3D gaming. Where high-end PCs aren't really needed, since there is a loss of quality in story or gameplay for quantity of graphic fidelity.

GPU arch. = Compute/W and Gaming/W but not necessarily at the same time at low $. (Split-RDNAx&CDNAx isn't feasible. It needs to do both Compute and Gaming well)
CPU arch. = GFlops/W and GOPs/W at low $.

Fam16h was optimized for Gate-last, Fam15h never removed HPC M-SPACE/Arch FE, BU, FPU. Neither designs would port well to 22FDX, so relatively grounds-up would need to be done for new Fam w/ CMT; ~$40M + ~$1500 vs ~$100M + ~$4000

$1800/538 (107 mm2 - 28SHP) = ~3.3
$1800/455 (125 mm2 - 28SHP) = ~4.0
$1500/577 (100 mm2 - 22FDX-0.8x) = ~2.6
$1500/653 (90 mm2 - 22FDX-0.7x) = ~2.3
$4000/386 (150 mm2 - Dali) = ~10.4
$4000/256 (225 mm2 - Monet) = ~15.6

Minimum Client SEP from Dali = 4.711538462
22FDX-0.8x = ~$12.25
22FDX-0.7x = ~$10.84
Monet = ~$73.5
Worst case insert at risk 22FDX SEP($3600-14nm FDSOI price @ 2015) = ~29.4 for the 100mm2-0.8x die.
Worst case for 7nm VGH=6nm MDN(FT6 version) = ~121.85
However, if it is that then we can basically look at Ryzen 5 3400G-3700U/2400G-2700U launch prices for MDN. Since, it actually provides that level of performance.

Oversimplification:
Up to 50,000,000 consumer chips × $25(ASP) = Up to $1,250,000,000
Up to 5,000,000 consumer chips × $250(ASP) = Up to $1,250,000,000

APU: 1.2W to 25W w/ big seller being the 1.2W to 6W range. 4c/3WGP(6x1 SIMD16 FP64) from 2c/3CU(3x4 SIMD16 FP32) and 4c/2CU(2x4 SIMD16 FP32)
CPU: 1.2W to 25W w/ big seller being the 5W to 10W range. Single-die 4 modules/8 MB L2(25-cycle for all four, 19-cycle for one), and dual-die 8 modules (3x4 PCIe GFX, 1x4 PCIe GPP, 128-bit DDR)
GPU: 1.2W to 25W w/ any range being good. 6WGP(12x1 SIMD16 FP64), single GPU card through quad GPU card => 3x slots from above ~12 dGPUs avg sold.
$25-$35 per chip range;; Low-cost Essentials to Compute/W/Dollar scale out.

---
Orange square is two Jaguar cores, did it on the side placed bits like BD-XV with area concerns popping up with private big LSU/L1d.
jaguarcmt.jpg
Some placement is covered in other non-AMD designs to maximize perf/power. I threw most of the repeats in, there is a case for a single 32KB L1d and 32KB L1i. Since, L0s are present to cover latency. Because of the repeats especially in FPU/LSU/dTLB/iTLB/L1d/L1i those are larger than if in an actual design.

MI/CNN/DNN machine-learned synthesis-place-route is relatively mature
22FDX is near its equipment depreciation, no new fab so fab depreciation isn't in, 2017 + 100% depreciation, 2018 + 80% depreciation, 2019 + 60% depreciation, 2020 + 40% depreciation, 2021 + 20% depreciation, 2022 + 0% depreciation.
Next-gen I/O(DDR5, PCIe Gen5, USB4, etc. on die for cost-crossover refresh), H.266/AV1 hw decode/encode, exhaustive power/architecture design and process improvements, etc. The benefit of inserting into Si at the trailing edge.

22FDX Risk = 2016, Tsi up to 8.5
22FDX Intro-to-Volume = 2017, Tsi up to 7.5
22FDX Actual-to-Volume = 2018-2019, Tsi avg to 6~6.5
Big EDA tools support Adaptive Body Biasing = 2019
Big EDA tools improve performance by ~20% and shorten time to market by half = 2020
22FDX+ is introduced, no mentions of logic perf except for STM's roadmap being half-way between 22FDX and 12FDX = 2020

40% + 20% + 15% = Note the increase from 2017+ 22FDX ZBB [40%] to 2021+ 22FDX+ ZBB [75%], whereas BB was only [70%] in 2019+.

\\\\
With 12FDX returning to profiles in 2021, 2019 till very recently, 12FDX wasn't present at all. So, it is very important to check on these:
globalfoundriesmalta.png
globalfoundriesmalta3.png
globalfoundriesmalta2.png
"The 300 mm pilot line is on track to be completed by the end of the year" -- 2021 Missouri 300mm SOI, GlobalWafers.
soifactofabfac.jpeg
Dresden/F1 = 22FDX (now) and 12FDX (soon)
Singapore/F7 = 22FDX (soon)
Malta/F8 = 22FDX (soon - 300mm Missouri) and 12FDX (soon - 300mm Missouri)
Burlington/F9 = 22FDX (soon - 200mm Missouri)

Bernin II - 2020 €25 million($28.3M) in capital expenditure + €10M($11.3M) in 2021(Shared Bernin 1) + €220M[$249M](Shared Between Bernin II and third 300mm FDSOI fab SOITEC) between 2022 through 2026.
Pasir Ris - 2020 €26($29.4M million in capital expenditure + €67M($75.8M) in 2021 + €275M($311M) during 2022 through 2026.

$87.1M(2021 and shared between B1,B2, PR1) and going forward $111.5-140M(avg per year) from SOITEC
$210M(2021 and shared between 200mm/300mm), and no details going forward other than $800M in planned wafers bought by GloFo, from GlobalWafers
spent for capacity.

$800M using the 2009 cost reduced FDSOI wafer = 1,600,000 SOI wafers(if at 2009 planned costs) being planned to be bought by GlobalFoundries.
PDSOI was $1000 base, FDSOI entry was $500 back then. If the above is spread out over five years then it should be enough to supply ~27K (relative to 300mm) wafers per month(avg).

29.2K wafer starts per month at Fab 8 in 2021
+ $1B for another 12.5K wafer starts per month for 2022 and beyond.
+ sub-$10B for a double of the above with Fab 8.2/Fab8 Phase2, unknown introduction.

Since, GloFo is not going to update the roadmap I did it for them.
22fdx12fdxthewaytogo.png
What FinFET node? We never had one!
Some special customers who were eyeballing 14LPP/12LP/12LP+ on-shore are opting to move to Intel's 22FFL instead. GlobalFoundries does not currently do 22FDX in the states, so it is off-shore.
Worldwide Best-All-Around = TSMC FinFETs
United States = Intel FinFETs
Cheapest WW/US/China = Samsung(US/Korea) and SMIC(China) FinFETs
- Samsung 3nm Taylor-Austin, Texas
- TSMC 5nm Northphoenix, Arizona
- Intel, all of em, Oregon/Arizona
////

22FDX-Y1 -> 22FDX-Y2 -> 22FDX-Y3 -> 22FDX-Y4 -> Yx -> 12FDX-Y1
::
12FDX-RP -> 12FDX-IV -> 12FDX-AV -> 12FDX-BiABB -> 12FDX-PPACY+x -> 12FDX-Y1

By the time 12FDX processors exist for AMD, GlobalFoundries will probably be inserting JFIL(Japan), SRPL(China), DSA(Belgium), or REBL/MAPPER/ISFEA(USA) into 6FDX.

Chartered => 157nm or SFIL for next-gen nodes
Albany => SFIL for next-gen nodes
Both combine in a modernity and GF chooses JFIL for next-gen nodes and removes high straight-line depreciation costs.

However, looking at the EOL lineup: GlobalFoundries 28nm is dead, the only 28nm product in is AMD Embedded G-Series 1st Generation SoC, AMD Embedded G-Series CPUs which is TSMC. 28nm GF EOL at January 2021, Customers of EOL products Last-time-ship is January 2022. Thus, End User last-time-purchase is 2022 for GloFo 28nm.

AMD has completely waived purchases of 14nm/12nm instead opting to use node freedom to second source those products.
GF-28nm killed in 2021.
GF-14nm/GF-12nm killed in 2022.

Back to why 22FDX;
::ARM::
BCM2711 28nm = 1.5 GHz to 1.8 GHz Cortex-A72 (quad-core) = ~6W TDP
RK356x 22FDX = 1.8 GHz to 2.0 GHz Cortex A55 (quad-core) = 5W TDP

Cortex-A510 is within 10% of Cortex-A73, while Cortex-A73 was an ~10% improvement over A72.
RPi4 can go to RPi5 and get 4x 3 GHz A510 at lower area and RK356x can move to 4x 3 GHz A510 for better perf.

::AMD::
A6-9220C 28nm = 1.8 GHz to 2.7 GHz (dual-core) = 6W TDP
3015e 14nm = 1.2 GHz to 2.3 GHz (dual-core) = 6W TDP w/ 5 min fPPT boost to 18W and 50 min sPPT boost to 12W.

::Intel::
Pentium N6000 = 1.1 GHz to 3.3 GHz (quad-core) = 6W TDP w/ 15 second boost to 18W.
Alderlake-N = 8x Gracemont (octo-core) = 1.8 GHz to 3.4 GHz(sust. quad-core boost) and 3.0 GHz(sust. octo-core boost)
Meteorlake-N = 8x Crestmont (octo-core) = IPC(Arch+)/GHz boost from Intel 4.

AMD at the low-end at GlobalFoundries is basically crushed... big die 14nm isn't competitive enough against small die Intel 7/4 and big die 14nm is too expensive against small die 28nm/22nm node generation.

Hence, why small-die 22FDX is the clear choice going forward.
~6.2 mm2 for two Jaguar-cores on 28nm <-> ~4.0 mm2 for two Enhanced-Jaguar-cores on 16nm. The issue with this singular design is that they don't have the FPU power of the above; 2x64 or 2x128 FMA SVE for ARM and 2x256 FMA for Intel.

So, they need a small core or module that at median provide 2x128 FMA. With 28nm CMT = 4.65 mm2+arbitrary area for 2x128 FMA and 16nm-like Area-opt 22nm CMT = 3 mm2+arbitrary area for 2x128 FMA.

Process Complexity/Cost:
22FDX/22FDX+(w/ in-situ perf boosters) -> 28PolySi -> 28HKMG(SLP) -> 12FDX(w/ gen2.1 in-situ perf boosters) -> 28HKMG(28HPP/28SHP/28SHP+ w/ implant perf boosters) -> 14nm FinFET(Fin complexity, RMG complexity, MOL complexity)

Performance+Power:
28PolySi -> 28HKMG(SLP) -> 28HKMG(28HPP-SHP+) -> 14nm FinFET -> 22FDX/22FDX+ -> 12FDX

14LPP has delay variation at 1.1-up V(Ultra-high-performance) and 0.7-down V(Ultra-low-power) which requires costly implants to fix.
22FDX doesn't have these issues because of its use of in-situ boosters intrinsically.
So, across ultra-wide range workloads 22FDX comes out way cheaper and way faster.

28SLP-A9 1 GHz @ 1.1V
New Range: 28FDS-A9 1 GHz @ 0.65V
Same Range: 28FDS-A9 @ 2.3 GHz @ 1V
New Range: 28FDS-A9 3 GHz @ 1.4V
22FDX was derived from the Tri-gate competitor line(20FD: 0.9Vdd+20nmLg, 14FD: 0.8Vdd+boosters, 14FD+: 0.7Vdd+gen1.1 boosters, 22FDX: 0.65Vdd+gen1.2 boosters)... 20LPM to 20SHP is a 10% perf increase, but 20LPM to 20FD(gen1) is a 20% perf increase.
If the eQuad A9 core was ported to 22FDX, the A9 core would be 4.5 GHz @ 1.3V... which explains 28BLK and 28FDS designs at STMicro were ported to 22FDX.

Body biasing in designs:
Intel's 45nm - 2008
"Dynamic SRAM PMOS forward-body-bias (FBB) and Active-Controlled SRAM VCC in Sleep are integrated in the design to lower Active-VCCmin and Standby Leakage, respectively. FBB improves the Active-VCCmin by up to 75 mV, and Active-Controlled SRAM VCC distribution tightened by 100 mV, both of which result in further power reduction.
The 16 KB Subarray was also used as the building block in on-die 6 MB Cache for Intel Core 2 Duo CPU in 45 nm technology.

Oracle 40nm - 2010
"In addition, the design implements body-bias capability for both PMOS (VNW) and NMOS (VSB)."

Samsung 32nm - 2012
"Data from the monitors is analyzed to identify the process corner, the amount of threshold voltage skew and the on-chip variation and utilized to control body bias and supply voltage for the power planes. It effectively reduces the process window of the silicon samples and minimizes the leakage/performance impact of process variation. We can target the process corner to SS to minimize the overall leakage current and selectively apply forward body bias on the speed critical blocks, or target the processor to the FF corner and apply backward body bias on the leakage critical blocks."

Samsung 28nm - 2013
"In addition, based on the measurements from on-chip performance sensors, reverse and forward body biasing are appropriately applied to compensate against the process variations for reducing leakage, improving performance and yield."

28FDS eQuad was compared to Samsung's 28nm processor in one of the demo videos. 22FDX Rockchip uses body-biasing for A35, A53, A55 cores.

Closest processor TDPs of tier 1:
:Shrink of not-AMD MediaGX processor:
Geode range => 2.8-5.1 watts

:Bobcat w/ iGPU like MediaGX:
Geode2 => 6.4 watts
Geode2 client => 5.9W and 4.5W

:Jaguar w/ iGPU like MediaGX:
Geode3 (dual-core) => 6W
Geode3 client (quad-core) => 8W

: Puma w/ iGPU like MediaGX:
Geode3+ (quad-core) => 5W-7W
Geode3+ client (quad-core) => 4.5W

:Excavator w/ iGPU like MediaGX:
Geode4 (dual-core) => 6W-10W
Geode4 client (dual-core) => 6W
:No successor to reduce power and increase performance at similar lowering price point:

Embedded, Industrial, Mobility tier 1 (Stripped I/O for lower power) => 22FDX is ideal and allows for "Cool_AMD64" back-biasing//body-biasing.
Client, Desktop, Mobility tier 2 (Full I/O do to higher availability of power) => 22FDX can also spin up for high TDP bursts.

22FDX;
Lowest VT + FBB
Lower Mid VT + FBB
Higher Mid VT + RBB
High VT + RBB

12FDX:
Lowest VT + FBB + RBB
Lower Mid VT + FBB + RBB
Higher Mid VT + FBB + RBB
High VT + FBB + RBB

Collected a range of Mullins and Stoney => 0.9V=+70% frequency and 0.5V=-75% power
Frequency of 0.9V for 22FDX design is 3.06 to 3.74 GHz. (+500 MHz to 1.2 GHz for Puma and +300 MHz to 1 GHz for Excavator)
Frequency of 0.5V for 22FDX design is 1.71 to 2.01 GHz. (-75% the power of 28nm 0.9V)

The issue is 28nm bin range is absolutely everywhere. Whereas 22FDX should be more biased towards highest bin.
 
Last edited:
  • Like
Reactions: Zepp

VirtualLarry

No Lifer
Aug 25, 2001
56,339
10,044
126
Now. i just dont see how Stoney can accomplish anything here.
This, 100%. I own (or have owned) Puma(+?) laptops, as well as several Stoney Ridge, as well as Zen-based 3200U, 3020e, and 3050e laptops, and there really is no comparison, the Zen-based APUs, even the 3020e and 3050e, are in a much better league of their own. Stoney Ridge, at least in 2021, is a total bust, and needs to die a horrible, fiery, silicon death.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
This, 100%. I own (or have owned) Puma(+?) laptops, as well as several Stoney Ridge, as well as Zen-based 3200U, 3020e, and 3050e laptops, and there really is no comparison, the Zen-based APUs, even the 3020e and 3050e, are in a much better league of their own. Stoney Ridge, at least in 2021, is a total bust, and needs to die a horrible, fiery, silicon death.
It is unlikely a shrink to 22FDX would keep the Stoney name.
Puma++ and Excavator++ are unlikely since there was the 2016 through 2018 ultra-low-power CPU/GPU architecture.

Puma -> ULP CPU => Thin-OoO small CMP -> Thin-OoO small CMT
Excav -> ULP CPU => Thick-OoO big CMT -> Thin-OoO small CMT

Specifically, Family 15h "Brainiac-design Module" design going forward to a new architecture family with a much smaller "Speedracer-design Module" design. Enabled by much lower self-heating effect compared PDSOI/FinFET and much lower switching energy/delay compared to Bulk/FinFET. Another enabler for this is the increasing LPDDRx speeds; 4.266 -> 6.4 -> 8.5, w/ LPDDR5 still not using QDR mode for 12.8-17 gigatransfers. Add as well DDRx; 2.4 -> 3.2 -> 4.8 -> 6.4 -> 8.4 on desktop as well. There is also NVMe SSDs taking care of the slowest side; "Kioxia compares its prototype to a PCIe 4.0 drive, the Kioxia CM6. That gets roughly 6,900MB/s sequential read and 4,200MB/s sequential write, which is easily crushed by the 14,000MB/s sequential read and 7,000MB/s sequential write of the PCIe 5.0 prototype."

Merlin R-series (GloFo 28nm, EOL 2021) -> Banded Kestrel R1000-series (GF 14nm)
Brown/Prairie/Crowned/LX G-series (GloFo 28nm, EOL 2021) -> G1000-series (GF 22nm)
On the embedded side.

Bristol A series (GloFo 28nm, EOL 2019?) -> Dali/Pollock Athlon-series (GF 14nm)
Stoney A/E series (GloFo 28nm, EOL 2021) -> Sempron-series (GF 22nm)
On the desktop/mobile side.
 
Last edited:

VirtualLarry

No Lifer
Aug 25, 2001
56,339
10,044
126
The thing that really killed Stoney, was the exceptionally poor SINGLE-CHANNEL DDR4 performance.

Both lack of DDR4 higher-speed RAM support, as well as inefficiencies, would often see the compute side (CPU cores) DRAM bandwidth-starved, and stalling out. Especially using VSR @ 2560x1440 on a 1080P laptop screen. It becomes notably slower.

A Zen 3050e-based laptop, I can use VSR to bump up a 1080P screen to 4K UHD res., and NO appreciable slowdown.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
The thing that really killed Stoney, was the exceptionally poor SINGLE-CHANNEL DDR4 performance.

Both lack of DDR4 higher-speed RAM support, as well as inefficiencies, would often see the compute side (CPU cores) DRAM bandwidth-starved, and stalling out. Especially using VSR @ 2560x1440 on a 1080P laptop screen. It becomes notably slower.

A Zen 3050e-based laptop, I can use VSR to bump up a 1080P screen to 4K UHD res., and NO appreciable slowdown.
Mind you the 128-bit DDR4 is warranted. As the Raven2 that launched went against Bristol not Stoney.

CPU-side:
Average Benchmarks AMD Athlon Silver 3050e → 100% // (6W)
Average Benchmarks AMD A12-9720P → 98% // (15W)

GPU-side:
Average Benchmarks AMD Radeon RX Vega 3 → 100% // 1 GHz (6-15W)
Average Benchmarks AMD Radeon R7 (Bristol Ridge) → 90% // 900 MHz (35W)

---
Architecture, SoC, etc for 22FDX wouldn't be as bad as the above. However, we are actually eyeballing Pollock.

A6-9220c: 1x64-bit DDR4-1866/3 CU-720 MHz
vs 3015e: 1x64-bit DDR4-1600/3 CU-600 MHz

Which isn't particular hard to beat on a optimization-scale. 1x64-bit DDR4-3200 or 1x64-bit LPDDR4-4266 and a new SoC/CPU/GPU design. Definitely not hard to push a Sub-$30 APU below that in which actually performs better at lower TDP on 22FDX.

Rather than being a leader on 22FDX, it is much better to be follower for such designs;
- Lower manufacturing cost up to at least 2x lower.
- Higher performance, lower power, smaller area.
- Feature rich installation capability of third-party IP, and more access to a semi-custom SoC.
 
Last edited:

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,691
136
The thing that really killed Stoney, was the exceptionally poor SINGLE-CHANNEL DDR4 performance.

Having played around with a single module A6-9500, dual channel doesn't really do much for BR/Stoney. It's still dog slow. What really kills it is that it's effectively a single core w/ SuperHT™ CPU in todays environment. That's just not going to cut it.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Having played around with a single module A6-9500, dual channel doesn't really do much for BR/Stoney. It's still dog slow. What really kills it is that it's effectively a single core w/ SuperHT™ CPU in todays environment. That's just not going to cut it.
IPC is hella expensive. Bulldozer-Excavator is optimized for 4 ALU/4 FPU + 4 AGU per core. Coming out as 14.48 mm2 for two cores in Excavator in the same libs as Puma. Which was 2 ALU/2 FPU + 2 AGU optimized per core at 3.1 mm2&6.2 mm2.

The OG CMT core in which two cores both have:
include, for example, a floating point add unit, a floating point multiply unit, two integer units, a branch unit, a load address generation unit, a store address generation unit, and a store data unit.

Don't forget the slowest part is the FPU which doesn't even retain MUL+ADD+MISC per core from Husky or Bobcat, or Jaguar. The benefit that the shared FPU provides which requires less resources to get high IPC is completely ruined by the FMA units.

However, what really kills it is the outdated fabrics, inter-core&inter-module communication, etc.

Small module w/ shared FPU so one core/thread can hog on co-processor; 2x MUL/2x ADD throughput like Zen for one thread and 1x MUL/1x ADD for two thread anyway.

Rather than use a wide ~192-entry + SMT + Big PRF OoO engine at lower clocks, it is better to use two ~64-entry + Small PRF OoO engine at higher clocks. Since, IPC beyond >2 is expensive on scalar integer parts.

3.1 (area of single core) * 1.5 (CMT2) * 0.55 (maximum area shrink) to 0.8x (minimum area shrink) => 2.56 mm2 to 3.72 mm2 if exactly prescribed in CMT concept. In this path, high Fmax wire-speed and pre-built structure optimization is available for extra GHz.

There is a big gap of 80 mm2 to 130 mm2 for a low-cost APU. Which is benefited from a low cost node like 22FDX. With the end of production of GF 28nm, a product in that generation is needed for substitution at GloFo. There is indication for the same thing that happened to 28nm at Fab 1 will happened to 14nm at Fab 8 by 2023. Which is inline with 2021 being the big year of 22FDX shipments at Fab1, where 2023 being the big year of 45nm SOI and a potential volume start of 12nm SOI at Fab8.
 
Last edited:
  • Like
Reactions: Tlh97

Shivansps

Diamond Member
Sep 11, 2013
3,855
1,518
136
Having played around with a single module A6-9500, dual channel doesn't really do much for BR/Stoney. It's still dog slow. What really kills it is that it's effectively a single core w/ SuperHT™ CPU in todays environment. That's just not going to cut it.

yeah, the A8-9600 is A LOT faster than the 9500, it feels like, you know, something you can actually use. But it still gets destroyed by a lowly 200GE/3000G. There is just no point in any of this.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,691
136
yeah, the A8-9600 is A LOT faster than the 9500, it feels like, you know, something you can actually use. But it still gets destroyed by a lowly 200GE/3000G. There is just no point in any of this.

Was one of the famous bootkits. Since it was already in there, I saw no reason not to install Windows and test it before returning.

Otherwise, I wouldn't have touched it with a barge pole.
 

Asterox

Golden Member
May 15, 2012
1,026
1,775
136
yeah, the A8-9600 is A LOT faster than the 9500, it feels like, you know, something you can actually use. But it still gets destroyed by a lowly 200GE/3000G. There is just no point in any of this.

It is ok, but it is not even close dirt cheep as someone would expected.It is AM4 APU, and it can be very useful for any cheap PC.



Athlon 3000G is lol much beter CPU, but today for or around 100$ it is huh to expensive.This is absurd they are identical CPU-s, but very big price difference for Athlon 3000G Tray version.Athlon 300GE 564kn=84$, Athlon 3000G 746kn=111$.


When i used Athlon 3000G, i paid 384kn=57$.:mask:
 

jpiniero

Lifer
Oct 1, 2010
14,591
5,214
136
Why would an OEM use Stoney when Goldmont Plus is faster?

The tiny amount of the very low end Zen/Intel stuff released to DIY is going to be gobbled up by miners.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,691
136
It is ok, but it is not even close dirt cheep as someone would expected.It is AM4 APU, and it can be very useful for any cheap PC.

I would not recommend building anything new using it, as AMD has sunset graphics drivers for everything pre-Polaris. If you can live with that, it's reasonable, but getting very long in the tooth.

I still have a similar Athlon 845 system used in the family, and I have to say it's getting to the point where it has to be replaced.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Why would an OEM use Stoney when Goldmont Plus is faster?
Puma+Excavator (Carrizo-L/Stoney) are on a cheaper node relative to Intel's low-end while competitive towards ARM's low-end. It only needs to be faster than Cortex-A55/A510/A72.

The issue is that both architecture designs are targeting different markets. Stoney replaced Bristol-L because at 25W Puma was capped at 2.5 GHz, where Excavator got 3.5 GHz at 25W and later 3.7 GHz at 15W.

Hence, Stoney's Successor needs to follow up on Puma -> Excavator, faster frequency at lower power at a given TDP. While also being on a node at base is cheaper than FinFETs:
22fdxlowcomplexity.jpeg

130nm Geode LX (2005) vs 65nm Brisbane (2006) for example. Bobcat, Jaguar, Puma were leading edge, and had to take the brunt of leader costs. Where a 22FDX design at this point is a follower/trailing edge design, where design costs and costs in manufacturing are lower.
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,237
5,018
136
Puma+Excavator (Carrizo-L/Stoney) are on a cheaper node relative to Intel's low-end while competitive towards ARM's low-end. It only needs to be faster than Cortex-A55/A510/A72.

The issue is that both architecture designs are targeting different markets. Stoney replaced Bristol-L because at 25W Puma was capped at 2.5 GHz, where Excavator got 3.5 GHz at 25W and later 3.7 GHz at 15W.

Hence, Stoney's Successor needs to follow up on Puma -> Excavator, faster frequency at lower power at a given TDP. While also being on a node at base cheaper than FinFETs:
View attachment 54404

AMD gave up on these ultra-low-end markets because the margins suck. It's not a good business to be in. They were desperate to sell any product because their CPU architecture sucked, so they were stuck selling these bottom of the barrel parts while consistently losing money. They now have a CPU architecture that doesn't suck, and they're now focused on high margin markets and making billions of dollars in profit.

Offering a CPU which can only be described as "better than an A55" is not a good money making strategy.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Offering a CPU which can only be described as "better than an A55" is not a good money making strategy.
Actually it is the best strategy going forward especially with GlobalFoundries. That specific market is massive in volume of units sold. It is 1 billion units in size compared to faster than Atom at 10 million units large.

There is more than one road available going forward:
22FDX -> 12FDX -> 6FDX (standard More Moore roadmap)
22FDX -> 2-stack 22FDX -> 4-stack 22FDX (More than Moore enabled by Monolithic Inter-layer Vias)
AMD gave up on these ultra-low-end markets because the margins suck.
AMD gave up on low-end not because of margins suck, but that the low-end designs were on the same node as high-end designs. Essentially, burning wafer capacity and not very cost-efficient.

250 mm2 - Design A
125 mm2 - Design B
107 mm2 - Design C
All on GF 28nm at the same time(insert at leader not at follower) and at the same Fab(no fab seperation like previous trailing edge).

Where previous trailing edge processors at AMD were all N-2/N-1 relative. 65nm Fam 11h versus 45nm Fam 10h and 130nm Geode LX versus 90nm/65nm Athlons.

----
1.2 GHz 14LPP w/ 2x 128-bit MUL and 2x 128-bit ADD on Zen and 1.2 GHz 22FDX w/ 2x128-bit MUL and 2x 128-bit ADD on CMT = ~same work time on both. However, the 22FDX design is more efficient and can clock 2x higher.

6W 1.2 GHz 3015e vs 5/6W 2.2/2.4 GHz 22FDX where two-threads have the same throughput heavily favors faster thin-OoO CMT.

2-core 3-watt 1.2 GHz (Jaguar/Puma) = 1.2 GHz -> 1.5x 22FDX -> 1.3x FBB-opt -> 1.1x CMT-floorplan allowing for HF => ~2.574 GHz.
Ignoring Excavator since it doesn't have the function units or IPC-width that AMD would want to finesse with for ULP. Going up from Family 16h is more efficent than going down from Family 15h. It also more matches the original design function units for 1998-CMT(2x RISC-portion of K5/K6-successor) and 2004-CMT(PowerPC 2x Independent Integer/Shared FPU).

I am also ignoring 22FDX at V3.0 PDK, 22FDX+, and the EDA synthesis tool Perf increase of ~20%. Only using what was promised with 22FDX-2015t2017, so we can expect more from 22FDX+-2020t2022. There is also machine learning DFM that happened in Q4 2020 as well.
 
Last edited:
  • Like
Reactions: Tlh97 and Zepp