Future to Bulldozer architecture?

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

majord

Senior member
Jul 26, 2015
440
529
136
Dozers were not that power efficient. I think SMT wins in efficiency and single thread. Drawbacks of SMT are design complexity and possibly area. Dozer was not that great in area though thanks to the long number of stages.

I think area and cheapness is still something Puma could beat Zen at. Lower performance at lower power. Sometimes you need a golf cart rather than an automobile or racecar.

Maybe a modernized Puma could beat Zen in the under 5W SoC. If they can get 40% higher freq at the same wattage it'd be close: 1.7GHz Puma 4c/4t vs ???GHz Zen 2c/4t .

http://www.cpu-world.com/CPUs/Puma/AMD-A10-Series A10 Micro-6700T.html

I still think the next atom/cat replacements should be a hybrid complex of cores, involving puma+excavator or puma+zen.


I agree puma still has it's place, and would love to see it iterated /tweaked on @ 14nm.. It's perf/watt on 28nm bulk proves how good it is,

I think it would indeed hold quite the perf/mm2 advantage over a Zen core. possibly as good perf/watt in the lower frequencies.
 
  • Like
Reactions: VirtualLarry

amd6502

Senior member
Apr 21, 2017
971
360
136
Better yet, experiment with 22FDX instead.

AMD has an announcement mid month which might contain an update on dozer or Puma; though they may only show major products and highlights and limit news to vega and RR.
 

coffeemonster

Senior member
Apr 18, 2015
241
86
101
Better yet, experiment with 22FDX instead.

AMD has an announcement mid month which might contain an update on dozer or Puma; though they may only show major products and highlights and limit news to vega and RR.
by 'dozer' you just mean Bristol Ridge right? I can't imagine them doing much more updating on that platform before ryzen APU's make it completely obsolete though. As cool as it would be to see a die shrink, AMD doesn't have much reason to bother.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
by 'dozer' you just mean Bristol Ridge right? I can't imagine them doing much more updating on that platform before ryzen APU's make it completely obsolete though. As cool as it would be to see a die shrink, AMD doesn't have much reason to bother.

Well, they need something to fill the low end. From the previous roadmaps that's Bristol and the junky ultracheap Stoney. While Stoney is stupid and ultracheap, 4c/4t Bristol does this job really well in my opinion.

It would make the most sense that around 2018 they will release a native 2c Zen+ budget APU with small iGPU to entirely phase out 28nm products. Another possibility to fill the low end is the Seronx 22FDX route with modified old architecture---but this is looking pretty unlikely and would be a massive surprise.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
- 20/14 Lost Generation => 20LPM-A Excavator/14XM Unnamed Bulldozer core & 20LPM-A Leopard/14XM Margay
- 22FDX is near 14LPP without biasing and could achieve better than LPP. So, to offset that advantage use older cores and tweak for even lower power. It would be cheaper than developing a specific Ryzen SKU.
- Utilized IP that was lost in new ways. Rather x86 & ARM, aim for HP 15h & LP 16h simultaneously. (Maybe, even push the GCN front-end/back-end on the Bridge with GFXv9/10.)

Basically, the gist is get some of that lost revenue now rather than later. (One core doesn't fit all.)

Imagine 16h w/ FBB @ 3.2 GHz turbo (Samsung M1/M2 IPC) & 15h w/ FBB @ ~4-5 GHz Turbo (+15%-25% improved perf @ same clock)
or less with full FP256 capabilities.

Ryzen Mobile = Premium, oh look its on FinFETs! Oh, it has SMT! It is competitve! (Big APU sized GPU)
Not-Ryzen Mobile = Move outta da way, its on FDSOI. Oh, it does big.LITTLE! It is cheaper than contra-revenue atom! (Small to Mid APU sized GPU)
 
Last edited:
  • Like
Reactions: BHZ-GTR and amd6502

coffeemonster

Senior member
Apr 18, 2015
241
86
101
Well, they need something to fill the low end. From the previous roadmaps that's Bristol and the junky ultracheap Stoney. While Stoney is stupid and ultracheap, 4c/4t Bristol does this job really well in my opinion.

It would make the most sense that around 2018 they will release a native 2c Zen+ budget APU with small iGPU to entirely phase out 28nm products. Another possibility to fill the low end is the Seronx 22FDX route with modified old architecture---but this is looking pretty unlikely and would be a massive surprise.
I agree and I think that's what they have already with current Bristol Ridge. I just don't see them making any more revisions or die shrinking. considering that to really sell any of these they have to be very low margin parts that aren't really even advertised.


Seronx, sometimes I wish you were part of AMD's engineering team.

But since AMD never bothered to update the FX line after piledriver, I don't have much hope they'll do much with anything not ryzen going forward.
Why oh why was there never a steamroller or excavator FX 8 core?
Imagine 16h w/ FBB @ 3.2 GHz turbo (Samsung M1/M2 IPC)
yes if they could get a CAT core to clock to 3.2ghz efficiently I'd love to see what it could do.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
Except a wider core at 2GHz will be more efficient than a narrow Cat core at 3.2GHz.
Frequency and EPI is better than improvements to IPC.

So, 2 GHz with RR might be a 35W part. While, 16h with Fast and ULL aka 7.5T & RBB w/ HP voltage, might be 3.2 GHz max. Then as well still be 15W with a 15h module. So, four cores @ 2 GHz or six cores @ ~4 GHz for HP set(2-core) & ~3.2 GHz for LP set(4-core).

For casual/mainstream users more than 2 ALUs is beyond enough. Which allows for Bulldozer architectures to aim for clock rates and extreme efficiency. While replacing or being companions to traditional LP cores.
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,269
5,129
136
Just in case anybody was in doubt- leaked AMD roadmap confirms that they're completely replacing all construction cores with Zen. https://videocardz.com/69428/amd-snowy-owl-naples-starship-grey-hawk-river-hawk-great-horned-owl

AMD-Enterprise-CPU-2015-2019-Roadmap_1-1.jpg


AMD-Banded-Kestrel-SOC-Block-Diagram-1-1140x633.jpg


("Brown Falcon" is the embedded version of Stoney Ridge. Even in 4-15W, Zen is replacing Excavator.)
 
  • Like
Reactions: krumme

amd6502

Senior member
Apr 21, 2017
971
360
136
Nice catch NTMBK

Amazing to think that Zen can scale down to 4w.
Eventually... the TDP wattage box for 2c/4t Banded Kestrel reads 15.

Only for next gen 7nm "River Hawk" does it read 4-15.

I thought Zen+ was going to be on 14nm, and was hoping their first 2c native APU die would be Zen+ 14nm.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,129
136
Nice catch NTMBK


Eventually... the TDP wattage box for 2c/4t Banded Kestrel reads 15.

Only for next gen 7nm "River Hawk" does it read 4-15.

I thought Zen+ was going to be on 14nm, and was hoping their first 2c native APU die would be Zen+ 14nm.

It's listed as 4-15W on that second slide.
 
  • Like
Reactions: amd6502

jpiniero

Lifer
Oct 1, 2010
14,820
5,432
136
If they are, then investors had better jump ship now while the stock price is still reasonably high.

Bulldozer wasn't necessarily a bad idea, it's just that the implementation was friggin awful and they never got to the full endgame with the GPU replacing the FPU, etc...

That being said I think he's misinterpreting what Intel is doing with Sapphire Rapids, at least according to the leaks. I think SR is more like Power in that it's very wide but with the vector units gone/separated out. How it works in practice I guess we will have to see.
 
  • Like
Reactions: amd6502

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
Not so much the future of 15h/Bulldozer. More the future of Bulldozer as the module.

- Take Excavator module.
- Take Jaguar cluster.
- Remove L2 unit from Jaguar cluster. Place power controller of Jaguar cores and L1<->L2 interface into a new modulized unit.
- Remove cores, LD/ST, and FPU from XV module. Keep and modify XV front-end and cache unit/L2. Particularly; Split 2x4-wide into 4x2-wide decode, I$/BPU/L2_CU considerations, etc.
- Place full Jaguar cores into void space. Call the new interface unit the mid-end of the module, L1<->L2 becomes L0<->L1. Make a new 256KB L1d cache shared between all Jag cores. Existing L1i/L1d to be shrinked from 32KB to 16KB and be called L0i/L0d. Do things with fetch/decode/rename to better optimize for macro-ops->micro-ops. Do things with LD/ST to interface with mid-end better.
- Optimize for clock speed(RVT/LVT stuff & FO4) since that is like what <23-stages?
* To enhance even further push L0i into front-mid-end with a Global renamer and boom reverse multitheading.
* While it would lose 15h execution compatibility. It would not lose 16h execution compatibility.
* Have the mid-end have a bypass between cores. So, a core can directly write to another core via ring or crossbar. Mid-end should also have instruction & data coherency tables(look-ups/RAM/etc). Something like that should allow for a flat space across cores. It also reduces the energy to write to the L1, then to another L0.
* Power controller in the mid end can reduce power controller complexity from module level to the core level. (2-lv power controller)
* Global renamer above can house its own FP Decode/Rename component. Thus allowing the FPUs to be FlexFPU-like while still being housed in independent cores. With the mid-end PMU, a core would only need to have its FPU/LSU components active to be controlled.
* Ideally, it would be best to arrange Pipe 0 and Pipe 1 of the FPUs into: P1-128-bit Floating Point FMAC and P0-128-bit Integer FMAC.

Call it Caracal or something. Also, AMD put it on 22FDX. I want those sub-10 picoseconds RO/FO3/FO4s delays asap!! (Not even FinFETs or Nanowires can do those!)

For those who noticed that the module was and is an exoskeleton raise your virtual hand.
 
Last edited:

majord

Senior member
Jul 26, 2015
440
529
136
Steamroller and excavator became quite 'unbalanced' designs, in an attempt to gain Throughput. It's quite clear now the dual 4-way decoders per module are incredibly wasteful. The (already mentioned) various bottlenecks and design issues elsewhere in the pipeline meant the uplift was lower than expected - and these contributed nothing to ST performance.

To offset this extra area/power other compromises had to be made - e.g the FPU was reduced to 3 pipeline from 4, and for excavator L2 cache was halved, and Fmax lowered in exchange for better perf/watt and density at lower clocks..

Now think about this last one for a minute. Aplying high density LOW frequency orientated physical design to a fundamentally high frequency, high power uArch.

Essentially by the end of its life, the Physical and deep-rooted Architectural elements of the core completely contradicted each other. and the architecture as a whole became a big patch-up job.. as such there was and is never any future for them.

That said, I think the challange of getting such a fundamentally troubled architecture, on a 2 generation old process to actually be semi-competitive in mobile (notebook) market drove an incredible amount of innovation at the power management, core layout and Soc design level. which now that's been applied to a competent, and more suitable uArch (Zen) is really paying off.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
1. the FPU was reduced to 3 pipeline from 4, and for excavator L2 cache was halved, and Fmax lowered in exchange for better perf/watt and density at lower clocks..

2. Applying high density LOW frequency orientated physical design to a fundamentally high frequency, high power uArch.
1. Even though the number of pipes were reduced from 4 to 3, no functionality was lost.
http://i.imgur.com/8PHmV5l.png
The L2 was halved and latency was cut. The Fmax was lowered because the; Vmax and leakage efficency of PDSOI was lost by going bulk. PDSOI at stock was pretty horrible, *hint*FX-9370/FX-9590*hint*(manufactured certified working overclocks). Bulk at stock was better. Since, majority of people didn't buy it for overclocking, it was most successful in mobile/pre-built markets. Which makes the move to Bulk a good conscious move. ((22FDX wasn't available anytime for Bristol/Stoney... so got to wait for 2018 for the refreshes in January.))

2. High density libs does not necessary mean low frequency. Yes, there is a relative performance drop, but "relative" can mean <1% absolute perf loss.
http://i.imgur.com/CYx8hvD.jpg
^- 28HPC+ Standard Libs @ TSMC

http://images.anandtech.com/doci/8995/7 - Low Power Graphics.png
http://images.anandtech.com/doci/8995/6 - High Density Design.png
http://images.anandtech.com/doci/10436/Slide 9 - Power Frequency curve with libraries.png
^-- Looking at this majority of the high density selection is above that of Steamrollers high speed selection. Which to the right shows that Excavator is faster than a legacy Steamroller design.
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
Steamroller and excavator became quite 'unbalanced' designs, in an attempt to gain Throughput. It's quite clear now the dual 4-way decoders per module are incredibly wasteful. The (already mentioned) various bottlenecks and design issues elsewhere in the pipeline meant the uplift was lower than expected - and these contributed nothing to ST performance.

I'm suspecting that moving to dual decoders greatly increased the energy budget for Kaveri. If there had been an octacore Steamroller it would either have had to clock way lower than an FX-8350, or consume 50% more at the same base clock (4ghz), that is, a TDP of almost 190W!
Compare top end Richland and Kaveri A10's, same base clocks, yet Richland has 65W TDP versus the A10-7850k's 95W. http://www.cpu-world.com/CPUs/Bulldozer/AMD-A10-Series A10-6700 - AD6700OKA44HL.html

And yes, most importantly, this did nothing to help Single thread performance!

In fact, possibly the opposite: the flagship Kaveri A10-7850k could only hit a 4GHz turbo frequency (versus A10-6700's 4.3GHz): http://www.cpu-world.com/CPUs/Bulldozer/AMD-A10-Series A10-7850K.html

Surprisingly Zen also has only a single 4-issue decoder (per core and pair of threads), but it has a mu op cache https://upload.wikimedia.org/wikipe...cture.svg/576px-Zen_microarchitecture.svg.png and can dispatch 6 ops.

I suspect that adding one more ALU (or at least allowing one of the AGU to be an AGU/ALU-lite) would have helped much more than getting rid of the shared decode.

Richland quad APUs suffered more from single thread ability than lack of multi thread. (If multithread was a problem, they could likely also have offered hexacore APUs.)
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,692
136
Richland quad APUs suffered more from single thread ability than lack of multi thread. (If multithread was a problem, they could likely also have offered hexacore APUs.)

If you keep Trinity/Richland above 4GHz, single thread performance is respectable. Preferably you'll want to run at or above the maximum turbo bin of the 6800K (4.4GHz), then it is decent. Not great, but decent. But you'll pay with a steep step up in power consumption.

I've just side-graded my HTPC from a 6800K to an 845, and it can match the 6800K's performance. Even exceed it in some cases. But at much lower wattage.
 
  • Like
Reactions: amd6502

amd6502

Senior member
Apr 21, 2017
971
360
136
Yes Richland was the essentially same (if not better) in single thread. It clocked much faster than Kaveri. Richland vs Godavari is pretty even in single thread. https://browser.primatelabs.com/geekbench3/compare/2947131?baseline=7032950 (it's possible the Richland was OCd by ~200MHz vs the 7870k which seems to run at stock). Linuxferret has a large number of interesting dozer family benchmark results: https://browser.primatelabs.com/user/26626

I meant to say that the dozer family in general lacked single thread more than multithread, and that this should have been top in their priority list when working on improvements from Trinity. Instead, as majord said above, they worked on total throughput. One selling point AMD marketing pitched was more even threads (vs SMT). This was a mistake, they partially abandoned their model of CMT with the dual decoders, and so gave up much of the efficiency advantages of CMT in the first place. Because of this, Steamroller would have been pretty bad in servers.

I think it must have been a decision based on time and budget limitations. Redesigning a decoder that was wider would have taken much too long; so they went with what was available to them, and just stuck a xerox copied decoder in there. If Piledriver had been fixed with a proper front end (eg https://upload.wikimedia.org/wikipe...cture.svg/576px-Zen_microarchitecture.svg.png), and the capability of a third ALU, it'd have been much more competitive. With leapfrogging design teams on Zen, such rough revisions based on time limitation likely won't be repeated anymore.
 
Last edited:

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,692
136
If Piledriver had been fixed with a proper front end (eg https://upload.wikimedia.org/wikipe...cture.svg/576px-Zen_microarchitecture.svg.png), and the capability of a third ALU, it'd have been much more competitive. With leapfrogging design teams on Zen, such rough revisions based on time limitation likely won't be repeated anymore.

I've always thought the bulldozer family suffered far more from the shared front end, then the shared decoder. After all, if you can't feed the decoders, what is the point of having them in the first place?

BD would -properbly- have been better with a wider front end, better branch predictors and better caches, then it is. You can see some of that in Excavator, which has very much improved single thread performance then the previous cores and doesn't suffer the "module penalty" at all. If only it could clock higher.

But all of this is pointless speculation, because the BD family got Conroe'd over night by Zen... but they're fun to play with... :)
 
  • Like
Reactions: amd6502