Question Was it the tick or the tock that was the problem or something else?

NostaSeronx · Dec 5, 2024

lakedude said:
Ok, it looks to me that most of the items in that list are architecture/AMD related. I've read that Bulldozer had long pipelines (like the P4) to enable higher frequencies so was the process node to blame for the underwhelming frequencies?

The pipeline stage increase is no different from the increase from Core2 (14-stages) -> Nehalem (16-stages) -> Sandy Bridge (fetch-based 19-stages/uop-based 14-stages).

The increase in pipeline depth for Bulldozer(Zambezi) was to reduce power at 1.3V and increase frequency at 0.8V. Relative to Husky(Llano).

HP-team (CTO of Cores:FW -> DM(interim)):
K9, 65nm PDSOI = 5 GHz, target frequency
K10 (proto-Bulldozer), 45nm PDSOI = 9 GHz, target frequency (quad-processor/quad-core/octo-cluster)
Bulldozer, 32nm PDSOI = 3.5 GHz, target frequency (quad-processor/octo-core)

LP-team (CTO of Cores: PH):
The unreleased Bulldozer, 45nm PDSOI = 2 GHz, target frequency (octo-proccessor/octo-core)
The unreleased Bobcat, 65nm PDSOI = 1 GHz, target frequency (single-core), same node as K8L, Lion core, Fam 11h and is the reason Bobcat IEEE says this "The Bobcat core uses about one-third the area the earlier K8 architecture would have used if implemented in the same fabrication process."

Unreleased Bulldozer equates close to 2/3rd the area of Husky.
Husky = 9.69 mm2 core area
Husky 2-core = 19.36 mm2 core area (however care is put more on L2-size this)
Unreleased Bulldozer on 32nm = 14.52 mm2, if unr_Bulldozer was based on Husky, which it isn't. It was instead based on a higher performance target Bobcat. 40nm bulk -> 32nm PDSOI can cover the high-performance cost-add; 7.35 mm2 general area of the unreleased fully mobile-focused/high core count Bulldozer.
ILP efficiency(OoO) with 16-registers, single L1 memory or TLP efficiency(CMT/SMT) with 2x16-registers, single L1 memory.

The leap in size is basically paying attention to out-of-order resources to find out why released Bulldozer is so large;
- Bobcat = 56-entry Retire
- Husky = 84-entry Retire
- Bulldozer (module/processor-level resource) = 256-entry Retire, which is comparable to Zen3 on 7nm with it's 256-entry Retire.
If the retire is bigger then all other OoO resources are bigger.

Also, newer designs can use less OoO resources for the same performance.
Cortex-A65/Neoverse-E1 = 40-entry OoO with SMT2 with slighter better capabilities of it's 2x64-bit ALUs, 2x64-bit FPUs.
E1 on 7nm = 2.5 GHz/183 mW/0.46 mm2 with 128KB L2.
If the MT-core design comes back in ARMv9 it is candidate to Cluster-based Multithreading if pushed for Cortex-A76/N1 performance.

The switch to this: The Importance of Being Low Power in High Performance Computing. Where some of the box packages for AMD have the green leaf or green circle logo, began in 2005. With focus that HE/EE parts would replace SE parts. Hence, why low-power focus being the first Bulldozer (2005/2007). Why Cluster-based Multithreading used a pure green tone.

igor_kavinski · Dec 6, 2024

Fallen Kell said:
The chips were not lying, really the OS was as they didn't want to make a distinction to end users.

I would really prefer if Windows Task Manager had an option to show greater than 100% utilization for a core instead of showing extra virtual cores. That way one could easily see how much extra performance SMT is providing to the chip.

HT matters a lot to 2 core and 4 core processors where turning off HT feels considerably slower.

igor_kavinski · Dec 6, 2024

NostaSeronx said:
K9, 65nm PDSOI = 5 GHz, target frequency
K10 (proto-Bulldozer), 45nm PDSOI = 9 GHz, target frequency (quad-processor/quad-core/octo-cluster)

What were they smoking back then? 5 GHz and 9 GHz target frequencies with THOSE process technologies???

NostaSeronx · Dec 6, 2024

igor_kavinski said:
What were they smoking back then? 5 GHz and 9 GHz target frequencies with THOSE process technologies???

It was feasible to do it. We technically have some sort of proof.

POWER6 = 65nm PDSOI, 14-stage pipeline with reduced FO4, Fmax =>, >6 GHz
POWER7 = 45nm PDSOI, more pipeline stages with increased FO4. Switch from higher frequency to lower power design.
Gutting the OoO and FPU increases and keeping FO4 stable wouldn't have been impossible.

Gideon · Dec 6, 2024

yuri69 said:
Phenom (10h) was based on Opteron (K8) which was based on Athlon (K7) from 1999. Carrying this stuff for another 5 years surely looked wrong to AMD. So they went with Bulldozer instead...

Oh absolutely, i have no illusions about that. I only mentioned it as the discussion was about whether the node itself was the major thing to blame.

I'd even extrapolate some more. By Zen 6's launch in 2026 Zen uarch will be 9 years old. While the situation is entirely different performance wise, it's not hard to see signs of architecture slowly getting "too long in the tooth" for Zen too.

Not because of performance but other signs seem to be there:

Mike Clark mentioned that for Zen 5 they decided to remove complexity from Zen 4's frontend (noop fusion, etc) to get it shipped in time ... and it still took 22 months. On top of that it seems they couldn't yet get the ultra complex uop cache + clustered decoder thing to work for a single thread (though most of the plumbing seems to be there).
Jim Keller has also stated in multiple interviews that architectures that are iterated on for too long get complex and hard to improve upon. Yet managers really don't want to engineers to start from scratch, as that means that the "next thing" won't be faster at everything or rather they deliberately have to decide what to let go / make slow (Every ground-up redesign actually has plenty of tradeoffs. It's just that If they are well chosen the end user never stumbles on them). There is also the obvious risk of ending up with something like P4 or Bulldozer or whatever the last Samsung core was - waste resources, bloat up area for no actual gain.

Obviously Zen5 is a bigger departure than zen 1 - zen 4, but all-in-all it's still very evolutionary (same amount of stages, similar layout just a bit wider). It doesn't seem this kind of redesign as K10 -> Bulldozer or Bulldozer -> Zen were.

As a software developer it's easy to see similar patterns in software systems. Engineers call for little else than a "groud-up rewrite" for complex old systems. But it's actually very hard to do well. You have to distill the actual requirements for the system (not easy when it has lots of features and wide client base) and decide what to let go. If you want "everything but new and better" the "clean and beautiful rewrite" very quickly becomes as bloated as the old system.

It's hard to do rewrites and you have to be ready to accept compromises. But if done well, you can extract huge gains from many subsequent generations. This pattern is easily observable for:

Zen 1 -> Zen 4 (~20% total ST uplift every gen, lots of other improvements)
Intel new Atom cores (30% IPC growth each gen)
Qualcomm Oryon cores
Apple cores of old

All in all, I'm quite convinced AMD has at least some totally new designs in the oven. The fastest we can see them is in Zen 7 timeframe (unless something "Skunk Works" level secret is planned for new/other markets)

lakedude · Dec 6, 2024

igor_kavinski said:
I would really prefer if Windows Task Manager had an option to show greater than 100% utilization for a core instead of showing extra virtual cores.

IDK about that. You don't really ever get over 100%.

A CPU is way faster than RAM and storage so it is sitting around waiting to be fed. Hyperthreading is just a way to keep a CPU busier. Even with hypertheading I expect there are still plenty of wasted clock cycles in typical work loads. Some small loop that fits entirely in cache might keep a CPU busy...

igor_kavinski · Dec 6, 2024

lakedude said:
IDK about that. You don't really ever get over 100%.

Don't know about current CPUs but I've seen bad web pages raise cores to almost 100% usage on Ivy and Haswell and even 4 cores of Ice Lake can feel sluggish without HT. By going above 100%, I mean add the utilization of the real and virtual core. If both threads are at something like 55 and 51%, that would be 106%. I don't expect that to go more than 150% unless it's some crazy targeted code to benefit from SMT.

yuri69 · Dec 6, 2024

Gideon said:
Oh absolutely, i have no illusions about that. I only mentioned it as the discussion was about whether the node itself was the major thing to blame.

I'd even extrapolate some more. By Zen 6's launch in 2026 Zen uarch will be 9 years old. While the situation is entirely different performance wise, it's not hard to see signs of architecture slowly getting "too long in the tooth" for Zen too.

...

All in all, I'm quite convinced AMD has at least some totally new designs in the oven. The fastest we can see them is in Zen 7 timeframe (unless something "Skunk Works" level secret is planned for new/other markets)

Yea, Zen 5 is a rather large rework - it was surely a project to create a base for next 5 years. Right now, it's an imbalanced design but might have potential.

That complexity issue you mentioned might affect the launch cadence which slowed down significantly past Zen 3. Zen 6 is reported going to be really late. I can't even imagine Zen 7's timeframe and its design goals.

igor_kavinski · Dec 6, 2024

Apparently they think they are in no danger from Apple and WinARM SoCs. Well, AMD is allegedly doing the latter with Sound Wave APU so I guess they want to see how that fares before turning the throttle all the way up on x86 once again.

eek2121 · Dec 8, 2024

CMT as a concept isn’t bad. AMD just didn’t have a good design. I’d really love to see a good implementation. Maybe one day…

Search

Question Was it the tick or the tock that was the problem or something else?

NostaSeronx

Diamond Member

igor_kavinski

Lifer

igor_kavinski

Lifer

NostaSeronx

Diamond Member

Gideon

Platinum Member

lakedude

Platinum Member

igor_kavinski

Lifer

yuri69

Senior member

igor_kavinski

Lifer

eek2121

Diamond Member

TRENDING THREADS