Question Was it the tick or the tock that was the problem or something else?

lakedude · Dec 4, 2024

Bulldozer was recently mentioned in "leading edge nodes" and I got to wondering if the issues with Bulldozer were more architecture (AMD) or process node (Global) related. Then I got to wondering about other failures like the recent Raptor Lake issues.

I remember Global not keeping up so my guess is Global was at least partly to blame for Bulldozer being lackluster. Would Bulldozer have been awesome on a better node?

Seems to me that RL could have been a good product (at least for certain use cases) but they just pushed too hard. If they had just dialed it back a notch RL would have been fine (except for the oxidation issues).

What do y'all think?

NostaSeronx · Dec 4, 2024

It was the switch away from Cluster-based Multithreading to Chip-level Multithreading.

Zen5 as-is is basically the flipped(int<->fpu) design of the prior design to release Bulldozer design.
Dual front-end <-- (Steamroller+Excavator -- etc)
Dual Execution(2*FMUL/FMA + FADD) clusters + Memory(2 StD/IntD) cluster
Shared Integer Unit.

Hitman928 · Dec 4, 2024

lakedude said:
Bulldozer was recently mentioned in "leading edge nodes" and I got to wondering if the issues with Bulldozer were more architecture (AMD) or process node (Global) related. Then I got to wondering about other failures as well like the recent Raptor Lake issues as well.

I remember Global not keeping up so my guess is Global was at least partly to blame for Bulldozer being lackluster. Would Bulldozer have been awesome on a better node?

Seems to me that RL could have been a good product (at least for certain use cases) but they just pushed too hard. If they had just dialed it back a notch RL would have been fine (except for the oxidation issues).

What do y'all think?

lakedude · Dec 4, 2024

Thanks for the replies.

Nosta your reply is a little too far in the weeds for me to understand your meaning. Can you dumb it down a little?

Hitman your reply is a bit the other way, short on details, care to elaborate?

igor_kavinski · Dec 4, 2024

ChatGPT summary of weaknesses:

Weaknesses of AMD's Bulldozer Architecture:

Performance Regression: Bulldozer CPUs were slower than their predecessor, Phenom II, in single-core performance and only matched multi-core performance despite having more cores.

Power Consumption: Bulldozer processors were noted for running hot and consuming more power, which was detrimental to overall efficiency.

Outdated Design: By 2010, Bulldozer was based on an outdated architecture that struggled to compete with Intel's more advanced designs, such as Sandy Bridge.

Complexity and Ambitious Goals: The architecture aimed for ambitious improvements in both single-threaded and multi-threaded performance but faced challenges in execution, particularly in maintaining clock frequencies.

Branch Predictor Limitations: While the branch predictor was overhauled, it was still outperformed by Intel's Sandy Bridge, which had a more advanced and faster predictor.

Latency Issues: Bulldozer's design introduced higher latencies in certain operations, particularly in branch prediction and load/store operations.

Resource Allocation: The architecture's resource allocation for integer execution was relatively light, leading to potential bottlenecks when handling multiple instructions.

Dual-Threading Inefficiencies: Although designed for simultaneous multi-threading (SMT), Bulldozer's sharing of certain resources (like the FPU) often resulted in performance penalties rather than improvements.

Memory Access Penalties: The load/store unit's performance was hampered by higher latencies, especially when handling misaligned memory accesses compared to competitors.

Lack of Optimization for Single-Threaded Performance: The focus on multi-threading came at the cost of single-threaded performance, which remained subpar compared to Intel's offerings.

Poor Software Support for New Instructions: The introduction of new instruction sets (like AVX) was limited by a lack of widespread software support, hampering the effective use of these features.

Ineffective Handling of Cache Misses: The architecture struggled with instruction fetches from L2 caches, leading to performance degradation during cache misses.

By addressing these weaknesses, AMD aimed to create a more competitive architecture, but Bulldozer's shortcomings ultimately hindered its market performance.

And more:

Memory Bottlenecks:

Memory advances lag behind CPU speed, leading to significant cache latency issues.

The architecture's reliance on a triple-level cache hierarchy does not adequately overcome memory bottlenecks.

Cache Limitations:

The L1D cache size was reduced to 16 KB from the intended 64 KB, resulting in a significant decrease in cache capacity and higher L1D miss rates.

The L1D's latency increased from 3 to 4 cycles, making it slower than its predecessor (K10).

Despite advanced features like 4-way associativity and way-prediction, the small capacity results in poor hit rates and high miss rates per instruction.

Write Bandwidth Issues:

The L1D is a write-through cache, leading to poor write bandwidth performance (just over 10 bytes per cycle) compared to competitors.

The Write Coalescing Cache (WCC) has limited capacity (4 KB), which is inadequate for handling two threads effectively, resulting in lower overall write performance.

L2 and L3 Cache Deficiencies:

The L2 cache, while larger than K10's, suffers from higher latencies and does not effectively compensate for the weaknesses in the L1D.

The L3 cache has high latency (over 18 ns) and is not as efficient as Intel's designs, leading to poor performance in multithreaded scenarios.

High Latency in Core Communication:

Contested lock cmpxchg latency between threads in different modules is significantly high (over 200 ns), creating performance bottlenecks.

Core-to-core communication incurs additional latency, affecting performance in multi-threaded applications.

Power Efficiency Challenges:

Bulldozer is less power-efficient than competing architectures, partly due to the complexity of the design and the challenges associated with the 32 nm process node.

Reduced Execution and Reordering Capacity:

Execution resources and reordering buffers are smaller compared to Sandy Bridge, limiting performance, especially in single-threaded applications.

Lower reordering capacity combined with higher cache latency makes it difficult to extract parallelism, particularly during L1 misses.

Overall Performance Deficit:

Bulldozer's single-threaded performance significantly lags behind Sandy Bridge, which results from a combination of architectural weaknesses and Intel's advancements.

Multithreaded performance is competitive, but the design struggles with non-parallel workloads, leading to inefficiencies.

Design Compromises Due to Process Node Challenges:

Transitioning to the 32 nm process required compromises in design, affecting transistor density and overall efficiency.

Issues with the new process node resulted in a need to switch to less efficient 8T SRAM for certain caches, impacting performance.

Lack of Optimization for Power Consumption:

The simultaneous push for performance and power efficiency resulted in suboptimal designs, with difficulties in identifying significant power waste sources.

These weaknesses collectively hinder the Bulldozer architecture's ability to compete effectively with Intel's offerings, particularly in single-threaded scenarios and overall efficiency.

Source: https://chipsandcheese.com/p/bulldozer-amds-crash-modernization-caching-and-conclusion

Hitman928 · Dec 4, 2024

lakedude said:
Thanks for the replies.

Nosta your reply is a little too far in the weeds for me to understand your meaning. Can you dumb it down a little?

Hitman your reply is a bit the other way, short on details, care to elaborate?

Architecture wise, it was a server first design that sounded good in theory, but in the real world, failed miserably. AMD being stuck on a worse process exacerbated the issue. I would not be surprised at all if the architecture was strongly influenced by the fact that AMD engineers knew they would be constantly behind Intel on process (i.e., stuck on significantly less dense processes) as it attempted to give the best multi-core integer performance in the smallest area possible.

desrever · Dec 4, 2024

They somehow thought they could clock bulldozer at 6ghz without blowing up power, which would have made it a lot better. They thought PDSOI processes could handle it which it obviously didn't in retrospect.

Both the design and the process was at fault. If they refined what they did in Phenom II on 32nm, they could have had a better CPU than Bulldozer.

Gideon · Dec 4, 2024

desrever said:
They somehow thought they could clock bulldozer at 6ghz without blowing up power, which would have made it a lot better. They thought PDSOI processes could handle it which it obviously didn't in retrospect.

Both the design and the process was at fault. If they refined what they did in Phenom II on 32nm, they could have had a better CPU than Bulldozer.

The worst part about this is they actually went through the trouble of porting Phenom II to 32nm with Llano (though without L3):

The AMD Llano Notebook Review: Competing in the Mobile Market

www.anandtech.com

They even implemented proper turbo:

The AMD Llano Notebook Review: Competing in the Mobile Market

www.anandtech.com

Cosidering that AMDs 32nm, despite it's faults, was a pretty good shrink, they could have easily fit an 8-core thuban successor (with loads of L3) in the same transistor budget as Bulldozer.

And well, cosidering a 6-core Thuban managed to often beat the 8-core Bulldozer should tell all there is to tell about the architecture ...

Fallen Kell · Dec 4, 2024

The biggest issues with bulldozer were a combination of architecture and then software/driver/kernel support in operating systems to properly use it. The problems all stem from two CPU threads sharing a single FPU and the way L1 cache was segmented. This created all kinds of bottlenecks because operating systems did not know how to properly schedule the workload across the various threads to avoid heavy floating point load on two threads that had a shared FPU when bulldozer released.

In reality bulldozer really needed some additional internal scheduling monitoring and automated tuning to re-allocate the various threads when it detected high contention between the shared resources. If it had that, I think it would have been much better. On the same note, if the OS support for things that we now have existed (like P cores vs E cores, etc.), it also may have been able to function a lot better than it did. But those things were simply not there and as a result, once you started running multiple programs or multi-threaded programs that needed lots of floating point operations, the performance was just horrible.

desrever · Dec 4, 2024

Fallen Kell said:
The biggest issues with bulldozer were a combination of architecture and then software/driver/kernel support in operating systems to properly use it. The problems all stem from two CPU threads sharing a single FPU and the way L1 cache was segmented. This created all kinds of bottlenecks because operating systems did not know how to properly schedule the workload across the various threads to avoid heavy floating point load on two threads that had a shared FPU when bulldozer released.

In reality bulldozer really needed some additional internal scheduling monitoring and automated tuning to re-allocate the various threads when it detected high contention between the shared resources. If it had that, I think it would have been much better. On the same note, if the OS support for things that we now have existed (like P cores vs E cores, etc.), it also may have been able to function a lot better than it did. But those things were simply not there and as a result, once you started running multiple programs or multi-threaded programs that needed lots of floating point operations, the performance was just horrible.

Even in INT workloads Bulldozer wasn't good. The shared FPU was not the main issue imo. They did make OS change to schedule the CMT threads like SMT threads and it "fixed" most of that issue.

IPC was just so bad and there wasn't much AMD could do about it on that arch. There were a lot of bottlencks that got somewhat fixed by steamroller but even then, it was just not able to even remotely compete with Intel. Steamroller's INT IPC was probably still lower than Sandybridge.

Gideon · Dec 5, 2024

desrever said:
Steamroller's INT IPC was probably still lower than Sandybridge.

Probably? It wasn't even consistently above Core2 Duo or Phenom II, it just clocked higher. than the latter. Though by Steamroller it was at least close. Piledriver still wasn't quite where Phenom II was (IPC wise):

The Vishera Review: AMD FX-8350, FX-8320, FX-6300 and FX-4300 Tested

www.anandtech.com

Sandy bridge was loads faster. AMD claimed 40% IPC growth for Zen 1 (from Piledriver desktop). I remember discussions that it would have meant only Ivy Bridge level IPC. Luckily it was 52% instead, which was damn close to Skylake instead.

But thst shows just how far it was from Sandy Bridge IPC wise

NostaSeronx · Dec 5, 2024

lakedude said:
Nosta your reply is a little too far in the weeds for me to understand your meaning. Can you dumb it down a little?

It only gets to be more complex I think.

As example, Alpha 21164 has a single Ebox. Where 21264 has two Eboxes. Optimizing for Cluster-based Multithreading can allow a single core CMT2-tuned 21264 to replace a dual-core 21164. Cluster-based Multithreading is just Simultaneous Multithreading but for clusters rather than monolithic units.

Dual-core = 2x area increase for 1.7x performance increase from second thread.
CMT2 (Clustered execution core) = 1.5x area increase for 1.8x performance increase from second thread. Speed-up is from less components duplicated and lower distance overhead for MT.
Application of Cluster-based Multithreading can be switched to single-thread as a Clustered Microarchitecure w/o Multithreading.

Application of Chip-level Multithreading can not be reserved to a single-thread as it is separate units. This thus leans more into Dual-core in area and performance. As it has more components duplicated and has more distance overhead for MT.
~~~~

~~~~
Zen lineage has been transitioning over to being Bulldozer-like given the 2005 cluster-based multithreading and 2007 bulldozer slides. Which were not in the 2009+ Bulldozer design as it switched away from Cluster-based Multithreading to Chip-level Multithreading.

Zen3~5 core has flipped the clustered components from integer to floating point. Where the Integer component is shared and the FPU is clustered.

Bulldozer released (Chip-level Multithreading);
2x Retire
2x Integer Scheduler
2x Integer/Memory Execution
with a shared monolithic SMT2 FPU unit.

Bulldozer unreleased (Cluster-based Multithreading);
1x Retire
3x 2Integer/1Memory Scheduler
2x Integer Execution
1x Memory Execution
with a shared monolithic SMT2 FPU unit.

Zen5 released;
1x FP Retire
3x FP Schedulers
2x FP/SIMD Execution
1x Store/Convert Execution
with a shared monolithic SMT2 Integer unit.
// Front-end for Zen5 is basically a continuation of the front-end of Bulldozer ~ Steamroller. Where there is two fetches to two picks to two decodes.
~~~~

////\\\\
It was likely that Bulldozer Gen3 would have returned to being a cluster-based multithreading processor. By how the units were being smooshed together from BD Gen1 to BD Gen2.

There is also Zen5 having the correct integer scheduler layout for clustering.

1x Integer Execution Scheduler to 6 ALU (1x PRF 240-entry) to
2x Integer Execution Scheduler to 4 ALUs (2x PRF >128-entry) with shared memory unit AGU/LSU.

As they did the front-end, the floating point unit, it is likely to be the integer part next to be clustered.

It is purely AMD-sided that let Bulldozer launch as a chip-level multithreading part. Rather than keeping to the cluster-based multithreading part.

RTX · Dec 5, 2024

So was AMD licensing IBM's nodes for bulldozer/derivatives too?

Hitman928 · Dec 5, 2024

RTX said:
View attachment 112660
So was AMD licensing IBM's nodes for bulldozer/derivatives too?

AMD was part of an IBM consortium type of thing where multiple companies shared process development knowledge. It was driven by IBM but multiple companies were involved. I don’t remember all the details now, but yes, the only way AMD/GF got as far as they did with process nodes was by being a part of that group. IBM still does process research but it’s not the same as it was back then.

igor_kavinski · Dec 5, 2024

RTX said:
View attachment 112660

I would love to read the full interview. URL please, if possible. These behind the scenes juicy things are more exciting than the processors themselves. So much intrigue

RTX · Dec 5, 2024

What's that bit about Schumer about?

NostaSeronx · Dec 5, 2024

RTX said:
So was AMD licensing IBM's nodes for bulldozer/derivatives too?

They were collaborating on the nodes. With several partnerships going on 1998 to 2003. AMD/Motorola -> AMD/UMC/Infineon -> AMD/IBM.

"According to the agreement, AMD and IBM will be able to use the jointly developed technologies to manufacture products in their own chip fabrication facilities and in conjunction with selected manufacturing partners." - January 15, 2003

However, AMD's side of the partnership didn't actually appear. As it was during the phat Tsi/BOX-era of ETSOI.

- 2005, 32nm node ETSOI, Tsi(Tch)=18nm

With STMicroelectronics basically finally getting FDSOI to fruition;
"STMicroelectronics (NYSE: STM) and IBM (NYSE: IBM) today announced that the two companies have signed an agreement to collaborate on the development of next-generation process technology - the “recipe” that is used in semiconductor development and manufacturing." - July 24, 2007

-2010, 32nm node UTBB FDSOI

Where GlobalFoundries officially pops backup in 2011 for the 28nm/22nm to 20nm[14nm]/14nm[10nm] node stuff.

90nm PDSOI = IBM/AMD/Toshiba/Sony
65nm PDSOI = IBM/AMD/Toshiba
45nm PDSOI = IBM/AMD
32nm PDSOI = IBM/AMD/Freescale, 32nm PDSOI plus (accustomed to AMD) = GlobalFoundries
22nm PDSOI = IBM
22nm FDSOI = IBM/STMicroelectronics/CEA-Leti/SOITEC/Toshiba/Renesas/GlobalFoundries

RTX · Dec 5, 2024

What happened to FreeScale? They made a rectangular 67nm hfin but only 2.68x aspect ratio. Is 8x AR / 72nm hfin possible?

Hitman928 · Dec 5, 2024

RTX said:
What happened to FreeScale? They made a rectangular 67nm hfin but only 2.68x aspect ratio. Is 8x AR / 72nm hfin possible?

View attachment 112675

They got bought by NXP. I'm guessing any advanced node development was stopped but you can look to see if NXP is doing anything in that area.

yuri69 · Dec 5, 2024

Gideon said:
The worst part about this is they actually went through the trouble of porting Phenom II to 32nm with Llano (though without L3):

Cosidering that AMDs 32nm, despite it's faults, was a pretty good shrink, they could have easily fit an 8-core thuban successor (with loads of L3) in the same transistor budget as Bulldozer.

And well, cosidering a 6-core Thuban managed to often beat the 8-core Bulldozer should tell all there is to tell about the architecture ...

The problem with extending the life of Phenom was its accumulated age. AMD needed a base architecture designed for the predicted 2010-2015 workloads - this means stuff like AVX, FMA, and who knows what else.

Phenom (10h) was based on Opteron (K8) which was based on Athlon (K7) from 1999. Carrying this stuff for another 5 years surely looked wrong to AMD. So they went with Bulldozer instead...

Sure, extending the 6c 10h to 8 cores might work for a year or two. But it would be the same stuff as the 6 cores were - just a stopgap.

---

Btw that ChatGPT summary got it pretty well: a radical departure from existing stuff, totally strange cache sizes, horrible latency everywhere, odd penalties at multiple places, missed frequency targets, etc.

lakedude · Dec 5, 2024

yuri69 said:
Btw that ChatGPT summary got it pretty well: a radical departure from existing stuff, totally strange cache sizes, horrible latency everywhere, odd penalties at multiple places, missed frequency targets, etc.

Ok, it looks to me that most of the items in that list are architecture/AMD related. I've read that Bulldozer had long pipelines (like the P4) to enable higher frequencies so was the process node to blame for the underwhelming frequencies?

desrever · Dec 5, 2024

lakedude said:
Ok, it looks to me that most of the items in that list are architecture/AMD related. I've read that Bulldozer had long pipelines (like the P4) to enable higher frequencies so was the process node to blame for the underwhelming frequencies?

the process node was like a year late and not able to clock as high as they expected

igor_kavinski · Dec 5, 2024

lakedude said:
I've read that Bulldozer had long pipelines (like the P4) to enable higher frequencies

They did get to 5 GHz first with FX-9590 so mission accomplished compared to Intel whose promises of a 10 GHz P4 never materialized.

lakedude · Dec 5, 2024

NostaSeronx said:
It only gets to be more complex I think.

Thanks for taking the time to explain all that. I'm still struggling to understand it all but I'm not giving up.

Prior to your post I had no idea multithreading was so complicated. All I knew was that hyperthreading was basically lying to the OS and telling it there were twice as many cores as there really are and I assumed all the different multithreading terms were just different terminology for the same thing.

Fallen Kell · Dec 5, 2024

lakedude said:
Thanks for taking the time to explain all that. I'm still struggling to understand it all but I'm not giving up.

Prior to your post I had no idea multithreading was so complicated. All I knew was that hyperthreading was basically lying to the OS and telling it there were twice as many cores as there really are and I assumed all the different multithreading terms were just different terminology for the same thing.

Yeah, all the different types have different reasons and results. But even hyperthreading appearing as 2 cores is just how OS's interpreted it and represented it to the users. The chips were not lying, really the OS was as they didn't want to make a distinction to end users. That said, under most work loads, hyperthreading typically provided 50-60% additional throughput of the overall unit than when disabling hyperthreading. Intel was correct in that threads stuck in IO wait were indeed beginning to bottleneck cores that could otherwise be working on other things if only it didn't have to pay the heavy tax of clearing the entire pipeline of the existing thread to load another thread.

Question Was it the tick or the tock that was the problem or something else?

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Diamond Member

Senior member

Platinum Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Member

Diamond Member

Lifer

Member

Diamond Member

Member

Diamond Member

Senior member

Platinum Member

Senior member

Lifer

Platinum Member

Attachments

Diamond Member