• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

News [Register] AMD agrees to cough up $35-a-chip payout over eight-core Bulldozer advertising fiasco

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Feb 4, 2009
29,814
10,355
136
Yea, it's super crappy it's limited to the eight "core" models. And yes, cardinal rule indeed. Never do that unless you want to be free tech support for the life of that build. Last time I did that was 6-7 years ago for my brother, but that actually worked out very well. Never any problems thankfully. He's fairly computer competent though.
Yeah I have a friend who is pretty competent that is a little scared to do his own build. I told him to read up and I’ll gladly supervise however I’m not touching anything. All is on him.
 

AMDK11

Member
Jul 15, 2019
40
28
51
AMD in the documentation described that two threads on the Bulldozer module are implemented using multi-threaded integer clusters (CMTs).

Core K10
L1-I 64KB 2-Way
3 x86 decoders
3ALU / 3AGU
L1-D 64KB 2-Way
FPU 128bit

Bulldozer module (CMT core)
L1-I 64KB 2-Way
4 x86 decoders
2x 2ALU, 2AGU (CMT)
2x L1-D 16KB 4-Way (CMT)
FPU 2x 128bit (SMT)

Nehalem SMT core
L1-I 32KB 4-Way
4 x86 decoders
3ALFPU, 2AGU
FPU 128bit
L1-D 32KB 8-Way

SandyBridge SMT core
L1-I 32KB 8-Way
4 x86 decoders
3ALFPU, 2AGU
FPU 256bit
L1-D 32KB 8-Way

Haswell SMT core
L1-I 32KB 8-Way
4 x86 decoders
3ALFPU, 1ALU, 3AGU
FPU 2x 256bit
L1-D 32KB 8-Way

Zen SMT core
L1-I 64KB 4-Way
4 x86 decoders
4ALU, 2AGU
L1-D 32KB 8-Way
FPU 2x 128bit

Zen2 SMT core
L1-I 32KB 8-Way
4 x86 decoders
4ALU, 3AGU
L1-D 32KB 8-Way
FPU 2x 256bit

SunnyCove SMT core
L1-I 32KB 8-Way
5 x86 decoders
3ALFPU, 1ALU, 4AGU
FPU 512bit (2x256bit (+ 512bit ICL-SP?)
L1-D 48KB 12-Way

The Bulldozer module and later for me is the CMT core which was the answer to SMT cores, except that a single thread has access to only one integer cluster.

AMD will simply replace the core with a wide block of Integer, e.g. 3-4ALU, 2-3AGU and a single L1-D 32-64KB 4-8Way with SMT, have developed a small design simpler integer from 2ALU, 2AGU and L1-D 16KB 4- way then doubled, and so the competition for Intel's SMT core was created with little effort. Unfortunately, it had a devastating effect on the performance of a single thread.

Maybe AMD blocked patents that Intel later released or a problem with engineers? Maybe both? I do not know.

The BD / PD module is equivalent to the core of SMT but with a different approach to multithreading. If it were advertised as 4Modules (CMT Cores) / 8Threads, I think then there would be no problem.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,254
794
136
Bulldozer module (CMT core)
L1-I 64KB 2-Way
4 x86 decoders
2x 2ALU, 2AGU (CMT)
2x L1-D 16KB 4-Way (CMT)
FPU 2x 128bit (SMT)
Bulldozer module = CMT module
Multithreaded Branch Predictor -> Vertical Multithreaded AMD64 Fetch/1x 64 KB 2-way L1i -> Per-core Instruction Byte Buffer -> Vertical Multithreaded AMD64 Decode/Dispatch -> Per-core Macro-op Retire Queue -> Per-core Macro-op Dispatch -> Per-core Macro-op to Micro-op Scheduler -> Per-core 2ALU+2AGU or per-unit 2xFMAC+2xPALU -> Per-core LSU -> Per-core 16KB 4-way L1d -> 4KB, 4-way WCC and 64B 4-entry WCB -> L2 cache.

The FPU is 4x 128-bit
PADD can be executed on MAL[P2 | P3] => 2-cycle latency
PMUL, PMADD can be executed on MMA[P0] => 5-cycle latency(PMUL)/4-cycle latency(PMADD)
FMUL, FADD, FMADD can be executed on FMA[P0 | P1] => 6-cycle latency(FMUL, FADD, FMADD)
Four 128-bit pipes; 2x FMACs(1x has the Packed Multiply&Multiply-add), 2x PALUs.

In a non-CMT design only the P0/P3 pipes would be present in a CMT-less architecture.

Husky -> 84-entry OoO window + 42-entry FPU scheduler + 3x 128-bit pipes + 120-entry PRF <- 32nm PDSOI
Bulldozer -> 256-entry OoO window(from both cores) + 64-entry FPU scheduler + 4x 128-bit pipes + 160-entry PRF <- 32nm PDSOI
Bobcat -> 56-entry OoO window + 18-entry FPU scheduler + 2x 64-bit pipes + 88-entry PRF <- 40nm bulk
Jaguar -> 64-entry OoO window + 18-entry FPU scheduler + 2x 128-bit pipes + 72-entry PRF <- 28nm bulk
Zen -> 192-entry OoO window + 36-entry FPU scheduler + 4x 128-bit pipes + 160-entry PRF<- 14nm bulk
Zen2 is the only one with the rename width of the rename part labeled;
224-entry OoO Window + 64-entry NSQ + 36-entry FPU scheduler + 4x 256-bit pipes + 160-entry PRF <- 7nm bulk
 
Last edited:

AMDK11

Member
Jul 15, 2019
40
28
51
Maybe I was too generalized and there was an understatement on my part. I meant AVX bandwidth. In Bulldozer per clock cycle it performs 2x128bit and Zen 2 2x256bit even though each has 4 pipelines FPUs:
Bulldozer module 4x 128bit - SIMD 2x128bit per cycle
Zen core 4x 128bit - SIMD 2x128bit per cycle
Zen2 core 4x 256bit - SIMD 2x256bit per cycle

Intel has 3 pipelines FPUs integrated with ALU:
Nehalem 3x 128bit core - SIMD 1x128bit per cycle
SandyBridge 3x 256bit core - 1x256bit SIMD
per cycle
Haswell core 3x 256bit - SIMD 2x256bit per cycle

However, the fact is that the Bulldozer Module (4-way x86 decoder) was the successor to the K10 core (3-way x86 decoder) and a competitor of the Intel SMT core (4-way x86 decoder).

Only a high clock saved a single thread and more resources in multithreading than in K10. This does not change the fact that IPC on a single thread has dropped dramatically.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,254
794
136
This does not change the fact that IPC on a single thread has dropped dramatically.
Technically, ILP per core against Greyhound-Husky is higher/more efficient. With both cores as logical threads the ILP efficiency is still better than what is expected with Zen/Zen2 in SMT-mode. In general, the server/HPC workloads have FX(95w) operating ~2.7x faster than Phenom II x6(125W).
 
Last edited:

SPBHM

Diamond Member
Sep 12, 2012
4,994
351
126
How is this really different than one company selling a CPU with a "proper" core but much lower IPC?
well, it would still be accurate to call a slow core a core,


it looks to me like most people agree that 1 BD module is not equal to 2 cores, both technically and in terms of performance

I don't think that people got scammed or anything, 1 module performs mostly like 2 cores, it's just that the single thread performance per module was already low, and then you have added to that some penalty from shared resources...

companies have to be careful with this sort of thing and I do think AMD made a mistake,
 

AMDK11

Member
Jul 15, 2019
40
28
51
Technically, ILP per core against Greyhound-Husky is higher/more efficient. With both cores as logical threads the ILP efficiency is still better than what is expected with Zen/Zen2 in SMT-mode. In general, the server/HPC workloads have FX(95w) operating ~2.7x faster than Phenom II x6(125W).
Wherever I see IPC tests of BD Module vs Zen Core, performance in a single thread is much lower. Even with two active integer clusters, the module has a problem with SMT turned off in Zen. After enabling SMT in the Zen core, the module is still much less efficient.
Maybe the scaling of threads in the module is better, only that a single thread is much less efficient and a second cluster is not enough to overtake the SMT Zen core.
I have a PC with FX 6300 (3 Piledriver modules) and a PC with Haswell-E. I used to do module performance tests with active CMT and the module is less efficient than the SMT core. Theory is theory but reality is brutal.
 
Last edited:

Insert_Nickname

Diamond Member
May 6, 2012
4,062
638
126
Wherever I see IPC tests of BD Module vs Zen Core, performance in a single thread is much lower. Even with two active integer clusters, the module has a problem with SMT turned off in Zen. After enabling SMT in the Zen core, the module is still much less efficient.
Maybe the scaling of threads in the module is better, only that a single thread is much less efficient and a second cluster is not enough to overtake the SMT Zen core.
I have a PC with FX 6300 (3 Piledriver modules) and a PC with Haswell-E. I used to do module performance tests with active CMT and the module is less efficient than the SMT core. Theory is theory but reality is brutal.
It would have helped BD if a single thread had access to all a modules resources. This what has made Zen great, apart from the architecture improvements, a single thread now has access to all a Zen cores resources.

The way it is, a single thread can only access one of the integer clusters at a time.
 

AMDK11

Member
Jul 15, 2019
40
28
51
Yes, that's right. A single thread has access to only one integer cluster which is the biggest problem of CMT. I wanted to point out that even a full module with two integer clusters has a big problem with the SMT core.

Not only that, the module with two active integer clusters often has a problem with a wide-block integer core and SMT turned off.

No matter how you look at the Bulldozer module, one thing is certain, a single integer cluster is much simpler in design than wide. And here, among other things, I would see the reason for AMD's transition to this exotic microarchitecture.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,254
794
136
-> 48B predecode at most for Greyhound/Husky; 2 full decode windows
-> 256B predecode(core0/core1) or 512B predecode(core0+core1) at most for Bulldozer; 8 full decode windows per core
-> 160B(SMT) or 320B(no-SMT) at most for Zen; 5 full decode windows per thread, or 10 full decode windows per core.
"In SMT mode each thread has 10 dedicated IBQ entries."

-> 72/84 inflight macro-instructions for Greyhound/Husky
-> 128 inflight macro-instructions(core0/core1) or 256 inflight macro-instructions(core0+core1) for Bulldozer
-> 96/112(SMT) inflight per thread or 192/224(no-SMT) inflight per core at most for Zen1/Zen2
"224 micro ops or 112 per thread in SMT mode."

The overall most efficient mode for Zen/Zen2 is no-SMT. As there is a significant larger decode window+inflight window for thread 0 w/o thread 1. However, in Bulldozer it is when both cores are running.
 
Last edited:

NTMBK

Diamond Member
Nov 14, 2011
9,249
2,641
136
-> 48B predecode at most for Greyhound/Husky; 2 full decode windows
-> 256B predecode(core0/core1) or 512B predecode(core0+core1) at most for Bulldozer; 8 full decode windows per core
-> 160B(SMT) or 320B(no-SMT) at most for Zen; 5 full decode windows per thread, or 10 full decode windows per core.
"In SMT mode each thread has 10 dedicated IBQ entries."

-> 72/84 inflight macro-instructions for Greyhound/Husky
-> 128 inflight macro-instructions(core0/core1) or 256 inflight macro-instructions(core0+core1) for Bulldozer
-> 96/112(SMT) inflight per thread or 192/224(no-SMT) inflight per core at most for Zen1/Zen2
"224 micro ops or 112 per thread in SMT mode."

The overall most efficient mode for Zen/Zen2 is no-SMT. As there is a significant larger decode window+inflight window for thread 0 w/o thread 1.
Yes, and decode window is all that matters :rolleyes: There will inevitably be some long latency instruction chains that your OoO machinery can't completely hide, at which point you have underutilized execution resources. That's where SMT comes in, and gives you some "free" performance.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,254
794
136
Yes, and decode window is all that matters :rolleyes: There will inevitably be some long latency instruction chains that your OoO machinery can't completely hide, at which point you have underutilized execution resources. That's where SMT comes in, and gives you some "free" performance.
That comes with the benefit of having overall slower threads. The free "performance" is only visible to the hardware. To the operating system it is a degrade in performance. Compared to SMT, CMT gets free performance without the degrade. As there is no competition or partition of critical general-purpose resources. The FPU always gets more performance from the broader task-level parallelism optimizations. Thus operating systems/software prefer a shared FPU over a private FPU.

If an 8-core Zen 14LPP was a successor of a 16-core 20LPM design of the same caliber as Excavator. Then, Zen wouldn't be praised, it would be critically panned as a flop.

20LPM is a bubble node *a required node to be successful*. Fab 1 >50K wafer starts and Fab 8 >50K wafer starts. 20LPM to 14FDS/12FDX for the Fab 1 option, and 20LPM to 14XM/14LPP/12LP for the Fab 8 option. So, when the price of production of 20LPM became costly it would have moved to 14FDS/12FDX. Ryzen 8-core (top-end) => $499 for Q1 2017 and FX 16-core (top-end) => $598 for Q3 2015. By the time of Ryzen, the price of FX would have depreciated as production matured. Repeating the cost-effectiveness present of Stoney vs Raven2.
 
Last edited:

Maxima1

Diamond Member
Jan 15, 2013
3,310
648
126
well, it would still be accurate to call a slow core a core,


it looks to me like most people agree that 1 BD module is not equal to 2 cores, both technically and in terms of performance

I don't think that people got scammed or anything, 1 module performs mostly like 2 cores, it's just that the single thread performance per module was already low, and then you have added to that some penalty from shared resources...

companies have to be careful with this sort of thing and I do think AMD made a mistake,
CMT is not equivalent to hyperthreading, so forced advertising as 4 cores / 8 threads biases perception towards Intel. You could introduce "module" as a new concept i suppose but then people have "cores" imprinted already for many years as the thing to look for, so it seems like that would unfairly introduce new advertising costs to AMD and sinks its multithreading perception (8 to 4, 4 to 2...).

I don't see much difference in a company making small cores or having piss-poor big cores on the market as the outcome is similar due to "real cores" not being equivalent regardless. Some would undoubtedly wet themselves over multithreading potential if the price is lower per core and overlook the singlethreaded performance (of which people already do). The real problem is people not doing research. I have no idea why anyone would have bought an FX frankly.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,062
638
126
No matter how you look at the Bulldozer module, one thing is certain, a single integer cluster is much simpler in design than wide. And here, among other things, I would see the reason for AMD's transition to this exotic microarchitecture.
I seem to remember the reasoning being getting as much integer performance as possible for a given area. With floating point being very much a secondary concern, since AMD was betting on offloading that to the GPU.

Essentially BD was designed for a future that never happened.

I have no idea why anyone would have bought an FX frankly.
If you could tolerate the power usage and cooling requirements, they made competent video encoding/transcoding boxes on a budget. Especially the E variety.
 

Phynaz

Lifer
Mar 13, 2006
10,143
817
126
Deadline to file is Jan 2nd.


Huge settlement.


Merged with existing thread.

I'm sure your only motivation for posting this reminder was out of
concern for Bulldozer owners, but this deadline reminder can be
posted here instead of creating a new thread.


AT Mod Usandthem
 

Shmee

Memory and Storage, Graphics Cards
Super Moderator
Sep 13, 2008
5,059
849
126
Wow that is a decent amount of money for not even having to sell the chip!
 

Kenmitch

Diamond Member
Oct 10, 1999
8,453
2,179
136
Wow that is a decent amount of money for not even having to sell the chip!
Did you mean buy the chip?

No proof of purchase = Internet scumbags will file multiple claims punishing those whom really bought them with lower payout(s).
 

Shmee

Memory and Storage, Graphics Cards
Super Moderator
Sep 13, 2008
5,059
849
126
No what I mean is that if you have a chip, you get $300 just for filing and you don't even have to give up the chip.
 

Charlie22911

Senior member
Mar 19, 2005
611
227
116
Up to $300? Wow, that’s steep.
I’m not going to file though, I knew what I was purchasing and it did just fine at the tasks I purchased it for; I’ve said before it was the most stable system I ever built. I just don’t think my conscience would be totally clear with that in mind.
 

Shmee

Memory and Storage, Graphics Cards
Super Moderator
Sep 13, 2008
5,059
849
126
Yikes, I guess they don't tell you how much you get then till you actually get it. I guess they really mean, up to $300, but possibly much less.
 

ASK THE COMMUNITY