YABulldozerT: AMD FX Processor Prices Lower Than Expected

NostaSeronx · Sep 23, 2011

Instruction Fetch Window => 2 32B aka 2 x 256b
4-Way Decoder => Fastpath Single => 4 x 16B(4 Macro-ops), Fastpath Double => 4 x 32B(8 Macro-ops)
Integer Cluster Scheduler => 2 x 40-Entry, 40 Macro-ops(Core A) 40 Macro-ops(Core B) (2 x 5KB L0i)
Floating Point Coprocessor Scheduler => 2 x 30-Entry, 30 Macro-ops(Core A) 30 Macro-ops(Core B) (1 x 7.5KB L0i split into two so 3.75KB L0is)
Fetch/Dispatch/Retire per core => 4 Macro-ops, 2 from the Integer Cluster and 2 from the Floating Point Coprocessor

Any points you want to fix Dresdenboy?

(If we add them using the same math we can come up with L0i$ differences)
(L0i$ is 17.5KB large for AMD Bulldozer
and
L0i$ is 6.75KB large for Intel Sandy Bridge)

drizek · Sep 23, 2011

With turbo,

scaling will never be the same again.

inf64 · Sep 23, 2011

AtenRa said:
I want you to answer me this,

A single core is 100% = 100pts

1: SMT will give you 25%, how many pts is that ???

It's 125pts for a single core running 2 threads via SMT. Easy. 2 core design that runs 4 threads via SMT is thus double of that so 250pts. Look below

.

If i do the math with your way, that will give me 50pts (200 * 0.25 = 50)

No,you are confusing things. 200pts is dual core CMP design with none of the MT techniques. SMT gives you 25% over that so it is 200x1.25=250.

AMD worded their slide as not 80% over single trhead but 80% of CMP design. Notice how this changes things. 80% of something is "base line x 0.8 " while 25% over something (the benefit of SMT) is "base line x 1.25".

It means that i have negative performance ?? remember 100pts is a single core and i cant have lower than 100pts with two cores

No ,it doesn't mean you have negative performance .Each "thread" is weaker though,since you have 250pts via 4 threads vs 200pts via 2 threads and no SMT. Each of 4 threads in case of SMT design is worth 62.5pts (250/4) as opposed to 100pts in CMP design with no SMT (200/2).

In case of AMD's "8C" chip you will have 4 of these modules so equal of 4x160=640 pts. Hypothetical CMP Bulldozer 8C chip would be 800pts with perfect scaling. 640pts is exactly 80% of 800pts : 800pts : 100% = 640pts : x => x=640x100/800=80%

Speculation for MT workloads(assume base clock of ~3.33-3.4Ghz for intel CPUs):
In case of Nehalem you have 4C chip so 4x125pts=500pts in MT workload if we assume 25% as a norm for SMT speedup(which it is not,it varies wildly and is not consistent). Note that each point is not the same in case of Zambezi and Nehalem since the designs are radically different and don't have exactly the same IPC. To convert Zambezi's points we need a relationship between Zambezi's and Nehalem's IPC (very difficult to model due to million of reasons but for the sake of argument let's try). If we assume that Zambezi's single core has 85% of Nehalem IPC on average and runs at the exactly the same clock:
160x0.85x4=544pts. Nehalem sits at 500pts in MT apps. What about SB? SB is around 11.4% faster than Nehalem . Zambezi has 3.6Ghz base clock ,SB 3.4Ghz. Turbo effect will be approx. the same for both in MT apps. so it will cancel out any advantage each has over the other (so disregard Turbo). Base clock difference between Zambezi and SB is 3.6/3.4Ghz=1.05
Nehalem: 500pts
SB : 500 x 1.114=557pts.
Zambezi : 544 x 3.6/3.4=576pts.
X6 : 434pts ( 500pts x 165.8/191.4 )

8C Zambezi @ 3.6Ghz ends up just slightly faster than SB 2600K in MT apps. It's also 15% and 32% faster than top Nehalem and X6. Each Zambezi's "thread" in MT workload is slightly slower than each X6's thread : 576/8=72 vs 434/6=72.3. Zambezi is also running at the somewhat higher clock too(3.6Ghz vs 3.3Ghz).

bronxzv · Sep 24, 2011

inf64 said:
No,you are confusing things. 200pts is dual core CMP design with none of the MT techniques. SMT gives you 25% over that so it is 200x1.25=250.

a problem with such comparisons is that people compare real world measured speedups (x1.25 with hyperthreading) and theroretical speedups (x2.0 with 2-way CMP)

NostaSeronx · Sep 24, 2011

inf64

CMP = 200% x 1 = 200% for two cores
CMT = 200% x 0.8 = 160% for two cores in Cluster Multithreading
SMT = 200% x 0.59 = 118% for one core in Hyperthreading executing two threads

160% + 160% + 160% + 160% => 6.4
vs
118% + 118% + 118% + 118% => 4.72

Idontcare · Sep 24, 2011

NostaSeronx said:
inf64

CMP = 200% x 1 = 200% for two cores
CMT = 200% x 0.8 = 160% for two cores in Cluster Multithreading
SMT = 200% x 0.59 = 118% for one core in Hyperthreading executing two threads

160% + 160% + 160% + 160% => 6.4
vs
118% + 118% + 118% + 118% => 4.72

These numbers make sense to me now.

And the downside of HT scaling is that it can actually slow down the performance with some apps.

But "4.72" MP is roughly what I get in Cinebench for my 2600K w/HT enabled.

AtenRa · Sep 24, 2011

inf64 said:
.......

You said, single core = 100pts

Dual core(CMP) = 200pts

Now I said that SMT will give you 25% more performance over a single core, I asked how much in percentage the SMT is over the CMP.

That means how much is that 25% out of 200pts, it is the same thing AMD have said for CMT that it has 80% of the CMP and you said 200*0.8 = 160pts.

If we do the math 200*0.25 = 50pts

What is that 50pts ?? That is 25% out of 200pts

How much is that over a single core (100pts)??? 1.25x or 125pts

No lets do again 80%

200*0.8 = 160pts That is 80% out of 200pts

How much is that over a single core ??? 1.8x or 180pts

SMT (Single core + HT)will be 125pts (25% over single core) 1.25x
CMT will be 180pts (80%)over a single core 1.80x
CMP will be 200pts (100%) over a single core 2.0x

AtenRa · Sep 24, 2011

NostaSeronx said:
inf64

CMP = 200% x 1 = 200% for two cores
CMT = 200% x 0.8 = 160% for two cores in Cluster Multithreading
SMT = 200% x 0.59 = 118% for one core in Hyperthreading executing two threads

Since you take SMT as 18% why do you multiply by 0.59 and not 0.18 ???

SMT with 18% would be 200% x 0.18 = 36%

Confused ??? i know

36% out of 200% = 18% 18 out of 100

NostaSeronx · Sep 24, 2011

AtenRa said:
Since you take SMT as 18% why do you multiply by 0.59 and not 0.18 ???

SMT with 18% would be 200% x 0.18 = 36%

Confused ??? i know

36% out of 200% = 18% 18 out of 100

118% is the average advantage of having two threads execute on Intel's SMT cores over just having 1 thread which is 100%

200% is the average advantage of having dedicated components for independent cores over having dedicated independent core that is 100%

160% is the average advantage of having shared components for independent cores on AMD's compute units over just have 1 core that is 100%

If a single SMT Core w/o Hyperthreading is 100%
and w/ Hyperthreading you see an average of 18% improvement with two threads executing at once
Then you just add them 100% + 18% = 118%

Deciphering 160%

10% from each core is lost from having a shared front-end
10% from each core is lost from having a shared floating point coprocessor

But these might be out of portion numbers, until release these numbers can't be verified

Sharing cache doesn't make the CPU lose any performance(probably makes it gain more)

One thing I learned from previous AMD press other than Phenom I

They lie hardcore

Any improvements are usually halved

Any setbacks are usually doubled to quadrupled

Any averages are inconsistent

AtenRa · Sep 24, 2011

http://obrovsky.blogspot.com/2011/09/bulldozer-is-crap-ive-told-you-proof.html

interesting

NostaSeronx · Sep 24, 2011

AtenRa said:
http://obrovsky.blogspot.com/2011/09/bulldozer-is-crap-ive-told-you-proof.html

interesting

http://www.donanimhaber.com/islemci...lemciler-teknik-ozellikler-test-sonuclari.htm

English:
http://www.donanimhaber.com/islemci/haberleri/AMD-Bulldozer-hakkinda-her-sey-islemciler-teknik-ozellikler-test-sonuclari.htm

http://www.donanimhaber.com/islemci/galerileri/AMD-Bulldozer-FX-resmi-test-sonuclari.htm <-- the slides

AtenRa · Sep 24, 2011

ahh thx, all the slides

Dresdenboy · Sep 24, 2011

NostaSeronx said:
Instruction Fetch Window => 2 32B aka 2 x 256b
4-Way Decoder => Fastpath Single => 4 x 16B(4 Macro-ops), Fastpath Double => 4 x 32B(8 Macro-ops)
Integer Cluster Scheduler => 2 x 40-Entry, 40 Macro-ops(Core A) 40 Macro-ops(Core B) (2 x 5KB L0i)
Floating Point Coprocessor Scheduler => 2 x 30-Entry, 30 Macro-ops(Core A) 30 Macro-ops(Core B) (1 x 7.5KB L0i split into two so 3.75KB L0is)
Fetch/Dispatch/Retire per core => 4 Macro-ops, 2 from the Integer Cluster and 2 from the Floating Point Coprocessor

Any points you want to fix Dresdenboy?

(If we add them using the same math we can come up with L0i$ differences)
(L0i$ is 17.5KB large for AMD Bulldozer
and
L0i$ is 6.75KB large for Intel Sandy Bridge)

You should avoid these L0i$ metrics, since mops and µops packets are of different size (likely larger) than the x86 ops. You could simply count the single or simple decoded x86 ops as a common base. Making the calculation simpler overall

And for general code, the x86 instructions which are being decoded into simple (ex) or complex/fused (ex+mem) ops should roughly match between SB and BD.

Double decoded ops use 2 slots, the rate is 2 per cycle.

To me the FP scheduler doesn't look like being split up between threads. If there are only FP ops coming from one thread, they should be able to fill up all of the 60 slots.

Since retirement is in order it likely retires instructions in the same order as they were found in the x86 code stream.

NostaSeronx · Sep 24, 2011

Dresdenboy said:
You should avoid these L0i$ metrics, since mops and µops packets are of different size (likely larger) than the x86 ops. You could simply count the single or simple decoded x86 ops as a common base. Making the calculation simpler overall And for general code, the x86 instructions which are being decoded into simple (ex) or complex/fused (ex+mem) ops should roughly match between SB and BD.

My understanding

AMD64 instructions -> Complex(64bit) or Simple(32bit) Macro-ops -> Complex or Simple Micro-ops

Macro-ops = 1 Integer op or 1 FP op + 1 Load or 1 Store or 1 Load/Store
So Fastpath Single can make 1 Macro-op and Fastpath Double makes 2 Macro-ops

The schedulers then queue up and decode then feed the execution units with Micro-ops which can either be Complex 64bit or Simple 32bit as well

Micro-ops = 1 Integer op or 1 FP op or 1 Load or 1 Store or 1 Load/Store

AMD64 instructions aren't fixed but macro-ops and micro-ops are

and after reading about how schedulers work they are entries for macro-ops the full length and half length

and a decode fits 128bit in fastpath single and double that with Fastpath single
(The fusion of ops doesn't occur here but at the Micro-op level K8 has had macro op fusion and this is the first time we will have micro-op fusion aka macro-op->macro-op)

Dresdenboy said:
Double decoded ops use 2 slots, the rate is 2 per cycle.

Which can either take two cycles to fill those two slots or do it from one AMD64 instruction fetch

Dresdenboy said:
To me the FP scheduler doesn't look like being split up between threads. If there are only FP ops coming from one thread, they should be able to fill up all of the 60 slots.

When both cores are used it's halved while in dedicated core mode it has full access I think

Dresdenboy said:
Since retirement is in order it likely retires instructions in the same order as they were found in the x86 code stream.

I think it is for coherency

AtenRa said:
ahh thx, all the slides

Seems like these slides say perfect upgrade for first gen i7 users and phenom II users

AtenRa · Sep 24, 2011

Does x264, Pov ray, 7zip use the FP or the Integer ??

wPrime uses Integer or FP ??

AtenRa · Sep 24, 2011

NostaSeronx said:
Seems like these slides say perfect upgrade for first gen i7 users and phenom II users

With this performance, I wont see any difference from my Core i7 920 @ 4.GHz

NostaSeronx · Sep 24, 2011

AtenRa said:
Does x264, Pov ray, 7zip use the FP or the Integer ??

wPrime uses Integer or FP ??

x264 integer
7zip integer

I think
Pov-ray might be FP
wPrime I am leaning on integer

and ofcourse these applications aren't using XOP(Integer performance increase) or FMA(Integer(shared xop instruction)/Floating point performance increase)

AtenRa · Sep 24, 2011

http://www.donanimhaber.com/islemci/galerileri/AMD-Bulldozer-FX-resmi-test-sonuclari_11.htm

This is the performance of the AMD FX8150 ???

NostaSeronx · Sep 24, 2011

AtenRa said:
http://www.donanimhaber.com/islemci/galerileri/AMD-Bulldozer-FX-resmi-test-sonuclari_11.htm

This is the performance of the AMD FX8150 ???

If only it had the 1100T on that list...if only.....

Crap Daddy · Sep 24, 2011

Yes, and they even brag about it! Watch all the slides, they compare in gaming the FX to 980x and in multithreaded with the 2500/2600. It's a joke. Look at the cinebench test. It's the same score as the x6 1100. the 2600k scores at stock speed 6.89.

AtenRa · Sep 24, 2011

Im sorry but something is wrong in that slide if it shows FX8150,

It seams from that graph that FX8150 is between the 2500K and 2600K in multithreading apps like x264, Pov Ray and 7zip

From anandtech Sandbridge review

http://www.anandtech.com/show/4083/...-core-i7-2600k-i5-2500k-core-i3-2100-tested/1

Phenom II X6 1100T is between 2500K and 2600K

Phenom II X6 1100T is between 2500K and 2600K

So, can anyone tell me what more BD brinks to the table than Phenom II in multithreading if the slide shows the performance of the FX8150 ??

Crap Daddy · Sep 24, 2011

It seems nothing. Now it surfaces that BD is in fact in a very tight competition with Phenom 2.

classy · Sep 24, 2011

If that is the performance that would confirm what MM and Kyle both said. Oustide of the Cinebench score and I don't know about Deus score, that processor is pretty good. They used 19 x 10 which is commonly used res now. I would also assume because it is a new arch, that some software may even need a patch to recognize it. We'll see, but those scores look pretty good.

996GT2 · Sep 24, 2011

AtenRa said:
Im sorry but something is wrong in that slide if it shows FX8150,

It seams from that graph that FX8150 is between the 2500K and 2600K in multithreading apps like x264, Pov Ray and 7zip

From anandtech Sandbridge review

http://www.anandtech.com/show/4083/...-core-i7-2600k-i5-2500k-core-i3-2100-tested/1

Phenom II X6 1100T is between 2500K and 2600K

Phenom II X6 1100T is between 2500K and 2600K

Phenom II X6 1100T is between 2500K and 2600K

So, can anyone tell me what more BD brinks to the table than Phenom II in multithreading if the slide shows the performance of the FX8150 ??

BD is also priced in between a 2500k and 2600k so it makes perfect sense

Olikan · Sep 24, 2011

Crap Daddy said:
Yes, and they even brag about it! Watch all the slides, they compare in gaming the FX to 980x and in multithreaded with the 2500/2600. It's a joke. Look at the cinebench test. It's the same score as the x6 1100. the 2600k scores at stock speed 6.89.

look again.
civ 5, h.a.w.x, f1 2010
are all games that scale with more cores

strangely enought,
there is no changes in gulf to sandy, for lost planet
....but bulldozer beat them Oo

YABulldozerT: AMD FX Processor Prices Lower Than Expected

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Elite Member

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Golden Member

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Senior member

Lifer

Senior member

Lifer

Diamond Member

Platinum Member