YABulldozerT: AMD FX Processor Prices Lower Than Expected

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
You may be mistaken here. Hasn't AMD said the FP unit can be used by only one thread at a time? In other words, the FP unit can't be 'shared' by the INT units.
The integer cores have no direct control over what the FPU executes. The main point where the FPU handles only one thread during a cycle is when instructions are being dispatched to it (together with those instructions going to the int core belonging to this thread). But after they've been received, the FPU is handling instructions of both threads simultaneously, which means it can execute ops of both threads simultaneously. Only retiring the FP ops seems to be one thread per cycle again. Even AMD calls the FPU's execution mode "SMT":
post-10072-1282727205.jpg


CHeck his posting history and after that use the neat forum option that has a list :).
I might try it out. But otherwise I might miss some entertaining postings ;)

So the final question I have is this "80%" versus "180%" number. AMD slides clearly only state "80%"...and 80% x 2 = 1.6

So is the performance scaling in going from 1 thread in one module to having two threads in one module going to merely be 1x -> 1.6x for applications which would have otherwise scaled perfectly on a CMP architecture 1x -> 2x?
According to AMD this is with reduced area and power, leaving room for interpretations. So the CMP arch might consume more power to achieve that 2x mark.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
When we have two threads in CMT, the two threads use the same single shared Front End (but they don't share the Integer execution units) and because of the single Shared Front End, the IPC will degrade per core.

don't forget the shared L2$ and the two tiny *write-through* L1D$s, I'm pretty sure it will hurt multithread code at least as much as the shared front end


Floating Point

Now because the FP is shared between two cores and each core can use the entire FP or two cores can share it (half each), then when we have a single thread that thread can take advantage of the entire FP and it will have higher IPC (per core) but when we have two threads they will share the FP execution units in two so each core will have lower IPC.

Here again the L1D$/L2$ bandwidth will be strong limiters, besides the shared front end and FPU (fp code is typically more bandwidth hungry than int code, far more with AVX-256) and there will be probably a very high IPC discrepency between 1 and 2 active threads (fp intensive) per module
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
So what? Didn't your wife do the same before C2D launch? Yes, C2D launch was spectacular and in the end you were right. Why can't you give them a chance.. and do the talking after BD launches?

My wife wasn't employed by Intel as head of server marketing . Do you see a differance . Sure she /we had inside info and the fact I was using dothan at 3ghz+ already .

Nothing has changed for us . Bob has 3 differant yet to be released cpus in the shop right now debugging.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
It's more like 1.8-1.9x over single core in a dual core CMP chip with no SMT. Now that we have acknowledge that,we can go back to that slide from AMD. They did say 80% of CMP design and this

[1.8,1.9] * 80% = [1.44,1.52]

it's quite common to get [1.2,1.3] speedups with 2 threads per core on Sandy Bridge, so the difference between SMT and CMT isn't as big as some PR man want us to think
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Well first of all,no application out there scales perfectly with more cores. It's more like 1.8-1.9x over single core in a dual core CMP chip with no SMT. Now that we have acknowledge that,we can go back to that slide from AMD. They did say 80% of CMP design and this means 1.6x uplift versus ~1.9 which is CMP.

Now in AMD's case,the FP workloads are probably the ones that put down the average figure the most. Without FP in the mix(shared Flexfp),you would probably have something like conventional CMP scaling. But due to shared nature of big floating pint unit , the average is down to 80%.

edit: I see AtenRa posted similar post :).

I don't know why you'd say that. Embarrassingly parallel coarse-grained apps exist in real-life, they are not hypothetical gedunken experiments.

Getting 2x thread scaling out of 2 threads is not impossible, it merely represents the extreme case boundary condition.

That said, I do see the point that given the statement "80%" is in reference to an "average" this would imply the number is higher when the applications is more embarrasingly parallel, and lower when the app is not, so we should be comparing the "80%" number to that of a CMP architecture which is also not going to have 2x speedup as an average either.

So the "1.9x" for CMP is reasonable, unfortunately we don't know the composition of the test suite.

0.8 x 1.9 = 1.52x

That starts to look even less favorable for the CMP model over that of hyperthreading :confused:

According to AMD this is with reduced area and power, leaving room for interpretations. So the CMP arch might consume more power to achieve that 2x mark.

Sure, this is a given, otherwise why do CMT instead of CMP in the first place?
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
And when exactly did you see me doing that? When I mentioned, that BD's IPC would be constant compared to older archs?
http://forums.anandtech.com/showthread.php?p=32263642&highlight=#post32263642

I think that whatever I write will be wrong in your eyes ;)

Actually nothing is further from the truth. I enjoy your research you do it well . But what you find we interpret differantly . These forums are for debate and information as well as helping others with problems they have . I don't think these forums were ever intended to be a marketing tool . I got banned many times here pounding heads with now proven marketer . Sad part is 3 people were using my log in . I was away a couple of weeks back and when i got back I found bob had used my account again . I got a little angry over it . As he has his own account so he agreed to stay off of my account.
 

podspi

Golden Member
Jan 11, 2011
1,965
71
91
Didn't JFAMD say that each thread would operate at 90% of what it would have if each module was a core. Hence the 1.8 instead of 1.6?

If the module design really ends up only increasing throughput of ~ 50% of another core, clearly it wasn't implemented correctly, or is just a bad idea in the first place.
 

exar333

Diamond Member
Feb 7, 2004
8,518
8
91
Just to add,

CMT is not about higher IPC (Performance) but smaller die size and lower power usage. They compromise and have a little IPC penalty when we have two threads in CMT vs CMP but at a smaller die size (shared Front end etc) and lower power usage.

Those were the goals of the Bulldozer architecture, smaller die size and lower power usage for CMP characteristics.

Power usage doesn't appear to be very good based on available SKUs. Time will tell if the power usage is acceptable at high clocks or not from 4-6-8 cores.
 

inf64

Diamond Member
Mar 11, 2011
3,698
4,018
136
Didn't JFAMD say that each thread would operate at 90% of what it would have if each module was a core. Hence the 1.8 instead of 1.6?

If the module design really ends up only increasing throughput of ~ 50% of another core, clearly it wasn't implemented correctly, or is just a bad idea in the first place.
According to AMD,second integer core adds ~12% to the module die area for average of 1.6x performance increase (int and fp) . This is compared to CMP-type design which has bulldozer core as a base.
SMT in intel CPUs adds 5% of die area for around 25% performance uplift on average.
So we have for perf./mm^2 :
-in AMD's case : 160/112=1.42
-in intel's case : 125/105=1.19

Higher numbers are better of course since you want as small as possible die and as high as possible perf. uplift. The two numbers above are not directly comparable of course,but are there to show that AMD gets higher perf./mm^2 ratio on its own core via the CMT approach then intel does on its own core via SMT.
 

Zstream

Diamond Member
Oct 24, 2005
3,396
277
136
According to AMD,second integer core adds ~12% to the module die area for average of 1.6x performance increase (int and fp) . This is compared to CMP-type design which has bulldozer core as a base.
SMT in intel CPUs adds 5% of die area for around 25% performance uplift on average.
So we have for perf./mm^2 :
-in AMD's case : 160/112=1.42
-in intel's case : 125/105=1.19

Higher numbers are better of course since you want as small as possible die and as high as possible perf. uplift. The two numbers above are not directly comparable of course,but are there to show that AMD gets higher perf./mm^2 ratio on its own core via the CMT approach then intel does on its own core via SMT.

Uh no. HT does not add 25% on average.
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
A single core will have 100%

SMT has 25% maximum more performance, then it is 100+25 = 125 or 1.25x

CMP has 100% more theoretical performance over a single core, so it is 100+100 = 200 or 2.0x (double the performance of the single core)


80% performance of the CMP means 80 out of 100(CMP = 100%) or 180 over 200 or 1.8x, not 1.6x

The part that people overlook here is that the constant of what is 100% has changed. In the beginning we had the single core that had 100% and then the CMP is our new 100% constant. ;)
 

inf64

Diamond Member
Mar 11, 2011
3,698
4,018
136
80% performance of the CMP means 80 out of 100(CMP = 100%) or 180 over 200 or 1.8x, not 1.6x

The part that people overlook here is that the constant of what is 100% has changed. In the beginning we had the single core that had 100% and then the CMP is our new 100% constant. ;)
100 pts 1 core in CMP bulldozer design
200 pts 2 cores in CMP bulldoze design with assumed perfect thread scaling
Now, 80% of 200pts is how much? That's right, 80% of 200pts is 200x0.8=160, not 180. Also 80% of 100pts is 80pts, very simple.
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
Power usage doesn't appear to be very good based on available SKUs. Time will tell if the power usage is acceptable at high clocks or not from 4-6-8 cores.

Doesn't matter for desktop users. The only thing that matters is idle power consumption, and at least theoretically, that should be quite a bit better than Phenom, both from what we know of Llano and what we know of Bulldozers architecture.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
As some already know, someone (very attention hungry and provocative) posted some AMD slides, which seem to be real and show a Cinebench 11.5 score of 5.95 on a 3.6GHz FX, likely no turbo due to full FPU load, and of 7.8 at 4.8GHz OC, citing 30% (actually 31%) more performance for a +33% overclock. The CB screenshot used there showing a Xeon system can be found here: http://www.maxon.net/de/downloads/cinebench.html

So as some already pointed out, this score is barely higher than that of a Ph II X6. It seems, CB has been used because it scales nicely with frequency (which is mentioned as a selling point). The score itself is nothing to brag about. But why is it not higher? This might have some simple reasons:
  • CB's code is likely very optimized for Intel CPUs (and maybe with a code path for AMD fam. 10h) since render times are critical
  • FP throughput is high, likely preventing any turbo mode
  • CB doesn't use FMA, so only half the max. throughput on FX
  • on those 4 FPUs on a FX CB could achieve a max. troughput of 4 FPUs x 2 FMAC units x 2 DP ops per FMAC x 3.6 GHz = 57.6 GFLOPS
  • on X6 the max. throughput is 79.2 GFLOPS (6*2*2*3.3) (but CB's code doesn't contain FP ops only)

So judging from FP throughput CB on FX should even just reach 3/4 of the X6 score. But the SMT like execution of FP code and some other architectural changes might have helped here not to fall behind.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Cinebench also doesn't do AVX!

It is mainly SSE2

So, the two selling points of Bulldozer isn't used in Cinema4D so why use it?
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
I can accept that the strengths of bulldozer will not be brought to bear in real-world apps until those apps are compiled so as to become bulldozer-aware, this is true of Intel's chips as well everytime they increment the SSE rev and add more instructions to the ISA as well.

The question is - so when can we expect real-world apps that are bulldozer-aware to become available?
 

Googer

Lifer
Nov 11, 2004
12,571
4
81
AMD FX-8120 for $205 with overclocking is going to be a sweet CPU for content creation users and heavy mathematical computation work!

Now an 8 core CPU w/ 4.0ghz Turbo (!) priced below a 2500k pretty much seals the deal in my eyes that it will not be as competitive in 1-4 threaded apps (esp. not in overclocked vs. overclocked case), but pull away in 5-8 threaded tasks.

I have to say the marketing force is strong with this one: 8 cores for $200 and change. If this was an Apple product, it would sell for $400+. :biggrin:

I'm just as exicted as you are, but I suggest not jumping the gun. We have yet to see one single benchmark for this processor. And with Ivy Bridge still looming in the air, no one knows who the victor of this battle for CPU supremacy may be.
 

OCGuy

Lifer
Jul 12, 2000
27,227
36
91
Honestly I can't imagine sinking a single penny into BD, when SB-E and IB are right "around the corner."

Unless of course I was just looking to play around with new hardware, which is always fun.
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
100 pts 1 core in CMP bulldozer design
200 pts 2 cores in CMP bulldoze design with assumed perfect thread scaling
Now, 80% of 200pts is how much? That's right, 80% of 200pts is 200x0.8=160, not 180. Also 80% of 100pts is 80pts, very simple.

I want you to answer me this,

A single core is 100% = 100pts

1: SMT will give you 25%, how many pts is that ???


If i do the math with your way, that will give me 50pts (200 * 0.25 = 50)

what that means ??

It means that i have negative performance ?? remember 100pts is a single core and i cant have lower than 100pts with two cores ;)

Single core = 100pts (that is your base)
Dual core = 200pts (that is your new 100%)

100pts is not 50% but 0%

125pts is 25%

180pts is 80%

;)
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
180pts is 80%

I think it is an unanswered question, "80% of what?"

Is it 80% of what a dual-core CMP design would have produced? (in which case 80% of 200pts is 160pts)

Or is it 80% of what a second-thread on a CMP design would have produced in addition to the performance of a single-thread (this is a really really odd scaling metric btw, I hope you can recognize the awkwardness of phrasing scaling like this from a computer science position and how that lowers the expectations of this being the way AMD intended to frame the performance picture as well) in which case 80% of 100pts + 100pts = 180 pts?
 

PreferLinux

Senior member
Dec 29, 2010
420
0
0
I think it is an unanswered question, "80% of what?"

Is it 80% of what a dual-core CMP design would have produced? (in which case 80% of 200pts is 160pts)

Or is it 80% of what a second-thread on a CMP design would have produced in addition to the performance of a single-thread (this is a really really odd scaling metric btw, I hope you can recognize the awkwardness of phrasing scaling like this from a computer science position and how that lowers the expectations of this being the way AMD intended to frame the performance picture as well) in which case 80% of 100pts + 100pts = 180 pts?
The second one is what JF AMD seems to say, I believe. See the post earlier in the thread (I think it was this one) where someone quoted him from another forum.