Observations with an FX-8350

AtenRa · Jan 16, 2013

Jovec said:
Perhaps you meant to say compare it Phenom or Thuban (i.e. a non-CMT CPU)?

As consumers apples-to-apples comparisons matter less than real-world performance, the question is how much of that 80-90% is due to the FXs higher clocks? Presumably someone has compared a Phenom II quad to a Vishera 2M at the same clockspeed and turbos off.

No,

You have to compare 8x single Bulldozer Cores (Half Module) against 8 CMT Bulldozer Cores (4x Modules)

An example is to use 2x Modules CMT (4 Threads) against 4x Modules (4 Threads) No CMT.

This is the only way to see how the CMT design works.

AMDs Bulldozer CMT Scaling

Compering 8 K10 cores(8 threads) against 4 Bulldozer Modules (8 threads) makes no apples to apples comparison.

jvroig · Jan 16, 2013

Idontcare said:
I was expecting the FX-8350 to do rather well in this app, what I was not expecting was that it is basically on-par with having the same IPC as a 6yr old QX6700 Blows my mind.

It blows my mind too, but for the opposite reason: The FX 8350 actually has equal IPC to a Core2Quad? Because even the first batch of Denebs were supposedly slightly slower clock for clock compared to C2Q, and Bulldozer was far slower than Deneb clock for clock, but Piledriver improved it a wee bit - but to my understanding, still not at Deneb levels. And yet, here it is well within Core2Quad levels or slightly better. What is it with gaussian that makes Piledriver suddenly look Deneb/C2Q level in single-thread performance? How would you characterize the computations or various other processing gaussian does?

itsmydamnation · Jan 16, 2013

its all about instruction throughput, when you just have one thread on a module PD beats hound (Llano) in general IPC. once you start running two thread per module you become decode/instruction limited in a big way and about 1/2 the IPC gain from bulldozer disappears.

SPBHM · Jan 16, 2013

jvroig said:
It blows my mind too, but for the opposite reason: The FX 8350 actually has equal IPC to a Core2Quad? Because even the first batch of Denebs were supposedly slightly slower clock for clock compared to C2Q, and Bulldozer was far slower than Deneb clock for clock, but Piledriver improved it a wee bit - but to my understanding, still not at Deneb levels. And yet, here it is well within Core2Quad levels or slightly better. What is it with gaussian that makes Piledriver suddenly look Deneb/C2Q level in single-thread performance? How would you characterize the computations or various other processing gaussian does?

take a look here
QX6580 = Q6600 at 3GHz, Q9650 runs at the same clock (it's the 45nm Core 2 with 12mb l2) and PII 940 = 3GHz (but using DDR2), 945 uses DDR3
http://www.hardware.fr/articles/778-14/moyenne.html

but it's really related to what software you use...
I'm sure a q9550 overclocked would be interesting to compare...

jvroig · Jan 16, 2013

itsmydamnation said:
its all about instruction throughput, when you just have one thread on a module PD beats hound (Llano) in general IPC. once you start running two thread per module you become decode/instruction limited in a big way and about 1/2 the IPC gain from bulldozer disappears.

But we've already seen so many single-thread and clock-for-clock comparisons made since BD appeared. That can't be it alone. The consensus was clear - clock for clock, far slower than Deneb. We've had review site upon review site test this, and they weren't testing MT applications when they said this. They were using single-threaded ones of course.

it has to be gaussian's specific mix of operations that either plays well with FX, or doesn't play well with C2Q, such that the net effect is they offer similar IPC.

AtenRa · Jan 16, 2013

As i have said soo many times, IPC is not constant in the same Processor. IPC is 99% application depended.

grimpr said:
Great job IDC, although that old version of Gaussian you're using is most propably running x87 code which modern AMD processors are very bad at (you can use AMDs excellent CodeXL to profile apps), your multitasking gauassian graph shows what this chip is capable of, its a multitasking throughput monster for the price.

Well it could be the reason.

jvroig · Jan 16, 2013

SPBHM said:
take a look here
QX6580 = Q6600 at 3GHz, Q9650 runs at the same clock (it's the 45nm Core 2 with 12mb l2) and PII 940 = 3GHz (but using DDR2), 945 uses DDR3
http://www.hardware.fr/articles/778-14/moyenne.html

I'm sorry, I may have misunderstood your point, but I think what you are trying to say is that your graph shows Deneb actually has better IPC than a QX6700, because the QX6700 isn't the highest IPC family of C2Q, not the C2Q family that Deneb was compared to when released? That would make sense, because that would mean IDC's result in gaussian still correspond to the general finding that IPC in BD and PD are still < Deneb/Thuban. Did I understand you right?

jvroig · Jan 16, 2013

AtenRa said:
As i have said soo many times, IPC is not constant in the same Processor. IPC is 99% application depended.

Of course, this does not need to be said. In the real world, however, applications follow a general trend, so you will find, for example, very few examples of a Phenom II having more IPC than Nehalem, Sandy Bridge or Ivy Bridge. I do not know of any, but the possibility exists if someone wanted to optimize solely for Phenom II, but that would be crazy.

And this is just as true with BD/PD. The trend found is that BD performs much lower clock-for-clock than Deneb (let's talk ST to avoid useless CMT arguments).

[This is why I then asked if it is really gaussian's particular mix of computations or processing, or something else. I think. as per SPBHM above, it is something else, namely the QX6700 isn't the higher-than-Deneb IPC arch I thought it was]

Furthermore, the saying "IPC is 99% application dependent" is not really used to imply that, in some impossibly lop-sided optimized software, a processor that has lower ST performance in 99% of other apps will somehow magically have much greater ST performance in another. What the rule of thumb you quote actually points to is, quite literally, instructions per cycle will depend on how optimized the application is, so even if a particular arch could have executed 3 instructions per cycle, whether they actually do so every cycle will depend if the app is actually capable of supplying the processor with enough instructions so as to be full all the time. Meaning, the same processor running app A may reach an average of 2 instructions per cycle being done, while in app B it may only reach an average of 1.3 (the processor is doing more waiting and thumb-twiddling). That's what that saying is for - comparing the same processor on different apps, not comparing different processors on the same app.

AtenRa · Jan 16, 2013

Thats why i have said in the same Processor, IPC will fluctuate depending of the application processed every time using the same processor.

Also, IPC is Frequency depended in a lesser percentage, the same Processor doesnt have the same IPC (same application) at 2GHz and at 4GHz.

jvroig · Jan 16, 2013

Ok, it seems we are in agreement. I was just confused because I thought you brought it up as an answer to my query regarding gaussian when it doesn't apply to my question at all.

Also, IPC is Frequency depended in a lesser percentage, the same Processor doesnt have the same IPC (same application) at 2GHz and at 4GH

It will almost always goes down as the clocks go higher (I don't mean starting from 1Mhz, I mean from whatever your stock is, as that's an already reasonably significant clock speed), because we aren't increasing the performance of every single part of the CPU when we overclock the "core" (as opposed to "uncore", or even when core/uncore is unified such as in Ivy - although rumor has it it will be decoupled again in Haswell and consequently OC headroom will increase? Sorry, I digress). Most significant is even if the execution units, for example, become fast enough to accommodate more instructions per second (not talking of cycle here), and theoretically get 20% more work done (20% OC, for example), they won't get the chance to actually get all of that 20% increase, because the registers and caches are still the same size and can still only hold the same amount of data and still suffers the same hit-rate, and no matter how much you OC you can't remove cache misses (if you were to be really pessimistic about it, say for an arch with very little of this infra to begin with (cache starved), you could say you are just making the cache misses happen faster and more frequently). Scaling well with clockspeed means the balance of transistors spent for the caches (sizes + latencies) is favorable to the higher frequency, and is a design goal (not just caches, I just chose them as the most obvious and most easily understood/appreciated example). Those with barely enough of them to begin with suffer from scaling (Phenom I probably, but this also suffered from poor headroom in the first place, so it doesn't really matter as overclocking the original Phenom BE's were rather pointless)

AtenRa · Jan 16, 2013

jvroig said:
Of course, this does not need to be said. In the real world, however, applications follow a general trend, so you will find, for example, very few examples of a Phenom II having more IPC than Nehalem, Sandy Bridge or Ivy Bridge. I do not know of any, but the possibility exists if someone wanted to optimize solely for Phenom II, but that would be crazy.

And this is just as true with BD/PD. The trend found is that BD performs much lower clock-for-clock than Deneb (let's talk ST to avoid useless CMT arguments).

There are applications that Bulldozer has same/higher IPC than Deneb and if we use SIMDs, Bulldozer has higher IPC than Nehalem and some times even SandyBridge.
But in legacy code, 99% of the time BD/PD has lower IPC than Deneb.

http://atenra.*********/2012/02/01/amd’s-bulldozer-cmt-scaling/

Just to show you how IPC is Application(Instruction) depended

And how Frequency Depended

dastral · Jan 16, 2013

hence the VRMS heatsink will be forced at an equivalent 40°C ambiant temp that will add to the 50°C due to the VRM dissipation , the whole
thing will reach untouchable temp at high loading.

I find this very misleading...
Pushing 40°C air on something that was at 60°C (with 0 active cooling at 25° Ambient) will not increase their temps.
TopDown coolers ALWAYS produce better results with VRM than Tower coolers because "forced hot air" cools much better than "passive case cooling".

inf64 · Jan 16, 2013

What should also blow your mind is the fact that "8C" Piledirver has 4 256bit(2x128bit) FP units while C2Q has 4x2x128bit FP units(and 3x128bit if you count SSE not pure FP throughput). So from the point of view of available resources "per core" or better yet "per thread" , C2Q has 8 128bit FP execution ports available to 4 cores so 2 fp execution ports per core or per thread and 12 128bit SSE execution portsavailable to 4 cores so 3 SSE execution ports per thread. "8C" or better yet "8T" Piledriver chip has 8 128bit fmac(for fmul or fadd) execution ports available to 8 cores so 1 128bit fp exec. port per thread and similarly 8 128bit SSE execution ports available to 8 cores so 1 SSE exec. port per core or thread.

As can be seen C2Q has 2x or even 3x more fp/SSE resources per core(or thread if you will) and still performs the same. I hope you can see how futile is to compare these 2 radically different designs and one can argue that PD design actually has much better execution efficiency as its units are actually doing more work per cycle to compensate for lack of exec. hardware vs Core design 😉.

Idontcare · Jan 16, 2013

dkm777 said:
I just wanted to chime in about VRM cooling affecting overclocks. Even though I don't have an FX (undecided), but when I lowered the fans on my Cogage Arrow so that some air would flow over the large VRM heatsink I could reduce both the CPU-NB voltage (which reduced temps dramatically) and core voltage. I left my PC folding for the day and if I don't find that it has crashed then I think we can confirm the theory that certain Asus AMD motherboards need better cooling for the VRMs (I have a Sabertooth 990FX).

😱 Wow, very cool to hear this :thumbsup: thanks for taking the time to weigh in and confirm the observation (or rather, mine confirms yours 😉)

That at least suggests it isn't entirely a fluke or a one-off situation.

I have one of those IR guns that measures the surface temperature of objects, I think I'll explore this a little more and see just what kind of temperature delta I am getting with the extra cooling :hmm:

jvroig said:
It blows my mind too, but for the opposite reason: The FX 8350 actually has equal IPC to a Core2Quad? Because even the first batch of Denebs were supposedly slightly slower clock for clock compared to C2Q, and Bulldozer was far slower than Deneb clock for clock, but Piledriver improved it a wee bit - but to my understanding, still not at Deneb levels. And yet, here it is well within Core2Quad levels or slightly better. What is it with gaussian that makes Piledriver suddenly look Deneb/C2Q level in single-thread performance? How would you characterize the computations or various other processing gaussian does?

This particular app traditionally strongly favored AMD's microarchitectures starting with the K7 Athlon. I suspect what we are seeing here is that AMD continued to hold this commanding IPC lead over Intel all the way up through Thuban, and that the lead was so large that even with the step back in IPC that piledriver might be relative to thuban it still isn't enough of a step back as to erase its IPC lead over the Kentsfield microarchitecture.

AtenRa said:
No,

You have to compare 8x single Bulldozer Cores (Half Module) against 8 CMT Bulldozer Cores (4x Modules)

An example is to use 2x Modules CMT (4 Threads) against 4x Modules (4 Threads) No CMT.

This is the only way to see how the CMT design works.

AMD’s Bulldozer CMT Scaling

Compering 8 K10 cores(8 threads) against 4 Bulldozer Modules (8 threads) makes no apples to apples comparison.

I have some data on that for this app.

At 4GHz, loading two modules with one thread each results in a compute time per thread of 330 seconds.

Loading both of those threads onto one module results in a compute time per thread of 376 seconds.

In this application the "CMT tax" is quite small, 330/376 = 0.88x, or roughly 14% loss (1/0.88) in scaling efficiency for the FX-8350.

Compare this to the "HT tax" on the 3770k, at 4GHz loading two physical cores results in a compute time per thread of 218 seconds. (ridiculously faster than the FX8350)

But load both threads onto a physical + virtual core pairing and the compute time per thread balloons to 397 seconds.

In this application the "HT tax" is quite large, 218/397 = 0.55x, or roughly 82% loss in (1/0.55) scaling efficiency for the FX-8350 😱

(100% loss in scaling efficiency would be tantamount to the performance you would expect in putting two threads onto just one single-threaded core)

edit: I see the powers that be have managed to torpedo your URL's again 🙁 given enough time I have every confidence we'll eventually find ourselves censoring anandtech.com too and righteously proclaiming "mission accomplished" over the spammers in the process 😀

AtenRa · Jan 16, 2013

Idontcare said:
I have some data on that for this app.

At 4GHz, loading two modules with one thread each results in a compute time per thread of 330 seconds.

Loading both of those threads onto one module results in a compute time per thread of 376 seconds.

In this application the "CMT tax" is quite small, 330/376 = 0.88x, or roughly 14% loss (1/0.88) in scaling efficiency for the FX-8350.

Compare this to the "HT tax" on the 3770k, at 4GHz loading two physical cores results in a compute time per thread of 218 seconds. (ridiculously faster than the FX8350)

But load both threads onto a physical + virtual core pairing and the compute time per thread balloons to 397 seconds.

In this application the "HT tax" is quite large, 218/397 = 0.55x, or roughly 82% loss in (1/0.55) scaling efficiency for the FX-8350 😱

(100% loss in scaling efficiency would be tantamount to the performance you would expect in putting two threads onto just one single-threaded core)

edit: I see the powers that be have managed to torpedo your URL's again 🙁 given enough time I have every confidence we'll eventually find ourselves censoring anandtech.com too and righteously proclaiming "mission accomplished" over the spammers in the process 😀

So lets see,

FX8350 two threads single Module = 376 secs
Core i7 single core + HT = 397 secs

Now, since one BD module almost same size as one SB core (at 32nm), from the above measurements we can conclude (for that application) that as far a throughput goes, BD is better than SB.

People calling all kind of names the BD architecture, from inefficient to sucks etc but they have never understood that BD was designed for Multitasking/Multithreading and Throughput and not high IPC.

Edit: well, I have to find a way to invade those torpedoes 😉

mrmt · Jan 16, 2013

Idontcare said:
Loading both of those threads onto one module results in a compute time per thread of 376 seconds.

(...)

But load both threads onto a physical + virtual core pairing and the compute time per thread balloons to 397 seconds.

IDC,

Would you have power consumption data for those measurements?

grimpr · Jan 16, 2013

AtenRa said:
People calling all kind of names the BD architecture, from inefficient to sucks etc but they have never understood that BD was designed for Multitasking/Multithreading and Throughput and not high IPC.

This, the more you push the cpu running multiple concurrent programs, the better it behaves, i wonder what will happen when they dump the shared decoder in SR and switch to a dedicated decoder in each integer core, this alone should catapult performance up in single threaded apps.

jvroig · Jan 16, 2013

AtenRa said:
People calling all kind of names the BD architecture, from inefficient to sucks etc but they have never understood that BD was designed for Multitasking/Multithreading and Throughput and not high IPC.

It's not that they don't understand. It's that they don't care because it doesn't matter - single thread performance is what matters more - it improves performance of lightly threaded apps, and improves performance of highly threaded apps as well, simply because you had more performance to start with.

"Throughput per die area" or "Throughput per core size", as you are proposing, is not a useful metric for anybody, not even for server guys. Performance/watt is. Pure performance is. Performance/dollar is. In all of these three metrics, in most of the scenarios (single-thread, lightly-threaded, multi-threaded, only exception is probably embarrassingly parallel), Intel CPUs (and sometimes, even Deneb/Thuban) score higher than BD.

You see the problem? You can point invent metrics where AMD wins, but unless it is something really solid like pure performance, performance/watt, or performance/dollar, most people will continue not giving two bleeps about the BD arch, and continue to say it sucks.

Idontcare · Jan 16, 2013

AtenRa said:
So lets see,

FX8350 two threads single Module = 376 secs
Core i7 single core + HT = 397 secs

Now, since one BD module almost same size as one SB core (at 32nm), from the above measurements we can conclude (for that application) that as far a throughput goes, BD is better than SB.

People calling all kind of names the BD architecture, from inefficient to sucks etc but they have never understood that BD was designed for Multitasking/Multithreading and Throughput and not high IPC.

Edit: well, I have to find a way to invade those torpedoes 😉

Yeah that is what shows up in this graph:

Gaussian98A7Multi-TaskingSingle-ThreadedPerformanceScaling.png

Once you start multi-tasking the app, the work you can get done in a given amount of time heavily favors (in this specific app) the CMT approach that AMD adopted.

I think it is also worthwhile to point out that this is one of those apps for which even the oft-miquoted Moore's Law "performance doubles every 2 yrs" mantra actually pans out.

From 1998 to 2012 we'd expect performance to now be 128x that of what it was in 1998, which the 4GHz FX-8350 basically provides.

"Confirmation bias, a game everyone plays only until such time that they are ahead" 😛 😉

mrmt said:
IDC,

Would you have power consumption data for those measurements?

Unfortunately I do not have the specific results, they were right in-line with the power numbers I had previously recorded when running CinebenchR11.5 so I didn't bother to record them in detail for the G98A.7 benching.

(basically ~165W for the 4GHz IB and ~265W for the 4GHz PD, platform power usage at the wall sans LCD power)

There is no question, thanks to the comparison being between a 32nm processor and a 22nm processor, the 22nm processor yields fantastic performance/watt results even in lopsided cases like this one where the performance crown goes to the 32nm processor.

Homeles · Jan 16, 2013

AtenRa said:
You have to compare it against a 8 (legacy) core based Bulldozer to see how it performs. Most of the time, CMT performs at 80-90% of a legacy Multi Processor(MP) design.

80-90% is still less than 100%, so I'm still correct.

guskline · Jan 16, 2013

IDC: Went ahead and added active cooling to my VRM on my Sabertooth FX990 mb. Basically, had an extra 120 mm Corsair High performance fan with a 4 pin connector that I mounted above the VRM/NB area. I used special velcro hat you can buy at Radio Shack and attached the fan on the 2 corners to the top of 2 metal housing units at the rear of the mb. It dropped the temp @2degrees at idle. I'll play with it awhile to see what affect it really has.

grimpr · Jan 16, 2013

Homeles said:
80-90% is still less than 100%, so I'm still correct.

Thats what AMD was saying all along too about CMT.

AtenRa · Jan 16, 2013

jvroig said:
It's not that they don't understand. It's that they don't care because it doesn't matter - single thread performance is what matters more - it improves performance of lightly threaded apps, and improves performance of highly threaded apps as well, simply because you had more performance to start with.

Depends who we are talking to. People using single threaded apps will not care, people that dont multi-task will not care, people gaming until recently would not care. But Bulldozer architecture was not designed for those people, for those apps. Single thread and higher IPC was the dominant force for desktop until now. Things are starting to change with Intels HT and AMDs CMT.
It is not coincidence that Haswell is getting wider, getting two more ports in the execution engine (port 6 and 7). IPC (Legacy code) will not get much higher that 5-10% BUT, im expecting core throughput and performance with HT will gain much more than 10% increase because of the wider core execution vs SB/IB.

jvroig said:
"Throughput per die area" or "Throughput per core size", as you are proposing, is not a useful metric for anybody, not even for server guys. Performance/watt is. Pure performance is. Performance/dollar is. In all of these three metrics, in most of the scenarios (single-thread, lightly-threaded, multi-threaded, only exception is probably embarrassingly parallel), Intel CPUs (and sometimes, even Deneb/Thuban) score higher than BD.

I have only wanted to show that at the same real estate, Multithread/Multitask performance is almost the same. Two different architectures performing the same using the same die area. Yes BD architecture doesnt have the same high IPC as SB/IB, but you really have to acknowledge the fact that BDs integer core inside the Module has 2/3 the throughput of SB/IB Integer execution engine. Yes it doesn't concern the majority of people but from a engineering and architectural point of view it changes the way you look at it.

jvroig said:
You see the problem? You can point invent metrics where AMD wins, but unless it is something really solid like pure performance, performance/watt, or performance/dollar, most people will continue not giving two bleeps about the BD arch, and continue to say it sucks.

Performance efficiency is very good for a 315mm2 1.2B transistor processor. The only problem is that those processors are not efficient(Performance, power usage) for desktop use. Much like Intel Core i7 3820.
Intel Core i7 3820 became obsolete for Desktop the moment IvyBridge Core i5 and Core i7 where released. Even before IB, there was no point of paying more for the Core i7 3820 (mobo included) against Core i7 2600K for desktop use.

AtenRa · Jan 16, 2013

double post

Homeles · Jan 16, 2013

grimpr said:
Thats what AMD was saying all along too about CMT.

Until they produce a product capable of delivering that, it doesn't matter. Still, FPU throughtput will suffer.

Observations with an FX-8350

Lifer

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Platinum Member

Platinum Member

Lifer

Platinum Member

Lifer

Member

Diamond Member

Elite Member

Lifer

Diamond Member

Golden Member

Platinum Member

Elite Member

Platinum Member

Diamond Member

Golden Member

Lifer

Lifer

Platinum Member