Bulldozer has better IPC when run 4C/4M?

mosox · Oct 14, 2011

The OS scheduler should let you tweak your CPU cores how you want. I mean while in Windows, one can say "I don't need 8 cores for this, lemme disable 4 of them". Is this feasible? Can the OS be made to cope with this kind of changes while running it? Like buttons in the task manager, disable this, enable that.

jvroig · Oct 14, 2011

dawp said:
So what are the chances of AMD releasing an optimizer like they did with the dual core when they first came out?

Ha! I remember that, AMD's Dual Core Optimizer. I think I needed one for my X2 5000+, but I can't be sure if I actually needed it or I just ended up using it erroneously. Come to think of it, I think I actually had to uninstall it eventually because I determined it started some funky behavior in a few Steam games I had. It's quite some time ago I can't remember.

Anyhoo, if they do come out with a "Bulldozer Optimizer" and Zambezi becomes salvageable, that would be good news.

ydnas7 · Oct 14, 2011

there are also turbo boost implications for 4m4c vs 2m4c

Ferzerp · Oct 14, 2011

People getting excited about this seem to be forgetting one thing. All the evidence is not pointing to any magical scheduler fixable issue. They only get better single thread performance when the other part of the module is actually *shut off*. In no situation has that been replicated with it enabled, but idle.

bryanW1995 · Oct 14, 2011

Kevmanw430 said:
See, this is the problem. BD just came out, so there is no software optimization, GF is having issues with 32nm, and core scheduling should be 1,3,5,7,2,4,6,8, not 1-8. If all of these things were resolved, BD might actually have a chance..

Why can't they just "trick" the software by making the "primary" core in a module get treated like a "real" core in, say, a 2600k? Say name them core 1,2,3,4, and then the extra/CMT/ht cores become 5/6/7/8? Or would they lose optimizations because in some cases it's more efficient to have the workload on a single core instead of space across 2 separate ones?

edit: And Anand's article said that IPC was about 7% less than ph2, so if when using <= 4 cores you can get that 20% bump then you'd be looking at about 13% greater IPC than ph 2. What is that, somewhere between penryn and nehalem? Not great, but BD at least clocks higher than nehalem it would at least make BD something worth considering for AMD fans.

remyat · Oct 14, 2011

What about "harvested" FX-41xx? are they 2m/4c? someone "lucky" will get a 4m/4c with 4 non-adjacent integrer units disabled? and FX-61xx, 4m/6c, 3m/6c? lol funny benchs "disclaimer: YMMV"

And about Bulldozer, I'm hoping they solve the power comsuption issue because if you use your pc a lot, even getting the processor for free could end costing you more money than an equal or better SB.

Idontcare · Oct 14, 2011

jvroig said:
Ha! I remember that, AMD's Dual Core Optimizer. I think I needed one for my X2 5000+, but I can't be sure if I actually needed it or I just ended up using it erroneously. Come to think of it, I think I actually had to uninstall it eventually because I determined it started some funky behavior in a few Steam games I had. It's quite some time ago I can't remember.

Anyhoo, if they do come out with a "Bulldozer Optimizer" and Zambezi becomes salvageable, that would be good news.

I wonder if it would at all be cost and performance effective to have an analogous situation to how we handle NAND flash wear-leveling algorithms where the addressed bit has not bearing on the physical location, it is mapped by a controller.

For CPU's, the addressed "core" is entirely virtual, the CPU itself manages core loadings such that performance is always maximized (modules are loaded with one thread with utilization rate of 100%) or such that power-consumption is minimized (modules are fully loaded before threads are allowed to go to be allocated to the next available module) and so on.

Make it OS agnostic, akin to how the controllers work on modern SSD's.

aigomorla · Oct 14, 2011

Kevmanw430 said:
Which, again, makes perfect sense. But, the fact is, if we can get Windows to schedule threads in the right order, BD might not be such a stinker... Who knows.

no it more points to the cores being like intel's "hyper threaded cores".

That means, its not a real octocore to any person who understands cpu's.

Its 1 physical core trying to split into 2 active threads... which is why u see better performance by disabling the threads, and pushing the physical cores.

Dont believe me? Run Intel's LinX with HT ON, and then OFF.
You see increase in flops.

zlejedi said:
So have anyone actually run strictly 4 threaded code on 8 core BD with 4M/8C and 4M/4C scenario ?

Because 4M/4C being faster than 2M/4C is discovery worthy of captain obvious award.

If 4M/4C would be faster than 4M/8C in any scenario then it would prove that indeed there's some kind of problem with scheduling.

actually this would be a good test..

Test the flops at 8, then 4, and then 1.

And see how it scales..

Kevmanw430 · Oct 14, 2011

aigomorla said:
no it more points to the cores being like intel's "hyper threaded cores".

That means, its not a real octocore to any person who understands cpu's.

Its 1 physical core trying to split into 2 active threads... which is why u see better performance by disabling the threads, and pushing the physical cores.

Dont believe me? Run Intel's LinX with HT ON, and then OFF.
You see increase in flops.

I know its not quite a real octocore. It has 8 IPU's, but only 4 FPU's shared between 2 FPU's. It is NOT 1 physical core trying to split two active threads, that's HT. Point is, while you do gain some single threaded increase from disabling HT, you get much more from not sharing that FPU. Maybe AMD can release that "Bulldozer Enchancer.)

AtenRa · Oct 15, 2011

Dresdenboy said:
And then there is inefficiently decodable code blocking important shared front end resources.

Hello,

Do you mean inside the shared Decoder of the module(MacroOps) or in the application ??

crazylocha · Oct 15, 2011

Aigo is almost there. (Salud mi amigo)

not about "normal sceduling"
There aren't enough terms to describe Intels original doubling speed optimizations. They were done in thirds. Go figure. Intel would return to the unoriginal divide by four, drop one for overhead. Remember the original dx33? Cheat was it quadropled speed, dropped one crank, for "engineering overhead", and add four memory sticks, dropping one for higher speed thresholds.
Start to sound like a mantra?

Your primary schedular is actually unit zero. Then BD core 1 becomes odd man out. Start by sceduling core 2 (really #3 by scedule), then skip to core 5, (BD core 4), and so on. Once you get out of base 10 thinking, you start getting closer to the truth. Its base 3 thinking. Look at other results. Enumerate by thirds. Not tenths. Hence the need for different thinking. Go back to when Intel first did three stick memory and figure out why. Headroom for speed (in a very basic way). A quad controller will be more effective if its only controlling 3 channels. Will have buffer flows looking like genius level when run on simpler levels. Run modern threading on older spacial sceduling and you will swamp it like first year entry programmers first try at hyperthreading.

Break its thermal threshold when scheduling. Core 2, then 5, on to 8, back to 3. Which quads of the base root are you waking on charge and which have applied heat from neighboring cores? Wanna bet will make difference? 95% of programmers don't have clue. They take easiest quickest route, instead of providing leeway. Try running around the shortcomings of Intels compilers (study Dan Corbit for lifelong lessons) or Amd's for fairness sake, and write them yourselves. Not MS's, or anybody else's. Figure it out by yourself. Your, close Aigomorla.

Not about 4 cores vs. 8. Which 4 cores? Why not 5? Or more importantly, which # out of the 8? Doesn't it depend on how its sceduled and WHAT is sceduled where (think layers of issues of importance) in a semi stacked level. So you popped stack on a virtual pile, which gets the cooled off integer integrator vs. Running parallel threading on other three base cores. So one "BD core is smokin" from lots of heavy floatpoint, why would bury it with threading between its "twin pair" while its cooling back off. Too many ppl don't bother thinking about how it actually goes through. Instead of having a conveinient "cooling off period" to lower tdp's, they have to throw kitchen sink at the toilet.

I can look at certain results knowing how will play out. Keep throwing water at candle, will eventaully wesr down the base until it lowers sides and puts out the wick at top.
Run 12 threads at same 6/8 core, will still bottleneck somewhere. Put some organization to how and where they get run and does it make a difference? Can you increase performance by crossing partial threading to opposite sides so buffers clear before rescheduling? Why not run an L2 buffer overflow to mirrors cache and let it flow back in same L3 stacking profile? How many programmers used to individual cores think like twin pairs and allow its trancendence?

A whole new genre of tweaks are yet to come. How is another animal.
Will they think out of 15 year old box? Many of these guys, thats most of their "sentient aware" lives. Can they learn from pasts and project forward and find someone elses unsaid truths? I don't have one in my hands yet... Will wait for "new" prices to come down first. Then will trash a few kernals until I see what makes it sing. Will tickling middle toe be more effective than smashing that big toe with three pound hammer? Maybe. Prolly not lol. Still have a Thunderbird desktop cpu that never was made to fit in laptop (mine was some kind of demo) that was "tickled" to dore more than laptop or desktop was meant to. Why? Bios dev thought out of box. He passed buck to me to figure out how to not make it tone deaf, for 7 years it sang operas while its sisters tried to figure out how to queen pawns. Move order theory hadn't got past negative progressions, until nul move theory gained steam. Skip a cell of root moves to advance a negative? Novel. Out of box even. Killed a 6 month continuos data feed to play in it. Found why Thunderbird died quick death when dual core oppositional stance on same integer line made huge difference scheduling both sides of same sine curve with expectant integer eval. Bi-directional doesn't work on bad single fpu without true "fpu co-processor" to shed excess null voids overlapping on single stacks. Parity is not strong suit of two dimensional exuecution of three or four demensional thinking (hard messy lesson).

Take my cell phone for ex. Dual core A9 Dx2. Droidfish running single threaded single core found matein 6 moves slower than my 6yr old nephew. Give it multi-threading and finds it faster then me. Contact dev and all of sudden, he turns on core awareness, and it smokes four times as fast. Why? Should have doubled only. 6 different iterations of theorums changing next level downs, causing greater gains through negative parsings. Each layer shed the slough of unlikely returnd through raw trail. Each built on each others failures that were sucesses. (CCC archives, Dr. Bob, Ed Screoder, etc. 20+ years ago) taught about posititive failures if you can figure out pattern of cascading leaks via different chipsets. They unleashed the compiling monster that became Dan Corbit. Tweak it depending on looping thresholds and you can surpass the norm with triviality. Some high paid twit (now, not then, lol) decided better to keep his long term prospects open by not tweaking his boss off with a better idea than his. Maybe he will let his pass off go as little noticed as possible. Can your recompile of his ss2 sub dep go better without his letdowns? Most of you that have read thus far can guess that might be yes. Depends on what your motivations are. Hence why I find more satisfaction in the Linux circles. They aren't waiting for some mega giant corp to come up with a better driver. Go do it differently. Will be happy to Alpha/Beta for ya if have a spec you need run.

OCGuy · Oct 15, 2011

dawp said:
So what are the chances of AMD releasing an optimizer like they did with the dual core when they first came out?

BD was so delayed, that hopes for some magic program to make it better are pie-in-the-sky.

aigomorla · Oct 15, 2011

i thought we lost drivers for CPU back in windows XP.

wahdangun · Oct 15, 2011

i think its a time for CPU to get driver because it will be more complex and the different architecture between intel and AMD will make it hard for software developer.

to be honest i think why AMD want to make the bulldozer what it is know, is to make a more efficient chip because they can shutdown the inactive module. but its seems this strategy backfired.

Idontcare · Oct 15, 2011

aigomorla said:
no it more points to the cores being like intel's "hyper threaded cores".

That means, its not a real octocore to any person who understands cpu's.

Its 1 physical core trying to split into 2 active threads... which is why u see better performance by disabling the threads, and pushing the physical cores.

Dont believe me? Run Intel's LinX with HT ON, and then OFF.
You see increase in flops.

actually this would be a good test..

Test the flops at 8, then 4, and then 1.

And see how it scales..

Tests were ran long ago, years IIRC:

Blastman · Oct 16, 2011

Vesku said:
Yeah some strange scenarios where the cores can't be fed properly going on in certain tests.

That's what I was thinking might be a problem after reading Anand's write-up on the BD architecture

Anandtech
Since fetch and decode hardware is shared per module, and AMD counts each module as two cores, given an equivalent number of cores the old Phenom II actually offers a higher peak instruction fetch/decode rate than the FX. The theory is obviously that the situations where you're fetch/decode bound are infrequent enough to justify the sharing of hardware. AMD is correct for the most part. Many instructions can take multiple cycles to decode, and by switching between threads each cycle the pipelined front end hardware can be more efficiently utilized. It's only in unusually bursty situations where the front end can become a limit.

So I gather that this little 4M/4C vs 2M/4C exercise shows that a single front end can't adequately feed the 2 cores in a module to attain an improved IPC over the previous generation Phenom II.

While that appears to be bad news, the fact that the IPC improves significantly over the previous generation when that front end bottleneck is removed (4M/4C mode) bodes good for AMD. It means that they have designed a significantly faster IPC core over the Phenom II when it can be fed properly.

What's the solution? I don't know.

It's possible they can redo the front end and remove that bottleneck, but that would probably mean increasing the transistor count on a module that is already getting unmanageably large die size wise. Maybe one solution would be to just

- remove 1 core from the module
- retain most or all of the FPU
- add hyperthreading (core is now 4 wide issue like Intel's processors)
- redo the L3 (smaller and faster like the on SB compared to Nehalem)

and they should have a CPU very competitive with Intel. Who knows, maybe such a radical re-design would take too much time, but if the current module problems can't be "fixed" they are going to be stuck with huge die size/transistor counts where cores are being underutilized because they can't be fed properly. It's not a good utilization of transistors and will make their die sizes too big to be competitive with Intel.

Accord99 · Oct 16, 2011

If you look at their full BD review and compare the scores with the Phenom II 980, the 980 will still beat the scores of the 4C/4M in nearly all of the benches.

http://udteam.tistory.com/440

aviat72 · Oct 16, 2011

BD is like a graduate school research project. You need to run simulations to figure out which combo works well for which load and what not.

To make their engineering effort worth it, AMD should publish the internal architecture specifications in the form an architecture level simulator with knobs for architectural variations (using 2 cores versus 4 cores, increasing the size of the cache, not sharing the cache for high priority threads etc.)

They should also publish an open source compiler for BD architecture.

I think after thousands of graduate students have simulated BD for their course projects, AMD engineers will FINALLY to figure out what the optimal configuration should be.

I am sure all MB manufacturers must be thanking their stars that AMD kept BD pin compatible and all it needed was a BIOS change. Imagine the hit they would have taken if AMD had asked them to support a new socket, with new boards and what not. Perhaps that by itself was a give-away of what was coming.

soccerballtux · Oct 16, 2011

Idontcare said:
Tests were ran long ago, years IIRC:

it would be great if we could fill in the plot with some more samples. Could someone rerun that in increments of 0.25 threads?

soccerballtux · Oct 16, 2011

Accord99 said:
If you look at their full BD review and compare the scores with the Phenom II 980, the 980 will still beat the scores of the 4C/4M in nearly all of the benches.

http://udteam.tistory.com/440

wasn't there something about Ph2 being a 3 wide decoder and the BD core (not module) being 2 wide, this being done because AMD found the 3rd was rarely utilized? That would explain why the BD only ties and loses to Ph2

intangir · Oct 16, 2011

soccerballtux said:
wasn't there something about Ph2 being a 3 wide decoder and the BD core (not module) being 2 wide, this being done because AMD found the 3rd was rarely utilized? That would explain why the BD only ties and loses to Ph2

You're thinking of ALU/AGUs. The ALUs execute integer instructions, the AGUs calculate memory addresses. AMD went from 3 of each in K8/10h to 2 of each in Bulldozer, but supposedly made the AGUs independent of the ALUs.
http://groups.google.com/group/comp...a93e7ec42eb?lnk=gst&q=k8+agu#a7ce9a93e7ec42eb

Mitch Alsup said:
> The third AGU was never used, waste of die area and heat.

The issue was that the 3rd unit was used a lot, only to run into the
dual-only ported DataCache. This caused sequencing issues.

> The third ALU is of more concern, Intel will standardize benchmarks to
> make this look bad, even though I know it was used 1% on average.

So what else is new.

It sounds like the 3rd AGU in K8 should've provided a benefit, but most of the time could not because there were only two ports to the data cache. The 3rd ALU was only used ~1% of the time.

Blitzvogel · Oct 16, 2011

From a module count perspective, BD doesn't seem all that bad, but why does it have so many transistors? All the on board cache?

Bulldozer has better IPC when run 4C/4M?

Senior member

Platinum Member

Member

Diamond Member

Lifer

Member

Elite Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Senior member

Lifer

Member

Lifer

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Golden Member

Elite Member

Golden Member

Platinum Member

Member

Lifer

Lifer

Member

Platinum Member