Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

jones377 · Apr 15, 2011

HW2050Plus said:
Except for Fritz Chess, it is quite obvious that the counts get lower the more you acces main memory and then you are much limited to memory access which reduce absolute IPC by memory stalls.

Whereas synthetic benchmarks obviously run in L1 cache only.

As I said the absolute IPC is more a function of the application than the CPU. It differs between 0.2 and 3.0 so by a factor of 15 on the same CPU.

What effect do you think improved load/store reordering, bigger instruction schedule window, improved prefetchers, bigger caches, etc will have on the measured IPC on real life applications such as these? They won't reduce the stalls happening on the CPU at all? Maybe AMD should have have added 8 decoders and 6 ALUs instead huh?

Idontcare · Apr 15, 2011

HW2050Plus said:
Except for Fritz Chess, it is quite obvious that the counts get lower the more you acces main memory and then you are much limited to memory access which reduce absolute IPC by memory stalls.

Yeah Fritz Chess is a crazy memory-speed dependent bench/app:

^ that's with a lowly 2.4GHz core clock, the dependence (slope of the line) on memory bottleneck for this app would be all the higher if the clocks were 4GHz.

Abwx · Apr 15, 2011

HW2050Plus said:
I really have no idea why they call it AGLU. But according to the optimization manual it appers as the AGUs are only MicroOp assists for address calulations for CALL/LEA/Load-Execute/MOV running on EX0/EX1. It looks more like a helper to not further increase instruction latencies on those instructions.

According to the said manual, section 2.10.2 :

" The AGLUs contain a simple ALU to execute arithmetic and
logical operations and generate effective adress".

Tuna-Fish · Apr 16, 2011

HW2050Plus said:
Exactly the bold marked statement is something which will not happen. I really have no idea why they call it AGLU. But according to the optimization manual it appers as the AGUs are only MicroOp assists for address calulations for CALL/LEA/Load-Execute/MOV running on EX0/EX1. It looks more like a helper to not further increase instruction latencies on those instructions.

The uop pipes shown in the section B.2 are obviously wrong -- notably, almost nothing, including mov's to and from memory don't use the AGUs. It would be completely and utterly ridiculous to have two extra pipes with a total of 4 load ports from the registers that only ever get used in call/lea. Nobody would ever make that kind of a design -- I find the much more likely explanation that the table in B.2 is simply wrong.

videoclone · Apr 19, 2011

Well with just over a month to go until the e3expo release announcement of desktop bulldozer parts its very strange we dont have any real leaked performance numbers from asia.. I would assume with such little time left we would have a few parts floating around for testing! its very suspicious! Hope its to hide greatness not shame!

Joseph F · Apr 19, 2011

Idontcare said:
I agree with both you and Joseph F.

To be sure the K8 was NOT merely a K7 with an integrated memory controller.

In the same respect though I think Joseph F was actually intending for the comment to be a compliment to the underlying microarchitecture itself.

The Athlon formed the basis of a strong pedigree of successive architectures that built upon the foundation that the K7 formed. That's something to be proud of IMO.

Yes, What I was trying to say is that K7 is an awesome architecture and the fact that it's still going strong today (in a heavily modified form) is proof that it is. However, it is time for a change and hopefully BD will be AMD's next K7 Athlon.

Joseph F · Apr 19, 2011

RobertPters77 said:
It was hypothetical algebraic equations based on the information released. If at the same clock speed BD does 50% more 'throughput' with 33% more cores then define throughput.

Clock for performance is what matters most. I don't care what anyone says. If cpu 'A' does 100 ops at 1ghz. While cpu 'B' does 50 ops at the same speed, then I'll take cpu A. The productivity boost speaks for itself. Of course price is a concern but in the Professional space all that matters is getting things done faster.

What if say, CPU A completes 100Mflops at 1GHz and CPU B completes 150Mflops at 2GHz while they cost the same and have the same heat/power consumption?

podspi · Apr 19, 2011

videoclone said:
Well with just over a month to go until the e3expo release announcement of desktop bulldozer parts its very strange we dont have any real leaked performance numbers from asia.. I would assume with such little time left we would have a few parts floating around for testing! its very suspicious! Hope its to hide greatness not shame!

AMD should win an award. Good or bad performance, we have no idea 😱

Topweasel · Apr 19, 2011

Joseph F said:
What if say, CPU A completes 100Mflops at 1GHz and CPU B completes 150Mflops at 2GHz while they cost the same and have the same heat/power consumption?

That is something I have been trying to explain to him for the last month. What matters is what gets the most things done in the most programs in the shortest amount of time. Not which has the best IPC, clockspeed, L2 cache, L3 Cache, faster memory, or so on. Its the best combination of all of the above that matters. Its what made the Core 2 Duo such a contender. It wasn't the better CPU in most of those categories, but was close enough in some that its lead in others made it a good CPU. The K10 Arch has suffered by being to far behind in some and not far enough ahead on others.

BladeVenom · Apr 19, 2011

Idontcare said:
Yeah Fritz Chess is a crazy memory-speed dependent bench/app:
http://i272.photobucket.com/albums/jj163/idontcare_photo_bucket/FritzChessMemoryBandwidthScaling.png

^ that's with a lowly 2.4GHz core clock, the dependence (slope of the line) on memory bottleneck for this app would be all the higher if the clocks were 4GHz.

That's because of hash tables, correct?

exar333 · Apr 19, 2011

podspi said:
AMD should win an award. Good or bad performance, we have no idea 😱

The 'keep a lid on it' award. 🙂

Idontcare · Apr 19, 2011

BladeVenom said:
That's because of hash tables, correct?

I really don't know, never thought about it to be honest.

HW2050Plus · Apr 19, 2011

Topweasel said:
That is something I have been trying to explain to him for the last month. What matters is what gets the most things done in the most programs in the shortest amount of time. Not which has the best IPC, clockspeed, L2 cache, L3 Cache, faster memory, or so on. Its the best combination of all of the above that matters. Its what made the Core 2 Duo such a contender. It wasn't the better CPU in most of those categories, but was close enough in some that its lead in others made it a good CPU. The K10 Arch has suffered by being to far behind in some and not far enough ahead on others.

Are you kidding? Core 2 is superior to K10.5 in practically every design aspect:
* Execution width (4 vs. 3)
* cache latency
* memory bandwidth/latency (since Nehalem)
* instruction latency
* prefetch
* branch prediction
* scheduler depth
* L/S reordering
* SMT
* Trace cache (since Sandy Bridge)
and so on.

And regarding these capabilities Bulldozer is a step back (exec width, cache latency, inst. latency) and a step forward (prefetch, branch prediction, L/S reordering) compared to K10.5. Altogether a step back, though by the high clock design they can surpass the K10.5. But compared to the latest Core2 incarnations it's way behind.

The big issue is that they did not achieve their design goals (perf/die size), means a simple (enhanced or not) 8 core Llano would be roughly equal. Okay Bulldozer could turn out as a TDP wonder - we have to see, that would enable some other options. So again AMD has to sell their dies cheaper than Intel. In the mid/long term I doubt that AMD can survive this. On the other hand Bulldozer may be revised in a version 2 to fix many of the issues.

And you should especially consider that exactly when Intel increased ALU width from 2 to 3 with adding issue port 5, AMD lost the performance crown.
Now AMD even drops it's width from K8-K10.5 of 3 to Bulldozer of 2 and you think you can expect a performance increase? Overall you will get an increase of course because of high clock and CMT.

podspi said:
AMD should win an award. Good or bad performance, we have no idea 😱

Oh we have a very good idea. The question is only if this is considered as good or bad. And that mainly depends on price (die size/yield) and TDP. As we can expect not to get information about price and TDP before release we have to wait until then.

Topweasel · Apr 19, 2011

HW2050Plus said:
Are you kidding? Core 2 is superior to K10.5 in practically every design aspect:
* Execution width (4 vs. 3)
* cache latency
* memory bandwidth/latency (since Nehalem)
* instruction latency
* prefetch
* branch prediction
* scheduler depth
* L/S reordering
* SMT
* Trace cache (since Sandy Bridge)
and so on.

And regarding these capabilities Bulldozer is a step back (exec width, cache latency, inst. latency) and a step forward (prefetch, branch prediction, L/S reordering) compared to K10.5. Altogether a step back, though by the high clock design they can surpass the K10.5. But compared to the latest Core2 incarnations it's way behind.

The big issue is that they did not achieve their design goals (perf/die size), means a simple (enhanced or not) 8 core Llano would be roughly equal. Okay Bulldozer could turn out as a TDP wonder - we have to see, that would enable some other options. So again AMD has to sell their dies cheaper than Intel. In the mid/long term I doubt that AMD can survive this. On the other hand Bulldozer may be revised in a version 2 to fix many of the issues.

And you should especially consider that exactly when Intel increased ALU width from 2 to 3 with adding issue port 5, AMD lost the performance crown.
Now AMD even drops it's width from K8-K10.5 of 3 to Bulldozer of 2 and you think you can expect a performance increase? Overall you will get an increase of course because of high clock and CMT.

Oh we have a very good idea. The question is only if this is considered as good or bad. And that mainly depends on price (die size/yield) and TDP. As we can expect not to get information about price and TDP before release we have to wait until then.

Not quite understand where your coming from except that I said it was a good challenger? Wasn't it? I mean they pretty much traded hits back and forth between launches till Nahelem was launched. But you seem hell bent on proving that BD is not just not competitive with current CPU's, but is actually slower then any CPU launched by AMD in the last 4 years. Seems like the people in GW crowd always looking solely at examples that prove the theory while almost completely ignoring or tossing aside anything that doesn't support it as insignificant. I am comfortable in admitting that the Core 2 Duo was a better solution at the time. In fact I think one of the reasons they held back a bit and allowed it to be a little more even then it should have been early on is Intel on its best manners because of the raids, charges in Japan and EU, and overall worried about what was going to happen here.

We are a month or two from knowing the true answer, but if you are sure think to yourself, why would AMD spend billions developing a CPU that runs slower then their previous unit. Remember that there was a K9 at one point and was dumped because they had issues with its overall performance. Why not ride 10.5 and Llano hard instead of using this same architecture for future Llano options. Think about that and maybe reanalyze your performance numbers and clockspeed estimates and maybe you might have a better idea. Will it beat SB? I have no idea. Will it beat K10.5? It would have to. Otherwise it would die, like the 300 series from Nvidia.

Abwx · Apr 19, 2011

HW2050Plus said:
Now AMD even drops it's width from K8-K10.5 of 3 to Bulldozer of 2 .

This assumption is not true, yet, all your twisted logic
is built upon this erroneous point...

BD, as pointed by AMD, is 4 issues width for each
integer core.

How they manage to do it is still an unknown,
but as i already posted it, the optimisation
manual say explicitly that the AGLUs perform
not only adresses generation , but also logical
and arithmetic operations....

RobertPters77 · Apr 19, 2011

I really really hope BD succeeds. Both as a consumer and as a shareholder(Just bought 100 shares of AMD stock). Reading all those Intel success stories makes me really peeved that I didn't buy their stock in 2008 when it was at 15$ a share.

Fox5 · Apr 19, 2011

Phenom wasn't really an improvement over Athlon X2 when it came out. It lost so much in clock speed (due to being quad core), that it wasn't really outperforming the best Athlon X2s, per core. It could happen again, I suppose, but bad luck on AMD's part!

videoclone · Apr 19, 2011

But they are also releasing quad core Bulldozer + 6 + 8

At the end of the day the 4 core will be faster simple due to the die shrink alone! clocks will go up along with performance thanks to 32nm

When bulldozer shrinks to 22nm and i hope they get it done sooner rather then later we will really start to see some jumps in performance.

I hope they get to that level with 32nm but if they dont the 22nm shrink will be the thing that really shows of the advantage AMD's design approch has over intels...

16cores on the same TDP and Ghz as an 8 core 32nm chip today anyone?

drizek · Apr 19, 2011

Good point about the 4-core being fast. Everybody forgets about the quad core, but sandy bridge only has 4 cores, and if AMD has good IPC and good clocks on these things, even the quad dozer should make at least a good value competitor to Sandy B.

I never really thought about how much of an upgrade the 8-core will be for me. Going from 3 cores at 3.6GHz to 8 cores at 4GHz+(hopefully) will be a pretty major upgrade, probably the single biggest jump in (raw) performance ever for me. Of course, it is also probably the least useful jump in performance ever, but I can't help it.

videoclone · Apr 20, 2011

Allot of people do say that 8 cores is overkill but if you think back ONLY 3 years ago people were saying the same about 4 😀

gfody · Apr 20, 2011

GloFo has the more cost effective tech for 32-28µ no? So even if the new BD architecture doesn't totally crush SB it ought to be more affordable for 32µ parts? If that's the case I expect nothing short of complete dominance in the server market.

Absolute performance crown on the desktop is not likely to come from a server chip so I wouldn't get my hopes up.

Joseph F · Apr 20, 2011

Topweasel said:
That is something I have been trying to explain to him for the last month. What matters is what gets the most things done in the most programs in the shortest amount of time. Not which has the best IPC, clockspeed, L2 cache, L3 Cache, faster memory, or so on. Its the best combination of all of the above that matters. Its what made the Core 2 Duo such a contender. It wasn't the better CPU in most of those categories, but was close enough in some that its lead in others made it a good CPU. The K10 Arch has suffered by being to far behind in some and not far enough ahead on others.

I guess some people are diehard Intel fanboys that just want to stick with their brand. And that's totally fine as it's their money and I expect and hope that they spend it on what makes them happy the most; whether it be Intel, AMD or VIA.

RobertPters77 said:
I really really hope BD succeeds. Both as a consumer and as a shareholder (Just bought 100 shares of AMD stock). Reading all those Intel success stories makes me really peeved that I didn't buy their stock in 2008 when it was at 15$ a share.

:thumbsup:

itsmydamnation · Apr 20, 2011

HW2050Plus said:
ArAnd regarding these capabilities Bulldozer is a step back (exec width, cache latency, inst. latency) and a step forward (prefetch, branch prediction, L/S reordering) compared to K10.5. Altogether a step back, though by the high clock design they can surpass the K10.5. But compared to the latest Core2 incarnations it's way behind.

if you ignore op fusion,
you have no idea how well latency of L2 can be hidden, SB and bulldozer have the same L1D latency, bulldozer can have a very high amount of outstanding requests to L2 as well (23 i think).
also AMD instructions can be more complex then intel so latency without context doesn't mean much.
also aren't the pipelines decoupled in BD vs K10h so high utilization of int resources per clock.
prefetch and branching no longer mess each other up
more load/store thoughput per core per clock

im not seeing a lot to justify your position.

edit: on the core width thing, redpriest on semiaccurate has said that each core also has 4 return buses.
edit2: L2 not L1D

JFAMD · Apr 20, 2011

videoclone said:
16cores on the same TDP and Ghz as an 8 core 32nm chip today anyone?

16 cores is same tdp as 12 core Opteron today.

zebrax2 · Apr 20, 2011

JFAMD said:
16 cores is same tdp as 12 core Opteron today.

He was talking about the possible outcome when manufactured at 22nm.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Senior member

Elite Member

Lifer

Golden Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Lifer

Diamond Member

Elite Member

Member

Diamond Member

Lifer

Senior member

Diamond Member

Golden Member

Golden Member

Golden Member

Junior Member

Diamond Member

Diamond Member

Senior member

Senior member