Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

IntelUser2000 · Apr 8, 2011

HW2050Plus: If I make a WAG, the increased latency is part of a trade-off. I'm guessing that is to reach higher clocks. If that results in the part clocking higher than otherwise(even 5-10&#37

, it would be worth it, since higher clocks help everything, not just the specific instructions. Everything is always a trade off.

The topic of execution resources: The performance is not bound only by the execution units alone but are important. The info about ALUs and AGUs are a positive thing.

Riek · Apr 8, 2011

IntelUser2000 said:
HW2050Plus: If I make a WAG, the increased latency is part of a trade-off. I'm guessing that is to reach higher clocks. If that results in the part clocking higher than otherwise(even 5-10%), it would be worth it, since higher clocks help everything, not just the specific instructions. Everything is always a trade off.

The topic of execution resources: The performance is not bound only by the execution units alone but are important. The info about ALUs and AGUs are a positive thing.

But according to the instruction latencies the AGLU can only do LEA (partial) and CALL (partial) so either that table is incomplete or i'm reading it wrong. (only EX0 and EX1 pipes can do add for example)

HW2050Plus · Apr 8, 2011

IntelUser2000 said:
HW2050Plus: If I make a WAG, the increased latency is part of a trade-off. I'm guessing that is to reach higher clocks. If that results in the part clocking higher than otherwise(even 5-10%), it would be worth it, since higher clocks help everything, not just the specific instructions. Everything is always a trade off.

The topic of execution resources: The performance is not bound only by the execution units alone but are important. The info about ALUs and AGUs are a positive thing.

In general I agree. But in particular for those two there is no reason nor tradeoff necessary for high frequency. For integer SSE it is not slower as Phenom - but they took this old disadvantage compared to Intel CPUs over to Bulldozer. There is no loss there but a necessary (and easy) gain missing.

With fdiv it is even more strange, as equivalent divps is fast. So it has nothing to do with the high frequency design and also nothing with die space, as the fast floating point division unit is present in Bulldozer - very present as it has been even doubled. But if the fdiv instruction is executed it uses another division unit. Means it is not only slower but also a waste in die space. Maybe not much die space as it is a slow unit but as there are already several fast units present it is as I say quite strange.

And yes performance is not directly equivalent to execution resources but it is a huge factor. Of course decoders, prediction of branches and accesses, cache hierachy and sizes and memory bandwith, etc. play an important role regarding performance. But it is clear to say that this 3->4 wide execution helps as it also helped Intel Core CPUs (also 3->4 and additional port 5) to significantly improve performance.

How well - or not - all this works to gether we have to see when real benchmarks come out. There is a lot of potential and very good solutions, e.g. the new feature of AGLU but as I mentioned also two flaws in the design. An old one already known from Phenom & Co. (integer SSE) but also a new one (fdiv).

Tuna-Fish · Apr 8, 2011

Riek said:
But according to the instruction latencies the AGLU can only do LEA (partial) and CALL (partial) so either that table is incomplete or i'm reading it wrong. (only EX0 and EX1 pipes can do add for example)

The table is not just incomplete, it is clearly wrong in many respects. For example, no memory operations require any AGU pipes. I think it's better to just ignore that table for now.

Tuna-Fish · Apr 8, 2011

HW2050Plus said:
With fdiv it is even more strange, as equivalent divps is fast.

fdiv and divps are not equivalent -- fdiv has a minimum of 64 bits of precision in the mantissa.

It's entirely possible that fdiv starts with the div for doubles, and then adds a round worth of newton-rhapson or goldschmitt.

IntelUser2000 · Apr 8, 2011

HW2050Plus said:
In general I agree. But in particular for those two there is no reason nor tradeoff necessary for high frequency. For integer SSE it is not slower as Phenom - but they took this old disadvantage compared to Intel CPUs over to Bulldozer. There is no loss there but a necessary (and easy) gain missing.

We can't know. Because we're not designers working on Bulldozer. Only they'll know why they did. Who knows why they did? Even the best designs have flaws. We can only congratulate them for the final result.

Ok, it might sound like I'm dismissing the results too easily, but there's not much point in talking about it. Numbers are numbers.

HW2050Plus · Apr 8, 2011

Riek said:
But according to the instruction latencies the AGLU can only do LEA (partial) and CALL (partial) so either that table is incomplete or i'm reading it wrong. (only EX0 and EX1 pipes can do add for example)

You are right, the description in chapter 2.10 does not fit with the instruction latencies, where the AGLU would even not be capable of doing a lea on it's own and also no call. Even worse it appears as need to execute them in EX0/1 and AGLU. However for memory operations it does not state the use of the AGLU to generate the address (e.g. add rdx, [qword ptr] z). Therefore there are mistakes in the latency table anyway, as well as the throughput indicator is missing.

Again an obstacle, the description on page 36 sounds especially strange. Looks as this document is not final. Also a lot of values are still missing. Let's hope that this resolves to a "simple ALU capability" because otherwise it would look more like a two pipeline design which would be really bad.

HW2050Plus · Apr 8, 2011

Tuna-Fish said:
It's entirely possible that fdiv starts with the div for doubles, and then adds a round worth of newton-rhapson or goldschmitt.

Okay in my opinion that is the only explanation that makes sense, therefore use the fast divider (of divps) first and get the additional bits otherwise.

Edrick · Apr 8, 2011

It will be interesting looking back at this thread come June when we may finally see some real life BD products/benchmarks.

Tuna-Fish · Apr 8, 2011

Yes, and the other one. I already know that I have been horribly wrong at least once.

ShadowVVL · Apr 10, 2011

I forget was llano for desktop launching with bd ,before or after bd?

I was hoping it would launch before or with bd. want to see if llano will be any good before getting bd.

BTRY B 529th FA BN · Apr 10, 2011

exciting times ahead!

tweakboy · Apr 10, 2011

Lets give it up It's not going to 50 percenet faster, intel Ivy Bridge will prove that.

Markfw · Apr 10, 2011

tweakboy said:
Lets give it up It's not going to 50 percenet faster, intel Ivy Bridge will prove that.

This thread has nothing to do with Ivy Bridge. The title is about I7 and Phenom II

podspi · Apr 10, 2011

Edrick said:
It will be interesting looking back at this thread come June when we may finally see some real life BD products/benchmarks.

Yea, I also really can't wait to read some of the more technical writeups of the architecture. It looks like some portions of it may be quite novel.

ShadowVVL said:
I forget was llano for desktop launching with bd ,before or after bd?

I was hoping it would launch before or with bd. want to see if llano will be any good before getting bd.

I don't think the market segments are going to intersect much. Llano's magic is that it is going to be cheaper for OEMs (and thus consumers) to buy Llano than an Athlon II X4 + 5570/5670. Boy it is going to be embarrassing if my discrete graphics card is slower than new IGP D:

tweakboy said:
Lets give it up It's not going to 50 percenet faster, intel Ivy Bridge will prove that.

Yes, IB is going to be so fast it actually makes other processors slower after the fact :awe:

HW2050Plus · Apr 10, 2011

With the latest information from the optimization guide my expectations for Bulldozer come back to what I had in the beginning after Hot Chips - no it is even worse.

I thought about the decoders of Bulldozer. Let's look at the module. The decoders can decode 4 ops/cycle. Now as AMD claims they have 12 pipelines which want to be fed by that. How can this work? The new and quite strange information in the optimization guide plus other issued information gives the clue. The 4 decoders decode one or two micro ops. E.g. a add rdx, rax is decoded in one yop and add rdx, [qword ptr] z is decoded in two yops, one for the memory and one for the add.

If we eliminate the micro op intermezzo we have as a result that each Bulldozer core can do 2 Macro-Ops / cycle plus the seperate vector scheduler.

So we can safly assume that a Bulldozer core is slower than a PII/Llano core. By what margin exactly has to be determined. To compensate for these slower cores we have a high frequency design. However this design again slows down the performance of the Bulldozer core per clock. On the other hand the cores can be clocked higher, e.g. at 4.5 GHz.

And we got CMT for Bulldozer which gives ~80% speedup.

So let's see what we have with an 8 core Bulldozer compared to a current Phenom II in integer:

BDPerf = PIIPerf * 0.8 // -Reduction in core capability, +Core Improvements
BDPerf = PIIPerf * 0.8 * 1.2 // + Higher clock (4.5 GHz), - cost of high freq. design
BDPerf = PIIPerf * 0.8 * 1.2 * 1.8 // CMT
results in:
BDPerf = PIIPerf * 1,728
means a Bulldozer is 1.7 to 1.8 times faster than a Phenom II

Sounds good so far and able to beat a current Sandy Bridge.

But I assume that Bulldozer is actually a failure.

Why?

Now let's take the performance in regard to die space utilization.

4 BD module = 120 mm²
8 MB L3 cache = 60 mm²
Uncore = 100 mm²
~280 mm² in total

So let's see what Llano could do with that die space:
8 Llano cores = 80 mm²
12 MB L3 cache = 90 mm²
Uncore = 100 mm²
~270 mm²

So regarding die space consumption Llano has an advantage.

But how is it with performance?

Llano8CPerf = 1.05 * PII4CPerf // better clock due to 32 nm
Llano8CPerf = 1.9 * 1.05 * PII4CPerf // +double core count, -memory bandwidth
final result:
Llano8CPerf = 2 * PII4CPerf

Means Llano will offer more performance on less die space.

Bulldozer features FMA, AVX, SSE4.1, SSE4.2, AES. Bulldozer's float performance is even more worse than that of a Llano part, integer SSE is much slower and float SSE might be same.

I am very afraid that Bulldozer was a backstep. It just consumes way too much die space for what it offers. Sure it is faster by delivering more cores with CMT but you could have had all this without investing so much R&D into that design.

CMT was a design failure. If you look at core sizes, CMT is a great idea but obviously does not work out. SMT would be much better and it would even work better on an AMD core because the decoders bandwith is higher.

If you compare die space consumption of Sandy Bridge and Bulldozer then you can only cry. Sandy Bridge is so smart in this regard with even a GPU on it. If you drop that off you get 8C/16T Sandy Bridge on the same die space as a 4M/8C Bulldozer.

The real problem arises that with CMT AMD can only provide little execution resources because otherwise they are immedeatly in decoder stall. So there is also a problem with fixing the performance. Because of linearity of a program it would be more costly to further increase decoding power than to get away from CMT.

So what should a Bulldozer Version 2 look like to fix the upcoming problems of AMD?

1.) Consideration. How fix CMT.
The only advantage I would see in CMT would be the use of a vector unit by two cores. So there is either the option to get away from CMT or to extend it to having two decoder units and 2 I-caches. But then adding another vector unit would be little more and would reduce all special handling because of CMT.
2.) Implement SMT (cost little/gains a lot)
3.) Fix Integer SSE
4.) Add a ALU unit/pipe to integer core (cost about nothing, gains like hell)
5.) Reconsider high frequency design - really worth?
6.) Get the abnormal high uncore die consumption fixed.

tweakboy · Apr 10, 2011

podspi said:
Yea, I also really can't wait to read some of the more technical writeups of the architecture. It looks like some portions of it may be quite novel.

I don't think the market segments are going to intersect much. Llano's magic is that it is going to be cheaper for OEMs (and thus consumers) to buy Llano than an Athlon II X4 + 5570/5670. Boy it is going to be embarrassing if my discrete graphics card is slower than new IGP D:

Yes, IB is going to be so fast it actually makes other processors slower after the fact :awe:

Wow I didn't know that.. :0

Idontcare · Apr 10, 2011

HW2050Plus said:
So we can safly assume that a Bulldozer core is slower than a PII/Llano core.

AMD (John F) has repeatedly stated in no uncertain terms that IPC for a bulldozer core is higher than a Thuban core.

tweakboy · Apr 10, 2011

Markfw900 said:
This thread has nothing to do with Ivy Bridge. The title is about I7 and Phenom II

Sorry conversation went there.

Markfw · Apr 10, 2011

tweakboy said:
Sorry conversation went there.

They were Joking....

Don't post while on meds please.

tweakboy · Apr 10, 2011

Markfw900 said:
They were Joking....

Don't post while on meds please.

What do you mean dont post when your on meds ???

Should I take that as a joke or a insult or what.

nice rig btw.
thx gb

Markfw · Apr 10, 2011

tweakboy said:
What do you mean dont post when your on meds ???

Should I take that as a joke or a insult or what.

nice rig btw.
thx gb

In the past yous have said that you are on medication. Seeing the ridiculous posts you have made recently, I assumed you were on your meds again. If I am wrong, then I will stick to just facts.

tweakboy · Apr 10, 2011

The medication is permanent and helps me feel good.

Everything else you said is rubbish. bg

SickBeast · Apr 10, 2011

Does "gb" stand for "glass bong"? I'm just asking. Is it medicinal marijuana?

SickBeast · Apr 10, 2011

By the way, I just want to say that this thread is great. It's gone on so many tangents, yet it has been interesting and entertaining still.

I think that when there are a lack of legitimate rumors out there, this thread is going to get sidetracked.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Elite Member

Senior member

Member

Golden Member

Golden Member

Elite Member

Member

Member

Golden Member

Golden Member

Senior member

Lifer

Diamond Member

Moderator Emeritus, Elite Member

Golden Member

Member

Diamond Member

Elite Member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Lifer

Lifer