Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Page 62 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
HW2050Plus: If I make a WAG, the increased latency is part of a trade-off. I'm guessing that is to reach higher clocks. If that results in the part clocking higher than otherwise(even 5-10%), it would be worth it, since higher clocks help everything, not just the specific instructions. Everything is always a trade off.

The topic of execution resources: The performance is not bound only by the execution units alone but are important. The info about ALUs and AGUs are a positive thing.
 

Riek

Senior member
Dec 16, 2008
409
15
76
HW2050Plus: If I make a WAG, the increased latency is part of a trade-off. I'm guessing that is to reach higher clocks. If that results in the part clocking higher than otherwise(even 5-10%), it would be worth it, since higher clocks help everything, not just the specific instructions. Everything is always a trade off.

The topic of execution resources: The performance is not bound only by the execution units alone but are important. The info about ALUs and AGUs are a positive thing.

But according to the instruction latencies the AGLU can only do LEA (partial) and CALL (partial) so either that table is incomplete or i'm reading it wrong. (only EX0 and EX1 pipes can do add for example)
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
HW2050Plus: If I make a WAG, the increased latency is part of a trade-off. I'm guessing that is to reach higher clocks. If that results in the part clocking higher than otherwise(even 5-10%), it would be worth it, since higher clocks help everything, not just the specific instructions. Everything is always a trade off.

The topic of execution resources: The performance is not bound only by the execution units alone but are important. The info about ALUs and AGUs are a positive thing.
In general I agree. But in particular for those two there is no reason nor tradeoff necessary for high frequency. For integer SSE it is not slower as Phenom - but they took this old disadvantage compared to Intel CPUs over to Bulldozer. There is no loss there but a necessary (and easy) gain missing.

With fdiv it is even more strange, as equivalent divps is fast. So it has nothing to do with the high frequency design and also nothing with die space, as the fast floating point division unit is present in Bulldozer - very present as it has been even doubled. But if the fdiv instruction is executed it uses another division unit. Means it is not only slower but also a waste in die space. Maybe not much die space as it is a slow unit but as there are already several fast units present it is as I say quite strange.

And yes performance is not directly equivalent to execution resources but it is a huge factor. Of course decoders, prediction of branches and accesses, cache hierachy and sizes and memory bandwith, etc. play an important role regarding performance. But it is clear to say that this 3->4 wide execution helps as it also helped Intel Core CPUs (also 3->4 and additional port 5) to significantly improve performance.

How well - or not - all this works to gether we have to see when real benchmarks come out. There is a lot of potential and very good solutions, e.g. the new feature of AGLU but as I mentioned also two flaws in the design. An old one already known from Phenom & Co. (integer SSE) but also a new one (fdiv).
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,667
2,537
136
But according to the instruction latencies the AGLU can only do LEA (partial) and CALL (partial) so either that table is incomplete or i'm reading it wrong. (only EX0 and EX1 pipes can do add for example)

The table is not just incomplete, it is clearly wrong in many respects. For example, no memory operations require any AGU pipes. I think it's better to just ignore that table for now.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,667
2,537
136
With fdiv it is even more strange, as equivalent divps is fast.

fdiv and divps are not equivalent -- fdiv has a minimum of 64 bits of precision in the mantissa.

It's entirely possible that fdiv starts with the div for doubles, and then adds a round worth of newton-rhapson or goldschmitt.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
In general I agree. But in particular for those two there is no reason nor tradeoff necessary for high frequency. For integer SSE it is not slower as Phenom - but they took this old disadvantage compared to Intel CPUs over to Bulldozer. There is no loss there but a necessary (and easy) gain missing.

We can't know. Because we're not designers working on Bulldozer. Only they'll know why they did. Who knows why they did? Even the best designs have flaws. We can only congratulate them for the final result.

Ok, it might sound like I'm dismissing the results too easily, but there's not much point in talking about it. Numbers are numbers.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
But according to the instruction latencies the AGLU can only do LEA (partial) and CALL (partial) so either that table is incomplete or i'm reading it wrong. (only EX0 and EX1 pipes can do add for example)
You are right, the description in chapter 2.10 does not fit with the instruction latencies, where the AGLU would even not be capable of doing a lea on it's own and also no call. Even worse it appears as need to execute them in EX0/1 and AGLU. However for memory operations it does not state the use of the AGLU to generate the address (e.g. add rdx, [qword ptr] z). Therefore there are mistakes in the latency table anyway, as well as the throughput indicator is missing.

Again an obstacle, the description on page 36 sounds especially strange. Looks as this document is not final. Also a lot of values are still missing. Let's hope that this resolves to a "simple ALU capability" because otherwise it would look more like a two pipeline design which would be really bad.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
It's entirely possible that fdiv starts with the div for doubles, and then adds a round worth of newton-rhapson or goldschmitt.
Okay in my opinion that is the only explanation that makes sense, therefore use the fast divider (of divps) first and get the additional bits otherwise.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
It will be interesting looking back at this thread come June when we may finally see some real life BD products/benchmarks.
 

ShadowVVL

Senior member
May 1, 2010
758
0
71
I forget was llano for desktop launching with bd ,before or after bd?

I was hoping it would launch before or with bd. want to see if llano will be any good before getting bd.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
It will be interesting looking back at this thread come June when we may finally see some real life BD products/benchmarks.

Yea, I also really can't wait to read some of the more technical writeups of the architecture. It looks like some portions of it may be quite novel.

I forget was llano for desktop launching with bd ,before or after bd?

I was hoping it would launch before or with bd. want to see if llano will be any good before getting bd.

I don't think the market segments are going to intersect much. Llano's magic is that it is going to be cheaper for OEMs (and thus consumers) to buy Llano than an Athlon II X4 + 5570/5670. Boy it is going to be embarrassing if my discrete graphics card is slower than new IGP D:

Lets give it up It's not going to 50 percenet faster, intel Ivy Bridge will prove that.

Yes, IB is going to be so fast it actually makes other processors slower after the fact :awe:
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
With the latest information from the optimization guide my expectations for Bulldozer come back to what I had in the beginning after Hot Chips - no it is even worse.

I thought about the decoders of Bulldozer. Let's look at the module. The decoders can decode 4 ops/cycle. Now as AMD claims they have 12 pipelines which want to be fed by that. How can this work? The new and quite strange information in the optimization guide plus other issued information gives the clue. The 4 decoders decode one or two micro ops. E.g. a add rdx, rax is decoded in one yop and add rdx, [qword ptr] z is decoded in two yops, one for the memory and one for the add.

If we eliminate the micro op intermezzo we have as a result that each Bulldozer core can do 2 Macro-Ops / cycle plus the seperate vector scheduler.

So we can safly assume that a Bulldozer core is slower than a PII/Llano core. By what margin exactly has to be determined. To compensate for these slower cores we have a high frequency design. However this design again slows down the performance of the Bulldozer core per clock. On the other hand the cores can be clocked higher, e.g. at 4.5 GHz.

And we got CMT for Bulldozer which gives ~80% speedup.

So let's see what we have with an 8 core Bulldozer compared to a current Phenom II in integer:

BDPerf = PIIPerf * 0.8 // -Reduction in core capability, +Core Improvements
BDPerf = PIIPerf * 0.8 * 1.2 // + Higher clock (4.5 GHz), - cost of high freq. design
BDPerf = PIIPerf * 0.8 * 1.2 * 1.8 // CMT
results in:
BDPerf = PIIPerf * 1,728
means a Bulldozer is 1.7 to 1.8 times faster than a Phenom II

Sounds good so far and able to beat a current Sandy Bridge.

But I assume that Bulldozer is actually a failure.

Why?

Now let's take the performance in regard to die space utilization.

4 BD module = 120 mm²
8 MB L3 cache = 60 mm²
Uncore = 100 mm²
~280 mm² in total

So let's see what Llano could do with that die space:
8 Llano cores = 80 mm²
12 MB L3 cache = 90 mm²
Uncore = 100 mm²
~270 mm²

So regarding die space consumption Llano has an advantage.

But how is it with performance?

Llano8CPerf = 1.05 * PII4CPerf // better clock due to 32 nm
Llano8CPerf = 1.9 * 1.05 * PII4CPerf // +double core count, -memory bandwidth
final result:
Llano8CPerf = 2 * PII4CPerf

Means Llano will offer more performance on less die space.

Bulldozer features FMA, AVX, SSE4.1, SSE4.2, AES. Bulldozer's float performance is even more worse than that of a Llano part, integer SSE is much slower and float SSE might be same.

I am very afraid that Bulldozer was a backstep. It just consumes way too much die space for what it offers. Sure it is faster by delivering more cores with CMT but you could have had all this without investing so much R&D into that design.

CMT was a design failure. If you look at core sizes, CMT is a great idea but obviously does not work out. SMT would be much better and it would even work better on an AMD core because the decoders bandwith is higher.

If you compare die space consumption of Sandy Bridge and Bulldozer then you can only cry. Sandy Bridge is so smart in this regard with even a GPU on it. If you drop that off you get 8C/16T Sandy Bridge on the same die space as a 4M/8C Bulldozer.

The real problem arises that with CMT AMD can only provide little execution resources because otherwise they are immedeatly in decoder stall. So there is also a problem with fixing the performance. Because of linearity of a program it would be more costly to further increase decoding power than to get away from CMT.

So what should a Bulldozer Version 2 look like to fix the upcoming problems of AMD?

1.) Consideration. How fix CMT.
The only advantage I would see in CMT would be the use of a vector unit by two cores. So there is either the option to get away from CMT or to extend it to having two decoder units and 2 I-caches. But then adding another vector unit would be little more and would reduce all special handling because of CMT.
2.) Implement SMT (cost little/gains a lot)
3.) Fix Integer SSE
4.) Add a ALU unit/pipe to integer core (cost about nothing, gains like hell)
5.) Reconsider high frequency design - really worth?
6.) Get the abnormal high uncore die consumption fixed.
 

tweakboy

Diamond Member
Jan 3, 2010
9,517
2
81
www.hammiestudios.com
Yea, I also really can't wait to read some of the more technical writeups of the architecture. It looks like some portions of it may be quite novel.



I don't think the market segments are going to intersect much. Llano's magic is that it is going to be cheaper for OEMs (and thus consumers) to buy Llano than an Athlon II X4 + 5570/5670. Boy it is going to be embarrassing if my discrete graphics card is slower than new IGP D:



Yes, IB is going to be so fast it actually makes other processors slower after the fact :awe:


Wow I didn't know that.. :0
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,254
16,110
136
What do you mean dont post when your on meds ???


Should I take that as a joke or a insult or what.

nice rig btw.
thx gb

In the past yous have said that you are on medication. Seeing the ridiculous posts you have made recently, I assumed you were on your meds again. If I am wrong, then I will stick to just facts.
 

SickBeast

Lifer
Jul 21, 2000
14,377
19
81
By the way, I just want to say that this thread is great. It's gone on so many tangents, yet it has been interesting and entertaining still.

I think that when there are a lack of legitimate rumors out there, this thread is going to get sidetracked.
 
Status
Not open for further replies.