AMD Bulldozer Dual-Interlagos Benchmarks On Linux

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

podspi

Golden Member
Jan 11, 2011
1,965
71
91
(edit) Note that it's not all roses -- they pay a high price for hunting higher clocks, with many instruction latencies going up. Based on released compiler latency numbers, they've ditched many pieces of special-purpose hardware, for example the floating point divider, and cache access became slower.

I thought hardware divider was one of the things they added to the Husky cores? Or was that integer?
 

GlacierFreeze

Golden Member
May 23, 2005
1,125
1
0
No wonder some won't ever be happy with AMD, crazy high and unrealistic expectations will always lead to lets downs. Near 5GHz turbo? Come on... Same people will say it sucks if it doesn't meet their baseless and made up expectations. Sheesh lol.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
I thought hardware divider was one of the things they added to the Husky cores? Or was that integer?

I don't know anything about that.

But the published FDIV latency is 42 cycles (compared to 19 in phenom and 14 in SB...), which is 7 consecutive 6-cycle FMA:s, which is `coincidentally` the amount of operations needed for full-precision software divide when you have FMA and a single extra rounding mode. I take this to mean that there is no hardware FDIV in BD.

FSQRT also got a huge increase, 35 => 52, however, this is faster than any software FSQRT I know, so I assume they have some special-purpose hardware left for it.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
No wonder some won't ever be happy with AMD, crazy high and unrealistic expectations will always lead to lets downs. Near 5GHz turbo? Come on... Same people will say it sucks if it doesn't meet their baseless and made up expectations. Sheesh lol.

So, what frequency do you expect the processors to hit, and what do you base this analysis on? Remember that one cycle in BD is 17 FO4, or 22% shorter than in Phenom.
 

ElFenix

Elite Member
Super Moderator
Mar 20, 2000
102,414
8,356
126
So, what frequency do you expect the processors to hit, and what do you base this analysis on? Remember that one cycle in BD is 17 FO4, or 22% shorter than in Phenom.

i wish i had any clue what you were talking about, because it sounds interesting
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
This intel system scales above lineair compared to the desktop part. e.g.6core/12T scoring 61seconds. So there is memory limitiation or HD limitation. Interlagos streaming result really sucked and with a desktop HD of TB one might expect this to be limiting factor also.

Core i7 970(6 cores, 2.67GHz)
With HT: 61.92
Without HT: 70.59

4x Xeon X7550(32 cores, 2.0GHz)
13.47

Assuming the X7550 system has Hyperthreading enabled, linear scaling of the Core i7 970 to 2GHz would result in 82.7 seconds.

Assuming it scales linearly to 32 cores, it would score 15.5 seconds, which is lower performance than what we expect. The scores aren't that much off expectations.

vs Magny Cours

2 socket Opteron 6168(1.9GHz, 24 cores): 30.61
4 socket Opteron 6168(1.9GHz, 48 cores): 15.43
2 socket Bulldozer(1.8GHz, 32 cores): 25.97

Hypothetical 24 core 1.9GHz Bulldozer: 32.8(7% difference)
Hypothetical 48 core 1.9GHz Bulldozer: 16.4(6% difference)

I don't see anything of an outlier considering per core peak FP capabilities are same between the two. Now a benchmark that actually stresses the other parts of the subsystem might show advantages for Interlagos.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
i wish i had any clue what you were talking about, because it sounds interesting

Ignoring heat and power, how fast a processor can clock is basically a measure of two things -- firstly, how quickly the physical transistors can switch and how quickly signals move trough wires, and secondly, how many consecutive transistor switches and how much signal transit needs to happen in series on the longest path in the processor before that clock cycle is complete.

These things are basically independent, making it useful to have standard idealized metric against which to compare both how long it takes for logic to complete, and how quickly a certain silicon implementation can compute things. This metric is FO4, or if I say that a logic device has a latency of 3 FO4, it means that regardless of the process where it's implemented, it will take as long as 3 consecutive not-gates each driving a load of 4 other not gates implemented on the same process. So a device with a logic length of 10 FO4 can be expected to clock twice as fast of another device on the same process and a FO4 of 20.

The rumor is (and I don't think there is any confirmation other than "yes, it's a speed demon") that BD has a FO4 length of 17, compared to known 22 for Phenom. So, looking only at transistor delay, BD should run at 30% higher clock speed, when implemented on the same process. Also, the GF 32nm SOI process should supposedly be much better than their earlier processes, meaning gain from Phenom should be higher than that.

This all was for the case when you are not limited by power. When you are, the gains should be more conservative, especially when AMD is fitting so much more logic in a single chip. However, for the single-threaded case, the conservative baseline quess should be "whatever phenom is clocking now" * 1.3 for architecture * roughly 1.1 for process = Holy ****.

disclaimers: FO4 is a lot hairier metric than the simplified view I told here -- wire delay and transistor switching time do not shrink in unison, making comparisons across too many process generations not very good. A more in-depth look on FO4 can be found from RWT.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
So a device with a logic length of 10 FO4 can be expected to clock twice as fast of another device on the same process and a FO4 of 20.

FO4 is a process normalized metric, so 17 for FO4 is an absolute value. 17 means its 22/17=29% faster than previous chips. For all we care, they could have used 0.25u process to achieve 17 delay. But it doesn't matter. Whatever they did got 17 in the end.

Fastest chip regarding that still remains the 90nm Prescott with 12 FO4 delay.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
Quite interesting result. This is the first benchmark showing bad results for Bulldozer. Not good but unclear if this is indicative.

Slower per core per clock than Magny-Cours is really not good. Does anybody have some more information about C-Ray? Does it use FPU or SSE?

Anyway any new benchmark is welcome, maybe we see some more but currently we see only new ones from Interlagos.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Ignoring heat and power, how fast a processor can clock is basically a measure of two things -- firstly, how quickly the physical transistors can switch and how quickly signals move trough wires, and secondly, how many consecutive transistor switches and how much signal transit needs to happen in series on the longest path in the processor before that clock cycle is complete.

These things are basically independent, making it useful to have standard idealized metric against which to compare both how long it takes for logic to complete, and how quickly a certain silicon implementation can compute things. This metric is FO4, or if I say that a logic device has a latency of 3 FO4, it means that regardless of the process where it's implemented, it will take as long as 3 consecutive not-gates each driving a load of 4 other not gates implemented on the same process. So a device with a logic length of 10 FO4 can be expected to clock twice as fast of another device on the same process and a FO4 of 20.

The rumor is (and I don't think there is any confirmation other than "yes, it's a speed demon") that BD has a FO4 length of 17, compared to known 22 for Phenom. So, looking only at transistor delay, BD should run at 30% higher clock speed, when implemented on the same process. Also, the GF 32nm SOI process should supposedly be much better than their earlier processes, meaning gain from Phenom should be higher than that.

This all was for the case when you are not limited by power. When you are, the gains should be more conservative, especially when AMD is fitting so much more logic in a single chip. However, for the single-threaded case, the conservative baseline quess should be "whatever phenom is clocking now" * 1.3 for architecture * roughly 1.1 for process = Holy ****.

disclaimers: FO4 is a lot hairier metric than the simplified view I told here -- wire delay and transistor switching time do not shrink in unison, making comparisons across too many process generations not very good. A more in-depth look on FO4 can be found from RWT.

Awesome post :thumbsup:
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Yes. That's sort of what I was attempting to say, but apparently I'm horrible at it.

I got what you were saying, but then that might be because you were talking my language to begin with, I think Inteluser may have distilled and transcoded your statement into something that is more readily digestable by the larger audience here, which is a win for everyone. :D

That said, I still say you made a great post :thumbsup:

And IntelUser, a 12 F04 is monstrously fast. I bet if they did actually put Netburst on a 32nm node they would have well exceeded 10GHz while keeping TDP's under 140W. Not that it would have outperformed Sandy Bridge, but just saying I don't think Intel was as clueless about what they were aiming to accomplish as some would have us believe.
 

podspi

Golden Member
Jan 11, 2011
1,965
71
91
And IntelUser, a 12 F04 is monstrously fast. I bet if they did actually put Netburst on a 32nm node they would have well exceeded 10GHz while keeping TDP's under 140W. Not that it would have outperformed Sandy Bridge, but just saying I don't think Intel was as clueless about what they were aiming to accomplish as some would have us believe.


Yes, I think they just were overconfident in how fast they would be able to ramp-up clockspeeds and AMD's higher IPC CPUs just flew past them.

Does anybody know Sandy Bridge's F04? I'm just curious and can't seem to find it anywhere...
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
And IntelUser, a 12 F04 is monstrously fast. I bet if they did actually put Netburst on a 32nm node they would have well exceeded 10GHz while keeping TDP's under 140W. Not that it would have outperformed Sandy Bridge, but just saying I don't think Intel was as clueless about what they were aiming to accomplish as some would have us believe.
Monstrously fast but only virtual. The only thing you get is high clock rate and low performance. To get this maybe 70% boost in raw clock they had to increase latency by 200-400%. Just have a look in the Intel optimization guide appendix instruction latencies and look for a family_F processor. It is just shocking how slow Prescott was.

I mean there is nothing won by letting three clock cycles pass to do one thing than do it in one clock cycle but at a little lower clock rate.

For BD this reduction comes with only one penalty. That is they removed the ability to use other pipes result in the same cycle as far as I understood an AMD engineer describing that. Not a big feature they dropped by that.

There is a IBM study out that 17 FO4 is the ideal for OOO processors. Don't know exactly how much FO4 Intel Core architecture has, likly in the range of 18-20 which is also close to optimum. Even the 22 was not bad for Phenom however it is questionable if the design was optimal for that "large" FO4 count.

So clock ist determined by two things process FO4 gate delay and FO4 count where

clock = 1 / (FO4count * processFO4gatedelay)

As you see the FO4 count directly pushes the clock as Tuna-Fish wrote and the process will add something in addition. On the other hand we have a new process which just ramps up so the starting clocks are difficult to estimate, but when the process is mature we will see BD running at ~ >4.5 GHz. It is also interesting how much AMD will spend from this clock gain for Turbo.

Anyway regarding the topic, for this leaked result it looks as AMD would need the clock advantage to get superior performance.

I don't know anything about that.

But the published FDIV latency is 42 cycles (compared to 19 in phenom and 14 in SB...), which is 7 consecutive 6-cycle FMA:s, which is `coincidentally` the amount of operations needed for full-precision software divide when you have FMA and a single extra rounding mode. I take this to mean that there is no hardware FDIV in BD.

FSQRT also got a huge increase, 35 => 52, however, this is faster than any software FSQRT I know, so I assume they have some special-purpose hardware left for it.
That are not published latencies. I have my doubts. It is too early in my opinion and I believe this only if those latencies are published by AMD and not from compiler definitions. However it is of course not a good sign to see that in compiler definitions.

Especially with those two values, the latencies just do not match. FSQRT with 52 and FDIV with 42? I mean SQRT 50% slower and the much more important DIV 120% slower? Does not make any sense to me.

And regarding hardware divider: There is no real hardware divider. The hardware does the same as a software divider would do. Whereas basically you have the option to make a hardware speedup by a 2 or 4 Bit / cycle divider (Phenom has a 4 Bit one). For a pure software divider it would take at least 53 cycles only for the mantissa and that only with special instructions. Anyway this would especially also make no sense since the part which consumes most die space is not the hardware divider but the (real) hardware multiplicator. I don't think there is much savings if you use a 1 Bit divider instead of a 4 Bit divider.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
And regarding hardware divider: There is no real hardware divider. The hardware does the same as a software divider would do. Whereas basically you have the option to make a hardware speedup by a 2 or 4 Bit / cycle divider (Phenom has a 4 Bit one). For a pure software divider it would take at least 53 cycles only for the mantissa and that only with special instructions.

No, if you have real pipelined FMA with 55 bits of mantissa precision, you can do software douple-precision FDIV in 7 consecutive ops, or 42 cycles at 6 cycles/FMA. Publication. You can do crazy things with proper FMA.

That 42 cycles for FDIV comes from patch committed to gcc by an AMD employee. It might be incorrect, but for now it seems to be the closest thing I have to real information on the subject.

Those latencies for FDIV vs FSQRT don't seem to make any sense from a consumer viewpoint, but as FDIV at 42 is literally free for them, it might well be that the only reason FSQRT isn't more expensive is that there is no gain in making it more expensive. In any case, they seem to be doing pretty heavy optimizing the best case at the expense of the less-used instructions.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
12 F04 is monstrously fast.

Remeber that the ALU's on those monsters were double-pumped -- or 6 FO4. They used dynamic logic, but I literally cannot imagine how it's possible to make 32-bit add with all the overhead on 6 FO4. Crazy.

Wasn't the reason they couldn't ramp clockspeed gate leakage? Which was overlooked in predictions because it was never a major factor before 90nm? And doesn't high-K gate material really help against gate leakage?

I second the wish to see P4 on a modern process. Or even just the alu's, because damn, that's just crazy.
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
The original A64's and Opterons launched at 1.8ghz. It must be symbolic ofcourse.

no it is in solidarity with their agena release and its "simulated performance" of 3.0 or whatever the number was. of course, the 16 core cpus are still 6 months out, they have until oct 1 to start producing them without being late. of more urgent concern to most of us here is how the 8 core cpus will perform.
 

Soleron

Senior member
May 10, 2009
337
0
71
We know that 16-core BD will be at least 2.3GHz because there's a public announcement of a company using them.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
We know that 16-core BD will be at least 2.3GHz because there's a public announcement of a company using them.

Really, we have no idea what they will top out at. Just because there are test results for a 1.8 GHz CPU doesn't mean that will be the only one that they release. That said, the numbers are still somewhat disappointing, even at 1.8 GHz.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
If bulldozer stays at 1.8ghz, it will come between a i7 920 and a i7 950 in terms of gaming performance.

If bulldozer is at 2.4ghz, it will nip at the heals of the 2500k in gaming performance and beat it barely on multi-threaded applications.

If bulldozer is at 2.8ghz, it will beat the 2500k by 5-10% in gaming and 15% in heavily threaded applications.

8 core bulldozer I mean.

If that was true, it would mean low thread count performance is much better than Sandy Bridge or the clock speeds at those thread count is really high. The first of which is doubtful.
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
Ignoring heat and power, how fast a processor can clock is basically a measure of two things -- firstly, how quickly the physical transistors can switch and how quickly signals move trough wires, and secondly, how many consecutive transistor switches and how much signal transit needs to happen in series on the longest path in the processor before that clock cycle is complete.

These things are basically independent, making it useful to have standard idealized metric against which to compare both how long it takes for logic to complete, and how quickly a certain silicon implementation can compute things. This metric is FO4, or if I say that a logic device has a latency of 3 FO4, it means that regardless of the process where it's implemented, it will take as long as 3 consecutive not-gates each driving a load of 4 other not gates implemented on the same process. So a device with a logic length of 10 FO4 can be expected to clock twice as fast of another device on the same process and a FO4 of 20.

The rumor is (and I don't think there is any confirmation other than "yes, it's a speed demon") that BD has a FO4 length of 17, compared to known 22 for Phenom. So, looking only at transistor delay, BD should run at 30% higher clock speed, when implemented on the same process. Also, the GF 32nm SOI process should supposedly be much better than their earlier processes, meaning gain from Phenom should be higher than that.

This all was for the case when you are not limited by power. When you are, the gains should be more conservative, especially when AMD is fitting so much more logic in a single chip. However, for the single-threaded case, the conservative baseline quess should be "whatever phenom is clocking now" * 1.3 for architecture * roughly 1.1 for process = Holy ****.

disclaimers: FO4 is a lot hairier metric than the simplified view I told here -- wire delay and transistor switching time do not shrink in unison, making comparisons across too many process generations not very good. A more in-depth look on FO4 can be found from RWT.


+1 for that :) nice read.



However, for the single-threaded case, the conservative baseline quess should be "whatever phenom is clocking now" * 1.3 for architecture * roughly 1.1 for process = Holy ****.


So Since there are 3.7ghz Phenom II x4 980 BE....

3.7ghz x 1.30 (30% increase from architecture) x 1.10 (process) = ~5.3ghz ?

Lmao that sounds too good to be true... however if it turns out your right.. man o man... a stock 5 ghz cpu would be totally nuts (even if it was only a 4 or 8 core version).
 

StinkyPinky

Diamond Member
Jul 6, 2002
6,761
777
126
5Ghz 8 core at stock? You guys are in dream land. Not unless AMD went forward in time and brought back some 16nm cpu's from 2015.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
No, if you have real pipelined FMA with 55 bits of mantissa precision, you can do software douple-precision FDIV in 7 consecutive ops, or 42 cycles at 6 cycles/FMA. Publication. You can do crazy things with proper FMA.

That 42 cycles for FDIV comes from patch committed to gcc by an AMD employee. It might be incorrect, but for now it seems to be the closest thing I have to real information on the subject.

Those latencies for FDIV vs FSQRT don't seem to make any sense from a consumer viewpoint, but as FDIV at 42 is literally free for them, it might well be that the only reason FSQRT isn't more expensive is that there is no gain in making it more expensive. In any case, they seem to be doing pretty heavy optimizing the best case at the expense of the less-used instructions.
Double mantissa is 52 not 55 btw. 7 ops at 6 cycle don't make it up. You have to calculate also exponent and do the rounding, FP exception handling plus NaN, +/- INF handling. And the patented algorithm has even more disadvantages as it needs the multiplier, means you are fully blocking your pipeline during a div. It would be a real mess to do it how it is described in this patent. And it would need complex decoder to make microcode lookup etc. Just forget the patent, because if you assume that the cycles are correct in gcc list it just can't be done using the algorithm in the patent.

And in Open64 compiler they have for Orochi: 25 cycles for FDIV, 35 for divxsd and 30 cycles for fsqrt.

I think both are just incorrect and we will have to wait for the official documentation (AMD Optimization Guide).
 

Riek

Senior member
Dec 16, 2008
409
14
76
Core i7 970(6 cores, 2.67GHz)
With HT: 61.92
Without HT: 70.59

4x Xeon X7550(32 cores, 2.0GHz)
13.47

Assuming the X7550 system has Hyperthreading enabled, linear scaling of the Core i7 970 to 2GHz would result in 82.7 seconds.

Assuming it scales linearly to 32 cores, it would score 15.5 seconds, which is lower performance than what we expect. The scores aren't that much off expectations.

vs Magny Cours

2 socket Opteron 6168(1.9GHz, 24 cores): 30.61
4 socket Opteron 6168(1.9GHz, 48 cores): 15.43
2 socket Bulldozer(1.8GHz, 32 cores): 25.97

Hypothetical 24 core 1.9GHz Bulldozer: 32.8(7% difference)
Hypothetical 48 core 1.9GHz Bulldozer: 16.4(6% difference)

I don't see anything of an outlier considering per core peak FP capabilities are same between the two. Now a benchmark that actually stresses the other parts of the subsystem might show advantages for Interlagos.

According to intel the 970 runs @ 3.2Ghz?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
Double mantissa is 52 not 55 btw.

I know that. With a 52-bit mantissa, you need 8 consecutive ops. If you add 3 bits to that (which you already kind of have for 64-bit mantissa), you can do division in 7 ops.

7 ops at 6 cycle don't make it up. You have to calculate also exponent and do the rounding, FP exception handling plus NaN, +/- INF handling.

I'm not making it up. The algorithm in the paper I posted does the correct exponent and rounding at the same time as the mantissa. NaN, inf and exception handling is done by trapping into microcode even today. I told you FMA is crazy. It can be used for things I wouldn't ever come up with on my own.

And the patented algorithm has even more disadvantages as it needs the multiplier, means you are fully blocking your pipeline during a div.
BD has two FMA pipelines -- no matter what, doing an FP op blocks a multiplier. Whether you use it or not. The other one is still available.

It would be a real mess to do it how it is described in this patent. And it would need complex decoder to make microcode lookup etc.
Yes, microcode is a disadvantage, but all the pieces needed are already in the processors.

Just forget the patent, because if you assume that the cycles are correct in gcc list it just can't be done using the algorithm in the patent.
Yes, it can. I just ran through the algorithm in my head -- it really does get 52 correct bits in 7 consecutive ops, when you start with the reciprocal approximation that's already in the processors.

And in Open64 compiler they have for Orochi: 25 cycles for FDIV, 35 for divxsd and 30 cycles for fsqrt.
If I can use svn Blame correctly, those numbers are from 2009. If nothing else, the gcc numbers are more current.
I think both are just incorrect and we will have to wait for the official documentation (AMD Optimization Guide).

I agree that we have very little information, and what little there is largely contradicts itself. However, I'd still be willing to bet money that FDIV in BD is software.