AMD Bulldozer Dual-Interlagos Benchmarks On Linux

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
But the published FDIV latency is 42 cycles (compared to 19 in phenom and 14 in SB...), which is 7 consecutive 6-cycle FMA:s, which is `coincidentally` the amount of operations needed for full-precision software divide when you have FMA and a single extra rounding mode. I take this to mean that there is no hardware FDIV in BD.

FSQRT also got a huge increase, 35 => 52, however, this is faster than any software FSQRT I know, so I assume they have some special-purpose hardware left for it.

These are x87 latencies with at least 80b precision. As I posted here, double precision SSE division has a latency of 27 cycles. What's not written there is that single precision has an even lower latency of 24 cycles (as one would expect). Same for SQRT, where DP lands at 38 cycles and SP at 29 cycles.

According to a patent the FMA units might have a forwarding path w/o rounding, which would provide the result after 5 cycles which would surely help in an implementation like Goldschmidt's division algorithm.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Remeber that the ALU's on those monsters were double-pumped -- or 6 FO4. They used dynamic logic, but I literally cannot imagine how it's possible to make 32-bit add with all the overhead on 6 FO4. Crazy.

Here's an explanation:
The processor does ALU operations with an effective
latency of one-half of a clock cycle. It does this operation
in a sequence of three fast clock cycles (the fast clock
runs at 2x the main clock rate) as shown in Figure 7. In
the first fast clock cycle, the low order 16-bits are
computed and are immediately available to feed the low
16-bits of a dependent operation the very next fast clock
cycle. The high-order 16 bits are processed in the next
fast cycle, using the carry out just generated by the low
16-bit operation. This upper 16-bit result will be
available to the next dependent operation exactly when
needed. This is called a staggered add. The ALU flags
are processed in the third fast cycle. This staggered add
means that only a 16-bit adder and its input muxes need to
be completed in a fast clock cycle. The low order 16 bits
are needed at one time in order to begin the access of the
L1 data cache when used as an address input.
from http://www.eecg.toronto.edu/~moshovos/ACA05/read/Pentium4ArchITJ.pdf
More here:
http://ctho.ath.cx/toread/forclass/18-722/logicfamilies/Deleganes05.pdf
http://www.chip-architect.com/news/2000_11_07_Intel_Adder.html

Wasn't the reason they couldn't ramp clockspeed gate leakage? Which was overlooked in predictions because it was never a major factor before 90nm? And doesn't high-K gate material really help against gate leakage?

I second the wish to see P4 on a modern process. Or even just the alu's, because damn, that's just crazy.
Maybe we'll see something loosely similar in a future CPU. ;)
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
5Ghz 8 core at stock? You guys are in dream land. Not unless AMD went forward in time and brought back some 16nm cpu's from 2015.

Looking at what a 6-core Thuban 1090T can do, on 45nm w/o benefits of HK/MG, I agree a 5GHz 8-core zambezi may seem lofty but it is not without reason.

The big difference, and maybe people are forgetting to account for this, is that the 1090T is made possible because the underlying 45nm process tech has been refined for 2 yrs since introduction and the Thuban itself is a tweaked/optimized PhII.

So I don't think we can use 1090T as a starting point for determining where Bulldozer will start out at, but it would seem reasonable to use it as a good estimate of where bulldozer could end up at on 32nm after 2+ yrs of process optimizations combined with maybe a few major stepping updates.

(Thuban today is E0-stepping, that's a lot of tweaks to the layout to get it up to 3.3GHz clocks and within the 125W TDP)

I think a 3.5GHz Zambezi at time-zero release on an immature 32nm process with only first few stepping respins under their belt would be reasonable.
 

Dribble

Platinum Member
Aug 9, 2005
2,076
611
136
Why is everyone still so sure bulldozer is going to come out at 3.5 or more after we have just seen an engineering sample doing 1.8?

I am sure you could do some study of how close to final speeds engineering samples are using engineering samples from previous generations of chips. I haven't done that but from what I remember of previous chips (more intel then amd) they were mostly either at or very close to final released speed. I doubt you'll find many where the speed doubled, particularly this close to release.
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Why is everyone still so sure bulldozer is going to come out at 3.5 or more after we have just seen an engineering sample doing 1.8?
There were "2.3GHz" Interlagos chips mentioned in supercomputer upgrade plans, scheduled for June. The 3.5GHz likely comes from the ISSCC paper.

I am sure you could do some study of how close to final speeds engineering samples are using engineering samples from previous generations of chips. I haven't done that but from what I remember of previous chips (more intel then amd) they were mostly either at or very close to final released speed. I doubt you'll find many where the speed doubled, particularly this close to release.
K8 samples before launch were mostly running at 800MHz. So such kind of analysis might be useless and depends heavily on AMD's ES policy. We're not observing trees here ;)
 

Zstream

Diamond Member
Oct 24, 2005
3,396
277
136
There were "2.3GHz" Interlagos chips mentioned in supercomputer upgrade plans, scheduled for June. The 3.5GHz likely comes from the ISSCC paper.

I doubt you'll see a 3.5ghz bulldozer. Maybe with turbo enabled but not at base speed.

Can we start a bulldozer prediction thread? Just with predictions :)
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
Maybe we'll see something loosely similar in a future CPU. ;)
Already there and old by now. IBM Power 6: 5 GHz base clock at 65 nm (2007)!

Though I don't know if they used that type of ALU, I think they used a regular one. However Power 6 is not OoO. I am not sure but afaik Power 6 has also 12 FO4.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Why is everyone still so sure bulldozer is going to come out at 3.5 or more after we have just seen an engineering sample doing 1.8?

Lets not forget most SB samples we saw early on were running at 2.0Ghz.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Already there and old by now. IBM Power 6: 5 GHz base clock at 65 nm (2007)!

Though I don't know if they used that type of ALU, I think they used a regular one. However Power 6 is not OoO. I am not sure but afaik Power 6 has also 12 FO4.

Yea, Power6 was dual core.

And the new IBM Z196 chip runs at 5.2Ghz on 45nm and it is a quad core.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
These are x87 latencies with at least 80b precision. As I posted here, double precision SSE division has a latency of 27 cycles. What's not written there is that single precision has an even lower latency of 24 cycles (as one would expect). Same for SQRT, where DP lands at 38 cycles and SP at 29 cycles.

According to a patent the FMA units might have a forwarding path w/o rounding, which would provide the result after 5 cycles which would surely help in an implementation like Goldschmidt's division algorithm.

Here's a public source for the latency numbers:
http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01883.html

For example:
+(define_insn_reservation "bdver1_ssediv_double" 27
 

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136
I think a 3.5GHz Zambezi at time-zero release on an immature 32nm process with only first few stepping respins under their belt would be reasonable.

Turbocore will boost all cores by 500mhz, and as much as 1ghz
when fewer cores are active, so no need of a high base frequency
to ensure very high single thread performance..
Btw, i d like to point that if FPs are involved , single thread
performance will surely be better than on SB even with
no big frequency boost...
It s often that when some afficionados point single thread
perfs, they are in fact talking about integer ops related perfs....
 

gevorg

Diamond Member
Nov 3, 2004
5,075
1
0
Even if Bulldozer can do 3.5Ghz, its very unlikely to have it at launch. AMD needs some headroom for future updates/revisions, they cannot just reveal all their cards at once.
 

JFAMD

Senior member
May 16, 2009
565
0
0
I am not saying what clock speed is, but you always ship with the highest speed clocks that you can hit.

All of the future updates come from process/design enhancements. One would be a fool to hold back when bringing a new product to market.
 

Ares1214

Senior member
Sep 12, 2010
268
0
0
I am not saying what clock speed is, but you always ship with the highest speed clocks that you can hit.

All of the future updates come from process/design enhancements. One would be a fool to hold back when bringing a new product to market.

I agree with this. I mean look bad at the GTX 460. I think Nvidia regrets releasing it at such low clock speeds. If it had maybe 15-20% higher clock speed, it would have made the 6870 look much less attractive. Same thing here. Furthermore, first impressions are everything. If they blow SB away but leave themselves little headroom, then that is better than waiting 6 months to increase the stock clocks by 300-400 MHz. The most important thing is how well it does against SB at maximum OC. If it can only hit 4.3-4.4 GHz on a good OC, then it will need to have much better performance at stock clocks.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
My thinking would be in regards to the Turbo clock. I like the idea and would wand C&C to still be able to run. Will they include a way to "overclock" the TDP and set the head room to 175watts for example.
 

ElFenix

Elite Member
Super Moderator
Mar 20, 2000
102,414
8,356
126
My thinking would be in regards to the Turbo clock. I like the idea and would wand C&C to still be able to run. Will they include a way to "overclock" the TDP and set the head room to 175watts for example.

i doubt the motherboards would be able to handle that
 

podspi

Golden Member
Jan 11, 2011
1,965
71
91
i doubt the motherboards would be able to handle that

You bring up an excellent point. If BD is designed to hit up against its TDP all the time, unless TurboCORE disables itself when not on stock settings, we could see a lot of blown motherboards (similar thing happened when Thuban was released, there is still an ancient thread going on about it over at HardOCP)
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
You bring up an excellent point. If BD is designed to hit up against its TDP all the time, unless TurboCORE disables itself when not on stock settings, we could see a lot of blown motherboards (similar thing happened when Thuban was released, there is still an ancient thread going on about it over at HardOCP)

If I read it carefully enough its not that it would run at turbo all the time but that it would if for example it would notice that 2 cores where doing work, if those two cores were near max CPU it would ramp up those two cores with the spare room in the TDP.

We already do stuff similar in overclocking (even when not increasing voltage power usage is increased). So if we found out a CPU could run at 5GHz and lets say that was at 180w of heat output and it was stable. Wouldn't you rather that it only ran at that when it really needed to, then it would also give that extra power to other cores on a more distributed load (like 4 cores at 4.5 and 6 cores at 4.0GHz) and so on?
 

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136
. So if we found out a CPU could run at 5GHz and lets say that was at 180w of heat output and it was stable. Wouldn't you rather that it only ran at that when it really needed to, then it would also give that extra power to other cores on a more distributed load (like 4 cores at 4.5 and 6 cores at 4.0GHz) and so on?

The top bin BD is rated 125W.
I would be surprised if the peak power reach 180W.
Moreover, with such a power drain, the VRMs must
be seriously beefed and correctly cooled , increasing the costs.
Sure that BD will be more or less a costly CPU , so
a few more $ wouldn t change the plateform total cost
by a significant margin, but then , reliabilty would be problematic,
and AMD cant risk a wave of burning MBs....
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
i doubt the motherboards would be able to handle that

Can't speak to AMD mobo's, but people take their Intel CPU's up over 300W power-consumption with extreme cooling.

I ran my QX6700 @4GHz for a couple of years on vaporphase cooling, it used more 300W on an ASUS P5E WS Pro mobo (not a cheap mobo at its time).

I know some of the lower-power AMD boards burned up with the 140W Phenoms, but I'd think the top-end AMD mobo's would be designed to handle 300W loads from the enthusiast overclockers out there. Just a guess though, I haven't read much about it to know.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Just to check again. Are you sure that wasn't system power?

http://www.tomshardware.com/reviews/overclock-core-i7,2268-10.html

It was system power, more specifically it was system power at full load less system power at idle. (sans the power for the vapoLS & LCD's in both cases)

Look at these tests, notice how easy it is to push around the power-consumption. With my QX6700 I could change the CPU's power consumption by 150W at full load just by moving Vcc within a standard range of usable Vcc's at 2GHz.

VccversusPowerConsumption.gif


^ Now that is at 2.6 GHz and 2.0 GHz. To operate at 4GHz fully loaded required 1.6V (just shy of it IIRC).

I don't have the data here but IIRC the system load (same system that generated the results above) was pulling around 600W.

The more I think about it, I think I concluded my CPU had to be doing around 350W at the socket once I factored in PSU efficiency.

Those guys who push with LN2 at >6GHz are probably pushing 600W or more through their CPU.
 

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136

Since this is a cubic function , i assume that it is composed
of a product of three first degree functions.
First is the current needed to charge/discharge the parasistic
capacitance, wich increase linearly with frequency.
Second should be the cross conduction losses of the push pull
cmos pairs wich increase also linearly with freq.
And third, i can only see the necessary voltage increase as the third factor that raise the final degree to 3...
Is that plausible ?..
 
Last edited: