Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

JSt0rm · May 2, 2011

June 7th for the new ones? Ok I will wait to see how these stack up.

nonameo · May 2, 2011

JSt0rm said:
June 7th for the new ones? Ok I will wait to see how these stack up.

I'm surprised we haven't gotten much out of leaks yet... I mean, come on. However, no hyped up statements from AMD so far either, all I've heard so far is better IPC than phenom II and 50% more performance with 33% more cores. (note: the title of this thread is LOL. I think that's pretty much been debunked, right?)

Really, I'm more looking forward to llano though. I think AMD has more money to make there, I just hope they sell all the chips they possibly can(well, I mean... I want them to make money

). They need it... AMD needs to grow to compete better with Intel.

Dresdenboy · May 2, 2011

Gundark said:
I wonder if JFAMD could clear this up, can BD do 4 macroOps per core or per module? Or, is this a secret?
As I recall, AMD does not promote modules but cores.

It can do both

In recent papers the Bulldozer designers call those ops "Cops" (complex ops, equivalent of a ALU/FP micro-op + a mem op [load/store/load+store]).

Decode: 4 Cops/cycle/module or up to 5 in case of branch fusion (IIRC branch op has to be in last place then)
Issue: 2 ALU ops + 2 AGLU ops per cycle per core plus 4 FP/SIMD ops per cycle per module in the FPU (belonging to both threads)

Bulldozers decode unit extracts and
decodes up to four x86 instructions per
cycle from raw instruction bytes. The decode
pipeline converts x86 instructions into Cops
that can directly execute on the functional
units.

The scheduler picks and
schedules four Cops per cycle to the execution
units out of order.

on FPU:

AMD designed the Bulldozer FPU to
deliver industry-leading performance on
HPC, multimedia, and gaming applications.
The primary means of achieving such
performance is a four-wide, two-way, multithreaded,
fully out-of-order FPU, combined
with two 128-bit FMAC units supported by
a 128-bit high-bandwidth load/store subsystem.

Source: Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas, "Bulldozer: An Approach to Multithreaded Compute Performance," IEEE Micro, pp. 6-15, March/April, 2011

Here are some links related to Chuck Moore's comments on Financial Analyst Day 2010, where he mentioned the 4 "instructions" per cycle issue per core and as also the same bandwidth of decode:
http://citavia.blog.de/2010/04/22/p...architecture-as-speculated-8429143/#c12914412

David Kanter's article on BD gives more details if the software optimization manual is too cryptic.
http://realworldtech.com/page.cfm?ArticleID=RWT082610181333

podspi · May 2, 2011

Dresdenboy said:
It can do both In recent papers the Bulldozer designers call those ops "Cops" (complex ops, equivalent of a ALU/FP micro-op + a mem op [load/store/load+store]).

Decode: 4 Cops/cycle/module or up to 5 in case of branch fusion (IIRC branch op has to be in last place then)
Issue: 2 ALU ops + 2 AGLU ops per cycle per core plus 4 FP/SIMD ops per cycle per module in the FPU (belonging to both threads)

Does this mean the theoretical max-throughput of a module is 4 "cops" a cycle? (Incidentally, so would a core's).

I think this should be enough. I'm doing some massive (100gb+) archiving right now with Winrar, and according to CodeAnalyst average IPC is 0.35 on my Thuban.

This sounds to me like synthetic benchmarks may not react well to Bulldozer

-- but real world performance should be ...(well who knows)...

Tuna-Fish · May 2, 2011

podspi said:
Does this mean the theoretical max-throughput of a module is 4 "cops" a cycle?

4+ whatever you gain from fusing ops. 6 avg for two threads both doing tight loops with two branches that have tests right before them?
like:

add
mov
test
jnz (never really taken)
dec
jnz start

(where no internal dependencies)

I think this should be enough. I'm doing some massive (100gb+) archiving right now with Winrar, and according to CodeAnalyst average IPC is 0.35 on my Thuban.

Note that large part of that, the cpu was twiddling it's proverbial thumbs waiting for memory. (or even disk...) To increase on that, you'd have to either decrease memory latency (not really possible), or increase ipc during the sections where the processor is actually doing something. Even if the average ipc is <1, upping the peak ipc can help.

As I said before, decode is special in this, because when the processor stalls on a data dependency, the decoders can keep working, and data misses tend to inconveniently happen at the beginning of functions, they can often not be hidden well with OOOE. L1 miss = 8 extra decoded instructions. L2 miss = ~40 extra decoded instructions. L3/memory miss = decode until all queues are full.

drizek · May 2, 2011

AtenRa said:
You lost me there, what do you mean ??

Intel started building quad cores at 65nm, and they have so far been shrinking them with every generation. Now they are selling very small, cheap to manufacture chips because they don't have any real competition forcing them to build more expensive chips.

On the other hand, AMD has pretty much transitioned completely to selling big 6-core chips.

In the graphics world, AMD was doing a lot better than nvidia. They could sell cheap, really nice $100 cards (3850, 4830, 5770) while nvidia was selling huge, power hungry cards and doing a very bad job of it. Same story again with the 9500/9700, where nvidia responded with the huge 5xxxx cards that were a total failure until the 57/5900 series, whereas ATI could get away with just the 9600xt/9800xt.

And again in the A64 days, where Intel was desperate and they started making enormous chips with ridiculous amounts of L3, and then doing the Pentium D.

I don't know what the cause/effect here is, but I think you can find some pretty good correlation between how well a company is doing and how big their chips are.

AtenRa · May 2, 2011

So you saying AMD is desperate and thats why Bulldozer's die size is bigger than Intel's Sandybridge ??

So die size is related to how desperate the company is ?? i guess then, NVIDIA is the most desperate company because of GF110 die size of 520mm2

Phynaz · May 2, 2011

drizek said:
On the other hand, AMD has pretty much transitioned completely to selling big 6-core chips.

I'll bet you the share of the X6 chips in AMD's product mix is in the single digits.

Far, far away from being "completely transitioned".

Martimus · May 2, 2011

AtenRa said:
So you saying AMD is desperate and thats why Bulldozer's die size is bigger than Intel's Sandybridge ??

So die size is related to how desperate the company is ?? i guess then, NVIDIA is the most desperate company because of GF110 die size of 520mm2

What is the Bulldozer die size? And what is the source of that information?

AtenRa · May 2, 2011

What is the Bulldozer die size? And what is the source of that information?

http://translate.googleusercontent....le.com&usg=ALkJrhhGTDZvNA4ijJjksX-jo2gfajc2wg

Edit: Sorry the link URL was wrong

drizek · May 2, 2011

Phynaz said:
I'll bet you the share of the X6 chips in AMD's product mix is in the single digits.

Far, far away from being "completely transitioned".

For the Phenoms I meant. Are they even producing Phenom II non-X6 chips anymore? I thought they were selling some gimped quad Thubans to OEMs.

Of course, most of their chips are going to be Athlon IIs at this point, but I was referring to the top end/enthusiast line.

So you saying AMD is desperate and thats why Bulldozer's die size is bigger than Intel's Sandybridge ??

Well...

1. Yes, AMD is desperate. I think we can all agree on that.
2. I don't think they are desperate with Bulldozer necessarily. Zambezi is smaller than Barcelona.
3. It is more that Intel is very confident and safe right now that SB is as small as it is, rather than AMD being particularly desperate. Zambezi is "normal sized", SB is small.

i guess then, NVIDIA is the most desperate company because of GF110 die size of 520mm2

Yes and no.

GF110 is a special case since it is a Fermi card. It is not competing directly with AMDs top end GPUs. I think that for gamers it is better to get two GF104 cards rather than one GF110. I personally would never buy a GF110 over an AMD or SLI setup, for instance. It does a bunch of stuff that I don't need as a gamer.

Basically, if you have an efficient, high performance architecture, you get an opportunity to provide the same or better performance as your competitor at a lower production price. This gives you the ability to make a tidy profit for the next 6-12 months. Your competitor will try to retake the performance crown by gluing two of their chips together, or gluing some cache on them. These chips are probably not very economical, but from the competitors point of view they can at least give consumers the impression that they are at the same level as you, and bring prices down to cut into your profits.

JSt0rm · May 2, 2011

Well Lets hope bullldozer is competitive in some area. AMD deserves to at least have me wait to see if they can compete. My x2 4200 has been flawless for a long time.

drizek · May 2, 2011

Some stuff about GF104, and big dies, from semiaccurate

As a humorous aside, both [Cypress and GF104] are made on the same process, TSMC’s 40nm, and literally at the same fab. AMD managed to cram 2.15 billion transistors into 334mm^2, about 6.44 million transistors per mm^2. GF104 has 1.95 billion transistors in 367mm^2, about 5.31 million transistors per mm^2. This means AMD’s Evergreen architecture is over 20% more space efficient than GF104 while delivering much more raw performance and vastly more performance per watt. When SemiAccurate teases Nvidia’s layout and physical design teams, it is for a reason.

http://semiaccurate.com/2010/07/21/gf104gtx460-has-huge-die/

Mopetar · May 2, 2011

drizek said:
I don't know what the cause/effect here is, but I think you can find some pretty good correlation between how well a company is doing and how big their chips are.

If there is any, it's only because a smaller die allows you to fit more processors on each wafer and is less susceptible to process defects. Those can lead to improved profits.

Of course a small die size doesn't mean you're making a more efficient chip or that you'll be wildly profitable. If you make a really powerful chip on a big die that beats what the competition has, you'll be able to sell it at a higher price and be more profitable. If you make an underperforming chip on a small die, it's not going to sell well.

As JFAMD has pointed out sever times, consumers don't care about the die size. No one makes purchasing decisions based on die size. They look at raw performance, performance/price, performance/watt, or some other more complicated metric.

Tuna-Fish · May 2, 2011

semiaccurate said:
AMD managed to cram 2.15 billion transistors into 334mm^2, about 6.44 million transistors per mm^2. GF104 has 1.95 billion transistors in 367mm^2, about 5.31 million transistors per mm^2. This means AMDs Evergreen architecture is over 20% more space efficient than GF104

This quote is just plain wrong. Different kinds of circuits just take more space per transistor, and it's irrelevant if you can get the same performance with more/less sram.

This doesn't change the point about AMD being much more efficient on this product cycle -- HD6850, which falls between 460 and it's respin 560 in performance, takes this to absurdity, as it's only 255mm^2, or closest in size to NVidia GTX 550, which is 238mm^2, and competes closest with AMD's 5770, which in turn is only 170mm^2.

AMD is basically countering every NV card except the very top ones with the model that is one cost-tier below it's opponent.

drizek · May 2, 2011

Mopetar said:
As JFAMD has pointed out sever times, consumers don't care about the die size. No one makes purchasing decisions based on die size. They look at raw performance, performance/price, performance/watt, or some other more complicated metric.

Yes, which is why I bought a GTX 460 instead of an AMD card. Mostly it was performance/price, but it was also just that it had lower idle power draw, and I got a specific card which was said (and proved) to be essentially silent when idle.

SA says the same,

The problem is simple for Nvidia, the economics of this part dont work out, the underlying architecture is wrong, so the resultant parts start out with an uphill battle. This is a problem for Nvidia, not for the end user. If the GTX460 is priced at a loss, the consumer shouldnt care, they get a deal, and that is the end of it. Retail buyers rarely care if the part is making a profit for the manufacturer.

So again, as an ethusiast, I find this whole discussion interesting because it tells us a lot about what is going on at hte companies, both in terms of their engineering and in terms of their financing. It doesn't really affect my purchasing decisions.

There is one exception though, and that's the environmental angle. Making wafers is quite resource intensive, and with all else being equal, I generally try and support companies who minimize their environmental impact by being more efficient in their manufacturing.

videoclone · May 2, 2011

a 4 core Bulldozer is smaller then a 4 core Sandybridge..

I dont understand what all this talk about die size is? YES the 8 core Bulldozer is larger then the 4 core Sandybridge but umm it has twice as many REAL CPU cores so it should be.

At the end of the day AMD has the smaller CPU per core then intel and so will have more room to improve on that design later on, either by tacking on a GPU or more cores.

Overall its a better option then intels, lets stick more FULL cpu's next to each other and try and shrink process node as fast as possible so we can stay ahead.

drizek · May 2, 2011

videoclone said:
a 4 core Bulldozer is smaller then a 4 core Sandybridge..

This entire discussion only becomes relevant if an 8-core BD is slower than a 4-core SB.

daveybrat · May 2, 2011

drizek said:
This entire discussion only becomes relevant if an 8-core BD is slower than a 4-core SB.

I can't see an 8-core BD being slower than a SB in Highly-Threaded applications. SB might be faster still in apps and games that don't utilize more than 4 cores.

Accord99 · May 2, 2011

videoclone said:
I dont understand what all this talk about die size is? YES the 8 core Bulldozer is larger then the 4 core Sandybridge but umm it has twice as many REAL CPU cores so it should be.

It depends on how real those Bulldozer cores are; right now a Sandy Bridge core is getting close to the throughput of two existing AMD cores at the same frequency.

http://www.anandtech.com/bench/Product/289?vs=85

A 3.1GHz dual-core Sandy Bridge has roughly the same throughput as a 2.7 GHz Phenom II X4. But if the number of CPU heavy threads drops to less than 4, the 2100 pulls increasingly ahead; highlighting the superiority of Intel's approach. More cores helps some things, more powerful cores helps everything.

At the end of the day AMD has the smaller CPU per core then intel and so will have more room to improve on that design later on, either by tacking on a GPU or more cores.

The CPU core size advantage isn't that significant, at 45nm it took AMD 346 mm^2 to match the throughput of 263mm^2 Nehalem.

Rezist · May 2, 2011

videoclone said:
a 4 core Bulldozer is smaller then a 4 core Sandybridge..

I dont understand what all this talk about die size is? YES the 8 core Bulldozer is larger then the 4 core Sandybridge but umm it has twice as many REAL CPU cores so it should be.

At the end of the day AMD has the smaller CPU per core then intel and so will have more room to improve on that design later on, either by tacking on a GPU or more cores.

Overall its a better option then intels, lets stick more FULL cpu's next to each other and try and shrink process node as fast as possible so we can stay ahead.

Is it though? I mean in the image posted above advertised as an "8-core" is actually a 4-core chip.....

drizek · May 2, 2011

Look on the bottom left. Core 0, Core 1, net to each other in the same module.

gdansk · May 2, 2011

Rezist said:
Is it though? I mean in the image posted above advertised as an "8-core" is actually a 4-core chip.....

If you're talking about modules, there are quite clearly two integer cores on each module.

Mopetar · May 2, 2011

Accord99 said:
It depends on how real those Bulldozer cores are; right now a Sandy Bridge core is getting close to the throughput of two existing AMD cores at the same frequency.

http://www.anandtech.com/bench/Product/289?vs=85

Why not actually find two chips closer in clock speed?

Let's take the i3 2100 against the 3.2 GHz x4 (955), both because they are closer in clock speed, and approximately the same price (The i3 is $5 cheaper) on Newegg. I can't find a price for the 910, but the 810 which has almost identical performance is listed for a shade above $90 at TigerDirect, or about $35 cheaper than either the 2100 or 955. It's also worth noting the disparity in die sizes. We should have a better idea how well a Phenom-like core will perform once Llano is released.

I don't dispute that the Intel chip has better performance, but given that AMD is going to be releasing two new architectures shortly, the comparison is a bit disingenuous, given the difference processes used to manufacture the chips among other things.

drizek · May 2, 2011

Unless amd can make an octal core that overclocks to 4.5ghz+, there is no need to use equivalent clocks in hese comparisons.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Lifer

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Lifer

Lifer

Diamond Member

Lifer

Golden Member

Lifer

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Elite Member

Platinum Member

Senior member

Golden Member

Diamond Member

Diamond Member

Golden Member