Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Page 70 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

nonameo

Diamond Member
Mar 13, 2006
5,902
2
76
June 7th for the new ones? Ok I will wait to see how these stack up.

I'm surprised we haven't gotten much out of leaks yet... I mean, come on. However, no hyped up statements from AMD so far either, all I've heard so far is better IPC than phenom II and 50% more performance with 33% more cores. (note: the title of this thread is LOL. I think that's pretty much been debunked, right?)

Really, I'm more looking forward to llano though. I think AMD has more money to make there, I just hope they sell all the chips they possibly can(well, I mean... I want them to make money :p). They need it... AMD needs to grow to compete better with Intel.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I wonder if JFAMD could clear this up, can BD do 4 macroOps per core or per module? Or, is this a secret?
As I recall, AMD does not promote modules but cores.

It can do both ;) In recent papers the Bulldozer designers call those ops "Cops" (complex ops, equivalent of a ALU/FP micro-op + a mem op [load/store/load+store]).

Decode: 4 Cops/cycle/module or up to 5 in case of branch fusion (IIRC branch op has to be in last place then)
Issue: 2 ALU ops + 2 AGLU ops per cycle per core plus 4 FP/SIMD ops per cycle per module in the FPU (belonging to both threads)

Bulldozer’s decode unit extracts and
decodes up to four x86 instructions per
cycle from raw instruction bytes. The decode
pipeline converts x86 instructions into Cops
that can directly execute on the functional
units.

The scheduler picks and
schedules four Cops per cycle to the execution
units out of order.

on FPU:
AMD designed the Bulldozer FPU to
deliver industry-leading performance on
HPC, multimedia, and gaming applications.
The primary means of achieving such
performance is a four-wide, two-way, multithreaded,
fully out-of-order FPU, combined
with two 128-bit FMAC units supported by
a 128-bit high-bandwidth load/store subsystem.
Source: Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas, "Bulldozer: An Approach to Multithreaded Compute Performance," IEEE Micro, pp. 6-15, March/April, 2011

Here are some links related to Chuck Moore's comments on Financial Analyst Day 2010, where he mentioned the 4 "instructions" per cycle issue per core and as also the same bandwidth of decode:
http://citavia.blog.de/2010/04/22/p...architecture-as-speculated-8429143/#c12914412

David Kanter's article on BD gives more details if the software optimization manual is too cryptic.
http://realworldtech.com/page.cfm?ArticleID=RWT082610181333
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
It can do both ;) In recent papers the Bulldozer designers call those ops "Cops" (complex ops, equivalent of a ALU/FP micro-op + a mem op [load/store/load+store]).

Decode: 4 Cops/cycle/module or up to 5 in case of branch fusion (IIRC branch op has to be in last place then)
Issue: 2 ALU ops + 2 AGLU ops per cycle per core plus 4 FP/SIMD ops per cycle per module in the FPU (belonging to both threads)


Does this mean the theoretical max-throughput of a module is 4 "cops" a cycle? (Incidentally, so would a core's).

I think this should be enough. I'm doing some massive (100gb+) archiving right now with Winrar, and according to CodeAnalyst average IPC is 0.35 on my Thuban.

This sounds to me like synthetic benchmarks may not react well to Bulldozer :eek: -- but real world performance should be ...(well who knows)...
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,684
2,572
136
Does this mean the theoretical max-throughput of a module is 4 "cops" a cycle?
4+ whatever you gain from fusing ops. 6 avg for two threads both doing tight loops with two branches that have tests right before them?
like:

add
mov
test
jnz (never really taken)
dec
jnz start

(where no internal dependencies)

I think this should be enough. I'm doing some massive (100gb+) archiving right now with Winrar, and according to CodeAnalyst average IPC is 0.35 on my Thuban.
Note that large part of that, the cpu was twiddling it's proverbial thumbs waiting for memory. (or even disk...) To increase on that, you'd have to either decrease memory latency (not really possible), or increase ipc during the sections where the processor is actually doing something. Even if the average ipc is <1, upping the peak ipc can help.

As I said before, decode is special in this, because when the processor stalls on a data dependency, the decoders can keep working, and data misses tend to inconveniently happen at the beginning of functions, they can often not be hidden well with OOOE. L1 miss = 8 extra decoded instructions. L2 miss = ~40 extra decoded instructions. L3/memory miss = decode until all queues are full.
 
Last edited:

drizek

Golden Member
Jul 7, 2005
1,410
0
71
You lost me there, what do you mean ??

Intel started building quad cores at 65nm, and they have so far been shrinking them with every generation. Now they are selling very small, cheap to manufacture chips because they don't have any real competition forcing them to build more expensive chips.

On the other hand, AMD has pretty much transitioned completely to selling big 6-core chips.

In the graphics world, AMD was doing a lot better than nvidia. They could sell cheap, really nice $100 cards (3850, 4830, 5770) while nvidia was selling huge, power hungry cards and doing a very bad job of it. Same story again with the 9500/9700, where nvidia responded with the huge 5xxxx cards that were a total failure until the 57/5900 series, whereas ATI could get away with just the 9600xt/9800xt.

And again in the A64 days, where Intel was desperate and they started making enormous chips with ridiculous amounts of L3, and then doing the Pentium D.

I don't know what the cause/effect here is, but I think you can find some pretty good correlation between how well a company is doing and how big their chips are.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
So you saying AMD is desperate and thats why Bulldozer's die size is bigger than Intel's Sandybridge ??

So die size is related to how desperate the company is ?? i guess then, NVIDIA is the most desperate company because of GF110 die size of 520mm2 :p
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
On the other hand, AMD has pretty much transitioned completely to selling big 6-core chips.

I'll bet you the share of the X6 chips in AMD's product mix is in the single digits.

Far, far away from being "completely transitioned".
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
So you saying AMD is desperate and thats why Bulldozer's die size is bigger than Intel's Sandybridge ??

So die size is related to how desperate the company is ?? i guess then, NVIDIA is the most desperate company because of GF110 die size of 520mm2 :p

What is the Bulldozer die size? And what is the source of that information?
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
What is the Bulldozer die size? And what is the source of that information?

30327896.jpg


http://translate.googleusercontent....le.com&usg=ALkJrhhGTDZvNA4ijJjksX-jo2gfajc2wg

Edit: Sorry the link URL was wrong
 
Last edited:

drizek

Golden Member
Jul 7, 2005
1,410
0
71
I'll bet you the share of the X6 chips in AMD's product mix is in the single digits.

Far, far away from being "completely transitioned".

For the Phenoms I meant. Are they even producing Phenom II non-X6 chips anymore? I thought they were selling some gimped quad Thubans to OEMs.

Of course, most of their chips are going to be Athlon IIs at this point, but I was referring to the top end/enthusiast line.

So you saying AMD is desperate and thats why Bulldozer's die size is bigger than Intel's Sandybridge ??

Well...

1. Yes, AMD is desperate. I think we can all agree on that.
2. I don't think they are desperate with Bulldozer necessarily. Zambezi is smaller than Barcelona.
3. It is more that Intel is very confident and safe right now that SB is as small as it is, rather than AMD being particularly desperate. Zambezi is "normal sized", SB is small.

i guess then, NVIDIA is the most desperate company because of GF110 die size of 520mm2

Yes and no.

GF110 is a special case since it is a Fermi card. It is not competing directly with AMDs top end GPUs. I think that for gamers it is better to get two GF104 cards rather than one GF110. I personally would never buy a GF110 over an AMD or SLI setup, for instance. It does a bunch of stuff that I don't need as a gamer.

Basically, if you have an efficient, high performance architecture, you get an opportunity to provide the same or better performance as your competitor at a lower production price. This gives you the ability to make a tidy profit for the next 6-12 months. Your competitor will try to retake the performance crown by gluing two of their chips together, or gluing some cache on them. These chips are probably not very economical, but from the competitors point of view they can at least give consumers the impression that they are at the same level as you, and bring prices down to cut into your profits.
 

JSt0rm

Lifer
Sep 5, 2000
27,399
3,948
126
Well Lets hope bullldozer is competitive in some area. AMD deserves to at least have me wait to see if they can compete. My x2 4200 has been flawless for a long time.
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
Some stuff about GF104, and big dies, from semiaccurate

As a humorous aside, both [Cypress and GF104] are made on the same process, TSMC&#8217;s 40nm, and literally at the same fab. AMD managed to cram 2.15 billion transistors into 334mm^2, about 6.44 million transistors per mm^2. GF104 has 1.95 billion transistors in 367mm^2, about 5.31 million transistors per mm^2. This means AMD&#8217;s Evergreen architecture is over 20&#37; more space efficient than GF104 while delivering much more raw performance and vastly more performance per watt. When SemiAccurate teases Nvidia&#8217;s layout and physical design teams, it is for a reason.

http://semiaccurate.com/2010/07/21/gf104gtx460-has-huge-die/
 

Mopetar

Diamond Member
Jan 31, 2011
8,510
7,766
136
I don't know what the cause/effect here is, but I think you can find some pretty good correlation between how well a company is doing and how big their chips are.

If there is any, it's only because a smaller die allows you to fit more processors on each wafer and is less susceptible to process defects. Those can lead to improved profits.

Of course a small die size doesn't mean you're making a more efficient chip or that you'll be wildly profitable. If you make a really powerful chip on a big die that beats what the competition has, you'll be able to sell it at a higher price and be more profitable. If you make an underperforming chip on a small die, it's not going to sell well.

As JFAMD has pointed out sever times, consumers don't care about the die size. No one makes purchasing decisions based on die size. They look at raw performance, performance/price, performance/watt, or some other more complicated metric.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,684
2,572
136
semiaccurate said:
AMD managed to cram 2.15 billion transistors into 334mm^2, about 6.44 million transistors per mm^2. GF104 has 1.95 billion transistors in 367mm^2, about 5.31 million transistors per mm^2. This means AMD’s Evergreen architecture is over 20% more space efficient than GF104

This quote is just plain wrong. Different kinds of circuits just take more space per transistor, and it's irrelevant if you can get the same performance with more/less sram.

This doesn't change the point about AMD being much more efficient on this product cycle -- HD6850, which falls between 460 and it's respin 560 in performance, takes this to absurdity, as it's only 255mm^2, or closest in size to NVidia GTX 550, which is 238mm^2, and competes closest with AMD's 5770, which in turn is only 170mm^2.

AMD is basically countering every NV card except the very top ones with the model that is one cost-tier below it's opponent.
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
As JFAMD has pointed out sever times, consumers don't care about the die size. No one makes purchasing decisions based on die size. They look at raw performance, performance/price, performance/watt, or some other more complicated metric.

Yes, which is why I bought a GTX 460 instead of an AMD card. Mostly it was performance/price, but it was also just that it had lower idle power draw, and I got a specific card which was said (and proved) to be essentially silent when idle.

SA says the same,

The problem is simple for Nvidia, the economics of this part don’t work out, the underlying architecture is wrong, so the resultant parts start out with an uphill battle. This is a problem for Nvidia, not for the end user. If the GTX460 is priced at a loss, the consumer shouldn’t care, they get a deal, and that is the end of it. Retail buyers rarely care if the part is making a profit for the manufacturer.

So again, as an ethusiast, I find this whole discussion interesting because it tells us a lot about what is going on at hte companies, both in terms of their engineering and in terms of their financing. It doesn't really affect my purchasing decisions.

There is one exception though, and that's the environmental angle. Making wafers is quite resource intensive, and with all else being equal, I generally try and support companies who minimize their environmental impact by being more efficient in their manufacturing.
 

videoclone

Golden Member
Jun 5, 2003
1,465
0
0
a 4 core Bulldozer is smaller then a 4 core Sandybridge..

I dont understand what all this talk about die size is? YES the 8 core Bulldozer is larger then the 4 core Sandybridge but umm it has twice as many REAL CPU cores so it should be.

At the end of the day AMD has the smaller CPU per core then intel and so will have more room to improve on that design later on, either by tacking on a GPU or more cores.

Overall its a better option then intels, lets stick more FULL cpu's next to each other and try and shrink process node as fast as possible so we can stay ahead.
 
Last edited:

daveybrat

Elite Member
Super Moderator
Jan 31, 2000
5,822
1,036
126
This entire discussion only becomes relevant if an 8-core BD is slower than a 4-core SB.

I can't see an 8-core BD being slower than a SB in Highly-Threaded applications. SB might be faster still in apps and games that don't utilize more than 4 cores.
 

Accord99

Platinum Member
Jul 2, 2001
2,259
172
106
I dont understand what all this talk about die size is? YES the 8 core Bulldozer is larger then the 4 core Sandybridge but umm it has twice as many REAL CPU cores so it should be.
It depends on how real those Bulldozer cores are; right now a Sandy Bridge core is getting close to the throughput of two existing AMD cores at the same frequency.

http://www.anandtech.com/bench/Product/289?vs=85

A 3.1GHz dual-core Sandy Bridge has roughly the same throughput as a 2.7 GHz Phenom II X4. But if the number of CPU heavy threads drops to less than 4, the 2100 pulls increasingly ahead; highlighting the superiority of Intel's approach. More cores helps some things, more powerful cores helps everything.

At the end of the day AMD has the smaller CPU per core then intel and so will have more room to improve on that design later on, either by tacking on a GPU or more cores.
The CPU core size advantage isn't that significant, at 45nm it took AMD 346 mm^2 to match the throughput of 263mm^2 Nehalem.
 
Last edited:

Rezist

Senior member
Jun 20, 2009
726
0
71
a 4 core Bulldozer is smaller then a 4 core Sandybridge..

I dont understand what all this talk about die size is? YES the 8 core Bulldozer is larger then the 4 core Sandybridge but umm it has twice as many REAL CPU cores so it should be.

At the end of the day AMD has the smaller CPU per core then intel and so will have more room to improve on that design later on, either by tacking on a GPU or more cores.

Overall its a better option then intels, lets stick more FULL cpu's next to each other and try and shrink process node as fast as possible so we can stay ahead.


Is it though? I mean in the image posted above advertised as an "8-core" is actually a 4-core chip.....
 

Mopetar

Diamond Member
Jan 31, 2011
8,510
7,766
136
It depends on how real those Bulldozer cores are; right now a Sandy Bridge core is getting close to the throughput of two existing AMD cores at the same frequency.

http://www.anandtech.com/bench/Product/289?vs=85

Why not actually find two chips closer in clock speed?

Let's take the i3 2100 against the 3.2 GHz x4 (955), both because they are closer in clock speed, and approximately the same price (The i3 is $5 cheaper) on Newegg. I can't find a price for the 910, but the 810 which has almost identical performance is listed for a shade above $90 at TigerDirect, or about $35 cheaper than either the 2100 or 955. It's also worth noting the disparity in die sizes. We should have a better idea how well a Phenom-like core will perform once Llano is released.

I don't dispute that the Intel chip has better performance, but given that AMD is going to be releasing two new architectures shortly, the comparison is a bit disingenuous, given the difference processes used to manufacture the chips among other things.
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
Unless amd can make an octal core that overclocks to 4.5ghz+, there is no need to use equivalent clocks in hese comparisons.
 
Status
Not open for further replies.