Differences between Nvidia/ATI architecture, can anyone explain?

Kakkoii

Senior member
Jun 5, 2009
379
0
0
HD4870: 55nm - 956 Million transistors

Die Size = 256mm²
Bus-w = 256 bit
Shaders = 800
TMU's = 40
ROP's = 16


GTX285: 55nm - 1.4 Billion transistors

Die Size = 470mm²
Bus-w = 512 bit
Shaders = 240
TMU's = 80
ROP's = 32



This is something I'm confused about..

NV's chip has 30% the shaders compared to ATI's. But has double the amount of ROP's and TMU's. How is it that it can perform around the same as the 4870 with much less shaders, while at the same time having a much bigger die size than the 4870?

Is there something different about how NV's shaders work that they take up more room?
Or do ROP's and/or TMU's just take up a lot of die space?


edit: Interesting read for comparing:

NVIDIA GT200 GPU and Architecture Analysis
http://www.beyond3d.com/content/reviews/51/1

AMD R600 GPU and Architecture Analysis
http://www.beyond3d.com/content/reviews/16
 

Keysplayr

Elite Member
Jan 16, 2003
21,219
56
91
That's just it. Two completely different architectures really cannot be compared to one another. It's like comparing a bananna to an apple.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Some non-zero percentage of those 1.4B xtors are there to enable CUDA to run more efficiently than it would have otherwise, presumably to the non-zero detriment of the GPU performance to which those xtors are there for.

I'd also be surprised if ATI's circuits that they label ROP or TMU are the same functionality as an NV circuit which they label as a ROP or TMU.

There's plenty of experts here though, so I'll just let them weigh in and answer your question in detail.

For starters though, checkout this AT article detailing ATI's architecture (and link therein for NV's).
 

Kakkoii

Senior member
Jun 5, 2009
379
0
0
Originally posted by: Idontcare
Some non-zero percentage of those 1.4B xtors are there to enable CUDA to run more efficiently than it would have otherwise, presumably to the non-zero detriment of the GPU performance to which those xtors are there for.

I'd also be surprised if ATI's circuits that they label ROP or TMU are the same functionality as an NV circuit which they label as a ROP or TMU.

There's plenty of experts here though, so I'll just let them weigh in and answer your question in detail.

For starters though, checkout this AT article detailing ATI's architecture (and link therein for NV's).

Thanks for the link. The AT article seems to put things in an easier to understand way than the 2 article I linked... They're a mental overload lol.
 

OCGuy

Lifer
Jul 12, 2000
27,224
37
91
Originally posted by: Keysplayr
That's just it. Two completely different architectures really cannot be compared to one another. It's like comparing a bananna to an apple.

This.

It is like asking why the 2010 Camero has 100+ more HP than the 2010 Mustang GT, but is only .2-.3 faster on the 0-60.

Anyone can twist the data to fit their conclusion.
 

Kakkoii

Senior member
Jun 5, 2009
379
0
0
Originally posted by: OCguy
Originally posted by: Keysplayr
That's just it. Two completely different architectures really cannot be compared to one another. It's like comparing a bananna to an apple.

This.

It is like asking why the 2010 Camero has 100+ more HP than the 2010 Mustang GT, but is only .2-.3 faster on the 0-60.

Anyone can twist the data to fit their conclusion.

I'm not asking for a twist of data to fit a conclusion. Just cold hard facts about the differences in their architecture that would at least help explain the size to performance ratio differences between the two.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: Kakkoii
The AT article seems to put things in an easier to understand way

That's what's great about Anand articles, they are digestible and yet they contain a lot of wonderful knowledge.
 

MODEL3

Senior member
Jul 22, 2009
528
0
0
there are many reasons:

1.ATI & NV are counting differently the transistors number (ATI counting method is resulting in more million transistors, maybe NV doesn't count the caches, just a guess i don't know...)

Just an example (i can go all the way back to 2002 GPUs like 9700, it doesn't change anything...):

RV770 956 million transistors 260mm2
G92b 753 million transistors is around 270mm2

2.
RV770 has 160 shader processors (at core speed) that each can issue up to 5 instructions! (that means 1 in the worst case, 5 at best case, all these theoritically)

GT200 has 240 shader processors (at 2,25X core speed) that each can issue up to 2 instructions (thise can't be achieved all the time, NV is saying that the GT200 architecture can aproach 1,5X)

So 750MHz RV770 is 1,2TFLOPs best case

and 650MHz GT285 is 1,06TFLOPs best case again

The thing is that the minimum perf. for RV770 SP architecture is way lower than the minimum of GT285 SP architecture!

In my opinion a fairly representative analogue for NV/ATI SPs (2008 drivers , 2008 games) is the following:

128 Nvidia's shader processors (at 2,5X core speed like G92b) is equal in most of the case with 640 (128 5-way at core speed) ATI's shader processors!

Already ATI is using the 5-way design 2,5 years, with time (if ATI keeps this) as developers get more and more familiar with ATI's 5-way tech and ATI gets better in the compiler level maybe in the future the ratio will change in something like:

128 Nvidia's shader processors (at 2,5X core speed like G92b) is equal in most of the case with 480 (96 5-way at core speed) ATI's shader processors!(but this is just my opinion!)

For the above equation (128/640), depending on the game or the synthetic benchmark ATI or NV can be victorious (usually for synthetic benchmarks ATI is way faster...)

EDIT*
SPs are just one part of the GPU design (ROPs &TUs are also important...)
Of cource with time SPs are getting more and more important...
The thing is to have a balanced design each period, in order to suit the games of the lifetime of the product.

 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: MODEL3
there are many reasons:

1.ATI & NV are counting differently the transistors number (ATI counting method is resulting in more million transistors, maybe NV doesn't count the caches, just a guess i don't know...)

Just an example (i can go all the way back to 2002 GPUs like 9700, it doesn't change anything...):

RV770 956 million transistors 260mm2
G92b 753 million transistors is around 270mm2

They don't really do this though, do they? Most designers don't undersell the xtor count that went into creating their chips. Why would they? xtor counts are a pretty easy metric to generate data for, its not like you have to pay someone a full-time salary to come up with the xtor count.

Are you thinking this is case because you are attempting to rationalize the xtor density delta between the two? (I'm presuming that is why you used that data for your example)

Xtor density tells you very little when comparing two different architectures, there are two many unknowns in that equation. For the same architecture implemented on two or more different process generations it can tell you something, but only if the architecture is practically identical across the nodes.

Understand xtors are planar 2D components. They have a width and a length. Both are variable for any given node.

When we say a process node features and Lgmin of 40nm for instance that doesn't mean all xtors implemented in that process node are going to have an Lg of 40nm, it just means they can't be any narrower than that. In any given circuit on the design the xtors Lg can be microns if need be to serve the purpose of the circuit. And the width is equally variable, ranging from nm's to microns depending on the purpose of the xtor in the circuit.

So comparing xtor density between two different architectures really doesn't tell you anything of any relevance when attempting to make such a comparison, to many unknowns in your set of linear equations resulting in no unique solutions (i.e. the sky is the limit when it comes to rationalizing why the xtor densities are different).
 

MODEL3

Senior member
Jul 22, 2009
528
0
0
Originally posted by: Idontcare

Are you thinking this is case because you are attempting to rationalize the xtor density delta between the two? (I'm presuming that is why you used that data for your example)

Exactly, i was just trying to rationalize the difference.

Let me explain a little more why i tried to rationalize it that way.

I can understand that on the same process, xtors have variable width and length and that comparing two different architectures, there are too many unknown factors in the equation!

But for me the situation is a little strange, let explain what i mean:

When we comparing different architectures, the bigger the chips the bigger the unknown factors in that equation, so with that scaling logic it can lead to bigger differencies! (i think)

I took the lowest chips per generation (the one that have very small die size) in order to try to lower just a little bit the impact of the different architectures.

And also went all the way back to 2002 (comparing these low die sized chips) in order to find architectures that are much more simpler than todays architecture, again in my effort to try to lower just a little bit the impact of the different architectures.

What i found strange is that in every freakin year (the last 7 years) , NV has worse transistor count/die size ratio even for these low end parts, even for much simpler designs than todays!

Don't you think that this is strange to happen every freakin year? (i am talking same process of cource)

I guess the logic thing would be for NV to have higher transistor count/die size ratio some times?

Like you said, there are too many unknown factors in the equation, so probably the analysis i made is meaningless, but i have a feeling that something fishy is going on here (i mean with the way they count the transistors, 7 freakin years for even low end much simpler designs? Out of the conspiracy book of cource.... :laugh: )

 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: MODEL3
Originally posted by: Idontcare

Are you thinking this is case because you are attempting to rationalize the xtor density delta between the two? (I'm presuming that is why you used that data for your example)

Exactly, i was just trying to rationalize the difference.

Let me explain a little more why i tried to rationalize it that way.

I can understand that on the same process, xtors have variable width and length and that comparing two different architectures, there are too many unknown factors in the equation!

But for me the situation is a little strange, let explain what i mean:

When we comparing different architectures, the bigger the chips the bigger the unknown factors in that equation, so with that scaling logic it can lead to bigger differencies! (i think)

I took the lowest chips per generation (the one that have very small die size) in order to try to lower just a little bit the impact of the different architectures.

And also went all the way back to 2002 (comparing these low die sized chips) in order to find architectures that are much more simpler than todays architecture, again in my effort to try to lower just a little bit the impact of the different architectures.

What i found strange is that in every freakin year (the last 7 years) , NV has worse transistor count/die size ratio even for these low end parts, even for much simpler designs than todays!

Don't you think that this is strange to happen every freakin year? (i am talking same process of cource)

I guess the logic thing would be for NV to have higher transistor count/die size ratio some times?

Like you said, there are too many unknown factors in the equation, so probably the analysis i made is meaningless, but i have a feeling that something fishy is going on here (i mean with the way they count the transistors, 7 freakin years for even low end much simpler designs? Out of the conspiracy book of cource.... :laugh: )

Yeah your analysis is tripping you up because it is way too oversimplified even despite your having tried to minimize the differences.

Changes in xtor density has nothing to do with an architecture being simpler to you and me. First we'd need to come to an agreement as to what makes one architecture simpler or more complex than another, and the answer is not xtor count.

Also we have to deconvolve the interaction between xtor density and clockspeed and power-consumption for some circuits. For example the xtor density in sram has a direct impact on the srams clockspeed for equivalent Vcc.

I can't begin to even remotely over-state just how complicated and sophisticated the engineering tradeoffs are here when creating an architecture designed for manufacturer on a given node.

To attempt to divine anything remotely meaningful from the tea leaves of a xtor density statistic is really an exercise in futility in my opinion unless you make so many assumptions regarding clockspeed/power-consumption/Vcc/cost/timeline-risk/R&D budget/etc that you really are just creating whatever answer you wanted to arrive at from the beginning anyways.

Here's an exercise to assist you in arriving at this conclusion at your own hands, compare all the relevant xtor density, clockspeed, power-consumption, voltage, etc stats you can get your hands on for a 45nm yorkfield versus a 45nm nehalem. In this case you've eliminated the process-node unknowns because it is the same underlying process technology at work but you do have two differing architectures to compare.

Engineering IC's is about tradeoffs. Tradeoffs in performance, tradeoffs in power-consumption, tradeoffs in cost to manufacture, tradeoffs in the risk to the timeline for time to market, etc. When looking at xtor density, which is an amalgam representing all of these tradeoffs in an undecipherable combination, we truly have no way of determining what tradeoffs were made and why.

We might get lucky and have an engineer in our midst who worked directly on the project and is willing to talk about it, but otherwise we really just have to settle for resigning ourselves to being ignorant of the details and thus not really having any confidence in whatever conclusions we would like to draw from the superficial comparisons.
 

MODEL3

Senior member
Jul 22, 2009
528
0
0
Originally posted by: alyarb
are you saying density is going down?

No, i just said that if you divide (same process) the transistors count with the die size per chip,
NV always has lower ratio! (this is a fact)

Like IDC said, i tried to rationalize why this is happening! (I said that maybe NV is not counting something in their designs, and i gave caches only as a possible example)

Like IDC said there are too many unknown factors in the equation, so this doesn't mean anything concrete.

 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: MODEL3
Originally posted by: alyarb
are you saying density is going down?

No, i just said that if you divide (same process) the transistors count with the die size per chip,
NV always has lower ratio! (this is a fact)

Like IDC said, i tried to rationalize why this is happening! (I said that maybe NV is not counting something in their designs, and i gave caches only as a possible example)

Like IDC said there are too many unknown factors in the equation, so this doesn't mean anything concrete.

But you can work to divine interesting boundary conditions regarding the process tech and/or architecture if you isolate one or the other set of unknowns.

Working within the same architecture but looking at how it scaled between successive nodes is a fun exercise:

CELL B.E. 90nm, 65nm, and 45nm

EE and GS 250nm to 90nm

Nvidia Die and SM Comparison 65nm, 55nm 40nm

X2 MLC vs. X3 MLC at 56nm

65nm Barcelona vs 45nm Shanghai

We just have to be mindful of the fact that we have no insight into the budgetary constraints and timeline constraints that went on behind the scenes for creation of each of these products, to say these results represent a necessary limitation imposed by either the process tech or the architecture would be a false notion on our behalf.

Also for the OP, in hunting down the links above I came across this graphic which highlights the architecture comparisons between GT200 and RV770 to another level of detail.

edit: fixed shanghai/barcelona link, thanks alyarb :thumbsup:
 

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
you didn't copy/paste the proper graph at the end for shanghai. but kaigai_7.jpg is an excellent chart
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
Originally posted by: MODEL3
there are many reasons:

agree with most of what you said, except these:

2.
RV770 has 160 shader processors (at core speed) that each can issue up to 5 instructions! (that means 1 in the worst case, 5 at best case, all these theoritically)

GT200 has 240 shader processors (at 2,25X core speed) that each can issue up to 2 instructions (thise can't be achieved all the time, NV is saying that the GT200 architecture can aproach 1,5X)

These are not really valid comparisons.

RV770's individual SPs (for more accuracy, let's call them pipelines, not fully fledged microprocessors; nor are the G80/82/200 SPs) are not individually scheduled, but as the instruction stream enters the "SIMD core"'s local instr-memory store, the instructions are already largely statically arranged for each of the 5-wide VLIW pipelines horizontally (in terms of the instructions scheduled to be issued each cycle) as well as vertically (ordering of successive cycles of LIWs arranged by taking into account of dependencies and the pipeline length, since there is no bypass/forward mechnism in a single SP, as far as I know). In a way, R600/RV670/R770 "SIMD cores" are coarse grained SIMD superimposed on fine grained VLIW, and the instruction store has both of these statically arranged parallelisms encoded in the instruction stream prior to execution of individaul "wavefronts" (really just SPMD's individual streams) or whatever they are call it these days.

The G200' SPs (again, let's call them pipelines) are simply single issue pipelines containing high number of stages to reach a certain cycle-time target (higher clocks), and has separate shared pipelines with in a SPMD cluster (which NV calls SMP or something cryptic) for certain special fixed function logic that can also execute most other instructions (not too much unlike the transcendental units on the RV770). This arrangement is actually similar to the way one of the upcoming general purpose designs will take shape (clustered ALU/AGUs and shared FP/SIMD pipeline). There is a limited amount of dynamic instructions scheduling going on for an individual pipeline, outside of the normal warp arrangement; and each pipeline acts more or less as a 1.25 wide dynamic superscalar, within the constraints of SPMD, of course.

There is no real sense in which NV can reach 2.0 or even 1.5 instructions/cycle on the cluster, in a per pipeline basis. Perhaps one or two of these in a single SPMD cluster can get close to 2.0 a fraction of the time, but taking into account of all the pipelines, it's going to be less than 1.25 in any real usage.

The thing is that the minimum perf. for RV770 SP architecture is way lower than the minimum of GT285 SP architecture!

This statement does not make much real sense, in a theoretical or practical way. Minimum performance on either of these uarchitectures is going to be heavily dependent on the amount of parrallelism existing within any program, as would be for any SPMD or VLIW machine, and each can be as low as one per cycle per instruction stream; which would be 30 per cycle for G200 and 10 per cycle for rv770, somewhere in the 1-8% of theoretical performance range. But I doubt that's what you really mean; maybe I'm not understanding your definition of "minimum".


Anyways, the biggest difference in the delta of transister cnt/performance ratio between these two uarchitectures seems to be the length of the pipelines and the complexity of multiported register files that have to meet certain frequency targets. NV essentially paid more for higher core clocks. There are obviously other factors such as the size and functionality of texture units, rastor, local/global data store, width of mem controller, etc, etc.

Edit: actually 30, not 24, had a momentary lapse of basic arithmetic
 

BFG10K

Lifer
Aug 14, 2000
22,709
3,007
126
Originally posted by: Idontcare

They don't really do this though, do they? Most designers don't undersell the xtor count that went into creating their chips. Why would they?
I?m not entirely sure, but I vaguely remember something about one vendor counting the VRAM and/or cache transistors, but the other wasn?t. Or something like that.
 

BFG10K

Lifer
Aug 14, 2000
22,709
3,007
126
Originally posted by: Hard Ball

The G200' SPs (again, let's call them pipelines) are simply single issue pipelines containing high number of stages to reach a certain cycle-time target (higher clocks),
There is no real sense in which NV can reach 2.0 or even 1.5 instructions/cycle on the cluster, in a per pipeline basis. Perhaps one or two of these in a single SPMD cluster can get close to 2.0 a fraction of the time, but taking into account of all the pipelines, it's going to be less than 1.25 in any real usage.
Are you sure about that? Because that wasn't my understanding and the B3D article appears to agree with me. Firstly, on this page they state the SPs are capable of dual-issue:

http://www.beyond3d.com/content/reviews/51/3

Secondly, even with single issue, the fact that the shader clock runs at least twice as fast as the rest of the chip means the SPs can process at least two instructions in the same time as the rest of the units can process one. So in that sense it would be 2x per clock.
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
Originally posted by: BFG10K
Originally posted by: Hard Ball

The G200' SPs (again, let's call them pipelines) are simply single issue pipelines containing high number of stages to reach a certain cycle-time target (higher clocks),
There is no real sense in which NV can reach 2.0 or even 1.5 instructions/cycle on the cluster, in a per pipeline basis. Perhaps one or two of these in a single SPMD cluster can get close to 2.0 a fraction of the time, but taking into account of all the pipelines, it's going to be less than 1.25 in any real usage.
Are you sure about that? Because that wasn't my understanding and the B3D article appears to agree with me. Firstly, on this page they state the SPs are capable of dual-issue:

http://www.beyond3d.com/content/reviews/51/3

Dual-issue? certainly, but issue refers to the number of sets of control signals (each set associatedd with a single instruction) that can be sent from the ICU of a processor pipeline to FU resources, and secondarily the 1 or more (depending on the max number of sources specified by the ISA) data sources that can be obtained from read ports of the register files. Each pipeline can certainly issue two instruction, otherwise the "SP"s would always only be able to complete at most one instruction per cycle.

The question is the whether the execution resources is available at the time, which would depend on the instruction type. For specifically the type the B3D article talks about (MAD, MUL), there is exactly the resource to do two extra instructions for the SPMD cluster, which is the "another per-SM block of computation units" that B3D is referring to, although in a very vague way that can be easily misinterpreted by someone who's not in computer architecture.

Secondly, even with single issue, the fact that the shader clock runs at least twice as fast as the rest of the chip means the SPs can process at least two instructions in the same time as the rest of the units can process one. So in that sense it would be 2x per clock.

Yes, the core clock is higher; that's completely irrelevant; the instruction store is contained within each SPMD cluster, and is fetched, decoded and scheduled within the same clock domain where instructions are executed as well as the local data store. So I'm not sure how "SPs can process at least two instructions in the same time as the rest of the units can process one" bit is relevant to the current discussion at all; and certainly has nothing to do with issue, execution, or retire width of the SPs or the SPMD clusters.
 

Keysplayr

Elite Member
Jan 16, 2003
21,219
56
91
Originally posted by: Idontcare
Some non-zero percentage of those 1.4B xtors are there to enable CUDA to run more efficiently than it would have otherwise, presumably to the non-zero detriment of the GPU performance to which those xtors are there for.

I'd also be surprised if ATI's circuits that they label ROP or TMU are the same functionality as an NV circuit which they label as a ROP or TMU.

There's plenty of experts here though, so I'll just let them weigh in and answer your question in detail.

For starters though, checkout this AT article detailing ATI's architecture (and link therein for NV's).

Approximately 20%, or @ 280 million of the transistor budget in the GT200 was dedicated to GPGPU usage in addendum to the existing shader architecture and cannot be utilized for graphics rendering. Conversely, the entire shader domain can be used for GPGPU usage. From the horses mouth (engineer at Nvidia). So, this would technically bring us down to about 1.12 billion transistors.

Now consider the transistor budget for a 512-bit Memory Bus. I don't have a figure for this budget, so I'll leave it to an educated guess by someone else.

As far as die size goes, as IDC mentioned in other threads, this has much to do with the transistor density in the design. AMD may have a greater density per sq. mm. than Nvidia does. IDC could undoubtedly explain this better than most.




 

yacoub

Golden Member
May 24, 2005
1,991
14
81
Originally posted by: OCguy
Originally posted by: Keysplayr
That's just it. Two completely different architectures really cannot be compared to one another. It's like comparing a bananna to an apple.

This.

It is like asking why the 2010 Camero has 100+ more HP than the 2010 Mustang GT, but is only .2-.3 faster on the 0-60.

Anyone can twist the data to fit their conclusion.

No, that's actually a lot more straightforward. Using the same driver, it comes down to gearing, weight, and traction. :)
But i get what you mean ;)

Back to GPUs though, it sure would be nice to see ATI widen their bus bandwidth. That would really make their cards stand out performance-wise.
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
Originally posted by: yacoub
No, that's actually a lot more straightforward. Using the same driver, it comes down to gearing, weight, and traction. :)
But i get what you mean ;)

But what if there's "driver" issue? If the driver was drunk may crash the car, if the driver found out that his wife is cheating on him, the car may not respond or will have slow performance. :laugh:

Back to GPUs though, it sure would be nice to see ATI widen their bus bandwidth. That would really make their cards stand out performance-wise.

While the Radeon HD 4870 wasn't entirely bandwidth starved, it barely had benefits from the additional bandwidth available, if they can get GDDR5 faster than 4GHz, I don't see how they could need 512-Bit bus, specially with their small die approach.
 

MODEL3

Senior member
Jul 22, 2009
528
0
0
Originally posted by: Hard Ball
Originally posted by: MODEL3
there are many reasons:

agree with most of what you said, except these:

2.
RV770 has 160 shader processors (at core speed) that each can issue up to 5 instructions! (that means 1 in the worst case, 5 at best case, all these theoritically)

GT200 has 240 shader processors (at 2,25X core speed) that each can issue up to 2 instructions (thise can't be achieved all the time, NV is saying that the GT200 architecture can aproach 1,5X)

These are not really valid comparisons.

RV770's individual SPs (for more accuracy, let's call them pipelines, not fully fledged microprocessors; nor are the G80/82/200 SPs) are not individually scheduled, but as the instruction stream enters the "SIMD core"'s local instr-memory store, the instructions are already largely statically arranged for each of the 5-wide VLIW pipelines horizontally (in terms of the instructions scheduled to be issued each cycle) as well as vertically (ordering of successive cycles of LIWs arranged by taking into account of dependencies and the pipeline length, since there is no bypass/forward mechnism in a single SP, as far as I know). In a way, R600/RV670/R770 "SIMD cores" are coarse grained SIMD superimposed on fine grained VLIW, and the instruction store has both of these statically arranged parallelisms encoded in the instruction stream prior to execution of individaul "wavefronts" (really just SPMD's individual streams) or whatever they are call it these days.

The G200' SPs (again, let's call them pipelines) are simply single issue pipelines containing high number of stages to reach a certain cycle-time target (higher clocks), and has separate shared pipelines with in a SPMD cluster (which NV calls SMP or something cryptic) for certain special fixed function logic that can also execute most other instructions (not too much unlike the transcendental units on the RV770). This arrangement is actually similar to the way one of the upcoming general purpose designs will take shape (clustered ALU/AGUs and shared FP/SIMD pipeline). There is a limited amount of dynamic instructions scheduling going on for an individual pipeline, outside of the normal warp arrangement; and each pipeline acts more or less as a 1.25 wide dynamic superscalar, within the constraints of SPMD, of course.

There is no real sense in which NV can reach 2.0 or even 1.5 instructions/cycle on the cluster, in a per pipeline basis. Perhaps one or two of these in a single SPMD cluster can get close to 2.0 a fraction of the time, but taking into account of all the pipelines, it's going to be less than 1.25 in any real usage.

I agree with your analysis, if you are talking about applications like games only.

I said in my original post about synthetic benchmarks also (i am not talking about 3DMark/Vantage... , i am talking about benchmarks like Shadermark, or other custom benchmarks...)

Yes in real usage in games, the less or around 1.25X figure is what i thought also, but in my post i was talking about synthetic benchmarks also...

Originally posted by: MODEL3
For the above equation (128/640), depending on the game or the synthetic benchmark ATI or NV can be victorious (usually for synthetic benchmarks ATI is way faster...)

Nvidia is saying that their architecture can approach 1,5X, i guess NV is talking about the figure that some tests in synthetic benchmarks like Shadermark (or other custom benchmarks...) can approach...


Originally posted by: Hard Ball
Originally posted by: MODEL3
The thing is that the minimum perf. for RV770 SP architecture is way lower than the minimum of GT285 SP architecture!

This statement does not make much real sense, in a theoretical or practical way. Minimum performance on either of these uarchitectures is going to be heavily dependent on the amount of parrallelism existing within any program, as would be for any SPMD or VLIW machine, and each can be as low as one per cycle per instruction stream; which would be 30 per cycle for G200 and 10 per cycle for rv770, somewhere in the 1-8% of theoretical performance range. But I doubt that's what you really mean; maybe I'm not understanding your definition of "minimum".


Anyways, the biggest difference in the delta of transister cnt/performance ratio between these two uarchitectures seems to be the length of the pipelines and the complexity of multiported register files that have to meet certain frequency targets. NV essentially paid more for higher core clocks. There are obviously other factors such as the size and functionality of texture units, rastor, local/global data store, width of mem controller, etc, etc.

Edit: actually 30, not 24, had a momentary lapse of basic arithmetic

What do you mean by it does not make sense in a theoritical level?

There are already synthetic benchmarks that have some specific shader only related tests that can show much better results for a 4770 (640) in relation with a GTS250 (128),

and there are again other specific shader only related tests in synthetic benchmarks that can show worse results for a 4770 (640) in relation with a GTS250 (128). (but like i said for the majority of these specific shader only related tests the ATI can show better results... )

As you are saying: "Minimum performance on either of these uarchitectures is going to be heavily dependent on the amount of parrallelism existing within any program!

So for these specific shader only related tests the amount of parrallelism existing maybe is leading to those performance results...