Differences between Nvidia/ATI architecture, can anyone explain?

Hard Ball · Sep 6, 2009

Originally posted by: MODEL3

Originally posted by: Hard Ball

Originally posted by: MODEL3
there are many reasons:

Click to expand...

agree with most of what you said, except these:

2.
RV770 has 160 shader processors (at core speed) that each can issue up to 5 instructions! (that means 1 in the worst case, 5 at best case, all these theoritically)

GT200 has 240 shader processors (at 2,25X core speed) that each can issue up to 2 instructions (thise can't be achieved all the time, NV is saying that the GT200 architecture can aproach 1,5X)

Click to expand...

These are not really valid comparisons.

RV770's individual SPs (for more accuracy, let's call them pipelines, not fully fledged microprocessors; nor are the G80/82/200 SPs) are not individually scheduled, but as the instruction stream enters the "SIMD core"'s local instr-memory store, the instructions are already largely statically arranged for each of the 5-wide VLIW pipelines horizontally (in terms of the instructions scheduled to be issued each cycle) as well as vertically (ordering of successive cycles of LIWs arranged by taking into account of dependencies and the pipeline length, since there is no bypass/forward mechnism in a single SP, as far as I know). In a way, R600/RV670/R770 "SIMD cores" are coarse grained SIMD superimposed on fine grained VLIW, and the instruction store has both of these statically arranged parallelisms encoded in the instruction stream prior to execution of individaul "wavefronts" (really just SPMD's individual streams) or whatever they are call it these days.

The G200' SPs (again, let's call them pipelines) are simply single issue pipelines containing high number of stages to reach a certain cycle-time target (higher clocks), and has separate shared pipelines with in a SPMD cluster (which NV calls SMP or something cryptic) for certain special fixed function logic that can also execute most other instructions (not too much unlike the transcendental units on the RV770). This arrangement is actually similar to the way one of the upcoming general purpose designs will take shape (clustered ALU/AGUs and shared FP/SIMD pipeline). There is a limited amount of dynamic instructions scheduling going on for an individual pipeline, outside of the normal warp arrangement; and each pipeline acts more or less as a 1.25 wide dynamic superscalar, within the constraints of SPMD, of course.

There is no real sense in which NV can reach 2.0 or even 1.5 instructions/cycle on the cluster, in a per pipeline basis. Perhaps one or two of these in a single SPMD cluster can get close to 2.0 a fraction of the time, but taking into account of all the pipelines, it's going to be less than 1.25 in any real usage.

Click to expand...

I agree with your analysis, if you are talking about applications like games only.

I said in my original post about synthetic benchmarks also (i am not talking about 3DMark/Vantage... , i am talking about benchmarks like Shadermark, or other custom benchmarks...)

Yes in real usage in games, the less or around 1.25X figure is what i thought also, but in my post i was talking about synthetic benchmarks also...

No, that's not what I was talking about, it's not about applications or benchmarks, but the limits placed on the computational throughput by the microarchitecture of the "SM" as NV calls it (it would be much more transparent to call it an SPMD cluster), the programming model used for this and other NV designs since G80, and the interaction of the two. Essentially, it's referring to the limits on the microarchitecture itself in terms of what the highest computational throughput would be possible in the case of an ideal instruction stream through the SPMD cluster.

Clearly I have not said enough to help you to understand what's going on. And maybe this needs to be done over multiple posts, if we need to get some more fundamental stuff right about microarchitectures.

The "dual issue" that you refered to, I see where you are getting that from; which is not exactly what I thought originally, but closely related. The "dual issue" that NV referred to in their PR blabs, which I just read (I work in computer architecture, so I usually only read technical documents, not public presentations or press releases), is not really ISSUE at all, it refers to the way that and SPMD program can be scheduled on the cluster.

An SPMD cluster (which in NV's vocab is SM) on G80 as well as G200 issues instructions in groups of threads that are tied to the same SPMD instruction stream, which NV calls a warp; usually minimum 6 such streams of 32 threads each are scheduled on a single SM through fine-grained multithreading (thread/stream switch on cycle). The local instr store (not exactly I-cache, since it can be explicitly manipulated by global scheduler of the overall compute core), contains these SPMD streams; the 32 threads in a single stream is directed down a single control path (no predication in the way that larrabee does branches), which means that they execute the same instruction i in the stream in lockstep in the perspective of the program, even if they are on temporally differnt cycles in hardware. The cluster's shedulers runs at the full core clock (what NV calls shader domain, but the non-shader domain is largely the non-programmable portion of the GPU), as does each of the execution units in the 8 regular pipelines as well as the SFUs in the cluster.

However, the issue ports from the scheduler in the cluster, let's call them 0 and 1, cannot be issued to on the same cycle, but need to be scheduled on alternate cycles. This is probably done for simplifying the way the programming model interacts with the hardware; in principle for the same reason as the way Intel larrabee's ring bus that can schedule data between a particular pair of node only on alternate cycles; there is another reason we will explain shortly. One of the ports, is capable of sending a single SPMD instruction to the 8 regular pipelines (remember the 32 threads are in lockstep logically, so there should certainly be enough for 8 FUs for a single cycle to get the same instruction from the same warp). The alternate cycle is restricted to issue to the SFUs in the cluster, each of which contains 4 FUs that are capable of executing FMUL instructions, and other logic that do more complex fixed functions (transcendental, interpolation, reciprocal, etc, etc).

Some simple instructions (mostly ALU instr) can complete in one cycle on any of the FU's within the regular pipelines. The more complex instructions that the regular pipelines (SP for NV speak) are capable of, such as multiply-add, and these would be the majority case, requires two cycles in the pipeline to complete; and so does the relatively complex FMUL instructions in each of the SFU's function blocks. This is the other reasons why, while the scheduler itself runs at full core clock, the issue ports to the pipelines and the SFUs only accepts at alternate cycles (you can look at it as the issue ports being run at half clock, although physically it's not the case). When the SPMD stream has instructions that are capable of taking advantage of this alternate clock alternate port scheduling (which NV mistakenly calls "dual issue"), such as when instruction stream MAD, FMUL, MAD, FMUL, MAD, FMUL ..., it is necessarily the case that 16 instructions can be excuted in 2 cycles; and so for the sake of the simplicity of the scheduler(and since in majority of the case an FU can only handle one instr per 2 cycles), the microarchitecture uses this type of schedule algorithm in the clusters.

So what NV terms "dual issue", what you probably got from their public presentations, and probably posted on a few internet sources is not related to dual issue in the traditional superscalar sense at all; it is only dual issue in respect to the restriction on the speed of the SPMD issue ports inside a cluster. I hope I'm making things a little more clear. I can delve into more details of how such uarchitectures work, but that might not be necessary, and might be beyond the scope of this forum.

This post is already too long, I'll answer the rest of yours in another one.

Hard Ball · Sep 6, 2009

As for the rest of the post

Originally posted by: MODEL3
For the above equation (128/640), depending on the game or the synthetic benchmark ATI or NV can be victorious (usually for synthetic benchmarks ATI is way faster...)

Click to expand...

Nvidia is saying that their architecture can approach 1,5X, i guess NV is talking about the figure that some tests in synthetic benchmarks like Shadermark (or other custom benchmarks...) can approach...

As we just discussed, it usually goes barely above 1.0, let alone above 1.5X; there is no sense in which the # of actual instructions as defined by the ISA can exceed 1.25 the count of the number of SPMD pipelines (SP in NV terminology), that is taking into account the special fixed functions that can be executed on the SFUs, many of which take about twice as long than FMUL in the functional units; and these cannot be issue at the same time from the cluster scheduler to the SFUs as the FMULs that are alternate cycle scheduled with the regular pipelines, due to the limited control signal paths for issuing and ports on RF.

If NV is referring to actual instruction defined in the same way as in FLOPS, then I would be ready to believe that it might approach 1.5, since such instructions have a mathematical definition, and many relatively complex instructions in G80/G200 ISA probably would be comprised of the amount of computational throughput of multiple floating point operations in the mathematical sense.

Originally posted by: Hard Ball

Originally posted by: MODEL3
The thing is that the minimum perf. for RV770 SP architecture is way lower than the minimum of GT285 SP architecture!

Click to expand...

This statement does not make much real sense, in a theoretical or practical way. Minimum performance on either of these uarchitectures is going to be heavily dependent on the amount of parrallelism existing within any program, as would be for any SPMD or VLIW machine, and each can be as low as one per cycle per instruction stream; which would be 30 per cycle for G200 and 10 per cycle for rv770, somewhere in the 1-8% of theoretical performance range. But I doubt that's what you really mean; maybe I'm not understanding your definition of "minimum".

Anyways, the biggest difference in the delta of transister cnt/performance ratio between these two uarchitectures seems to be the length of the pipelines and the complexity of multiported register files that have to meet certain frequency targets. NV essentially paid more for higher core clocks. There are obviously other factors such as the size and functionality of texture units, rastor, local/global data store, width of mem controller, etc, etc.

Edit: actually 30, not 24, had a momentary lapse of basic arithmetic

Click to expand...

What do you mean by it does not make sense in a theoritical level?

There are already synthetic benchmarks that have some specific shader only related tests that can show much better results for a 4770 (640) in relation with a GTS250 (128),

and there are again other specific shader only related tests in synthetic benchmarks that can show worse results for a 4770 (640) in relation with a GTS250 (128). (but like i said for the majority of these specific shader only related tests the ATI can show better results... )

As you are saying: "Minimum performance on either of these uarchitectures is going to be heavily dependent on the amount of parrallelism existing within any program!

So for these specific shader only related tests the amount of parrallelism existing maybe is leading to those performance results...

Benchmarks don't really tell you nothing precise at all in terms of the underlying architecture, or the theoretical limits on the throughput. Benchmarks are bascially simulated applications designed with a goal to simulate a workload, or a mix of workloads present in real work computing scenarious. And without accually seeing the assmbly code for these "shader benchmarks", there is no way for me, or anyone else to comment on how they actually interact with the uarchitecture or the programming model, let along trying to acertain a theoretical limit on performance. Benchmarks are methods used to find simulated application performance, and has little or nothing to say about architectural limits.

MODEL3 · Sep 6, 2009

Thanks for your effort Hard Ball, to try to explain to me the situation!

Now i can understand better what you were trying to say.

You said that you work in computer architecture field.

Let me share my background.

I am from Greece (so English is not my native language) and I don't have a technology background (my studies are classical based...) so you can imagine that i don't have the knowledge to analyze if what NV or ATI says is accurate or not.

In fact even the forum staff is very new for me (1,5 month...) and as you can understand Press releases/reviews/benchmarks etc... are what i use to try to understand better the situation...

I can understand that benchmarks don't really tell nothing precise, but they are indicative in a manner of speaking (if you do a lot of different tests with different benchmarks) for some things like what performance a specific application can extract from a design A and what from a design B.

Now you certainly helped me a lot to understand better what is going on.

Thanks again

Hard Ball · Sep 6, 2009

Originally posted by: MODEL3
Thanks for your effort Hard Ball, to try to explain to me the situation!

Now i can understand better what you were trying to say.

You said that you work in computer architecture field.

Let me share my background.

I am from Greece (so English is not my native language) and I don't have a technology background (my studies are classical based...) so you can imagine that i don't have the knowledge to analyze if what NV or ATI says is accurate or not.

In fact even the forum staff is very new for me (1,5 month...) and as you can understand Press releases/reviews/benchmarks etc... are what i use to try to understand better the situation...

I can understand that benchmarks don't really tell nothing precise, but they are indicative in a manner of speaking (if you do a lot of different tests with different benchmarks) for some things like what performance a specific application can extract from a design A and what from a design B.

Now you certainly helped me a lot to understand better what is going on.

Thanks again

No problem;

Sorry if I came on a little strong in the beginning. Well, I'm usually pretty direct when it comes to discussing technical issues, definitely not an exception for people in engineering. I see what you are saying about the benchmarks, and yet, they can be very useful in determining some performance characteristics of many designs.

I actually do know some Greek (Attic and Koine, nothing modern), which I studied a long time ago. Keep up the good work at Anand here, you are doing a great job.

Anyway, I would be glad to be of any help in the future, in terms of helping deciphering microarchitectural designs, trends in the industry, and making educated predictions about the future, etc.

Hard Ball · Sep 6, 2009

I would also be very glad to discuss more of the technical aspects of general purpose microprocessors, and to see if I can liven the discussion there a little in the CPU forum.

I spend much more time at arstechnica than here, but I thought I might see if people at anand would be interested in a detailed discussion of future commercial microarchitectures in a substantive way. Here, I just posted:

http://forums.anandtech.com/me...=2332684&enterthread=y

Differences between Nvidia/ATI architecture, can anyone explain?

Hard Ball

Senior member

Hard Ball

Senior member

MODEL3

Senior member

Hard Ball

Senior member

Hard Ball

Senior member

TRENDING THREADS