Hard Ball
Senior member
Originally posted by: MODEL3
Originally posted by: Hard Ball
Originally posted by: MODEL3
there are many reasons:
agree with most of what you said, except these:
2.
RV770 has 160 shader processors (at core speed) that each can issue up to 5 instructions! (that means 1 in the worst case, 5 at best case, all these theoritically)
GT200 has 240 shader processors (at 2,25X core speed) that each can issue up to 2 instructions (thise can't be achieved all the time, NV is saying that the GT200 architecture can aproach 1,5X)
These are not really valid comparisons.
RV770's individual SPs (for more accuracy, let's call them pipelines, not fully fledged microprocessors; nor are the G80/82/200 SPs) are not individually scheduled, but as the instruction stream enters the "SIMD core"'s local instr-memory store, the instructions are already largely statically arranged for each of the 5-wide VLIW pipelines horizontally (in terms of the instructions scheduled to be issued each cycle) as well as vertically (ordering of successive cycles of LIWs arranged by taking into account of dependencies and the pipeline length, since there is no bypass/forward mechnism in a single SP, as far as I know). In a way, R600/RV670/R770 "SIMD cores" are coarse grained SIMD superimposed on fine grained VLIW, and the instruction store has both of these statically arranged parallelisms encoded in the instruction stream prior to execution of individaul "wavefronts" (really just SPMD's individual streams) or whatever they are call it these days.
The G200' SPs (again, let's call them pipelines) are simply single issue pipelines containing high number of stages to reach a certain cycle-time target (higher clocks), and has separate shared pipelines with in a SPMD cluster (which NV calls SMP or something cryptic) for certain special fixed function logic that can also execute most other instructions (not too much unlike the transcendental units on the RV770). This arrangement is actually similar to the way one of the upcoming general purpose designs will take shape (clustered ALU/AGUs and shared FP/SIMD pipeline). There is a limited amount of dynamic instructions scheduling going on for an individual pipeline, outside of the normal warp arrangement; and each pipeline acts more or less as a 1.25 wide dynamic superscalar, within the constraints of SPMD, of course.
There is no real sense in which NV can reach 2.0 or even 1.5 instructions/cycle on the cluster, in a per pipeline basis. Perhaps one or two of these in a single SPMD cluster can get close to 2.0 a fraction of the time, but taking into account of all the pipelines, it's going to be less than 1.25 in any real usage.
I agree with your analysis, if you are talking about applications like games only.
I said in my original post about synthetic benchmarks also (i am not talking about 3DMark/Vantage... , i am talking about benchmarks like Shadermark, or other custom benchmarks...)
Yes in real usage in games, the less or around 1.25X figure is what i thought also, but in my post i was talking about synthetic benchmarks also...
No, that's not what I was talking about, it's not about applications or benchmarks, but the limits placed on the computational throughput by the microarchitecture of the "SM" as NV calls it (it would be much more transparent to call it an SPMD cluster), the programming model used for this and other NV designs since G80, and the interaction of the two. Essentially, it's referring to the limits on the microarchitecture itself in terms of what the highest computational throughput would be possible in the case of an ideal instruction stream through the SPMD cluster.
Clearly I have not said enough to help you to understand what's going on. And maybe this needs to be done over multiple posts, if we need to get some more fundamental stuff right about microarchitectures.
The "dual issue" that you refered to, I see where you are getting that from; which is not exactly what I thought originally, but closely related. The "dual issue" that NV referred to in their PR blabs, which I just read (I work in computer architecture, so I usually only read technical documents, not public presentations or press releases), is not really ISSUE at all, it refers to the way that and SPMD program can be scheduled on the cluster.
An SPMD cluster (which in NV's vocab is SM) on G80 as well as G200 issues instructions in groups of threads that are tied to the same SPMD instruction stream, which NV calls a warp; usually minimum 6 such streams of 32 threads each are scheduled on a single SM through fine-grained multithreading (thread/stream switch on cycle). The local instr store (not exactly I-cache, since it can be explicitly manipulated by global scheduler of the overall compute core), contains these SPMD streams; the 32 threads in a single stream is directed down a single control path (no predication in the way that larrabee does branches), which means that they execute the same instruction i in the stream in lockstep in the perspective of the program, even if they are on temporally differnt cycles in hardware. The cluster's shedulers runs at the full core clock (what NV calls shader domain, but the non-shader domain is largely the non-programmable portion of the GPU), as does each of the execution units in the 8 regular pipelines as well as the SFUs in the cluster.
However, the issue ports from the scheduler in the cluster, let's call them 0 and 1, cannot be issued to on the same cycle, but need to be scheduled on alternate cycles. This is probably done for simplifying the way the programming model interacts with the hardware; in principle for the same reason as the way Intel larrabee's ring bus that can schedule data between a particular pair of node only on alternate cycles; there is another reason we will explain shortly. One of the ports, is capable of sending a single SPMD instruction to the 8 regular pipelines (remember the 32 threads are in lockstep logically, so there should certainly be enough for 8 FUs for a single cycle to get the same instruction from the same warp). The alternate cycle is restricted to issue to the SFUs in the cluster, each of which contains 4 FUs that are capable of executing FMUL instructions, and other logic that do more complex fixed functions (transcendental, interpolation, reciprocal, etc, etc).
Some simple instructions (mostly ALU instr) can complete in one cycle on any of the FU's within the regular pipelines. The more complex instructions that the regular pipelines (SP for NV speak) are capable of, such as multiply-add, and these would be the majority case, requires two cycles in the pipeline to complete; and so does the relatively complex FMUL instructions in each of the SFU's function blocks. This is the other reasons why, while the scheduler itself runs at full core clock, the issue ports to the pipelines and the SFUs only accepts at alternate cycles (you can look at it as the issue ports being run at half clock, although physically it's not the case). When the SPMD stream has instructions that are capable of taking advantage of this alternate clock alternate port scheduling (which NV mistakenly calls "dual issue"), such as when instruction stream MAD, FMUL, MAD, FMUL, MAD, FMUL ..., it is necessarily the case that 16 instructions can be excuted in 2 cycles; and so for the sake of the simplicity of the scheduler(and since in majority of the case an FU can only handle one instr per 2 cycles), the microarchitecture uses this type of schedule algorithm in the clusters.
So what NV terms "dual issue", what you probably got from their public presentations, and probably posted on a few internet sources is not related to dual issue in the traditional superscalar sense at all; it is only dual issue in respect to the restriction on the speed of the SPMD issue ports inside a cluster. I hope I'm making things a little more clear. I can delve into more details of how such uarchitectures work, but that might not be necessary, and might be beyond the scope of this forum.
This post is already too long, I'll answer the rest of yours in another one.