Originally posted by: dunno99
Hrm, seems like everyone is arguing over symmentics. I guess I'll take a shot at this, but don't flame me if I get it wrong....
Reading chapter 30 of GPU Gems 2 (available for free on nVidia's developer site...or Google for a direct link), it says that the NV45 (GF 6800) has exactly 16 pixel pipelines (and 6 vertex, but we don't care about that right now).
However, the 16 pipelines are divided into units of four (which is also why all the NV4x cards have multiples of 4 pixel pipelines), and fragments from each primitive are processed in adjacent groups of 4 at a time (which would mean from the same primitive). What I think this implies is, at border conditions, the pipeline units are less than fully efficient...as in, each unit may process less than 4 pixels (I'm willing to assume that the hardware is smart enough to align the pixels such that it will result in one less "batch" per horizontal scan, if possible). So this means that it isn't
really a fully 16 pipeline GPU (although, the performance penalty is probably minimal, and it can probably make up for it if derivatives are used, compared to other cards).
Furthermore, each pixel pipeline has two fp32 shader units (ALUs, I presume) in series, of which the first shader unit can have its result substituted by a texture fetch instead. Both shaders units process instructions in parallel per clock (assuming no hazards or dependencies). From the looks of it, both shader units are full ALUs. Since they probably both can do vec4, vec3 + scalar, or vec2 + vec2, this would mean, at most, 4 parallel instructions (two coissues per shader unit) are processed at each tick of the clock. This is why latency hiding is especially important, and I suppose which is why each texture fetch should be followed by either more texture fetches or non-dependent instructions. (Note: Because of the deep pipeline structure of these GPUs, branching is basically done via a brute force approach. I believe this is the reason why the NV4x GPUs can only perform 4 nested if/else statements, because by taking all branches, 4 nested branches would equate to 2^4 = 16 different paths...but I'm guessing here.)
On the other hand, I'm guessing that the X1900 will have 16 separate (although they might be able to work together) pipelines, each being able to process 3 separate fragments in parallel each. This seems to me to be like the fragment "units" above (I'm guessing the two companies are using the terminology a little differently). So each pipeline is a unit itself, and each unit is composed of 3 individual fragment processors (each being able to work on one fragment at a time). The 16 texture units would mean that each cycle, only 16 of these fragment processors will get to retrieve data from memory. Given that NV45 has 16 texture units and 16 pixel processors with two full ALUs each, that means a 16*2:16 = 2:1 ratio. On the other hand, ATi has 16 texture units and 48 pixel processors with either 1 or 1.5 ALUs each (I don't know which)...this would translate to a 3:1 or 4.5:1.
If anyone notices anything wrong, feel free to correct, not flame.