Originally posted by: VIAN
A Pipeline is to it's simplest explanation is the act of doing a few things at once. Namely fetching data, executing data, and storing data at one time. An example is an assembly line. You can divide those 3 basics into many other things and this lowers your instructions per clock (IPC). I'm not exactly sure how these divisions are made however. To get a round about figure of performance, IPC is multiplied by cycles per second, or Hz. IPC is usually unknown to regular people, but related to how many stages are in the pipeline somehow.
This was just one pipeline. To get more performance, you can add more pipelines. A pipeline can be considered a chip, so a multi-pipeline chip can be considered a multi-core chip, just like what CPUs are starting to do now.
The article looks at some stages, or things that are performed within the pipeline. It gets very technical, so probably not made for many people to understand.
Yes i know about pipelines. Its just what they are composed of and how each unit works together to get out a shaded pixel (for example). Basically things like the acranyms (alu, fpu, nrm, etc) and how they actually do their indavidual job within the pipeline is what im not getting. Also "dual issue" and "co-issue", along with how shader and texture units are utalized (besides the fact that they shade a pixel, but more so how they do it)
Also, this paragraph:
"The traditional co-issued pipeline can work as a single vector4 unit, or as seperated vector3 and scalar units. NV40's pipelines can also work in a vector2 + vector2 configuration. Dual- and co-issue combined, an NV40 pipe can execute up to four instructions ? while having a single all-purpose arithmetic unit only. For explanatory reasons, we will continue to talk about Unit 1 and 2."
As well as all of this:
"From top to bottom: Two different sources can be used as inputs for the pixel pipeline (the rasterizer or the pipeline loopback). A crossbar transmits the required values to the appropriate interpolators in Unit 1. We don't know how many interpolators are implemented in the hardware. (Shader Model 3.0 logically offers 10 interpolators, instead of 8 in version 2.0 / 2.X). Shader Unit 1 has an SFU built in (yellow), and four multiply channels (shown in blue-green ). A dedicated unit for texture operations follows (orange).
Actually, the special functions RCP and RSQ are two different units in the hardware, but to keep it simple, we abstract them to a single unit we call SFU#1. Now, the whole shader unit can execute up to two instructions per clock, either SFU+MUL3 ("3" stands for up to three components) or SFU+TEX. If only MUL is needed, up to two independent MULs can be executed, but still any clock cycle is limited to four data channels at maximum, meaning MUL2 and MUL3 cannot be computed in a single cycle. Since some data paths are shared, any unit can also just hand over the data, which means this unit is effectively blocked for this cycle. The result of the special functions RCP and RSQ can be used as input for any of the four multiply-channels.
The TEX unit needs texture coordinates as input. Because this unit does not have access to the input registers, Shader Unit 1 has to channel them through. The input for the TEX operations goes through the MUL channels. It is possible to modify the input with the MUL operation right before the TEX operation in the same clock. In most cases, at least the scalar special function can still be used while a TEX operation is performed.
Subsequently, TEX calculates the LOD (level of detail). Both, coordinates and LOD, are transmitted then to the TMU located near the memory controller (which is not included in our picture). The TMU performs the actual sampling of the texture and returns the sample to an input register for Unit 2. Because any texture sampling comes with a latency of some clocks, the pipeline executes instructions for other quads meanwhile."
And finally, on a side note, i was wondering why more pipelines aren't just added to graphic cards to beat the competition. I know they are indavidual units unlike processors where its really one long "pipe" in many stages.(or so i assume) But do they need a certian clock speed to gain speed from more pipes and thus providing a limitation?