When programming GPUs we know that we typically schedule many 1000s of threads and we also know that we can further organize them in many tiles of threads.
Aside: These concepts also exist in other programming models, so in HLSL they are called “threads” and “thread groups”. In CUDA they are called “CUDA threads” and “thread blocks”. In OpenCL they are called “work items” and “work groups”. But we’ll stick with the C++ AMP terms of “threads” and “tiles (of threads)”.
From a correctness perspective and from a programming model concepts perspective, that is the end of the story.
The hardware scheduling unit
However, from a performance perspective, it is interesting to know that the hardware has an additional bunching of threads which in NVIDIA hardware is called a “warp”, in AMD hardware is called a “wavefront”, and in other hardware that at the time of writing is not available on the market, it will probably be called something else. If I had it my way, they would be called a “team” of threads, but I lost that battle.
A “warp” (or “wavefront”) is the most basic unit of scheduling of the NVIDIA (or AMD) GPU. Other equivalent definitions include: “is the smallest executable unit of code” OR “processes a single instruction over all of the threads in it at the same time” OR “is the minimum size of the data processed in
SIMD fashion”.
A “warp” currently consists of 32 threads on NVIDIA hardware. A “wavefront” currently consists of 64 threads in AMD hardware. Each vendor may decide to change that, since this whole concept is literally an implementation detail, and new hardware vendors may decide to come up with other sizes.
Note that on CPU hardware this concept of most basic level of parallelism is often called a “vector width” (for example when using the SSE instructions on Intel and AMD processors). The vector width is characterized by the total number of bits in it, which you can populate, e.g. with a given number of floats, or a given number of doubles. The upper limits of CPU vector widths is currently lower than GPU hardware.
So without going to any undesirable extremes of tying your implementation to a specific card, or specific family of cards or a hardware vendor’s cards, how can you easily use this information?
Avoid having diverged warps/wavefronts
Note: below every occurrence of the term “warp” can be replaced with the term “wavefront” and the meaning of the text will not change. I am just using the shorter of the two terms
.
All the threads in a warp execute the same instruction in lock-step, the only difference being the data that they operate on in that instruction. So if your code does anything that causes the threads in a warp to be unable to execute the same instruction, then some threads in the warp will be
diverged during the execution of that instruction. So you’d be leaving some compute power on the table.