What you describe sounds like GPGPU-related stuff. You ignore how the rasterizer fits into this story. The question was: can you force the order from threads generated by the rasterizer? If so, how?
Intel and nVidia do this by guaranteeing the triangle-order from the rasterizer, and then having a sort of critical section inside a pixel-shader to make sure that the per-pixel operations of each triangle are performed in-order as well.
If the rasterizer does not know about ROV, then it may try to be smart and triangles might overtake eachother. For example, say triangles 0-4 are queued on one cluster, where triangles 5-8 are queued on another
or if triangles 0, 2, 4 etc are queued on one cluster and triangles 1, 3, 5 etc are queued on another, and triangles 0, 2, 4 take longer to render than 1, 3, 5
many kinds of scenarios where triangle order can not be solved by just a critical section inside the shader.
If this is possible with GCN/Mantle, Id like to have some detailed code explaining how to set up both the rasterizer and the pixel shaders for that. And then we can see how efficient that will be. The most naive solution would just serialize all triangles, making it extremely slow. The critical section part is what makes it very efficient, since it only slows down when there is actual overlap of pixels.