I have always thought something internal wasn't keeping everything fed. If you look at a 5870 and a 5770, the 5870 is basically two 5770's on one die. Twice the SP's, twice the ROP's, twice the TMU's, twice the memory bandwidth, same clock speed. Yet two 5770's on CF were typically faster than a 5870, even with the CF inefficiencies factored in. You would think the 5870 would be the faster part as it has no CF 'tax' to deal with, yet it has basically twice the specs.
After the 2900 series AMD got rid of the ring bus and went to a hub, I wonder if that bottlenecks, if it cannot always feed SP's properly. Just a thought, because as your remove SP's the performance hit is often quite a bit less, percentage wise when compared with the percentage of SP's lost. That has been the case for a few generations now.