<<
Systemlogic.net talked about how there was a super computer called the Cray MTA that had no data caches but used an extreme form of multithreading called fine grain multithreading to overcome memory latency. It had 128 register sets so it could support 128 threads! >>
Actually, the Cray MTA uses a hybrid form of multithreading. If there are 128 threads (or more), it will rotate through each thread, giving each thread one cycle of processing time. This style of computing is known as FGM (I called it FMT in the article, but since writing it I've seen it referred to as FGM -- likewise, I saw CMT known as CGM elsewhere.....I'm guessing I should add a glossary to that article). However, should there be fewer than 128 threads, it will then give each thread a few cycles, and then rotate through. Reread
page 7 if you're still confused.
<<
I don't really know for certain but I would think that the extra circuitry would add very little to the die size. The bigger cost may be in development effort. This would be a pretty radical new design feature in the world of graphics. The graphics company would probably try to hire some engineers that had worked on CMT processors in the past. >>
I don't know how radical an idea it is, as, from an academic perspective, this stuff is pretty old. Even SMT is old. It may only now be making its way to commercial processors, so I don't think there are that many engineers who have worked on CMT (CGM) in commercial processors (which isn't to say they haven't worked on them, rather, not all of them may have seen light of day). Also, it should be noted that only
IBM's Northstar processors have made commercial use of pure CMT (as Cray's approach is a massively hybrid scheme).
<<
Given the big potential performance increase and the fact that adding more pipelines issues isn't going to do a whole lot at this stage in graphics development, I think CMT would be very cost effective. >>
I don't think you're really right here. Graphical processing is EXTREMELY (or, as I've heard some engineers put it, "Embarassingly,") parallel. Regular code isn't nearly so parallel. The types of code that are best suited for hardware optimizations (such as CMT) are those that tend to have irregular access patterns, and end up having cache misses a lot. The northstar PPC chip from IBM actually switches on a L1 cache miss, not a L2 as I indicated from the article....however at the time my only information concerning a full-blown CMT design was based on the MAJC architecture, which itself has, and likely never will, make it to market (interestingly, at one point I heard that Sun was trying to turn this architecture into a graphics chip architecture).
Graphics cards can easily use up all the pixel pipelines with relative ease, so the problem with making a graphics chip "wider" issue is not being able to find more data to munch on, rather it is finding out if the ability to issue more pixels through the chip each cycle is worth the added die space, given the fact that the required bandwidth goes up as well. It's a performance maximization / cost minimization issue.
For general-purpose microprocessors on the other hand, especially integer based, the problem is similar, but because the code is not so parallel, the bandwidth required is not so high, and because the code "jumps" a lot more, the benefits seen from reducing latency become much more apparent than they do on graphics chips. Graphics processing is like a lot of scientific FP code: they can exploit a lot of instruction level parallelism, but they tend to be bandwidth limited, which also limits the usefulness of adding in more functional units. Generally, integer apps require low latency, so that they can issue another instruction, as they tend to be data-dependant as compared to graphics/FP code.
While I'm not a graphics buff, I don't (right now) see a reason why they should, in particular, move to a processing paradigm that allows fast swapping of instruction streams, as they (as far as I understand it) tend to be bandwidth bound anyway.
For further proof, look at the Kryo: they chose a mode of rendering that radically reduces the amount of bandwidth needed. If you take an Nvidia chip with the same amount of bandwidth, the Kryo would utterly destroy it. I'd love to be shown otherwise ('cause I need to learn more

), but I don't understand why graphics chips would be more latency bound than bandwidth bound.