Crazy idea to get around slow video card memory problem

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
I was reading over at systemlogic.net reading about simultaneous multithreading and I ended up reading about coarse grain multithreading to.

Coarse grain multithreading (CMT) is where a processor has multiple sets of registers and whenever some sort of blocking situation occurs, e.g. a cache miss, it switches to a different register set and starts executing a different thread! While thread 2 is running, thread 1's memory request is filled. In this way, memory latencies are alleviated.

Systemlogic.net talked about how there was a super computer called the Cray MTA that had no data caches but used an extreme form of multithreading called fine grain multithreading to overcome memory latency. It had 128 register sets so it could support 128 threads!
So I was thinking about how everyone says that modern video cards are bandwith limited and was thinking that maybe CMT could be very beneficial to video cards. In my analysys, I'm assuming that video cards are really bandwith AND latency limited and not merely bandwidth limited.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
I don't really know for certain but I would think that the extra circuitry would add very little to the die size. The bigger cost may be in development effort. This would be a pretty radical new design feature in the world of graphics. The graphics company would probably try to hire some engineers that had worked on CMT processors in the past. However, it should be noted that similiarly difficult things have been done in the past in the world of graphics. A while ago, transform and lighting were added to graphics processors and I think that's more difficult to implement than CMT. Intel and AMD are working on Simultaneous multithreading on their processors.

Given the big potential performance increase and the fact that adding more pipelines issues isn't going to do a whole lot at this stage in graphics development, I think CMT would be very cost effective.
 

BurntKooshie

Diamond Member
Oct 9, 1999
4,204
0
0


<< Systemlogic.net talked about how there was a super computer called the Cray MTA that had no data caches but used an extreme form of multithreading called fine grain multithreading to overcome memory latency. It had 128 register sets so it could support 128 threads! >>



Actually, the Cray MTA uses a hybrid form of multithreading. If there are 128 threads (or more), it will rotate through each thread, giving each thread one cycle of processing time. This style of computing is known as FGM (I called it FMT in the article, but since writing it I've seen it referred to as FGM -- likewise, I saw CMT known as CGM elsewhere.....I'm guessing I should add a glossary to that article). However, should there be fewer than 128 threads, it will then give each thread a few cycles, and then rotate through. Reread page 7 if you're still confused.



<< I don't really know for certain but I would think that the extra circuitry would add very little to the die size. The bigger cost may be in development effort. This would be a pretty radical new design feature in the world of graphics. The graphics company would probably try to hire some engineers that had worked on CMT processors in the past. >>



I don't know how radical an idea it is, as, from an academic perspective, this stuff is pretty old. Even SMT is old. It may only now be making its way to commercial processors, so I don't think there are that many engineers who have worked on CMT (CGM) in commercial processors (which isn't to say they haven't worked on them, rather, not all of them may have seen light of day). Also, it should be noted that only IBM's Northstar processors have made commercial use of pure CMT (as Cray's approach is a massively hybrid scheme).



<< Given the big potential performance increase and the fact that adding more pipelines issues isn't going to do a whole lot at this stage in graphics development, I think CMT would be very cost effective. >>



I don't think you're really right here. Graphical processing is EXTREMELY (or, as I've heard some engineers put it, "Embarassingly,") parallel. Regular code isn't nearly so parallel. The types of code that are best suited for hardware optimizations (such as CMT) are those that tend to have irregular access patterns, and end up having cache misses a lot. The northstar PPC chip from IBM actually switches on a L1 cache miss, not a L2 as I indicated from the article....however at the time my only information concerning a full-blown CMT design was based on the MAJC architecture, which itself has, and likely never will, make it to market (interestingly, at one point I heard that Sun was trying to turn this architecture into a graphics chip architecture).

Graphics cards can easily use up all the pixel pipelines with relative ease, so the problem with making a graphics chip "wider" issue is not being able to find more data to munch on, rather it is finding out if the ability to issue more pixels through the chip each cycle is worth the added die space, given the fact that the required bandwidth goes up as well. It's a performance maximization / cost minimization issue.

For general-purpose microprocessors on the other hand, especially integer based, the problem is similar, but because the code is not so parallel, the bandwidth required is not so high, and because the code "jumps" a lot more, the benefits seen from reducing latency become much more apparent than they do on graphics chips. Graphics processing is like a lot of scientific FP code: they can exploit a lot of instruction level parallelism, but they tend to be bandwidth limited, which also limits the usefulness of adding in more functional units. Generally, integer apps require low latency, so that they can issue another instruction, as they tend to be data-dependant as compared to graphics/FP code.

While I'm not a graphics buff, I don't (right now) see a reason why they should, in particular, move to a processing paradigm that allows fast swapping of instruction streams, as they (as far as I understand it) tend to be bandwidth bound anyway.

For further proof, look at the Kryo: they chose a mode of rendering that radically reduces the amount of bandwidth needed. If you take an Nvidia chip with the same amount of bandwidth, the Kryo would utterly destroy it. I'd love to be shown otherwise ('cause I need to learn more :D), but I don't understand why graphics chips would be more latency bound than bandwidth bound.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81


<< I don't understand why graphics chips would be more latency bound than bandwidth bound. >>


That may very well be true. I can't be sure. But I would think that since the rendering of a single pixel requires as many calculations as it does, that there would be a significant number of cycles between memory accesses as the pixel is rendered.

Actually, come to think of it, I think that there are significant amounts of time between memory accesses but that this is hidden using the 4-6 pipelines that graphics cards have. Hmmm, as you said, rendering is extremely parrallel in in nature so I expect that the pipelines would have very high utilization. Even if there were latency issues, they would be hidden by so many pipelines and the bottleneck would then be mostly bandwidth limited. Each individual pipeline would of course experience latency bound but the chip as a whole would be bandwidth bound.

Nuts, I guess my idea isn't so hot after all. However, I suppose that CMT (CGM?) could be used to reduce the number of pipelines while retaining a particular level of performance.