Note that I'm not trying to bash either architectures here, and I may have no clue what I'm talking about (since I don't have any hard information on hand). I merely want to do a comparison of the texture system for both architectures.
So if the information on the R600's TMUs are correct (I know, I know, it's Fudzilla...), then here's a simple calculation:
Assuming we have a shader program that all it does is fetch textures [1], that these textures are 32-bits, and the TMUs can have infinite outstanding transactions (which actually isn't too much of a problem even if it wasn't infinite...it just needs some sort of minimum number of outstanding transations) then we have:
(for the sake of simplifying the calculations, we're going ignore the differences between 1000 and 1024 for memory bases)
R600XT: 757 MHz * 512 bits (ring bus) = 387.6 Gbits/s = 48.4 GB/s (remember we're using the core frequency, since this is the TMU, not the actual memory).
G80GTX: 1350 MHz * 256 bits (32 TMUs * 8 bits [2] per TMU) = 345.6 Gbits/s = 43.2 GB/s.
From initial inspection, we see that the R600XT has superior memory bandwidth (although significantly lower than what's offered by the memory itself). However, note that most opaque polygons only use 3 of the 4 components, thus lowering the R600's effective memory bandwidth to 36.3 GB/s, or 12.1 Gsamples/s. Furthermore, if the textures are FP16 or FP32, the effective number of samples decreases by 50% or 75%, respectively (6.05 Gsamples/s or 3.025 Gsamples/s, respectively).
However, on the G80, barring architectural limitations the public doesn't know about, it can use the full 43.2 GB/s bandwidth without any wastage, due to its scalar nature. And if my suspicion for [2] is correct, then that would mean the number of texture samples per second remains the same for FP16, or even FP32, on the G80 (this assumes that the memory crossbar is fast enough or fat enough to pump through the data back to the TMUs). In other words, there may be no significant cost overhead to using FP16 on the G80 architecture.
This may also explain why the R600 is rumored to be unresponsive to memory overclocks -- since it's not bandwidth bound. (Although the higher frequencies helps with memory latencies, which may be why overclocking results in a tad of speedup)
The only saving grace is if the R600's ring bus architecture is fully duplexed/bidirectional, then assuming that the Ring Stops are smart enough to decouple the data to send down both rings, the R600's theoretical bandwidth usage can go up to 96.9GB/s. But if I assume this, I may also need to assume that the G80 architecture supports 16-bit TMUs or bidirectionality.
If anyone sees any problems with my simple analysis, or knows that I have factual errors (and please point me to the proper white paperes or documentation), please don't hesitate to speak up. Thanks!
Footnotes:
[1] I'm assuming that caching is not an issue here, since it applies equally to both architectures.
[2] I'm wondering if this is the correct figure. I'm inclined to think that it's actually 16 bits or higher, since then doubling the figure (86.4GB/s) matches exactly the memory bandwidth of the GTX. However, someone needs to test this to see if the doubling is for bi-directionality or it's actually 16-bit fetches (one way to do this is saturate the bus with memory reads using 8-bits or 16-bits fetches).
So if the information on the R600's TMUs are correct (I know, I know, it's Fudzilla...), then here's a simple calculation:
Assuming we have a shader program that all it does is fetch textures [1], that these textures are 32-bits, and the TMUs can have infinite outstanding transactions (which actually isn't too much of a problem even if it wasn't infinite...it just needs some sort of minimum number of outstanding transations) then we have:
(for the sake of simplifying the calculations, we're going ignore the differences between 1000 and 1024 for memory bases)
R600XT: 757 MHz * 512 bits (ring bus) = 387.6 Gbits/s = 48.4 GB/s (remember we're using the core frequency, since this is the TMU, not the actual memory).
G80GTX: 1350 MHz * 256 bits (32 TMUs * 8 bits [2] per TMU) = 345.6 Gbits/s = 43.2 GB/s.
From initial inspection, we see that the R600XT has superior memory bandwidth (although significantly lower than what's offered by the memory itself). However, note that most opaque polygons only use 3 of the 4 components, thus lowering the R600's effective memory bandwidth to 36.3 GB/s, or 12.1 Gsamples/s. Furthermore, if the textures are FP16 or FP32, the effective number of samples decreases by 50% or 75%, respectively (6.05 Gsamples/s or 3.025 Gsamples/s, respectively).
However, on the G80, barring architectural limitations the public doesn't know about, it can use the full 43.2 GB/s bandwidth without any wastage, due to its scalar nature. And if my suspicion for [2] is correct, then that would mean the number of texture samples per second remains the same for FP16, or even FP32, on the G80 (this assumes that the memory crossbar is fast enough or fat enough to pump through the data back to the TMUs). In other words, there may be no significant cost overhead to using FP16 on the G80 architecture.
This may also explain why the R600 is rumored to be unresponsive to memory overclocks -- since it's not bandwidth bound. (Although the higher frequencies helps with memory latencies, which may be why overclocking results in a tad of speedup)
The only saving grace is if the R600's ring bus architecture is fully duplexed/bidirectional, then assuming that the Ring Stops are smart enough to decouple the data to send down both rings, the R600's theoretical bandwidth usage can go up to 96.9GB/s. But if I assume this, I may also need to assume that the G80 architecture supports 16-bit TMUs or bidirectionality.
If anyone sees any problems with my simple analysis, or knows that I have factual errors (and please point me to the proper white paperes or documentation), please don't hesitate to speak up. Thanks!
Footnotes:
[1] I'm assuming that caching is not an issue here, since it applies equally to both architectures.
[2] I'm wondering if this is the correct figure. I'm inclined to think that it's actually 16 bits or higher, since then doubling the figure (86.4GB/s) matches exactly the memory bandwidth of the GTX. However, someone needs to test this to see if the doubling is for bi-directionality or it's actually 16-bit fetches (one way to do this is saturate the bus with memory reads using 8-bits or 16-bits fetches).