TMU comparison on R600 vs G80

dunno99

Member
Jul 15, 2005
145
0
0
Note that I'm not trying to bash either architectures here, and I may have no clue what I'm talking about (since I don't have any hard information on hand). I merely want to do a comparison of the texture system for both architectures.

So if the information on the R600's TMUs are correct (I know, I know, it's Fudzilla...), then here's a simple calculation:

Assuming we have a shader program that all it does is fetch textures [1], that these textures are 32-bits, and the TMUs can have infinite outstanding transactions (which actually isn't too much of a problem even if it wasn't infinite...it just needs some sort of minimum number of outstanding transations) then we have:

(for the sake of simplifying the calculations, we're going ignore the differences between 1000 and 1024 for memory bases)

R600XT: 757 MHz * 512 bits (ring bus) = 387.6 Gbits/s = 48.4 GB/s (remember we're using the core frequency, since this is the TMU, not the actual memory).

G80GTX: 1350 MHz * 256 bits (32 TMUs * 8 bits [2] per TMU) = 345.6 Gbits/s = 43.2 GB/s.

From initial inspection, we see that the R600XT has superior memory bandwidth (although significantly lower than what's offered by the memory itself). However, note that most opaque polygons only use 3 of the 4 components, thus lowering the R600's effective memory bandwidth to 36.3 GB/s, or 12.1 Gsamples/s. Furthermore, if the textures are FP16 or FP32, the effective number of samples decreases by 50% or 75%, respectively (6.05 Gsamples/s or 3.025 Gsamples/s, respectively).

However, on the G80, barring architectural limitations the public doesn't know about, it can use the full 43.2 GB/s bandwidth without any wastage, due to its scalar nature. And if my suspicion for [2] is correct, then that would mean the number of texture samples per second remains the same for FP16, or even FP32, on the G80 (this assumes that the memory crossbar is fast enough or fat enough to pump through the data back to the TMUs). In other words, there may be no significant cost overhead to using FP16 on the G80 architecture.

This may also explain why the R600 is rumored to be unresponsive to memory overclocks -- since it's not bandwidth bound. (Although the higher frequencies helps with memory latencies, which may be why overclocking results in a tad of speedup)

The only saving grace is if the R600's ring bus architecture is fully duplexed/bidirectional, then assuming that the Ring Stops are smart enough to decouple the data to send down both rings, the R600's theoretical bandwidth usage can go up to 96.9GB/s. But if I assume this, I may also need to assume that the G80 architecture supports 16-bit TMUs or bidirectionality.

If anyone sees any problems with my simple analysis, or knows that I have factual errors (and please point me to the proper white paperes or documentation), please don't hesitate to speak up. Thanks!

Footnotes:

[1] I'm assuming that caching is not an issue here, since it applies equally to both architectures.

[2] I'm wondering if this is the correct figure. I'm inclined to think that it's actually 16 bits or higher, since then doubling the figure (86.4GB/s) matches exactly the memory bandwidth of the GTX. However, someone needs to test this to see if the doubling is for bi-directionality or it's actually 16-bit fetches (one way to do this is saturate the bus with memory reads using 8-bits or 16-bits fetches).
 

dunno99

Member
Jul 15, 2005
145
0
0
Originally posted by: rmed64
GTX is 320 bit

That's the bandwidth between the 6 ROP partitions and the memory chips, not (necessarily) between the ROPs and the TF/TMU. Besides, the GTX is 384 bits, not 320.
 

Munky

Diamond Member
Feb 5, 2005
9,372
0
76
I'm not sure if the numbers you derived are comparable. I have a few points to make:

1. The shaders of the gtx are clocked at 1350mhz (I'm assuming that's what you're referring to), but the rest of the gpu, including the texture units, are clocked at 575mhz.

2. I see no mention of how many TMU's the r600 has (rumors point to 16), how do you derive texturing bandwidth from the memory bandwidth?

3. We don't know the architectural details and capabilities of the r600 TMU's.

4. Why does the g80 supposedly not take a performance hit when doing FP16 and FP32 texture fetches?

5. How do scalar shaders have any benefit in situations where the performance is limited by texture units and/or bandwidth?
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
Ok, I don't want to be a dick but this analysis isn't so meaningful. Texture samples are usually 32bit. They're simple 24bit+8bit alpha bitmaps. The FP16 and FP32 stuff is usually just intermediate data which is unaccounted for in this analysis. Also, any texture unit can access any memory channel in the G80 with equal ease because of the crossbar memory switch. Shaders can access memory but they typically don't access memory very much compared to texture mapping units. Lastly, I'm sorry to tell you this but any calculations of this sort isn't so valid because of the caches.

http://www.techreport.com/reviews/2006q4/geforce-8800/block-g80.gif

I do agree with you on one thing though. The R600 seems to have a lot of memory bandwidth paired with a relatively small amount of texture power. This is definitely pretty strange. Your comparison of relative bandwidth limitations is between the G80 and R600 is interesting. It's really looking like the R600 was a poorly thought out and poorly balanced design. It's definitely not the miracle design people have been looking for since the x800.

(whoops. I said "shader" when I meant to say "texture")
 

blckgrffn

Diamond Member
May 1, 2003
9,686
4,345
136
www.teamjuchems.com
Originally posted by: zephyrprime
Ok, I don't want to be a dick but this isn't analysis isn't so meaningful. Texture samples are usually 32bit. They're simple 24bit+8bit alpha bitmaps. The FP16 and FP32 stuff is usually just intermediate data. Also, any texture unit can access any memory channel with equal ease because of the crossbar memory switch. Shaders can access memory but they typically don't access memory very much compared to texture mapping units. Lastly, I'm sorry to tell you this but any calculations of this sort isn't so valid because of the caches.

http://www.techreport.com/reviews/2006q4/geforce-8800/block-g80.gif

I do agree with you on one thing though. The R600 seems to have a lot of memory bandwidth paired with a relatively small amount of shader power. This is definitely pretty strange. Your comparison of relative bandwidth limitations is between the G80 and R600 is interesting. It's really looking like the R600 was a poorly thought out and poorly balanced design. It's definitely not the miracle design people have been looking for since the x800.

And for the flagship card, they threw even more memory bandwidth. The way things look right now, we might have been better off with a cheaper, lower power 256 bit memory interface. Maybe the numbers wouldn't have looked as promising, but the design would likely have been more balanced.

Nat
 

dunno99

Member
Jul 15, 2005
145
0
0
Originally posted by: munky
I'm not sure if the numbers you derived are comparable. I have a few points to make:

1. The shaders of the gtx are clocked at 1350mhz (I'm assuming that's what you're referring to), but the rest of the gpu, including the texture units, are clocked at 575mhz.

2. I see no mention of how many TMU's the r600 has (rumors point to 16), how do you derive texturing bandwidth from the memory bandwidth?

3. We don't know the architectural details and capabilities of the r600 TMU's.

4. Why does the g80 supposedly not take a performance hit when doing FP16 and FP32 texture fetches?

5. How do scalar shaders have any benefit in situations where the performance is limited by texture units and/or bandwidth?

Hrm...I'm going to refer to the diagram that zephyrprime posted in the post after yours:

http://www.techreport.com/reviews/2006q4/geforce-8800/block-g80.gif

1. As I kinda noted, I'm assuming quite a bit of stuff. From all the reports I've gathered, they all seem to indicate that the "shaders" run at 1350 MHz. Now, which is part of the shader? According to the diagram, it seems like it's the L2 cache and the memory fetch controllers. The rest seem to be part of the shaders themselves. I would assume that the L1 and the TMUs/TFs are clocked at the same frequencies as the computational shader units, rather than with the rest of the chip. This division seems only logical due to the wiring and signalling diagram. Of course, I have nothing to show in proof form, but it doesn't look like anyone else has either.

2. Well, I'm wasn't actually referring to the TMUs on the R600, I was referring to the ringbus bus width. It seems like that may be the actual limitation, rather than the TMUs themselves. No matter how the TMUs or ringbus are set up, we're limited by 512 bits, it seems. Although on second inspection of the ringbus architecture, it would seem that depending on how the ring stop topology is at any given time, more than 512 bits may be accessed.

3. That is correct, which is why I'm assuming. =) However, it would seem logical that these TMUs are 32-bits wide, since the numbers kinda "add up" (i.e. 32 * 16 = 512).

4. I guess this part refers back to the hardware paradigm. nVidia decided they want to split up the "old school" vector style units into scalar units. It would only seem logical that their TMUs/TFs follow the same principle. And since I have no reference as to the cross-bar setup of the internal communications between the shader blocks and the memory controller blocks, I will need to assume something that is logical. Then the only remaining clues would be to derive information from how the frequencies are set up. Given such, it would seem that the TMUs/TFs will need to be 16 bits wide in order to "consume" the bandwidth offered by the memory bus. And given the frequency and bandwidth numbers I worked out in my first post, it would seem a little too coincidental if it wasn't intentionally set up to be that way.

5. Scalar shaders don't, but with regards to traditional chip design (as opposed to ungated or asynchronous), memory alignment usually comes into play here. Due to the design philosophy nVidia engineers have taken (using scalar rather than vector units), that means the TMUs/TFs should logically be scalar as well. In such cases, texture lookups can be tailored to be smaller than the usual 32 bit alignment. However, in ATi's case, since they are still using the vector design, it would also only seem logical that their TMUs are also vector in nature. As in, every texture request is 32 bits long. For 4-component texturing, it's perfect. But for FP16+ or less than 4 components, the efficiency drops.
 

dunno99

Member
Jul 15, 2005
145
0
0
Originally posted by: zephyrprime
Texture samples are usually 32bit. They're simple 24bit+8bit alpha bitmaps.

Actually, after being in game development for a while, texture samples are usually 24 bits. Alpha tends to be a sticky issue, so games avoid putting a lot of it in their games (they'll still have some in the significant parts). Therefore, with regards to vector loading of textures, I suppose the alpha channel is just wasted.

The FP16 and FP32 stuff is usually just intermediate data.

FP16 and FP32 isn't intermediate data. If I specified 32 bits per channel textures (128 bit texture channels for 4 components), they'll remain 32 bits. I think what you're referring to is the shader precision. Nowadays, they're all full 32 bit precision internally in the shader computational units.

Also, any texture unit can access any memory channel with equal ease because of the crossbar memory switch.

Yes, any texture unit can access any of the memory channels. Which is why I was pointing out the bus width of the R600, rather than how the ringbus interacts with the memory modules.

Shaders can access memory but they typically don't access memory very much compared to texture mapping units.

Shaders are actually the ones requesting the textures to be accessed. The TMUs are only responsible for fetching the textures.

Lastly, I'm sorry to tell you this but any calculations of this sort isn't so valid because of the caches.

Yes, the caches do play an important role. This is exactly why developers usually sort their scene objects by textures, so they can get better cache coherency. However, there are only so many objects that share the same textures, so beyond a certain point, texture caches can't help anymore (since they'll all be misses). So in the worst case scenario (which happens a lot), texture cache becomes useless unless textures are prefetched.

I do agree with you on one thing though. The R600 seems to have a lot of memory bandwidth paired with a relatively small amount of shader power.

Actually, it seems like the R600 has a lot of shader power, but relatively little texture bandwidth. If one looks at the minimum assumption, the R600 has a 5:1 or 4:1 shader:TMU allocation, while the GTX has a 4:1 or better.

It's really looking like the R600 was a poorly thought out and poorly balanced design. It's definitely not the miracle design people have been looking for since the x800.

But yeah, it seems like there's some bottleneck for the R600 somewhere. =( This is exactly what we should figure out. =)
 

Munky

Diamond Member
Feb 5, 2005
9,372
0
76
Not sure if this was part of your calculations, but FP16 textures store 16 bits of info for each color component (RGBA). FP32 textures have 32 bits for each component.

Also, according to this page the texture units are clocked at the base frequency of the gpu.
 

dunno99

Member
Jul 15, 2005
145
0
0
Originally posted by: munky
Not sure if this was part of your calculations, but FP16 textures store 16 bits of info for each color component (RGBA). FP32 textures have 32 bits for each component.

Also, according to this page the texture units are clocked at the base frequency of the gpu.

Oh nice, thanks! So I will need to revise the TMU data rate to 575 MHz * 8 thread processing clusters * 4 pixels / cluster * 4 bytes / pixel (since they used the term "pixel", rather than "sample"). This will equate to 73.6GB/s worth of usable texturing bandwidth. This should also leave some room for writing to texture as well. I will also need to withdraw my comments about FP16+.

It seems like this new figure is higher than the 8 bits per component figure from my earlier posts (43.2 GB/s), but lower than the now-defunct-assumption of 86.4 GB/s for FP16 transfers. And with this new updated figure, it should beat ATi's minimum usable bandwidth assumption hands down, with or without the potential waste of reading the alpha channel. Now I'm really beginning to think that the R600's problem lies in the ringbus contention.
 

TanisHalfElven

Diamond Member
Jun 29, 2001
3,512
0
76
Originally posted by: dunno99
Originally posted by: munky
Not sure if this was part of your calculations, but FP16 textures store 16 bits of info for each color component (RGBA). FP32 textures have 32 bits for each component.

Also, according to this page the texture units are clocked at the base frequency of the gpu.

Oh nice, thanks! So I will need to revise the TMU data rate to 575 MHz * 8 thread processing clusters * 4 pixels / cluster * 4 bytes / pixel (since they used the term "pixel", rather than "sample"). This will equate to 73.6GB/s worth of usable texturing bandwidth. This should also leave some room for writing to texture as well. I will also need to withdraw my comments about FP16+.

It seems like this new figure is higher than the 8 bits per component figure from my earlier posts (43.2 GB/s), but lower than the now-defunct-assumption of 86.4 GB/s for FP16 transfers. And with this new updated figure, it should beat ATi's minimum usable bandwidth assumption hands down, with or without the potential waste of reading the alpha channel. Now I'm really beginning to think that the R600's problem lies in the ringbus contention.

i am not sure but i think in R600 the ringbus is 1024 bits.
 

blckgrffn

Diamond Member
May 1, 2003
9,686
4,345
136
www.teamjuchems.com
Originally posted by: tanishalfelven
Originally posted by: dunno99
Originally posted by: munky
Not sure if this was part of your calculations, but FP16 textures store 16 bits of info for each color component (RGBA). FP32 textures have 32 bits for each component.

Also, according to this page the texture units are clocked at the base frequency of the gpu.

Oh nice, thanks! So I will need to revise the TMU data rate to 575 MHz * 8 thread processing clusters * 4 pixels / cluster * 4 bytes / pixel (since they used the term "pixel", rather than "sample"). This will equate to 73.6GB/s worth of usable texturing bandwidth. This should also leave some room for writing to texture as well. I will also need to withdraw my comments about FP16+.

It seems like this new figure is higher than the 8 bits per component figure from my earlier posts (43.2 GB/s), but lower than the now-defunct-assumption of 86.4 GB/s for FP16 transfers. And with this new updated figure, it should beat ATi's minimum usable bandwidth assumption hands down, with or without the potential waste of reading the alpha channel. Now I'm really beginning to think that the R600's problem lies in the ringbus contention.

i am not sure but i think in R600 the ringbus is 1024 bits.

I remember this in the marketing material, too...
 

dunno99

Member
Jul 15, 2005
145
0
0
it's two one-way 512 busses, each going in the opposite direction of each other. However, they supposedly route in the shortest route. And given how textures are usually accessed, that means they're all coming from one memory bank and going into all the shaders, thus limiting the bus to effectively 512-bits.