From a dev over at B3D forums:
"All games use texture compression. A common material can for example contain two BC3 (DXT5) textures + one BC5 (BC5 = two DXT5 independent alpha channels). That kind of setup can store 10 channels. For example: RGB color, normal vector (2d tangent space), roughness, specular, ambient occlusion, opacity and height (for simple parallax mapping). That is 3 bytes per pixel. Multiply by 1920x1080 and you get 6.2 MB. That 50 MB figure for material sampling could be true for CGI offline renderers, but not for current generation interactive games.
We have been using deferred rendering since DX9 era (2007). I fully agree that the g-buffers can get really fat. The most optimal layout (depth + two 8888 g-buffers) barely fits to Xbox 360s 10 MB EDRAM, and it's slightly sub-hd (1152x720).
You don't need to store position data into g-buffer at all, since you can reconstruct that by using interpolated camera view vector and pixel depth (single multiply-add instruction). Normals are also often stored in 2d, and the third component is reconstructed (for example by using lambert azimuth equal area projection). Albedo color in g-buffer does not need to be in HDR, because g-buffer only contains the data sampled from the DXT-compressed material in [0,1] range (not the final lighted result).
A typical high end game could for example have D24S8 depth buffer (4 bytes) + four 11-11-10 g-buffers (4 bytes each). That's 20 bytes per pixel. If you also have screen space shadow masks (8888 32 bit texture contains four lights) and on average have 12 lights for each pixel, you need to fetch 32 bytes for each pixel in the deferred lighting shader. One x86 cache line can hold two of these input pixels during lighting pass. The access pattern is fully linear, so you never miss L1. The "standard" lighting output buffer format is 16f-16f-16f-16f (half float HDR, 8 bytes per pixel). All the new x86 CPUs have CVT16 instruction set, so they can convert a 32bit float vector to a 16bit float vector in a single instruction. Eight output pixels fit in a single x86 cache line, and again the address pattern is fully linear (we never miss L1, because of prefetchers). Of course you wouldn't even want to pollute the L1 cache with output pixels and use streaming stores instead (as you can easily generate whole 64 byte lines one at a time).
The GPU execution cycles are split roughly like this is current generation games (using our latest game as a example):
- 25% shadow map rendering (we have fully dynamic lighting)
- 25% object rendering to g-buffers
- 30% deferred lighting
- 15% post processing
- 5% others (like virtual texture updates, etc)
Deferred lighting and post processing are pure 2d passes and easy to prefect perfectly (all fetched data in L1 + streaming stores). In total those passes take 45% of the frame time. Shadow map rendering doesn't do any texture sampling, it just writes/reads depth buffer. 16 bit depth buffer (half float) is enough for shadow maps. 2048x2048 shadow map at 2 bytes per pixel is 8 MB. The whole shadow map fits nicely inside the 15 MB Sandy Bridge E cache (Haswell E will likely have even bigger caches). So, no L3 misses at all in shadow map rendering. Tiled shadow map rendering (256 kB tiles = 512x256 pixels) fits completely inside Sandy/Ivy/Haswell L2 cache of the CPU cores (and thus all cores can nicely work on parallel without fighting to share the L3). Shadow map vertex shaders are easy to run efficiently on AVX/FMA (just 1:1 mapping, nothing fancy, 100% L1 cache hit). So basically the only thing that doesn't suit CPU rendering that well is the object rendering to g-buffers. And that's only 25% of the whole frame... and even that can be quite elegantly solved (by using a tile based renderer). I'll write a follow up on that when I have some more time
"
Look at how bad g buffers have sucked anyway as devs still can't/won't use good quality AA with it without any errors.
I see no excuse to still use ROPs... they suck because they're wasteful, not more efficient, and not programmable. I wish someone would loan me some money so I could direct a team to make a GPU. Once I had enough profits from that, then I'd go with directing a team to make a monitor that's actually good for gaming (and even movies). Maybe I'd partner with an existing corporation, maybe I wouldn't.
Anyway, what I'd do with the GPU I'd direct the design would be:
attempt to crash DVI/HDMI by having only display ports on the back of the card and not even having DVI support in the display controller and video logic integrated into the GPU.
I'd attempt to give OpenSource a boost by not making it a DX part... I'd have the driver team create an excellent wrapper. That would weaken an already weakening Microsoft and future games would be free of fixed feature sets that are behind the times anyway.
I'd end lossy texture compression by having lossless map compression in the texture units and no hardware support for S3TC and its successors... S3TC formats would have to be emulated.
I'd pay Avalanche to port the simulated water in JC2 over to OpenCL or whatever they'd choose.
Once people saw how much better the IQ was, they wouldn't care about the lost performance in a few games released around the time as the GPU I'd like to make would.
I can't believe intel is pursuing the iGPU... I'd like to make the iGPU cease to be feasible.
What do you think about it?
I know I'll never have the money to do this... I can dream, right?
"All games use texture compression. A common material can for example contain two BC3 (DXT5) textures + one BC5 (BC5 = two DXT5 independent alpha channels). That kind of setup can store 10 channels. For example: RGB color, normal vector (2d tangent space), roughness, specular, ambient occlusion, opacity and height (for simple parallax mapping). That is 3 bytes per pixel. Multiply by 1920x1080 and you get 6.2 MB. That 50 MB figure for material sampling could be true for CGI offline renderers, but not for current generation interactive games.
We have been using deferred rendering since DX9 era (2007). I fully agree that the g-buffers can get really fat. The most optimal layout (depth + two 8888 g-buffers) barely fits to Xbox 360s 10 MB EDRAM, and it's slightly sub-hd (1152x720).
You don't need to store position data into g-buffer at all, since you can reconstruct that by using interpolated camera view vector and pixel depth (single multiply-add instruction). Normals are also often stored in 2d, and the third component is reconstructed (for example by using lambert azimuth equal area projection). Albedo color in g-buffer does not need to be in HDR, because g-buffer only contains the data sampled from the DXT-compressed material in [0,1] range (not the final lighted result).
A typical high end game could for example have D24S8 depth buffer (4 bytes) + four 11-11-10 g-buffers (4 bytes each). That's 20 bytes per pixel. If you also have screen space shadow masks (8888 32 bit texture contains four lights) and on average have 12 lights for each pixel, you need to fetch 32 bytes for each pixel in the deferred lighting shader. One x86 cache line can hold two of these input pixels during lighting pass. The access pattern is fully linear, so you never miss L1. The "standard" lighting output buffer format is 16f-16f-16f-16f (half float HDR, 8 bytes per pixel). All the new x86 CPUs have CVT16 instruction set, so they can convert a 32bit float vector to a 16bit float vector in a single instruction. Eight output pixels fit in a single x86 cache line, and again the address pattern is fully linear (we never miss L1, because of prefetchers). Of course you wouldn't even want to pollute the L1 cache with output pixels and use streaming stores instead (as you can easily generate whole 64 byte lines one at a time).
The GPU execution cycles are split roughly like this is current generation games (using our latest game as a example):
- 25% shadow map rendering (we have fully dynamic lighting)
- 25% object rendering to g-buffers
- 30% deferred lighting
- 15% post processing
- 5% others (like virtual texture updates, etc)
Deferred lighting and post processing are pure 2d passes and easy to prefect perfectly (all fetched data in L1 + streaming stores). In total those passes take 45% of the frame time. Shadow map rendering doesn't do any texture sampling, it just writes/reads depth buffer. 16 bit depth buffer (half float) is enough for shadow maps. 2048x2048 shadow map at 2 bytes per pixel is 8 MB. The whole shadow map fits nicely inside the 15 MB Sandy Bridge E cache (Haswell E will likely have even bigger caches). So, no L3 misses at all in shadow map rendering. Tiled shadow map rendering (256 kB tiles = 512x256 pixels) fits completely inside Sandy/Ivy/Haswell L2 cache of the CPU cores (and thus all cores can nicely work on parallel without fighting to share the L3). Shadow map vertex shaders are easy to run efficiently on AVX/FMA (just 1:1 mapping, nothing fancy, 100% L1 cache hit). So basically the only thing that doesn't suit CPU rendering that well is the object rendering to g-buffers. And that's only 25% of the whole frame... and even that can be quite elegantly solved (by using a tile based renderer). I'll write a follow up on that when I have some more time

Look at how bad g buffers have sucked anyway as devs still can't/won't use good quality AA with it without any errors.
I see no excuse to still use ROPs... they suck because they're wasteful, not more efficient, and not programmable. I wish someone would loan me some money so I could direct a team to make a GPU. Once I had enough profits from that, then I'd go with directing a team to make a monitor that's actually good for gaming (and even movies). Maybe I'd partner with an existing corporation, maybe I wouldn't.
Anyway, what I'd do with the GPU I'd direct the design would be:
attempt to crash DVI/HDMI by having only display ports on the back of the card and not even having DVI support in the display controller and video logic integrated into the GPU.
I'd attempt to give OpenSource a boost by not making it a DX part... I'd have the driver team create an excellent wrapper. That would weaken an already weakening Microsoft and future games would be free of fixed feature sets that are behind the times anyway.
I'd end lossy texture compression by having lossless map compression in the texture units and no hardware support for S3TC and its successors... S3TC formats would have to be emulated.
I'd pay Avalanche to port the simulated water in JC2 over to OpenCL or whatever they'd choose.
Once people saw how much better the IQ was, they wouldn't care about the lost performance in a few games released around the time as the GPU I'd like to make would.
I can't believe intel is pursuing the iGPU... I'd like to make the iGPU cease to be feasible.
What do you think about it?
I know I'll never have the money to do this... I can dream, right?
Last edited: