Software blending is as fast as ROPs.

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
From a dev over at B3D forums:
"All games use texture compression. A common material can for example contain two BC3 (DXT5) textures + one BC5 (BC5 = two DXT5 independent alpha channels). That kind of setup can store 10 channels. For example: RGB color, normal vector (2d tangent space), roughness, specular, ambient occlusion, opacity and height (for simple parallax mapping). That is 3 bytes per pixel. Multiply by 1920x1080 and you get 6.2 MB. That 50 MB figure for material sampling could be true for CGI offline renderers, but not for current generation interactive games.

We have been using deferred rendering since DX9 era (2007). I fully agree that the g-buffers can get really fat. The most optimal layout (depth + two 8888 g-buffers) barely fits to Xbox 360s 10 MB EDRAM, and it's slightly sub-hd (1152x720).

You don't need to store position data into g-buffer at all, since you can reconstruct that by using interpolated camera view vector and pixel depth (single multiply-add instruction). Normals are also often stored in 2d, and the third component is reconstructed (for example by using lambert azimuth equal area projection). Albedo color in g-buffer does not need to be in HDR, because g-buffer only contains the data sampled from the DXT-compressed material in [0,1] range (not the final lighted result).

A typical high end game could for example have D24S8 depth buffer (4 bytes) + four 11-11-10 g-buffers (4 bytes each). That's 20 bytes per pixel. If you also have screen space shadow masks (8888 32 bit texture contains four lights) and on average have 12 lights for each pixel, you need to fetch 32 bytes for each pixel in the deferred lighting shader. One x86 cache line can hold two of these input pixels during lighting pass. The access pattern is fully linear, so you never miss L1. The "standard" lighting output buffer format is 16f-16f-16f-16f (half float HDR, 8 bytes per pixel). All the new x86 CPUs have CVT16 instruction set, so they can convert a 32bit float vector to a 16bit float vector in a single instruction. Eight output pixels fit in a single x86 cache line, and again the address pattern is fully linear (we never miss L1, because of prefetchers). Of course you wouldn't even want to pollute the L1 cache with output pixels and use streaming stores instead (as you can easily generate whole 64 byte lines one at a time).

The GPU execution cycles are split roughly like this is current generation games (using our latest game as a example):
- 25% shadow map rendering (we have fully dynamic lighting)
- 25% object rendering to g-buffers
- 30% deferred lighting
- 15% post processing
- 5% others (like virtual texture updates, etc)

Deferred lighting and post processing are pure 2d passes and easy to prefect perfectly (all fetched data in L1 + streaming stores). In total those passes take 45% of the frame time. Shadow map rendering doesn't do any texture sampling, it just writes/reads depth buffer. 16 bit depth buffer (half float) is enough for shadow maps. 2048x2048 shadow map at 2 bytes per pixel is 8 MB. The whole shadow map fits nicely inside the 15 MB Sandy Bridge E cache (Haswell E will likely have even bigger caches). So, no L3 misses at all in shadow map rendering. Tiled shadow map rendering (256 kB tiles = 512x256 pixels) fits completely inside Sandy/Ivy/Haswell L2 cache of the CPU cores (and thus all cores can nicely work on parallel without fighting to share the L3). Shadow map vertex shaders are easy to run efficiently on AVX/FMA (just 1:1 mapping, nothing fancy, 100% L1 cache hit). So basically the only thing that doesn't suit CPU rendering that well is the object rendering to g-buffers. And that's only 25% of the whole frame... and even that can be quite elegantly solved (by using a tile based renderer). I'll write a follow up on that when I have some more time
icon_smile.gif
"

Look at how bad g buffers have sucked anyway as devs still can't/won't use good quality AA with it without any errors.

I see no excuse to still use ROPs... they suck because they're wasteful, not more efficient, and not programmable. I wish someone would loan me some money so I could direct a team to make a GPU. Once I had enough profits from that, then I'd go with directing a team to make a monitor that's actually good for gaming (and even movies). Maybe I'd partner with an existing corporation, maybe I wouldn't.

Anyway, what I'd do with the GPU I'd direct the design would be:
attempt to crash DVI/HDMI by having only display ports on the back of the card and not even having DVI support in the display controller and video logic integrated into the GPU.
I'd attempt to give OpenSource a boost by not making it a DX part... I'd have the driver team create an excellent wrapper. That would weaken an already weakening Microsoft and future games would be free of fixed feature sets that are behind the times anyway.
I'd end lossy texture compression by having lossless map compression in the texture units and no hardware support for S3TC and its successors... S3TC formats would have to be emulated.
I'd pay Avalanche to port the simulated water in JC2 over to OpenCL or whatever they'd choose.

Once people saw how much better the IQ was, they wouldn't care about the lost performance in a few games released around the time as the GPU I'd like to make would.

I can't believe intel is pursuing the iGPU... I'd like to make the iGPU cease to be feasible.

What do you think about it?

I know I'll never have the money to do this... I can dream, right?
 
Last edited:

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
How about a Kickstarter to prove it produces better image quality?

If you can show your algorithms and approach are tangibly better then you should be able to show off the API in software and show the images even if you don't have a hardware capable renderer.
 

jimhsu

Senior member
Mar 22, 2009
705
0
76
I'd be concerned about CPU usage ... how parallelizable exactly is software blending?
 

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
I'd be concerned about CPU usage ... how parallelizable exactly is software blending?
Blending by an addin board with CUDA/GCN cores (but no ROPs), texture units, and display controller could be done also.

Alternatively, motherboards could be made with two independent CPU sockets if no add in board is desired and if temps are a concern.
 

Obsoleet

Platinum Member
Oct 2, 2007
2,181
1
0
I like everything you said OP. That is the direction things need to move and I'd buy that card.
 

BFG10K

Lifer
Aug 14, 2000
22,709
3,000
126
I like everything you said OP.
The whole thing seems like a fantasy. I didn't see the developer mention AA even once. AMD's 2xxx/3xxx parts used shader blending and AA performance was absolutely abysmal as a result.

Also Intel tried the same thing with Larrabee and it was an epic fail. If Intel can't make their own hardware competitive after pouring billions of dollars at it, what chance does a 3rd party developer have?
 
Last edited:

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
The whole thing seems like a fantasy. I didn't see the developer mention AA even once. AMD's 2xxx/3xxx parts used shader blending and AA performance was absolutely abysmal as a result.
Didn't they have a bad shader architecture? I thought I heard that they did, although I could be wrong.

Also, this dev isn't the only one at B3D who thinks that programmable blending vs. hardware blending is a black and white issue... it's really not. Hardware functions are not always better quality and better performance than software. I do acknowledge that most people think those who believe in software blending are crazy, but I think something revolutionary needs to take place... nv and AMD have been holding up too many good things for years (with mostly me-too features) and we don't need the iGPUs in our future either.
Also Intel tried the same thing with Larrabee and it was an epic fail. If Intel can't make their own hardware competitive after pouring billions of dollars at it, what chance does a 3rd party developer have?
Perhaps they didn't focus enough on their drivers with optimizing the code and all. If they tried it again with some sincere effort, then I think they'd get it right. 2/3 the performance in current apps plus full programmability for the future is better than ROPs with a million formats that go out of date so easily.

I like everything you said OP. That is the direction things need to move and I'd buy that card.
Thanks.:)
 
Last edited:

BenSkywalker

Diamond Member
Oct 9, 1999
9,140
67
91
Michael Abrash(look it up ;) ) tried to help Intel use an approach along the lines of what the poster you quoted is talking about. He had every bit of the rendering pipeline figured out, down to assembly level optimizations with perfect control for the developers at every step of the rendering pipeline. They showed off a demo, and it certainly did eliminate all of the rendering errors that we have gotten used to at a mind blowing ~10fps on a game that wasn't quite a decade old. It wasn't a 30% performance hit, dedicated hardware was thirty *times* faster.

The problem is, and you will quickly realize this when you start digging in deeper and deeper, every single step in the real time 3D rendering pipeline has all sorts of errors that we are dealing with and working around at all times. The Z Buffer, can be replaced with something far more accurate, texture sampling is the same thing, why use compressed textures that are lossy anyway.... the list goes on and on.

Of relatively recent optimizations, since the DX8/DX9 timeframe we have moved to early Z reject and optimized texture sampling/filtering that both are inferior to what came before them when looked at from a clean rendering perspective. It is very simple to say we should get rid of them and we would have a better rendered engine if we did so. In isolation I could likely create test cases for both that would show that neither would have catastrophic performance hits, until of course someone came along and pulled back the curtain and shut them both off and there would be a massive performance issue crippling the cache/latency/bandwidth available on any currently available GPU.

At some point you have to analyze how much performance you want to dedicate to a given rendering feature. You are going to have to pay, and pay heavily, if you want to get rid of approximations. It isn't a few percentage points here and there, it is orders of magnitude. I used off line rendering engines for many years, you measured hours or days per frame- to be perfectly honest they didn't look all *that* much better then what we can do in real time now with our hacks. Yes, they were quite a bit cleaner, but certainly not worth the performance hit.
 

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
Michael Abrash(look it up ) tried to help Intel use an approach along the lines of what the poster you quoted is talking about. He had every bit of the rendering pipeline figured out, down to assembly level optimizations with perfect control for the developers at every step of the rendering pipeline. They showed off a demo, and it certainly did eliminate all of the rendering errors that we have gotten used to at a mind blowing ~10fps on a game that wasn't quite a decade old. It wasn't a 30% performance hit, dedicated hardware was thirty *times* faster.
It depends on the application.

Perhaps intel's approach wasn't the best.

At some point you have to analyze how much performance you want to dedicate to a given rendering feature. You are going to have to pay, and pay heavily, if you want to get rid of approximations.
I'm okay with approximations at times, but they choose the wrong ones that are biased towards performance side.

I think it is ridiculous that 32 bit FX logarithmic z buffers aren't used more often, for example. If things are originally programmed in software, then they'll use the best. It was also ridiculous that 32 bit z buffers weren't even required until DX10. Microsoft was terribly behind the times.

When the IHVs won't even give choices, then I have serious problems.

All of that said, I still think programmable blending is feasible:)
 

BenSkywalker

Diamond Member
Oct 9, 1999
9,140
67
91
Perhaps intel's approach wasn't the best.

Abrash was working on it, the reason I mentioned that in my post is he is likely the best low level optimizer in the gaming industry. If *his* approach was a catastrophic failure with billions backing his technique, it just isn't going to work.

I think it is ridiculous that 32 bit FX logarithmic z buffers aren't used more often, for example. If things are originally programmed in software, then they'll use the best. It was also ridiculous that 32 bit z buffers weren't even required until DX10. Microsoft was terribly behind the times.

Dump the Zbuffer altogether and go with painters. If you want accuracy, then that is the approach we should be taking. You will deal with a 90% performance hit, but it is far more accurate.

When the IHVs won't even give choices, then I have serious problems.

So you are volunteering to spend billions of dollars of your own money to release a product that is 3% as fast as the competition? IHVs don't give us the choice because people wouldn't buy it. It has been done before- look up Irix, Evans&Sutherland or Glint. We had the choice to buy extremely accurate extremely slow hardware, and people weren't willing.

All of that said, I still think programmable blending is feasible

I was using it for some time prior to the original Voodoo coming out. It is feasible, and slow as hell :)
 

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
Abrash was working on it, the reason I mentioned that in my post is he is likely the best low level optimizer in the gaming industry. If *his* approach was a catastrophic failure with billions backing his technique, it just isn't going to work.
Perhaps intel didn't do their part with making the actual processor. Perhaps they were wasting billions... just because someone is spending billions doesn't mean that the results will be good.

Never forget that the CPU used to do everything but the pixels and the scrolling... remember VGA graphics?:) The CPU was doing Audio, AI, the sprites, the graphics planes, and the special graphics effects. Some 2D DOS games required just as much processing power for back then as 3D games do today. Now with a CPU and a GPGPU with cuda cores, texture units, cache, an IMC, and display controller there will be plenty of processing power.
 

BFG10K

Lifer
Aug 14, 2000
22,709
3,000
126
Didn't they have a bad shader architecture?
It was competitive in terms of FLOPs, but the parts absolutely tanked with AA. nVidia enjoyed a 2x–3x performance advantage with AA.

It took the 4000 series for AMD to get back in the game, and their powerful ROPs were directly responsible, especially with 8xMSAA

And as bad as the shader performance was, it’s still hardware compared to a CPU. With the CPU performing ROP functions you’d be looking at 1% of the performance, if even that.

Hardware functions are not always better quality and better performance than software. I do acknowledge that most people think those who believe in software blending are crazy, but I think something revolutionary needs to take place...
There’s nothing revolutionary about software rendering. It’s prehistoric and died in 1996 when a simple hardware rasterizer was able to smash the finest hand-written assembly while looking better at the same time.

If you want to go back to that glory, go play software Quake and Unreal with your pixelated sprite graphics.

nv and AMD have been holding up too many good things for years (with mostly me-too features) and we don't need the iGPUs in our future either.
Holding things up? The GPU industry has never been in better shape thanks to their rapid development cycles and fierce competition.

Perhaps they didn't focus enough on their drivers with optimizing the code and all. If they tried it again with some sincere effort, then I think they'd get it right.
LMAO.

2/3 the performance in current apps plus full programmability for the future is better than ROPs with a million formats that go out of date so easily.
You aren’t going to get 2/3 the performance - not even close. Software ROPs done on the CPU would be about 100 times slower than a GPU, and I’m being generous to the CPU.

Even with shader-based MSAA resolve you're looking at 33%-50% of the original ROP performance (i.e unacceptably poor), as per AMD's 2000/3000 series.

But..but..your solution is flexible for emulation! Whatever that even means. :awe:

When the IHVs won't even give choices, then I have serious problems.
You have choices. Load up the DirectX software rasterizer and run your games under software mode.

Feel free to enjoy 640x480 at 10-15 FPS with bilinear filtering and no AA. That’s how fast my i5 2500K ran a five year old game when I tried it, a game with very light system requirements at release, I might add.

Meanwhile my GTX680 ran the same game at hundreds of FPS at 2560x1600, 16xAF and 8xMSAA.

Enjoy your "flexibility" while I'll enjoy reality.

Some 2D DOS games required just as much processing power for back then as 3D games do today.
Uh no, not even close. That’s just wrong on multiple levels. CPU loads alone now are much bigger thanks to AI, physics, scripting, etc, even completely ignoring the graphics side of things.

Secondly, drawing 256 color unfiltered pixels at VGA resolutions is one thing. Rendering today’s scenes at HD resolutions with millions of polygons, FP formats, and hundreds of pixel shader instructions per pixel is something totally different.

Now with a CPU and a GPGPU with cuda cores, texture units, cache, an IMC, and display controller there will be plenty of processing power.
But texture units (et al.) are hardware features. I thought you wanted to punish the evil nVidia/AMD? So go ahead, punish them by running Pong and telling us how good the CPU is for graphics.
 

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
But texture units (et al.) are hardware features. I thought you wanted to punish the evil nVidia/AMD? So go ahead, punish them by running Pong and telling us how good the CPU is for graphics.
I knew that texture units are hardware features.

And why would I want to "punish" nv/AMD? I think they need some competition. Competition is not punishment.

Also, I wasn't clear in what I meant... I think that CUDA/GCN (or an architecture like them) is flexible enough for programmable blending. I'm sorry for not being clear. It is impossible to get enough performance from just one processor as the CPU alone surely can't do everything at reasonable speed.
 

Red Hawk

Diamond Member
Jan 1, 2011
3,266
169
106
If you're using GCN/CUDA for blending, isn't that by definition a type of hardware acceleration rather than doing it in software?

The point remains, though. This essentially is what Intel tried with Larrabee. It was such a failure they didn't even bother publicly releasing it. If a company with the technical, business experience, and available capital like Intel, along with a respected engineer like Abrash, can't do it, then it's safe bet that this just really isn't possible. Or it may be possible, it's just that the performance wouldn't justify the effort put into the research. Handwaving those facts away by saying they didn't make "a sincere effort" is just a weird brand of wishful thinking.
 
Last edited: