Go Back   AnandTech Forums > Hardware and Technology > Video Cards and Graphics

Forums
· Hardware and Technology
· CPUs and Overclocking
· Motherboards
· Video Cards and Graphics
· Memory and Storage
· Power Supplies
· Cases & Cooling
· SFF, Notebooks, Pre-Built/Barebones PCs
· Networking
· Peripherals
· General Hardware
· Highly Technical
· Computer Help
· Home Theater PCs
· Consumer Electronics
· Digital and Video Cameras
· Mobile Devices & Gadgets
· Audio/Video & Home Theater
· Software
· Software for Windows
· All Things Apple
· *nix Software
· Operating Systems
· Programming
· PC Gaming
· Console Gaming
· Distributed Computing
· Security
· Social
· Off Topic
· Politics and News
· Discussion Club
· Love and Relationships
· The Garage
· Health and Fitness
· Merchandise and Shopping
· For Sale/Trade
· Hot Deals with Free Stuff/Contests
· Black Friday 2013
· Forum Issues
· Technical Forum Issues
· Personal Forum Issues
· Suggestion Box
· Moderator Resources
· Moderator Discussions
   

Reply
 
Thread Tools
Old 12-07-2012, 02:19 PM   #1
Anarchist420
Diamond Member
 
Join Date: Feb 2010
Posts: 8,315
Default Software blending is as fast as ROPs.

From a dev over at B3D forums:
"All games use texture compression. A common material can for example contain two BC3 (DXT5) textures + one BC5 (BC5 = two DXT5 independent alpha channels). That kind of setup can store 10 channels. For example: RGB color, normal vector (2d tangent space), roughness, specular, ambient occlusion, opacity and height (for simple parallax mapping). That is 3 bytes per pixel. Multiply by 1920x1080 and you get 6.2 MB. That 50 MB figure for material sampling could be true for CGI offline renderers, but not for current generation interactive games.

We have been using deferred rendering since DX9 era (2007). I fully agree that the g-buffers can get really fat. The most optimal layout (depth + two 8888 g-buffers) barely fits to Xbox 360s 10 MB EDRAM, and it's slightly sub-hd (1152x720).

You don't need to store position data into g-buffer at all, since you can reconstruct that by using interpolated camera view vector and pixel depth (single multiply-add instruction). Normals are also often stored in 2d, and the third component is reconstructed (for example by using lambert azimuth equal area projection). Albedo color in g-buffer does not need to be in HDR, because g-buffer only contains the data sampled from the DXT-compressed material in [0,1] range (not the final lighted result).

A typical high end game could for example have D24S8 depth buffer (4 bytes) + four 11-11-10 g-buffers (4 bytes each). That's 20 bytes per pixel. If you also have screen space shadow masks (8888 32 bit texture contains four lights) and on average have 12 lights for each pixel, you need to fetch 32 bytes for each pixel in the deferred lighting shader. One x86 cache line can hold two of these input pixels during lighting pass. The access pattern is fully linear, so you never miss L1. The "standard" lighting output buffer format is 16f-16f-16f-16f (half float HDR, 8 bytes per pixel). All the new x86 CPUs have CVT16 instruction set, so they can convert a 32bit float vector to a 16bit float vector in a single instruction. Eight output pixels fit in a single x86 cache line, and again the address pattern is fully linear (we never miss L1, because of prefetchers). Of course you wouldn't even want to pollute the L1 cache with output pixels and use streaming stores instead (as you can easily generate whole 64 byte lines one at a time).

The GPU execution cycles are split roughly like this is current generation games (using our latest game as a example):
- 25% shadow map rendering (we have fully dynamic lighting)
- 25% object rendering to g-buffers
- 30% deferred lighting
- 15% post processing
- 5% others (like virtual texture updates, etc)

Deferred lighting and post processing are pure 2d passes and easy to prefect perfectly (all fetched data in L1 + streaming stores). In total those passes take 45% of the frame time. Shadow map rendering doesn't do any texture sampling, it just writes/reads depth buffer. 16 bit depth buffer (half float) is enough for shadow maps. 2048x2048 shadow map at 2 bytes per pixel is 8 MB. The whole shadow map fits nicely inside the 15 MB Sandy Bridge E cache (Haswell E will likely have even bigger caches). So, no L3 misses at all in shadow map rendering. Tiled shadow map rendering (256 kB tiles = 512x256 pixels) fits completely inside Sandy/Ivy/Haswell L2 cache of the CPU cores (and thus all cores can nicely work on parallel without fighting to share the L3). Shadow map vertex shaders are easy to run efficiently on AVX/FMA (just 1:1 mapping, nothing fancy, 100% L1 cache hit). So basically the only thing that doesn't suit CPU rendering that well is the object rendering to g-buffers. And that's only 25% of the whole frame... and even that can be quite elegantly solved (by using a tile based renderer). I'll write a follow up on that when I have some more time "

Look at how bad g buffers have sucked anyway as devs still can't/won't use good quality AA with it without any errors.

I see no excuse to still use ROPs... they suck because they're wasteful, not more efficient, and not programmable. I wish someone would loan me some money so I could direct a team to make a GPU. Once I had enough profits from that, then I'd go with directing a team to make a monitor that's actually good for gaming (and even movies). Maybe I'd partner with an existing corporation, maybe I wouldn't.

Anyway, what I'd do with the GPU I'd direct the design would be:
attempt to crash DVI/HDMI by having only display ports on the back of the card and not even having DVI support in the display controller and video logic integrated into the GPU.
I'd attempt to give OpenSource a boost by not making it a DX part... I'd have the driver team create an excellent wrapper. That would weaken an already weakening Microsoft and future games would be free of fixed feature sets that are behind the times anyway.
I'd end lossy texture compression by having lossless map compression in the texture units and no hardware support for S3TC and its successors... S3TC formats would have to be emulated.
I'd pay Avalanche to port the simulated water in JC2 over to OpenCL or whatever they'd choose.

Once people saw how much better the IQ was, they wouldn't care about the lost performance in a few games released around the time as the GPU I'd like to make would.

I can't believe intel is pursuing the iGPU... I'd like to make the iGPU cease to be feasible.

What do you think about it?

I know I'll never have the money to do this... I can dream, right?

Last edited by Anarchist420; 12-07-2012 at 02:23 PM.
Anarchist420 is offline   Reply With Quote
Old 12-08-2012, 09:45 AM   #2
BrightCandle
Diamond Member
 
BrightCandle's Avatar
 
Join Date: Mar 2007
Posts: 4,487
Default

How about a Kickstarter to prove it produces better image quality?

If you can show your algorithms and approach are tangibly better then you should be able to show off the API in software and show the images even if you don't have a hardware capable renderer.
__________________
i7 3930k @4.4, 2xMSI GTX 680, 16GB Corsair 2133 RAM, Crucial m4 500GB, Soundblaster Z
Custom watercooled by 2x MCR 320 and 1 MCR 480
Zowie Evo CL EC2, Corsair K70, Asus Rog Swift PG278Q
BrightCandle is offline   Reply With Quote
Old 12-08-2012, 12:54 PM   #3
Anarchist420
Diamond Member
 
Join Date: Feb 2010
Posts: 8,315
Default

Quote:
Originally Posted by BrightCandle View Post
How about a Kickstarter to prove it produces better image quality?
It can provide better image quality than ROPs since it isn't limited to hardwired blending... software blending is more customizable.
Anarchist420 is offline   Reply With Quote
Old 12-08-2012, 01:08 PM   #4
jimhsu
Senior Member
 
Join Date: Mar 2009
Posts: 702
Default

I'd be concerned about CPU usage ... how parallelizable exactly is software blending?
jimhsu is offline   Reply With Quote
Old 12-08-2012, 02:41 PM   #5
Anarchist420
Diamond Member
 
Join Date: Feb 2010
Posts: 8,315
Default

Quote:
Originally Posted by jimhsu View Post
I'd be concerned about CPU usage ... how parallelizable exactly is software blending?
Blending by an addin board with CUDA/GCN cores (but no ROPs), texture units, and display controller could be done also.

Alternatively, motherboards could be made with two independent CPU sockets if no add in board is desired and if temps are a concern.
Anarchist420 is offline   Reply With Quote
Old 12-08-2012, 02:49 PM   #6
Obsoleet
Platinum Member
 
Obsoleet's Avatar
 
Join Date: Oct 2007
Location: CHICAGO (South Loop)
Posts: 2,160
Default

I like everything you said OP. That is the direction things need to move and I'd buy that card.
__________________
Intel C2Q 9450@3ghz | Intel X25-M G2 160GB | MSI Radeon 5870 (latest WHQLs)
Ubuntu 14.04 + Win7Pro | 8GB Mushkin XP2-6400 (4-4-4-12) | Lian Li PC-A05NB
Asus P5Q-E (P45 / ICH10R) | Corsair HX650 | Asus U3S6 | Asus VS278Q-P + 2x Dell P2210s
+ Samsung PN50B650 | External Seagate GoFlex 1.5TB USB 3.0 + External LiteOn IHES208 BR


Anandtech forums on trial: The corruption runs deep.
Obsoleet is offline   Reply With Quote
Old 12-08-2012, 10:04 PM   #7
BFG10K
Lifer
 
BFG10K's Avatar
 
Join Date: Aug 2000
Posts: 20,411
Default

Quote:
Originally Posted by Obsoleet View Post
I like everything you said OP.
The whole thing seems like a fantasy. I didn't see the developer mention AA even once. AMD's 2xxx/3xxx parts used shader blending and AA performance was absolutely abysmal as a result.

Also Intel tried the same thing with Larrabee and it was an epic fail. If Intel can't make their own hardware competitive after pouring billions of dollars at it, what chance does a 3rd party developer have?
__________________
4790K | Titan | 16GB DDR3-1600 | Z97-K | 128GB Samsung 830 | 960GB Crucial M500 | 1TB VelociRaptor | X-Fi XtremeMusic | Seasonic X 560W | Fractal Arc R2 Midi | 30" HP LP3065

Last edited by BFG10K; 12-08-2012 at 10:06 PM.
BFG10K is offline   Reply With Quote
Old 12-09-2012, 07:05 AM   #8
Anarchist420
Diamond Member
 
Join Date: Feb 2010
Posts: 8,315
Default

Quote:
Originally Posted by BFG10K View Post
The whole thing seems like a fantasy. I didn't see the developer mention AA even once. AMD's 2xxx/3xxx parts used shader blending and AA performance was absolutely abysmal as a result.
Didn't they have a bad shader architecture? I thought I heard that they did, although I could be wrong.

Also, this dev isn't the only one at B3D who thinks that programmable blending vs. hardware blending is a black and white issue... it's really not. Hardware functions are not always better quality and better performance than software. I do acknowledge that most people think those who believe in software blending are crazy, but I think something revolutionary needs to take place... nv and AMD have been holding up too many good things for years (with mostly me-too features) and we don't need the iGPUs in our future either.
Quote:
Originally Posted by BFG10K View Post
Also Intel tried the same thing with Larrabee and it was an epic fail. If Intel can't make their own hardware competitive after pouring billions of dollars at it, what chance does a 3rd party developer have?
Perhaps they didn't focus enough on their drivers with optimizing the code and all. If they tried it again with some sincere effort, then I think they'd get it right. 2/3 the performance in current apps plus full programmability for the future is better than ROPs with a million formats that go out of date so easily.

Quote:
Originally Posted by Obsoleet View Post
I like everything you said OP. That is the direction things need to move and I'd buy that card.
Thanks.

Last edited by Anarchist420; 12-09-2012 at 07:12 AM.
Anarchist420 is offline   Reply With Quote
Old 12-09-2012, 07:54 AM   #9
BenSkywalker
Elite Member
 
Join Date: Oct 1999
Posts: 8,955
Default

Michael Abrash(look it up ) tried to help Intel use an approach along the lines of what the poster you quoted is talking about. He had every bit of the rendering pipeline figured out, down to assembly level optimizations with perfect control for the developers at every step of the rendering pipeline. They showed off a demo, and it certainly did eliminate all of the rendering errors that we have gotten used to at a mind blowing ~10fps on a game that wasn't quite a decade old. It wasn't a 30% performance hit, dedicated hardware was thirty *times* faster.

The problem is, and you will quickly realize this when you start digging in deeper and deeper, every single step in the real time 3D rendering pipeline has all sorts of errors that we are dealing with and working around at all times. The Z Buffer, can be replaced with something far more accurate, texture sampling is the same thing, why use compressed textures that are lossy anyway.... the list goes on and on.

Of relatively recent optimizations, since the DX8/DX9 timeframe we have moved to early Z reject and optimized texture sampling/filtering that both are inferior to what came before them when looked at from a clean rendering perspective. It is very simple to say we should get rid of them and we would have a better rendered engine if we did so. In isolation I could likely create test cases for both that would show that neither would have catastrophic performance hits, until of course someone came along and pulled back the curtain and shut them both off and there would be a massive performance issue crippling the cache/latency/bandwidth available on any currently available GPU.

At some point you have to analyze how much performance you want to dedicate to a given rendering feature. You are going to have to pay, and pay heavily, if you want to get rid of approximations. It isn't a few percentage points here and there, it is orders of magnitude. I used off line rendering engines for many years, you measured hours or days per frame- to be perfectly honest they didn't look all *that* much better then what we can do in real time now with our hacks. Yes, they were quite a bit cleaner, but certainly not worth the performance hit.
BenSkywalker is offline   Reply With Quote
Old 12-09-2012, 08:42 AM   #10
Anarchist420
Diamond Member
 
Join Date: Feb 2010
Posts: 8,315
Default

Quote:
Originally Posted by BenSkywalker View Post
Michael Abrash(look it up ) tried to help Intel use an approach along the lines of what the poster you quoted is talking about. He had every bit of the rendering pipeline figured out, down to assembly level optimizations with perfect control for the developers at every step of the rendering pipeline. They showed off a demo, and it certainly did eliminate all of the rendering errors that we have gotten used to at a mind blowing ~10fps on a game that wasn't quite a decade old. It wasn't a 30% performance hit, dedicated hardware was thirty *times* faster.
It depends on the application.

Perhaps intel's approach wasn't the best.

Quote:
Originally Posted by BenSkywalker View Post
At some point you have to analyze how much performance you want to dedicate to a given rendering feature. You are going to have to pay, and pay heavily, if you want to get rid of approximations.
I'm okay with approximations at times, but they choose the wrong ones that are biased towards performance side.

I think it is ridiculous that 32 bit FX logarithmic z buffers aren't used more often, for example. If things are originally programmed in software, then they'll use the best. It was also ridiculous that 32 bit z buffers weren't even required until DX10. Microsoft was terribly behind the times.

When the IHVs won't even give choices, then I have serious problems.

All of that said, I still think programmable blending is feasible
Anarchist420 is offline   Reply With Quote
Old 12-09-2012, 09:28 PM   #11
BenSkywalker
Elite Member
 
Join Date: Oct 1999
Posts: 8,955
Default

Quote:
Perhaps intel's approach wasn't the best.
Abrash was working on it, the reason I mentioned that in my post is he is likely the best low level optimizer in the gaming industry. If *his* approach was a catastrophic failure with billions backing his technique, it just isn't going to work.

Quote:
I think it is ridiculous that 32 bit FX logarithmic z buffers aren't used more often, for example. If things are originally programmed in software, then they'll use the best. It was also ridiculous that 32 bit z buffers weren't even required until DX10. Microsoft was terribly behind the times.
Dump the Zbuffer altogether and go with painters. If you want accuracy, then that is the approach we should be taking. You will deal with a 90% performance hit, but it is far more accurate.

Quote:
When the IHVs won't even give choices, then I have serious problems.
So you are volunteering to spend billions of dollars of your own money to release a product that is 3% as fast as the competition? IHVs don't give us the choice because people wouldn't buy it. It has been done before- look up Irix, Evans&Sutherland or Glint. We had the choice to buy extremely accurate extremely slow hardware, and people weren't willing.

Quote:
All of that said, I still think programmable blending is feasible
I was using it for some time prior to the original Voodoo coming out. It is feasible, and slow as hell
BenSkywalker is offline   Reply With Quote
Old 12-09-2012, 09:43 PM   #12
Anarchist420
Diamond Member
 
Join Date: Feb 2010
Posts: 8,315
Default

Quote:
Originally Posted by BenSkywalker View Post
Abrash was working on it, the reason I mentioned that in my post is he is likely the best low level optimizer in the gaming industry. If *his* approach was a catastrophic failure with billions backing his technique, it just isn't going to work.
Perhaps intel didn't do their part with making the actual processor. Perhaps they were wasting billions... just because someone is spending billions doesn't mean that the results will be good.

Never forget that the CPU used to do everything but the pixels and the scrolling... remember VGA graphics? The CPU was doing Audio, AI, the sprites, the graphics planes, and the special graphics effects. Some 2D DOS games required just as much processing power for back then as 3D games do today. Now with a CPU and a GPGPU with cuda cores, texture units, cache, an IMC, and display controller there will be plenty of processing power.
Anarchist420 is offline   Reply With Quote
Old 12-10-2012, 02:54 AM   #13
BFG10K
Lifer
 
BFG10K's Avatar
 
Join Date: Aug 2000
Posts: 20,411
Default

Quote:
Originally Posted by Anarchist420 View Post
Didn't they have a bad shader architecture?
It was competitive in terms of FLOPs, but the parts absolutely tanked with AA. nVidia enjoyed a 2xĖ3x performance advantage with AA.

It took the 4000 series for AMD to get back in the game, and their powerful ROPs were directly responsible, especially with 8xMSAA

And as bad as the shader performance was, itís still hardware compared to a CPU. With the CPU performing ROP functions youíd be looking at 1% of the performance, if even that.

Quote:
Hardware functions are not always better quality and better performance than software. I do acknowledge that most people think those who believe in software blending are crazy, but I think something revolutionary needs to take place...
Thereís nothing revolutionary about software rendering. Itís prehistoric and died in 1996 when a simple hardware rasterizer was able to smash the finest hand-written assembly while looking better at the same time.

If you want to go back to that glory, go play software Quake and Unreal with your pixelated sprite graphics.

Quote:
nv and AMD have been holding up too many good things for years (with mostly me-too features) and we don't need the iGPUs in our future either.
Holding things up? The GPU industry has never been in better shape thanks to their rapid development cycles and fierce competition.

Quote:
Perhaps they didn't focus enough on their drivers with optimizing the code and all. If they tried it again with some sincere effort, then I think they'd get it right.
LMAO.

Quote:
2/3 the performance in current apps plus full programmability for the future is better than ROPs with a million formats that go out of date so easily.
You arenít going to get 2/3 the performance - not even close. Software ROPs done on the CPU would be about 100 times slower than a GPU, and Iím being generous to the CPU.

Even with shader-based MSAA resolve you're looking at 33%-50% of the original ROP performance (i.e unacceptably poor), as per AMD's 2000/3000 series.

But..but..your solution is flexible for emulation! Whatever that even means.

Quote:
When the IHVs won't even give choices, then I have serious problems.
You have choices. Load up the DirectX software rasterizer and run your games under software mode.

Feel free to enjoy 640x480 at 10-15 FPS with bilinear filtering and no AA. Thatís how fast my i5 2500K ran a five year old game when I tried it, a game with very light system requirements at release, I might add.

Meanwhile my GTX680 ran the same game at hundreds of FPS at 2560x1600, 16xAF and 8xMSAA.

Enjoy your "flexibility" while I'll enjoy reality.

Quote:
Some 2D DOS games required just as much processing power for back then as 3D games do today.
Uh no, not even close. Thatís just wrong on multiple levels. CPU loads alone now are much bigger thanks to AI, physics, scripting, etc, even completely ignoring the graphics side of things.

Secondly, drawing 256 color unfiltered pixels at VGA resolutions is one thing. Rendering todayís scenes at HD resolutions with millions of polygons, FP formats, and hundreds of pixel shader instructions per pixel is something totally different.

Quote:
Now with a CPU and a GPGPU with cuda cores, texture units, cache, an IMC, and display controller there will be plenty of processing power.
But texture units (et al.) are hardware features. I thought you wanted to punish the evil nVidia/AMD? So go ahead, punish them by running Pong and telling us how good the CPU is for graphics.
__________________
4790K | Titan | 16GB DDR3-1600 | Z97-K | 128GB Samsung 830 | 960GB Crucial M500 | 1TB VelociRaptor | X-Fi XtremeMusic | Seasonic X 560W | Fractal Arc R2 Midi | 30" HP LP3065
BFG10K is offline   Reply With Quote
Old 12-10-2012, 08:57 AM   #14
Anarchist420
Diamond Member
 
Join Date: Feb 2010
Posts: 8,315
Default

Quote:
Originally Posted by BFG10K View Post
But texture units (et al.) are hardware features. I thought you wanted to punish the evil nVidia/AMD? So go ahead, punish them by running Pong and telling us how good the CPU is for graphics.
I knew that texture units are hardware features.

And why would I want to "punish" nv/AMD? I think they need some competition. Competition is not punishment.

Also, I wasn't clear in what I meant... I think that CUDA/GCN (or an architecture like them) is flexible enough for programmable blending. I'm sorry for not being clear. It is impossible to get enough performance from just one processor as the CPU alone surely can't do everything at reasonable speed.
Anarchist420 is offline   Reply With Quote
Old 12-10-2012, 10:35 AM   #15
Red Hawk
Platinum Member
 
Red Hawk's Avatar
 
Join Date: Jan 2011
Posts: 2,009
Default

If you're using GCN/CUDA for blending, isn't that by definition a type of hardware acceleration rather than doing it in software?

The point remains, though. This essentially is what Intel tried with Larrabee. It was such a failure they didn't even bother publicly releasing it. If a company with the technical, business experience, and available capital like Intel, along with a respected engineer like Abrash, can't do it, then it's safe bet that this just really isn't possible. Or it may be possible, it's just that the performance wouldn't justify the effort put into the research. Handwaving those facts away by saying they didn't make "a sincere effort" is just a weird brand of wishful thinking.
__________________
Desktop: Thermaltake V-4 Black case | Gigabyte GA-Z68AP-D3 | Core i5 2500k @ 4 GHz | ASUS Radeon HD 7870 DirectCU II 2 GB @ 1110 MHz | 8 GB G.Skill DDR3 RAM 1333 MHz | 120GB OCZ Vertex 3 SSD & Western Digital 500 GB HDD | Antec 650w PSU | Acer 1080p 60 Hz 21.5'' | Windows 8.1 Professional
Laptop: ASUS K52Jr-X5 | Core i3-350m @ 2.26 GHz| Mobility Radeon HD 5470 1 GB @ 750 MHz 4 GB DDR3 RAM | 90 GB OCZ Agility 3 SSD | 1366x768 15.6'' | Windows 8.1 Professional

Last edited by Red Hawk; 12-10-2012 at 10:43 AM.
Red Hawk is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -5. The time now is 01:16 AM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.