Would it be wise for DX12 to ditch ROPs and Z/stencil FXfunction units?

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
I've been wondering that. I think DX12 should emulate the color and depth buffers. That way, they could go with something like 2x this gen's shader power for next gen. They would still need TMUs, but other than that it would be completely programmable, and everything could be handled at the driver level.

It would probably require more ondie cache and a very large high-bandwidth IMC, but I think emulation is the way to go.

I'm thinking a 22 nm process would be the largest this could be done at.

Your thoughts?
 

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
I have no idea about anything you said, but Horray for a topic that's not "should I get this card or that card". Hopefully the answer isn't just a strait up no and we can learn some stuff.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
In general, fixed function hardware runs faster than programmable hardware. So in this case eliminating the ROPS - even if it could be done - and running their functions on the shader hardware would probably result in a performance drop.

Was it the ATI 3000 series that eliminated the AA hardware and did AA in the shaders? Maybe it was the 2900. I just remember it didn't go to well, and the next generation with proper hardware AA was much improved.
 

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
I'm not too knowledgeable with the software side of things, but isn't DX for the hardware companies basically just a list of standards they need to support? I know its more for software side but it is up to the hardware manufactures to decide how they want to implement stuff.

I think if Nvidia/ATi decided it was faster to do what you mentioned, all of their cards would be doing that stuff in shaders already.
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
In general, fixed function hardware runs faster than programmable hardware. So in this case eliminating the ROPS - even if it could be done - and running their functions on the shader hardware would probably result in a performance drop.

Was it the ATI 3000 series that eliminated the AA hardware and did AA in the shaders? Maybe it was the 2900. I just remember it didn't go to well, and the next generation with proper hardware AA was much improved.

The HD 2x00 and HD 3x00 series did the Anti Aliasing resolve on shaders, I think it was a bug in the ROP's.

I'm not too knowledgeable with the software side of things, but isn't DX for the hardware companies basically just a list of standards they need to support? I know its more for software side but it is up to the hardware manufactures to decide how they want to implement stuff.

I think if Nvidia/ATi decided it was faster to do what you mentioned, all of their cards would be doing that stuff in shaders already.

ATi did something like that, the HD 5x00 series moved the TMU's interpolators to the shaders, the HD 4x00 series was never able to fully reach its theoricall texture power due to being interpolator limited (40 texture units with only 32 interpolators)
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
In general, fixed function hardware runs faster than programmable hardware. So in this case eliminating the ROPS - even if it could be done - and running their functions on the shader hardware would probably result in a performance drop.

Was it the ATI 3000 series that eliminated the AA hardware and did AA in the shaders? Maybe it was the 2900. I just remember it didn't go to well, and the next generation with proper hardware AA was much improved.

2900 did AA in the shaders. 3800 was largely based on 2900, think gf104 vs gf100, and it did this as well. a major reason for the performance jump in 48x0 was that amd realized that it was not a good move and went back to hardware AA.
 

ViRGE

Elite Member, Moderator Emeritus
Oct 9, 1999
31,516
167
106
In general, fixed function hardware runs faster than programmable hardware. So in this case eliminating the ROPS - even if it could be done - and running their functions on the shader hardware would probably result in a performance drop.
This is probably the best answer you're going to get. ROPs are very, very good at what they do and running their functions on programmable hardware would indeed be much slower. They also serve a dual-purpose of moving rasterization (the most RAM-bandwidth hungry operation in rendering) closer to RAM, which is why you see them coupled with memory controllers in both AMD and NVIDIA designs.

It's a bit off topic, but from what I understand ROPs were a big problem in Intel's initial Larabee design. Intel initially wanted to go with a purely programmable design - no geometry setup, no tessellator, no ROPs. Ultimately this wasn't a workable approach, and was one of the reasons Larrabee was first delayed as they needed to work in ROPs in to their design. Ultimately if ROPs can be removed, it will probably be Intel who does it first.
 

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
I knew fixed function was a lot faster (emulation is never faster), but the reason I advocate going to all shaders+TMUs, is because it would get rid of a lot of headaches and compatibility issues.

It won't happen with DX12 though, because too many people would be pissed off at going backwards in performance. I wouldn't though, because of the benefits of programmability.

I personally thought Larabbee was a great idea and that they should've just released it because it was all programmable (except for the TMUs), despite the fact that it would've had slower performance. There are so many things Intel could've done at the driver level that it would've been worthit IMO.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
I knew fixed function was a lot faster (emulation is never faster), but the reason I advocate going to all shaders+TMUs, is because it would get rid of a lot of headaches and compatibility issues.

What headaches and compatibility issues would those be?

It won't happen with DX12 though, because too many people would be pissed off at going backwards in performance. I wouldn't though, because of the benefits of programmability.

You already have full programmability. Ever heard of GPGPU?
Look at nVidia's Design Garage demo for example. A raytracer implemented via Cuda. Bypasses the fixed function hardware completely, yet it still manages to render a pretty picture.
So you can do this today, if you want.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
Can we run a complete game with such techniques at playable framerates? I doubt it.

Not with raytracing, perhaps (although, there are some CPU-based raytracing games out there, GPUs should be able to do that better).
But with other, more efficient rendering methods, yea sure, you can get playable framerates.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
Can we run a complete game with such techniques at playable framerates? I doubt it.
It can be done, it just wouldn't have the performance, that much was clear. That's not surprising, and it's probably not going to be much slower than what the OP is thinking of, getting rid of all fixed function hardware.

So: Yes it can be done today, and no, it won't have enough performance to compete with the cards out today, and yes, that's about just what would happen if all fixed function hardware was removed, and no, nobody is doing it that way (skipping fixed function hardware or completely removing them from the design itself) for a reason - performance.

The very reason for having dedicated hardware is it is immeasurably faster than software. So whenever some task becomes important enough, it gets some hardware dedicated to it. It happens everywhere. Video cards evolved into graphics processors (dumb framebuffers became what they are today), Intel has CPUs now with AES-NI, and there are expensive networking products that have redundant hardware inside that does the TCP three-way SYN / SYN-ACK / ACK handshake and buffering to spare the CPU of the server the processing work.
 

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
The performance will be good enough when the manufacturing process can handle it. If one were going with a 10nm process, then it would definitely be worth trying, because you could add more shaders and/or increase the clock speeds. 1024 cuda cores + 32TMUs + the appropriate amount of L2 cache + a 1024 bit ring bus memory controller at 1.5 GHz would be fast enough. The problem is that you can't get 1024 CUDA cores @1.5 GHz on a 40 nm process and probably not even 22nm.
 

dguy6789

Diamond Member
Dec 9, 2002
8,558
3
76
Fast enough for what? Today's games? Tomorrow's games? What are you basing this on? How are you accounting for the severe performance loss of losing ROPs when you say X cuda cores will be enough?
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
This is probably the best answer you're going to get. ROPs are very, very good at what they do and running their functions on programmable hardware would indeed be much slower. They also serve a dual-purpose of moving rasterization (the most RAM-bandwidth hungry operation in rendering) closer to RAM, which is why you see them coupled with memory controllers in both AMD and NVIDIA designs.

It's a bit off topic, but from what I understand ROPs were a big problem in Intel's initial Larabee design. Intel initially wanted to go with a purely programmable design - no geometry setup, no tessellator, no ROPs. Ultimately this wasn't a workable approach, and was one of the reasons Larrabee was first delayed as they needed to work in ROPs in to their design. Ultimately if ROPs can be removed, it will probably be Intel who does it first.

speaking of larrabee, has intel just completely scrapped that program or are they working on larrabee II?
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
They're still working on it, as per the news when Larrabee I was cancelled.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
Larrabee I wasn't completely cancelled though.
It just wasn't marketed as a retail product. They did build a series of Larrabee units for internal development and trusted partners.
Intel Senior Fellow and CTO, Justin Rattner, showed Larrabee could crack the 1 TeraFLOP mark at SC09 using a standard HPC benchmark (SGEMM 4Kx4K calculation).

So it's not THAT bad.