FERMI Architecture Analysis

busydude · Oct 30, 2010

Recently, there was an article published on Beyond3D which discusses the GF100 and compares it to Cypress, a GTX 470 and a 5870 to be precise.

I am quoting what I think is the most interesting part of the article:

Triangle setup is one of the questions that has had our attention for a long stretch of time, ever since Fermi got initially announced. Prior to actually having hardware on hand, we relied on help from a friend (thanks Dean!) to run some custom tests on a GTX 480 to see what's what. Armed with the knowledge gained from that early experience, we iterated a few times until we reached a satisfactory test setup, and a moderately satisfactory level of knowledge.

What we do rendering wise is deceivingly simple: we render a screen filling mesh (ResolutionX x ResolutionY sized) that we tessellate based on the triangle area we desire to check. The mesh lives in the Oxy plane in object space, and gets orthographically projected into screen-space (thus we ensure that we're actually getting the screen-space triangle area that we want). It's nicely uniform, with all triangles being congruent.

There are two possible test cases that we use the mesh for: we either fully tessellate on the host side, at mesh instantiation, which basically means subdividing it into congruent quads (area for 1 quad equals two times the desired triangle area, since there are two triangles per quad) and then building the index list. In this case we also pass the mesh through a D3DX10Mesh->Optimize step.

The second case involves using GPU tessellation, which means that on the host side we coarsely subdivide the screen filling mesh into quads (area for 1 quad = (2 * i * i) * desired triangle area, where i represents the tessellation factor), and then finely tessellate the quads on the GPU, using the adequate tessellation factor (we use quad patches since they produce a nice, even pattern). The process is a bit more involved, but we doubt you're all that interested in knowing more.

For each case there are two sub-cases, namely rendering only with Z enabled, thus getting only depth interpolated across the tris during setup, or rendering with a varying number of attributes interpolated across the tris, which get returned in the pixel shader. We use instancing to draw the mesh multiple time with minimal overhead, and then rely on the D3D query mechanism to get data back about the rendering process (depth test is set to always). With that said, let's first look at what happens when we tessellate on the host side (we've also thrown in an 8800GT for a glimpse at how things used to be):

First shock : Fermi looks as if it's doing slightly under 1 triangle per clock, or, wording it differently, reaching less than 25% efficiency (the fact that G8x/G9x are 2 clocks per triangle architecture should hardly surprise anyone by now). The Z-only case is likely to be the closest we can get to isolating the setup part (VP transform, coverage determination, interpolant calculation), since attribute interpolation is simpler and has a diminished impact.

Even so, larger triangle sizes are impacted by it as the bottleneck moves towards that end of the process, and no buffering is infinite (also, triangles start crossing screen tiles, thus inducing further inefficiencies as work gets replicated). Cypress handles itself decently by comparison, but it's quite interesting to look at how the two architectures handle interpolating progressively more attributes: Slimer has no preference with regards to either interpolating 4 Float attributes or using the system value mechanism, namely SV_Position (also 4 Floats), whereas Cypress dislikes it.

This hints at Cypress having to go through a slow path in this particular case (or rather, it hints at the existence of a small dedicated cache/buffer for system values, that gets thrashed with larger triangle sizes as the bottleneck shifts towards the last stage of rasterisation).

It also appears that Slimer is marginally better with attribute interpolation overall. All of the above is an important datapoint since an en-vogue theory was that small triangles kill Cypress and empower Slimer this doesn't seem to be the case. But maybe not all triangles are the same:

Hello, parallel setup, welcome! This is considerably more in touch with what marketing slides left and right have been showing between 1.8 and 2.1 triangles per cycle is pretty nifty, if still a bit far from the theoretical count of 4. Data routing has its cost, and the more data that needs to be re-shuffled, the lesser achieved performance, so for extremely fat vertices/control points, even less parallelism can be achieved.

When we look at the competition, it's even better: up to 2.57 times faster, because not only does Slimer speed up, but also Cypress slows down. There's a currently running meme about Cypress taking 3 clocks per tessellated triangle - this is incorrect in an absolute sense, albeit we can generate that scenario quite easily, as we can do a bit better than what you're seeing (note we've reached up to ~600 MTris/s by using triangular patches, thus trimming down the per control point data, all else being equal).

ATI's problem is primarily one of data-flow: they try to keep some data in shared memory (as far as we can see, they try to keep HS-DS pairs resident on the same SIMD, with hull shaders being significantly more expensive then domain shaders) but data to and from the tessellator needs to go through the GDS. There's also the need to serialise access to the tessellator, since it's an unique resource, coupled with a final aspect we'll deal with when looking at math throughput.

Given all this, fatter control points (our control points are as skinny as possible) or heavy math in the HS (there's no explicit math in ours, but there's some implicit tessellation factor massaging and addressing math) hurt Cypress comparatively more than they hurt Slimer - and now you know how the 3 clocks per triangle scenarios come into being, a combination of the two aforementioned factors.

Getting back to the main course, the question remains: why does Slimer need tessellation to expose its parallel setup capability? You may be thinking something along the lines of "bah, you guys are stupid, it's painfully obvious that the mesh data-set was too large, and caused vertex cache trashing/excessive fetches from VRAM". Whilst this line of reasoning is not without merit, we did struggle to be as un-stupid as possible we tried rendering just a few tris of given size, but the result was well within noise margins.

In fact, we struggled with many potential theories, until a fortuitous encounter with a Quadro made the truth painfully obvious: product differentiation, a somewhat en vogue term over at NVIDIA, it seems. The Quadro, in spite of being pretty much the same hardware (this is a signal to all those that believe there's magical hardware in the Quadro because it's more expensive engage rant mode!), is quite happy doing full speed setup on the untessellated plebs.

We can only imagine that this seemed like a good idea to someone. Sure, there's a finite probability that traditional Quadro customers, who are quite corporate and quite fond of extensive support amongst other things, would suddenly turn into full blown hardware ricers, give up all perks that come with the high Quadro price, buy cheap consumer hardware and use that instead.

Capping is done in drivers, by inducing artificial delays during the post viewport transform reordering (mind you this hasn't yet been confirmed by NVIDIA, but our own educated conclusion). Amusingly enough, Teslas get the cap too, in spite of qualifying for the "really fucking expensive" category as well. We'll refrain of arguing for or against the decision, since there are points to be made coming from either angle, and just stop at reporting it. That means it's time for maths!

Link

Dribble · Oct 30, 2010

Interesting article, interesting points above, ty for sharing.

busydude · Oct 31, 2010

Dribble said:
Interesting article, interesting points above, ty for sharing.

Welcome. By looking at that tessellation graphs, its clear why AMD was stressing on using ~16 pixels/triangle, Fermi architecture just blows Cypress out of the water when triangles get smaller. This just proves what Scali was saying all along.

Petey! · Nov 1, 2010

General cliffs for the tl;dr?

busydude · Nov 1, 2010

Petey! said:
General cliffs for the tl;dr?

If you look at images #3 and #4 it is clear that Fermi Architecture is superior to Cypress in the sense that as triangle area decreases, Fermi scales better.

AMD, in their recent conference for launch of Barts mentioned using tessellation adaptively and not using brute force. They specified using average of ~16 pixels per triangle.. where you will notice that both Cypress and Fermi perform similarly.

But Scali argues that 16 pixel per triangle limit is just a PR stunt and is not the right wat to use tessellation, that limit was mentioned because Cypress performs poorly at lower pixels/triangle.

bryanW1995 · Nov 2, 2010

actually, 16 pixels is fine for now, but in the indefinite "future" nvidia's approach will be better. if amd was pushing more into the professional market they would need more of this approach, but for this gen at least amd's implementation is ok. however, if you want your gtx 480 or 6970 to last 7 years then you might be in trouble

thilanliyan · Nov 2, 2010

busydude said:
This just proves what Scali was saying all along.

Nobody said Scali was wrong about Fermi's tessellation performance, just that his emphasis on tessellation performance is not entirely correct (RIGHT NOW). It probably will be important in future though, and by then we will have new cards anyway.

Keysplayr · Nov 2, 2010

thilanliyan said:
Nobody said Scali was wrong about Fermi's tessellation performance, just that his emphasis on tessellation performance is not entirely correct (RIGHT NOW). It probably will be important in future though, and by then we will have new cards anyway.

Which means?

-Slacker- · Nov 2, 2010

Why exactly do we need less than 16 pixels per triangle again? Who would spot the difference?

BFG10K · Nov 2, 2010

Keysplayr said:
Which means?

It means even the fastest tessellation part right now (GTX480) slideshows in current games when said feature is enabled.

Scali · Nov 2, 2010

thilanliyan said:
Nobody said Scali was wrong about Fermi's tessellation performance, just that his emphasis on tessellation performance is not entirely correct (RIGHT NOW). It probably will be important in future though, and by then we will have new cards anyway.

You realize that this 'emphasis' is merely your subjective interpretation of my words...

Lonbjerg · Nov 2, 2010

BFG10K said:
It means even the fastest tessellation part right now (GTX480) slideshows in current games when said feature is enabled.

Oh realy...care to back that statmenet up with data?

Scali · Nov 2, 2010

busydude said:
This just proves what Scali was saying all along.

Yup, even their explanation of the bottleneck:
"ATI's problem is primarily one of data-flow
...
There's also the need to serialise access to the tessellator, since it's an unique resource"

This has been common knowledge for a while, but it's nice to see other sites backing it up. Perhaps people who weren't willing to take my word for it, will accept these facts now.

Scali · Nov 2, 2010

-Slacker- said:
Why exactly do we need less than 16 pixels per triangle again? Who would spot the difference?

Well, a few weeks ago, Huddy was arguing for 4 pixels per triangle. Seems that the triangles get larger all the time, with AMD:
http://www.kitguru.net/components/g...e-constant-smell-of-burning-bridges-says-amd/

Richard Huddy said:
To be intelligent, a triangle needs to be more than 4 pixels big for tessellation to make sense.

Why we need smaller triangles is obvious: better image quality.
If we want full Pixar RenderMan quality, we need to go down to subpixel level (with the advantage that you can optimize your rasterizer for that case).
And yes, you WILL spot the difference. 16 pixel triangles are actually still quite large.

Scali · Nov 2, 2010

busydude said:
But Scali argues that 16 pixel per triangle limit is just a PR stunt and is not the right wat to use tessellation, that limit was mentioned because Cypress performs poorly at lower pixels/triangle.

No, what I said was: the size of the triangles is not important. It's the throughput.
And that is exactly what Beyond3D says as well.
Generating extra triangles is expensive on the serial architecture of the Radeon. Even if these are larger than 16 pixels, because the size of the triangles is only relevent to the rasterizer, while they are demonstrating that there is a throughput bottleneck in the tessellator.
As you can see from the first chart, the Radeon outperforms the GeForce at every triangle size, with more or less the same margin... when tessellation is not enabled. So the Radeon has the advantage in triangle size.

Arkadrel · Nov 2, 2010

Scali...

I think youve made it clear how superior you find nvidias tessellator.

Thread-crapping and personal attacks are not acceptable in VC&G.

This is a technical forum, please conduct yourself accordingly.

If a thread has info that is of little interest to you then just stay out of the thread.

If a post has info that is of little interest to you then refrain from posting negative comments regarding the poster.

Moderator Idontcare

Lonbjerg · Nov 2, 2010

Arkadrel said:
Scali...

I think youve made it clear how superior you find nvidias tessellator.

I fail to see the relevance of this post?

Genx87 · Nov 2, 2010

Arkadrel said:
Scali...

I think youve made it clear how superior you find nvidias tessellator.

I think he has made it clear with more proof how superior Nvidia's tesselator is.

Scali · Nov 2, 2010

Yea, this is just a technical discussion, where some synthetic tests have delivered a number of facts about the performance of each architecture.
I don't think he really understood the issue, because I have just said that that Radeon has a better triangle setup/rasterizer than Fermi (although according to Beyond3D, this is an artificial limit, and the Quadro series drivers let the triangle setup run at full speed).

It's ironic though... Radeons are actually BETTER at rendering triangles of < 16 pixels than GeForces.
*Woosh* <-- the sound of Richard Huddy's credibility fleeting.

Arkadrel · Nov 2, 2010

"You are beating a dead horse when you insist on talking about something that cannot be changed ... Beating a dead horse is an action that has no purpose, because no matter how hard or how long you beat a dead horse, it is not going to get up and run ... To repeatedly bring up a particular topic with no chance of affecting the outcome is beating a dead horse."

1) Amds currently tessellator cannot be changed (it ll happend in cards to come)

2) "an action that has no purpose, because no matter how hard or how long you beat a dead horse, it is not going to get up and run"

We wont see 1pixle tessellation anytime soon because of consol ports/amd parts not doing it well/dx9 cards marketshare/people running windows xp. Yet you make it sound like amd is doing something wrong with their tessellator. Its sufficent for current needs in gameing, by the time that changes people will have new cards.

3) "To repeatedly bring up a particular topic with no chance of affecting the outcome is beating a dead horse". I dont think your bringing up this 1pixle level tessellation will effect how amd or nvidia design cards. Nor will it make game developers design games that has it (currently).

Scali · Nov 2, 2010

I haven't said a thing about 1-pixel tessellation. Don't twist the facts to suit your own agenda.
Also, don't post if you're not interested in the discussion. Posting remarks like "beating a dead horse" isn't very constructive. Apparently there are plenty of people who are still willing to discuss this topic, and this thread provides some new information and technical facts on the topic.

BFG10K · Nov 2, 2010

Lonbjerg said:
Oh realy...care to back that statmenet up with data?

You mean aside from owning a GTX480 and benchmarking and/or playing several tessellation titles from start to finish?

Tessellation in games currently falls into two categories:

A small performance hit for essentially zero image quality gain, usually only visible in still screenshots in rare situations (e.g. Stalker).
A massive performance hit with a questionable gain with in-game image quality (e.g. Metro 2003’s performance almost cut in half at 2560x1600, the resolution I play the game at).

I disable tessellation in any game that supports it on my GTX480. At this time, tessellation performance is a non-factor because I’d do exactly same when gaming on a Radeon. Synthetic tessellation benchmarks also mean squat to that equation.

The same thing applies to hardware PhysX, but I don’t want to go off topic with that one.

Lonbjerg · Nov 2, 2010

BFG10K said:
You mean aside from owning a GTX480 and benchmarking and/or playing several tessellation titles from start to finish?

Tessellation in games currently falls into two categories:

A small performance hit for essentially zero image quality gain, usually only visible in still screenshots in rare situations (e.g. Stalker).

A massive performance hit with a questionable gain with in-game image quality (e.g. Metro 2003s performance almost cut in half at 2560x1600, the resolution I play the game at).

I disable tessellation in any game that supports it on my GTX480. At this time, tessellation performance is a non-factor because Id do exactly same when gaming on a Radeon. Synthetic tessellation benchmarks also mean squat to that equation.

The same thing applies to hardware PhysX, but I dont want to go off topic with that one.

So you got nothing?

BFG10K · Nov 2, 2010

Lonbjerg said:
So you got nothing?

I have the hardware and the games. What have you got?

tviceman · Nov 2, 2010

BFG10K said:
It means even the fastest tessellation part right now (GTX480) slideshows in current games when said feature is enabled.

I don't think tessellation alone is the cause of Metro2033's slowdowns. It does incur a performance hit turning tessellation on, but it isn't game breaking. There are other factors at work with that game which contribute heavily to low frame rates. Civilization 5 does not slide show with tessellation and with what yesterday's HAWX 2 benchmark demonstrated, there will not be any significant slow downs either with this game (releasing in two weeks).

FERMI Architecture Analysis

Diamond Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Lifer

Lifer

Elite Member

Golden Member

Lifer

Banned

Diamond Member

Banned

Banned

Banned

Diamond Member

Diamond Member

Lifer

Banned

Diamond Member

Banned

Lifer

Diamond Member

Lifer

Diamond Member