Question 'Ampere'/Next-gen gaming uarch speculation thread

Page 147 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ottonomous

Senior member
May 15, 2014
559
292
136
How much is the Samsung 7nm EUV process expected to provide in terms of gains?
How will the RTX components be scaled/developed?
Any major architectural enhancements expected?
Will VRAM be bumped to 16/12/12 for the top three?
Will there be further fragmentation in the lineup? (Keeping turing at cheaper prices, while offering 'beefed up RTX' options at the top?)
Will the top card be capable of >4K60, at least 90?
Would Nvidia ever consider an HBM implementation in the gaming lineup?
Will Nvidia introduce new proprietary technologies again?

Sorry if imprudent/uncalled for, just interested in the forum member's thoughts.
 

ozzy702

Golden Member
Nov 1, 2011
1,151
530
136
Rumor I'm hearing is a 20GB 3080 model is still real and we may know more in October. Also G6X is quite a bit more expensive (~$14-20 1/GB) so would expect if the 20GB model is coming for it to be $100-$200 more than a 3080 10GB.


If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.
 
  • Like
Reactions: tviceman

MrTeal

Diamond Member
Dec 7, 2003
3,614
1,816
136
If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.
Yeah, that wouldn't be worth the cost. You'd be better off just pocketing the $200 and putting it towards a 4080 or equivalent in a couple years when 10GB might start becoming restrictive.
 

MrTeal

Diamond Member
Dec 7, 2003
3,614
1,816
136

From the sounds of it, getting a block for a FE 3080 might be an issue. Apparently EK is working on one, release date TBD.
Might need to look at a partner board with the reference design instead of an FE from Nvidia.
 

eddman

Senior member
Dec 28, 2010
239
87
101
OK I have created Ampere SM approximation to see the difference of the new Cuda Cores partition

Inside the red square is the new Cuda Core partition that can execute one 16x FP32 or one 16x INT32 per cycle.
With the addition of the second 16x FP32 partition they can now execute 16x FP32 + 16x FP32 per cycle or 16x FP32 + 16x INT32.
So now they can do 128x FP32 versus 64x FP32 per SM and thats the double throughput they get in Ampere.

Ampere GA102 SM Approximation

SM-GA102-2.png

Hardwareluxx too made a similar diagram:

QHKNrOX.png


Nvidia has stated this:

"To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock."

You'd have to excuse me if the following questions seem dumb. I don't know even near enough about this subject at such levels.

1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?
2. Is it unable to utilize both FP and INT cores located in the same datapath at the same time? If so, why?
3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?
4. For point 3, if it's somehow not possible to put so many FP cores in one path, and considering the GPU can utilize more than one datapath at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?
 
Last edited:
  • Like
Reactions: Kirito

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,894
3,247
126
Seems to me in this one game 45-50% better for 3080 over 2080Ti

No offense to my statement when i say this...
but everyone knows doom is really really well coded vs other traditional AAA title games.
Hence it really does not show off much.

Id really like to see it with something which we knows break hardware like Minecraft with RayTracing enabled (not joking), or final fantasy 15 will also do.
 

Hitman928

Diamond Member
Apr 15, 2012
6,186
10,693
136
1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?

One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.

2. Is it unable to utilize both FP and INT cores located in the same data path at the same time? If so, why?

Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.

3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?

That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

4. For point 3, if it's somehow not possible to put so many FP cores in one path, considering the GPU can utilize more than one datapaths at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.

Edit: the 2/3 and 1/3 numbers are for illustrative purposes only. The actual mix could be 3/4 and 1/4 or even 9/10 and 1/10. The point is that Nvidia sees the compute needs as being FP heavy but INT compute is still very much needed so why have a dedicated INT path when it will just sit there waiting for work the majority of the time.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,531
7,858
136
Nvidia has stated this:

"To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock."

You'd have to excuse me if the following questions seem dumb. I don't know even near enough about this subject.

1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?
2. Is it unable to utilize both FP and INT cores located in the same datapath at the same time? If so, why?
3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?
4. For point 3, if it's somehow not possible to put so many FP cores in one path, and considering the GPU can utilize more than one datapaths at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?
I think others can better answer your questions directly, or even this link, but I wanted to at least drop some diagrams to help illustrate what's going on behind the scenes.
SMMrecolored_575px.png


Here's a cool comparison over the years:
2019-07-21-image-2-p.webp
 

DDH

Member
May 30, 2015
168
168
111
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.



Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.



That is what Turing did. Turing could do 1 INT or 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.



You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.
Are you sure turning wasn't capable of 1FP+1INT?
 

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
Turing could do both simultaneously if rhe software was written for it.

You know, that's what Async Compute is. It's leveraging both pipelines at the same time.

EDIT: Okay, so that was a mistake. Fair enough.
 

Konan

Senior member
Jul 28, 2017
360
291
106
No offense to my statement when i say this...
but everyone knows doom is really really well coded vs other traditional AAA title games.
Hence it really does not show off much.

Id really like to see it with something which we knows break hardware like Minecraft with RayTracing enabled (not joking), or final fantasy 15 will also do.

None taken, it is just one game and I fully agree :)
 

Konan

Senior member
Jul 28, 2017
360
291
106
The Reddit Nvidia Q&A had this interesting statement from an Nvidia rep.

Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

So much for needing the Tensor cores for this.

DLSS 2.0 requires Tensor cores
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
Unless NVIDIA provides the number of chips actually shipped and sold, everybody will spin this just as they want, and none of them will be either right or wrong. It will be Shrödinger's supply and demand. Exactly how it always have been, no matter which brand or vendor we're talking about.
 

eddman

Senior member
Dec 28, 2010
239
87
101
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.
So they did it to specifically enable FP+FP. I suppose this is probably aimed at scientific calculations, etc? Are there any desktop use cases where this could come into play?

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc.
Is it possible to guess how much bigger the die would've been if they implemented three paths?

I'm trying to imagine how much higher the gaming performance would be with three paths. IINM games tend to be FP heavy; the jump in performance would probably still be substantial, no?
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
6,186
10,693
136
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

A big reason, yes. The quoted FLOPs is assuming you are doing FP+FP the entire time which is not realistic for gaming. I haven't studied all of Nvidia's marketing slides but usually there's additional FLOP quotes where they are counting tensor flops and everything together as well which is also unrealistic.

So they did it to specifically enable FP+FP. I suppose this is probably aimed at scientific calculations, etc? Are there any desktop use cases where this could come into play?

Pure compute should benefit obviously but I'm sure gaming will benefit as well because, as I said, games tend to be FP heavy so you'll get a decent speed boost out of it. No where near 2x mind you, but it should be a noticeable boost.

Is it possible to guess how much bigger the die would've been if they implemented three paths?

Not really, not without knowing the size of each CU and each data path inside the CU which I'm pretty sure no one can tell you outside of Nvidia but if someone can, I'd be happy to hear.

I'm trying to imagine how much higher the gaming performance would be with three paths. IINM games tend to be FP heavy; the jump in performance would probably still be substantial, no?

Depends on a lot of things. If we waive away power and size requirements, then most likely you could get a very noticeable speed bump. However, we do have size and power requirements so the real question is what do you have to sacrifice in order to put that 3rd path in. I'm guessing at 8 nm, you would do more harm than good putting in that third path.
 

dr1337

Senior member
May 25, 2020
417
691
136
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

Are there any desktop use cases where this could come into play?
Well with the benchmarks we have now it would seem that the 3070 with "5888 cuda cores" is indeed faster than the 4352 cores of the current 2080ti. Though until 3rd party benchmarks come out its not going to be exactly clear how much the new design scales in games and typical desktop applications. If we take the ampere reveal at face value then objectively it would seem that have truly doubled compute performance.

Tflops have never been a direct comparison to gaming performance between architectures. If you can really squeeze the performance out of ampere like nvida claims it has, then this generation has really good potential for rendering and mining.
 

Tup3x

Golden Member
Dec 31, 2016
1,086
1,084
136
If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.
I game at 1440p so I'm pretty sure it will be just fine for quite some now. At this point I just want to upgrade - RTX 3080 would be absolutely massive upgrade and it would still offer 2 GB more VRAM. Heck, I don't care what comes out after.
 

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.



Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.



That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.



You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.

Edit: the 2/3 and 1/3 numbers are for illustrative purposes only. The actual mix could be 3/4 and 1/4 or even 9/10 and 1/10. The point is that Nvidia sees the compute needs as being FP heavy but INT compute is still very much needed so why have a dedicated INT path when it will just sit there waiting for work the majority of the time.
Yep. It seems that Nvidia is allowing for some flexibility in the int/fp instructions ratio as the fixed ratio that they used for the 2xxx series can only be rarely optimal for any one game. This way they get closer to the best possible.
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,894
3,247
126
WNEZTt9.jpg


That looks... wrong. :grin:

AHAHAHA... *ccough* omg u made me cough out my lung from laughing so hard...

I seriously can't think of ANY ITX cases which can handle that.... which defeats the whole point in ITX...
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,894
3,247
126
Why the hell do they still have SLI connectors on the cards when SLI is practically dead and no one is dumb enough to do SLI?