Question 'Ampere'/Next-gen gaming uarch speculation thread

Page 150 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ottonomous

Senior member
May 15, 2014
559
292
136
How much is the Samsung 7nm EUV process expected to provide in terms of gains?
How will the RTX components be scaled/developed?
Any major architectural enhancements expected?
Will VRAM be bumped to 16/12/12 for the top three?
Will there be further fragmentation in the lineup? (Keeping turing at cheaper prices, while offering 'beefed up RTX' options at the top?)
Will the top card be capable of >4K60, at least 90?
Would Nvidia ever consider an HBM implementation in the gaming lineup?
Will Nvidia introduce new proprietary technologies again?

Sorry if imprudent/uncalled for, just interested in the forum member's thoughts.
 

sze5003

Lifer
Aug 18, 2012
14,181
625
126
Way out of my price range but I agree.
What I don’t understand is why is it so difficult to get more memory on cards lately?
Seems like an 8GB card is rare and 10GB seems like a weird number.
Why is it so difficult to have 8GB be the gaming normal and 16GB be the halo product?
I understand there may not be a use case for that memory but the market appears to want it.
Well one of the reasons for leaving out such things and creating a big gap between the cards is so they can charge the prices they want. Compared to a 1080ti it's only 1gb less though so is that really a big deal?

I guess that depends on the person and how they game or want to use the card.
 

Martimus

Diamond Member
Apr 24, 2007
4,488
152
106
Way out of my price range but I agree.
What I don’t understand is why is it so difficult to get more memory on cards lately?
Seems like an 8GB card is rare and 10GB seems like a weird number.
Why is it so difficult to have 8GB be the gaming normal and 16GB be the halo product?
I understand there may not be a use case for that memory but the market appears to want it.
It has to do with the GDDR6X memory they are using. It currently only has half the chip memory capacity of GDDR6 memory. This will likely be rectified as time goes on, but it made it difficult for Nvidia to put large memory sizes on their new GPUs. Also, Nvidia usually differentiates their professional products from their gaming products with memory capacity, so that is likely another reason.
 

CakeMonster

Golden Member
Nov 22, 2012
1,389
496
136
Everything points to there being room for another card between 10Gb 3080 and 24Gb 3090. Maybe the performance gap is a bit too narrow (remains to be seen), but that has not stopped them in the past. So yeah, count on it.

HOWEVER be careful about basing your purchasing decision on getting something that is not confirmed to exist and does not have a launch date. For every post I read about someone saying they need the upgrade but have decided to wait for the "3080Ti", I worry about them not having considered the whole picture from a price/value prospective considering how long they may have to wait.

Take the 2080Ti, not a good value price/performance wise, BUT you had the top card for 2 years which sweetened the deal a whole lot. The problem was you couldn't predict that beforehand, as there was always a yearly upgrade before 2018. If there is a 2 year gap now, waiting for a 3080Ti that launches in 6 months is a good deal. If there is a 1 year gap, it would not be a very good deal. If a 3080Ti launches in 12 months, with a 2 year gap for the next top card, eh, I guess its still decent but that's 12 months lost performance if the slowness of your current card is already bothering you.

I'm NOT saying "just buy it", I don't want to get accused of that. But be aware of the implications of the value of staying with your old hardware while betting on something we don't even know will exist and not even knowing the time frame of its value.
 

Konan

Senior member
Jul 28, 2017
360
291
106
Rumor I'm hearing is a 20GB 3080 model is still real and we may know more in October. Also G6X is quite a bit more expensive (~$14-20 1/GB) so would expect if the 20GB model is coming for it to be $100-$200 more than a 3080 10GB.
 

ozzy702

Golden Member
Nov 1, 2011
1,151
530
136
Rumor I'm hearing is a 20GB 3080 model is still real and we may know more in October. Also G6X is quite a bit more expensive (~$14-20 1/GB) so would expect if the 20GB model is coming for it to be $100-$200 more than a 3080 10GB.


If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.
 
  • Like
Reactions: tviceman

MrTeal

Diamond Member
Dec 7, 2003
3,568
1,696
136
If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.
Yeah, that wouldn't be worth the cost. You'd be better off just pocketing the $200 and putting it towards a 4080 or equivalent in a couple years when 10GB might start becoming restrictive.
 

MrTeal

Diamond Member
Dec 7, 2003
3,568
1,696
136

From the sounds of it, getting a block for a FE 3080 might be an issue. Apparently EK is working on one, release date TBD.
Might need to look at a partner board with the reference design instead of an FE from Nvidia.
 

eddman

Senior member
Dec 28, 2010
239
87
101
OK I have created Ampere SM approximation to see the difference of the new Cuda Cores partition

Inside the red square is the new Cuda Core partition that can execute one 16x FP32 or one 16x INT32 per cycle.
With the addition of the second 16x FP32 partition they can now execute 16x FP32 + 16x FP32 per cycle or 16x FP32 + 16x INT32.
So now they can do 128x FP32 versus 64x FP32 per SM and thats the double throughput they get in Ampere.

Ampere GA102 SM Approximation

SM-GA102-2.png

Hardwareluxx too made a similar diagram:

QHKNrOX.png


Nvidia has stated this:

"To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock."

You'd have to excuse me if the following questions seem dumb. I don't know even near enough about this subject at such levels.

1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?
2. Is it unable to utilize both FP and INT cores located in the same datapath at the same time? If so, why?
3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?
4. For point 3, if it's somehow not possible to put so many FP cores in one path, and considering the GPU can utilize more than one datapath at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?
 
Last edited:
  • Like
Reactions: Kirito

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,841
3,189
126
Seems to me in this one game 45-50% better for 3080 over 2080Ti

No offense to my statement when i say this...
but everyone knows doom is really really well coded vs other traditional AAA title games.
Hence it really does not show off much.

Id really like to see it with something which we knows break hardware like Minecraft with RayTracing enabled (not joking), or final fantasy 15 will also do.
 

Hitman928

Diamond Member
Apr 15, 2012
5,232
7,773
136
1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?

One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.

2. Is it unable to utilize both FP and INT cores located in the same data path at the same time? If so, why?

Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.

3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?

That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

4. For point 3, if it's somehow not possible to put so many FP cores in one path, considering the GPU can utilize more than one datapaths at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.

Edit: the 2/3 and 1/3 numbers are for illustrative purposes only. The actual mix could be 3/4 and 1/4 or even 9/10 and 1/10. The point is that Nvidia sees the compute needs as being FP heavy but INT compute is still very much needed so why have a dedicated INT path when it will just sit there waiting for work the majority of the time.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,121
6,280
136
Nvidia has stated this:

"To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock."

You'd have to excuse me if the following questions seem dumb. I don't know even near enough about this subject.

1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?
2. Is it unable to utilize both FP and INT cores located in the same datapath at the same time? If so, why?
3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?
4. For point 3, if it's somehow not possible to put so many FP cores in one path, and considering the GPU can utilize more than one datapaths at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?
I think others can better answer your questions directly, or even this link, but I wanted to at least drop some diagrams to help illustrate what's going on behind the scenes.
SMMrecolored_575px.png


Here's a cool comparison over the years:
2019-07-21-image-2-p.webp
 

DDH

Member
May 30, 2015
168
168
111
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.



Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.



That is what Turing did. Turing could do 1 INT or 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.



You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.
Are you sure turning wasn't capable of 1FP+1INT?
 

uzzi38

Platinum Member
Oct 16, 2019
2,607
5,821
146
Turing could do both simultaneously if rhe software was written for it.

You know, that's what Async Compute is. It's leveraging both pipelines at the same time.

EDIT: Okay, so that was a mistake. Fair enough.
 

Konan

Senior member
Jul 28, 2017
360
291
106
No offense to my statement when i say this...
but everyone knows doom is really really well coded vs other traditional AAA title games.
Hence it really does not show off much.

Id really like to see it with something which we knows break hardware like Minecraft with RayTracing enabled (not joking), or final fantasy 15 will also do.

None taken, it is just one game and I fully agree :)
 

Konan

Senior member
Jul 28, 2017
360
291
106
The Reddit Nvidia Q&A had this interesting statement from an Nvidia rep.

Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

So much for needing the Tensor cores for this.

DLSS 2.0 requires Tensor cores
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
Unless NVIDIA provides the number of chips actually shipped and sold, everybody will spin this just as they want, and none of them will be either right or wrong. It will be Shrödinger's supply and demand. Exactly how it always have been, no matter which brand or vendor we're talking about.
 

eddman

Senior member
Dec 28, 2010
239
87
101
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.
So they did it to specifically enable FP+FP. I suppose this is probably aimed at scientific calculations, etc? Are there any desktop use cases where this could come into play?

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc.
Is it possible to guess how much bigger the die would've been if they implemented three paths?

I'm trying to imagine how much higher the gaming performance would be with three paths. IINM games tend to be FP heavy; the jump in performance would probably still be substantial, no?
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,232
7,773
136
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

A big reason, yes. The quoted FLOPs is assuming you are doing FP+FP the entire time which is not realistic for gaming. I haven't studied all of Nvidia's marketing slides but usually there's additional FLOP quotes where they are counting tensor flops and everything together as well which is also unrealistic.

So they did it to specifically enable FP+FP. I suppose this is probably aimed at scientific calculations, etc? Are there any desktop use cases where this could come into play?

Pure compute should benefit obviously but I'm sure gaming will benefit as well because, as I said, games tend to be FP heavy so you'll get a decent speed boost out of it. No where near 2x mind you, but it should be a noticeable boost.

Is it possible to guess how much bigger the die would've been if they implemented three paths?

Not really, not without knowing the size of each CU and each data path inside the CU which I'm pretty sure no one can tell you outside of Nvidia but if someone can, I'd be happy to hear.

I'm trying to imagine how much higher the gaming performance would be with three paths. IINM games tend to be FP heavy; the jump in performance would probably still be substantial, no?

Depends on a lot of things. If we waive away power and size requirements, then most likely you could get a very noticeable speed bump. However, we do have size and power requirements so the real question is what do you have to sacrifice in order to put that 3rd path in. I'm guessing at 8 nm, you would do more harm than good putting in that third path.
 

dr1337

Senior member
May 25, 2020
329
547
106
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

Are there any desktop use cases where this could come into play?
Well with the benchmarks we have now it would seem that the 3070 with "5888 cuda cores" is indeed faster than the 4352 cores of the current 2080ti. Though until 3rd party benchmarks come out its not going to be exactly clear how much the new design scales in games and typical desktop applications. If we take the ampere reveal at face value then objectively it would seem that have truly doubled compute performance.

Tflops have never been a direct comparison to gaming performance between architectures. If you can really squeeze the performance out of ampere like nvida claims it has, then this generation has really good potential for rendering and mining.
 

Tup3x

Senior member
Dec 31, 2016
955
938
136
If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.
I game at 1440p so I'm pretty sure it will be just fine for quite some now. At this point I just want to upgrade - RTX 3080 would be absolutely massive upgrade and it would still offer 2 GB more VRAM. Heck, I don't care what comes out after.
 

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.



Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.



That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.



You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.

Edit: the 2/3 and 1/3 numbers are for illustrative purposes only. The actual mix could be 3/4 and 1/4 or even 9/10 and 1/10. The point is that Nvidia sees the compute needs as being FP heavy but INT compute is still very much needed so why have a dedicated INT path when it will just sit there waiting for work the majority of the time.
Yep. It seems that Nvidia is allowing for some flexibility in the int/fp instructions ratio as the fixed ratio that they used for the 2xxx series can only be rarely optimal for any one game. This way they get closer to the best possible.