Question 'Ampere'/Next-gen gaming uarch speculation thread

Ottonomous · Nov 1, 2019

How much is the Samsung 7nm EUV process expected to provide in terms of gains?
How will the RTX components be scaled/developed?
Any major architectural enhancements expected?
Will VRAM be bumped to 16/12/12 for the top three?
Will there be further fragmentation in the lineup? (Keeping turing at cheaper prices, while offering 'beefed up RTX' options at the top?)
Will the top card be capable of >4K60, at least 90?
Would Nvidia ever consider an HBM implementation in the gaming lineup?
Will Nvidia introduce new proprietary technologies again?

Sorry if imprudent/uncalled for, just interested in the forum member's thoughts.

sze5003 · Sep 3, 2020

Fanatical Meat said:
Way out of my price range but I agree.
What I don’t understand is why is it so difficult to get more memory on cards lately?
Seems like an 8GB card is rare and 10GB seems like a weird number.
Why is it so difficult to have 8GB be the gaming normal and 16GB be the halo product?
I understand there may not be a use case for that memory but the market appears to want it.

Well one of the reasons for leaving out such things and creating a big gap between the cards is so they can charge the prices they want. Compared to a 1080ti it's only 1gb less though so is that really a big deal?

I guess that depends on the person and how they game or want to use the card.

moonbogg · Sep 3, 2020

Kenmitch said:
So if Nvidia sells out within ~~minutes~~ seconds it's because of demand?

FTFY

Martimus · Sep 3, 2020

Fanatical Meat said:
Way out of my price range but I agree.
What I don’t understand is why is it so difficult to get more memory on cards lately?
Seems like an 8GB card is rare and 10GB seems like a weird number.
Why is it so difficult to have 8GB be the gaming normal and 16GB be the halo product?
I understand there may not be a use case for that memory but the market appears to want it.

It has to do with the GDDR6X memory they are using. It currently only has half the chip memory capacity of GDDR6 memory. This will likely be rectified as time goes on, but it made it difficult for Nvidia to put large memory sizes on their new GPUs. Also, Nvidia usually differentiates their professional products from their gaming products with memory capacity, so that is likely another reason.

Kenmitch · Sep 3, 2020

moonbogg said:
FTFY

....Bots beat ya to it! Snooze ya loose!

CakeMonster · Sep 3, 2020

Everything points to there being room for another card between 10Gb 3080 and 24Gb 3090. Maybe the performance gap is a bit too narrow (remains to be seen), but that has not stopped them in the past. So yeah, count on it.

HOWEVER be careful about basing your purchasing decision on getting something that is not confirmed to exist and does not have a launch date. For every post I read about someone saying they need the upgrade but have decided to wait for the "3080Ti", I worry about them not having considered the whole picture from a price/value prospective considering how long they may have to wait.

Take the 2080Ti, not a good value price/performance wise, BUT you had the top card for 2 years which sweetened the deal a whole lot. The problem was you couldn't predict that beforehand, as there was always a yearly upgrade before 2018. If there is a 2 year gap now, waiting for a 3080Ti that launches in 6 months is a good deal. If there is a 1 year gap, it would not be a very good deal. If a 3080Ti launches in 12 months, with a 2 year gap for the next top card, eh, I guess its still decent but that's 12 months lost performance if the slowness of your current card is already bothering you.

I'm NOT saying "just buy it", I don't want to get accused of that. But be aware of the implications of the value of staying with your old hardware while betting on something we don't even know will exist and not even knowing the time frame of its value.

Konan · Sep 3, 2020

Rumor I'm hearing is a 20GB 3080 model is still real and we may know more in October. Also G6X is quite a bit more expensive (~$14-20 1/GB) so would expect if the 20GB model is coming for it to be $100-$200 more than a 3080 10GB.

ozzy702 · Sep 3, 2020

Konan said:
Rumor I'm hearing is a 20GB 3080 model is still real and we may know more in October. Also G6X is quite a bit more expensive (~$14-20 1/GB) so would expect if the 20GB model is coming for it to be $100-$200 more than a 3080 10GB.

If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.

MrTeal · Sep 3, 2020

ozzy702 said:
If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.

Yeah, that wouldn't be worth the cost. You'd be better off just pocketing the $200 and putting it towards a 4080 or equivalent in a couple years when 10GB might start becoming restrictive.

MrTeal · Sep 3, 2020

3090 Eiswolf - effect on length

From NVIDIA's presentation today, it looks like the PCB for the 3080 and 3090 only take up about half of the length of the card. Will the Eiswolf GPU block reduce the total length of the GPU by a significant amount, since there won't be any reason for the fan at the end of the shroud? If you...

forum.alphacool.com

From the sounds of it, getting a block for a FE 3080 might be an issue. Apparently EK is working on one, release date TBD.
Might need to look at a partner board with the reference design instead of an FE from Nvidia.

eddman · Sep 3, 2020

AtenRa said:
OK I have created Ampere SM approximation to see the difference of the new Cuda Cores partition

Inside the red square is the new Cuda Core partition that can execute one 16x FP32 or one 16x INT32 per cycle.
With the addition of the second 16x FP32 partition they can now execute 16x FP32 + 16x FP32 per cycle or 16x FP32 + 16x INT32.
So now they can do 128x FP32 versus 64x FP32 per SM and thats the double throughput they get in Ampere.

Ampere GA102 SM Approximation

Hardwareluxx too made a similar diagram:

Nvidia has stated this:

"To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock."

You'd have to excuse me if the following questions seem dumb. I don't know even near enough about this subject at such levels.

1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?
2. Is it unable to utilize both FP and INT cores located in the same datapath at the same time? If so, why?
3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?
4. For point 3, if it's somehow not possible to put so many FP cores in one path, and considering the GPU can utilize more than one datapath at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?

aigomorla · Sep 3, 2020

Konan said:
Seems to me in this one game 45-50% better for 3080 over 2080Ti

No offense to my statement when i say this...
but everyone knows doom is really really well coded vs other traditional AAA title games.
Hence it really does not show off much.

Id really like to see it with something which we knows break hardware like Minecraft with RayTracing enabled (not joking), or final fantasy 15 will also do.

Hitman928 · Sep 3, 2020

eddman said:
1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?

One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.

2. Is it unable to utilize both FP and INT cores located in the same data path at the same time? If so, why?

Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.

3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?

That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

4. For point 3, if it's somehow not possible to put so many FP cores in one path, considering the GPU can utilize more than one datapaths at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.

Edit: the 2/3 and 1/3 numbers are for illustrative purposes only. The actual mix could be 3/4 and 1/4 or even 9/10 and 1/10. The point is that Nvidia sees the compute needs as being FP heavy but INT compute is still very much needed so why have a dedicated INT path when it will just sit there waiting for work the majority of the time.

Saylick · Sep 3, 2020

eddman said:
Nvidia has stated this:

"To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock."

You'd have to excuse me if the following questions seem dumb. I don't know even near enough about this subject.

1. If it can utilize both datapaths at the same time, why can't it do FP+FP+INT?
2. Is it unable to utilize both FP and INT cores located in the same datapath at the same time? If so, why?
3. If 2 is true, and considering it can use both datapaths, why not put all FP cores in one datapath, and all INT cores in the other one?
4. For point 3, if it's somehow not possible to put so many FP cores in one path, and considering the GPU can utilize more than one datapaths at the same time, then why not make three paths, two paths each with equal number of FP cores, and a third path with just INT?

I think others can better answer your questions directly, or even this link, but I wanted to at least drop some diagrams to help illustrate what's going on behind the scenes.

Here's a cool comparison over the years:

DDH · Sep 3, 2020

Hitman928 said:
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.

Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.

That is what Turing did. Turing could do 1 INT or 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.

Are you sure turning wasn't capable of 1FP+1INT?

Hitman928 · Sep 3, 2020

DDH said:
Are you sure turning wasn't capable of 1FP+1INT?

Sorry, just a typo/mental lapse in typing. Yes, Turing can do 1FP+1INT, I'll fix in my post.

uzzi38 · Sep 3, 2020

Turing could do both simultaneously if rhe software was written for it.

You know, that's what Async Compute is. It's leveraging both pipelines at the same time.

EDIT: Okay, so that was a mistake. Fair enough.

Konan · Sep 3, 2020

aigomorla said:
No offense to my statement when i say this...
but everyone knows doom is really really well coded vs other traditional AAA title games.
Hence it really does not show off much.

Id really like to see it with something which we knows break hardware like Minecraft with RayTracing enabled (not joking), or final fantasy 15 will also do.

None taken, it is just one game and I fully agree

Konan · Sep 3, 2020

maddie said:
The Reddit Nvidia Q&A had this interesting statement from an Nvidia rep.

Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

So much for needing the Tensor cores for this.

DLSS 2.0 requires Tensor cores

lobz · Sep 3, 2020

moonbogg said:
FTFY

Unless NVIDIA provides the number of chips actually shipped and sold, everybody will spin this just as they want, and none of them will be either right or wrong. It will be Shrödinger's supply and demand. Exactly how it always have been, no matter which brand or vendor we're talking about.

eddman · Sep 3, 2020

Hitman928 said:
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.

So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

So they did it to specifically enable FP+FP. I suppose this is probably aimed at scientific calculations, etc? Are there any desktop use cases where this could come into play?

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc.

Is it possible to guess how much bigger the die would've been if they implemented three paths?

I'm trying to imagine how much higher the gaming performance would be with three paths. IINM games tend to be FP heavy; the jump in performance would probably still be substantial, no?

Hitman928 · Sep 3, 2020

eddman said:
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

A big reason, yes. The quoted FLOPs is assuming you are doing FP+FP the entire time which is not realistic for gaming. I haven't studied all of Nvidia's marketing slides but usually there's additional FLOP quotes where they are counting tensor flops and everything together as well which is also unrealistic.

So they did it to specifically enable FP+FP. I suppose this is probably aimed at scientific calculations, etc? Are there any desktop use cases where this could come into play?

Pure compute should benefit obviously but I'm sure gaming will benefit as well because, as I said, games tend to be FP heavy so you'll get a decent speed boost out of it. No where near 2x mind you, but it should be a noticeable boost.

Is it possible to guess how much bigger the die would've been if they implemented three paths?

Not really, not without knowing the size of each CU and each data path inside the CU which I'm pretty sure no one can tell you outside of Nvidia but if someone can, I'd be happy to hear.

I'm trying to imagine how much higher the gaming performance would be with three paths. IINM games tend to be FP heavy; the jump in performance would probably still be substantial, no?

Depends on a lot of things. If we waive away power and size requirements, then most likely you could get a very noticeable speed bump. However, we do have size and power requirements so the real question is what do you have to sacrifice in order to put that 3rd path in. I'm guessing at 8 nm, you would do more harm than good putting in that third path.

dr1337 · Sep 3, 2020

eddman said:
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

Are there any desktop use cases where this could come into play?

Well with the benchmarks we have now it would seem that the 3070 with "5888 cuda cores" is indeed faster than the 4352 cores of the current 2080ti. Though until 3rd party benchmarks come out its not going to be exactly clear how much the new design scales in games and typical desktop applications. If we take the ampere reveal at face value then objectively it would seem that have truly doubled compute performance.

Tflops have never been a direct comparison to gaming performance between architectures. If you can really squeeze the performance out of ampere like nvida claims it has, then this generation has really good potential for rendering and mining.

Tup3x · Sep 3, 2020

ozzy702 said:
If it's actually a $200 premium with no other TI style increase, that's a no for me. Better to just get the 3080 now and then sell it a year to 18 months down the road and pickup something more powerful with more memory. I doubt 10GB is going to be a bottleneck even at 4k anytime soon.

I game at 1440p so I'm pretty sure it will be just fine for quite some now. At this point I just want to upgrade - RTX 3080 would be absolutely massive upgrade and it would still offer 2 GB more VRAM. Heck, I don't care what comes out after.

maddie · Sep 3, 2020

Hitman928 said:
One data path is capable of FP. The other datapath is capable of FP or INT, so you can't do FP+FP+INT all at the same time.

Correct as answered for point 1. The existence of the FP and INT compute cores aren't all that is necessary to do calculations. There are registers, dispatch units, etc. that are required to actually get the correct data to and from the FP and compute units. The second data path that can do INT or FP will share some of the circuitry to save on space and power.

That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

You could do that but everything in engineering is a balancing game. You have size and power considerations, register pressure considerations, bandwidth considerations, etc. Basically Nvidia looked at what calculation mix modern games are making, what calculations they think modern games will be making in the near future, and planned the architecture accordingly. More specifically, Nvidia looked at the calculation mix they think Ampere will be asked to be performing for modern games and determined that it would be roughly 2/3 FP and 1/3 Int. So this was their solution to tweak Turing to get the most balanced and efficient architecture to meet that compute mix. Hopefully this makes sense.

Edit: the 2/3 and 1/3 numbers are for illustrative purposes only. The actual mix could be 3/4 and 1/4 or even 9/10 and 1/10. The point is that Nvidia sees the compute needs as being FP heavy but INT compute is still very much needed so why have a dedicated INT path when it will just sit there waiting for work the majority of the time.

Yep. It seems that Nvidia is allowing for some flexibility in the int/fp instructions ratio as the fixed ratio that they used for the 2xxx series can only be rarely optimal for any one game. This way they get closer to the best possible.

Karnak · Sep 3, 2020

That looks... wrong.

Question 'Ampere'/Next-gen gaming uarch speculation thread

Senior member

Lifer

Lifer

Diamond Member

Diamond Member

Golden Member

Senior member

Golden Member

Diamond Member

Diamond Member

Senior member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Diamond Member

Diamond Member

Member

Diamond Member

Platinum Member

Senior member

Senior member

Platinum Member

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Senior member