Question 'Ampere'/Next-gen gaming uarch speculation thread

Page 152 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ottonomous

Senior member
May 15, 2014
559
292
136
How much is the Samsung 7nm EUV process expected to provide in terms of gains?
How will the RTX components be scaled/developed?
Any major architectural enhancements expected?
Will VRAM be bumped to 16/12/12 for the top three?
Will there be further fragmentation in the lineup? (Keeping turing at cheaper prices, while offering 'beefed up RTX' options at the top?)
Will the top card be capable of >4K60, at least 90?
Would Nvidia ever consider an HBM implementation in the gaming lineup?
Will Nvidia introduce new proprietary technologies again?

Sorry if imprudent/uncalled for, just interested in the forum member's thoughts.
 

Konan

Senior member
Jul 28, 2017
360
291
106
If AMD's RDNA2 performance is at the RTX 3070 level, then wouldn't it technically compete with Nvidia's previous high end 2080 TI? I do wish Nvidia had been able to get the Ampere chips manufactured at 7 nm, which was one of the reasons I didn't bother with Turing. I also wonder just how big the next generation video cards are going to be. That would be funny if the missing coprocessor is a CPU socket because the video card is so big, it becomes it's own system.

NV is saying that the 3070 bests a 2080Ti. We don’t know what by yet but it was a point that they stressed. Can’t wait for the reviews to see. I suspect maybe 7% to 17% as an overall average across many games.
There are so many people dumping the 2080Ti’s Into the used market right now for less than $500 because of the launch announcement.

All of the card dimensions are listed on Nvidia’s website.
 

AtenRa

Lifer
Feb 2, 2009
14,000
3,357
136
No way would TSMC give AMD as good a deal as Nvidia for 8nm.

I dont believe anyone suggested something like that.
What it was said was that due to being the number one 7nm customer, AMD can have better prices than what they had last year at 7nm TSMC.

Also to note,
Last year on February 2019 they release Radeon VII , a 331mm2 die on a new and extremely expensive 7nm process at the time , with 16GB of HBM 2 memory at $699, same price as current RTX3080 10GB GDDR-6X.
Fast forward one and a half year later, with TSMC 7nm higher capacity, lower wafer prices and even better AMD wafer deal against all other 7nm customers, Im sure everyone can agree that they can have a RTX3080 competitor at the same $699 MSRP and make a nice profit.
Will they make the same profit as NV does out of the RTX3080, probably not but they will sure make a lot more than they did last year with the Radeon VII.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
So this is why 30 series cards' gaming performance is so much lower than the max theoretical compute performance.

It's not that "IPC" went down as MooresLawIsDead is claiming, or efficiency is down.

It's that not everything is about the compute units. What do you think the memory controller is doing? What about texture throughput? The compute doubles but the ROP/TMU stays the same. Memory bandwidth increases but not as much as compute.

People think Compute units/Flops = performance because generally for balance when you increase compute units by 50%, you try to bump everything else(ROP/TMU/Memory bandwidth) by 50%.

In Ampere, it changes that.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

So much for needing the Tensor cores for this.

Sounds like something is missing and out of context. Because if true, an ideal RTX 3080 Ti would have been increasing rasterization performance further by taking out the tensor units and keeping the RT cores.

Same could be said for RTX 2000's.
 
  • Like
Reactions: uzzi38

Veradun

Senior member
Jul 29, 2016
564
780
136
No offense to my statement when i say this...
but everyone knows doom is really really well coded vs other traditional AAA title games.
Hence it really does not show off much.

Id really like to see it with something which we knows break hardware like Minecraft with RayTracing enabled (not joking), or final fantasy 15 will also do.
It's more about shader heavy than well coded, but it definitely benefits from the doubling of fp32 alus
 

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
Will they make the same profit as NV does out of the RTX3080, probably not but they will sure make a lot more than they did last year with the Radeon VII.

But the comparision fails because VII was basically reject dies that couldn't be used for the workstation parts. So the "chip" was essentially free as the option was to trash it or put it on a consumer card. So the $699 price tag only needed to cover the board + HBM2 memory as the chip was available anyway.

With RDNA2 this doesn't apply as it is mostly a gaming card hence they must recuperate the full cost mostly from gaming sales. That could be a downside to NV as their Ampere chips certainly will be used in Quadro and Tesla skus.
 

leoneazzurro

Senior member
Jul 26, 2016
906
1,430
136
Compute architecture is CDNA, true, but that is designed for pure compute without display connectivity like the A100. Pro cards for CAD (equivalent of Quadro) will be almost surely based on RDNA architecture, too.
 
  • Like
Reactions: lightmanek

AtenRa

Lifer
Feb 2, 2009
14,000
3,357
136
But the comparision fails because VII was basically reject dies that couldn't be used for the workstation parts. So the "chip" was essentially free as the option was to trash it or put it on a consumer card. So the $699 price tag only needed to cover the board + HBM2 memory as the chip was available anyway.

With RDNA2 this doesn't apply as it is mostly a gaming card hence they must recuperate the full cost mostly from gaming sales. That could be a downside to NV as their Ampere chips certainly will be used in Quadro and Tesla skus.

This is pure FUD,
Radeon VII uses the same chips as Radeon Instinct MI50.
Radeon Instinct MI50 has exactly the same number of CUs and Memory configuration as Radeon VII , but MI50 has lower clocks.
There is no way the same chip that could make it to become a Radeon VII card with 1800MHz Peak clocks and 300W TDP is rejected to become a MI50 with 1725MHz Peak clocks and 300W TDP.
 

eddman

Senior member
Dec 28, 2010
239
87
101
It's that not everything is about the compute units. What do you think the memory controller is doing? What about texture throughput? The compute doubles but the ROP/TMU stays the same. Memory bandwidth increases but not as much as compute.

People think Compute units/Flops = performance because generally for balance when you increase compute units by 50%, you try to bump everything else(ROP/TMU/Memory bandwidth) by 50%.

I know that, but even with the same number of ROPs, TMUs, etc. had it been able to do FP+FP+INT at the same time, the performance would've probably still been quite higher, even if not all the cores were fully utilized. It was certainly not worth the added chip size, complexity, and hence cost, for a sub-optimal solution that would not have been able to fully utilize the cores.

That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

I've been thinking a bit more about this. They state that a Turing SM could do 64 FP + 64 INT, while an Ampere SM does either 64 FP + 64 INT, or 64 FP + 64 FP.

Now, if there are physically double the FP32 cores in ampere, why not do it similar to turing and put all the FP cores in their own paths? Wouldn't that result in a 128 FP + 64 INT setup?

Actually, are there even double the FP cores? Jenses said they made the FP cores double-issue; would that imply that the number of physical cores is actually the same, but they've been upgraded to do twice the work?

In that case, perhaps by "datapath", they mean a logical path, not a physical one?

I'm just speculating here (I don't really know how these things work exactly); perhaps the scheduler/dispatcher is itself capable of double-issue at most, so it can either issue two FP instructions at the same time, or FP + INT.

Maybe they could've upgraded the scheduler/dispatcher to a triple-issue setup, but it would've been too big and complex, and probably unnecessary since all the other GPU units weren't increased in numbers?
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
I know that, but even with the same number of ROPs, TMUs, etc. had it been able to do FP+FP+INT at the same time, the performance would've probably still been quite higher, even if not all the cores were fully utilized. It was certainly not worth the added chip size, complexity, and hence cost, for a sub-optimal solution that would not have been able to fully utilize the cores.

It wouldn't have been. They said for every 100 fp instructions, there are 34 integer instructions. So the gain might have been 5-10%, at the most.

Consider that games vary tremendously in requirements so the second unit being flexible for integer or fp means it can adjust to it as needed. The gains in real world with a separate integer path might have been even less, at 3-4%. Certainly not worth the die/power tradeoff.

It's similar to what Intel does with AVX-512. It doesn't have two additional AVX-512 units. Instead, one of them works by treating 2x256-bit units as 1x-512-bit. That saves on die area and power.

Intel actually does the same with their GPU architecture. The Integer unit is always shared with FP and not on a separate pipe. In Gen 9 GPU, you had 2x 4-wide FP/Integer unit per EU. In Gen 11, both can still do FP, but only one does Integer. In Xe it goes back to 2x FP/integer again.

Maybe they could've upgraded the scheduler/dispatcher to a triple-issue setup, but it would've been too big and complex, and probably unnecessary since all the other GPU units weren't increased in numbers?

Very likely these are low level detail decisions.

It's less of a "what's possible" but rather "what's possible in a given power/area budget?"
 

xpea

Senior member
Feb 14, 2014
429
135
116
Little information about the new dual issue FP32 that I got from my usual source, Ampere is apparently a monster in video encoding, 3D Rendering and compute in general. 2xFP32 are not fake and when the workloads match the new SM organization, the speedup over Turing is close to the theoretical TFLOPS numbers... For example I saw crazy numbers in Blender bench...
 

majord

Senior member
Jul 26, 2015
433
523
136
Re FP32 - One thing I can't work out (until the whitepaper anyway) , Can the execution units in the shared datapath execute FP and INT concurrently? , The split data path arrangement with Turing is what allowed this concurrent execution ? The info from the QA doesn't make this clear either
 

MrTeal

Diamond Member
Dec 7, 2003
3,554
1,658
136
Makes perfectly sens. Knowing that, 3090 is an incredible bargain for compute guys and I guess it will be in shortage for long time...
It wouldn’t even have to be strictly traditional compute guys. If you have a job that benefits from GPU acceleration, the ROI starts to look pretty good if you save 5 minutes a day but you’re charging out $150/hr.
 

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
If you have a job that benefits from GPU acceleration, the ROI starts to look pretty good if you save 5 minutes a day but you’re charging out $150/hr.

5 min usually spent on more breaks, talk or browsing the internet and not actually productivity. But yeah can always use that argument to get a nice new shiny box at work.
 
  • Like
Reactions: lightmanek

MrTeal

Diamond Member
Dec 7, 2003
3,554
1,658
136
5 min usually spent on more breaks, talk or browsing the internet and not actually productivity. But yeah can always use that argument to get a nice new shiny box at work.
The real challenge is convincing the purse strings to open even if there is a compelling case for it. :p
 

traderjay

Senior member
Sep 24, 2015
220
165
116
Is this Samsung foundry's first major contract for a complex chip like this? If so they must've bend over and backward to please the first big customer like nvidia. Plus, nvidia has a history of being extremely aggressive in negotiations as well.
 
  • Like
Reactions: KompuKare

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
Is this Samsung foundry's first major contract for a complex chip like this? If so they must've bend over and backward to please the first big customer like nvidia. Plus, nvidia has a history of being extremely aggressive in negotiations as well.
Just had a terrible thought. Not realistic at all but it just popped up.

What if this results in a reverse scenario to what happened to TSMC when they took on Apple as a client. Apple helps TSMC succeed while Nvidia helps sink SS by being too aggressive. All for me, none for thee.
 

shiznit

Senior member
Nov 16, 2004
422
13
81
Just had a terrible thought. Not realistic at all but it just popped up.

What if this results in a reverse scenario to what happened to TSMC when they took on Apple as a client. Apple helps TSMC succeed while Nvidia helps sink SS by being too aggressive. All for me, none for thee.
It will most likely help Samsung. When other vendors of large ASICs see the volume and margins Nvidia will be pushing with Ampere, they'll come knocking. Nobody wants to fight over TSMC's gaps if they can help it.

Also, Samsung doesn't need Apple's SoC business to fund innovation. Their enormous NAND and RAM business does that already.
 
  • Like
Reactions: n0x1ous