Question 'Ampere'/Next-gen gaming uarch speculation thread

Page 149 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ottonomous

Senior member
May 15, 2014
559
292
136
How much is the Samsung 7nm EUV process expected to provide in terms of gains?
How will the RTX components be scaled/developed?
Any major architectural enhancements expected?
Will VRAM be bumped to 16/12/12 for the top three?
Will there be further fragmentation in the lineup? (Keeping turing at cheaper prices, while offering 'beefed up RTX' options at the top?)
Will the top card be capable of >4K60, at least 90?
Would Nvidia ever consider an HBM implementation in the gaming lineup?
Will Nvidia introduce new proprietary technologies again?

Sorry if imprudent/uncalled for, just interested in the forum member's thoughts.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,052
1,716
136
Compute architecture is CDNA, true, but that is designed for pure compute without display connectivity like the A100. Pro cards for CAD (equivalent of Quadro) will be almost surely based on RDNA architecture, too.
 
  • Like
Reactions: lightmanek

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
But the comparision fails because VII was basically reject dies that couldn't be used for the workstation parts. So the "chip" was essentially free as the option was to trash it or put it on a consumer card. So the $699 price tag only needed to cover the board + HBM2 memory as the chip was available anyway.

With RDNA2 this doesn't apply as it is mostly a gaming card hence they must recuperate the full cost mostly from gaming sales. That could be a downside to NV as their Ampere chips certainly will be used in Quadro and Tesla skus.

This is pure FUD,
Radeon VII uses the same chips as Radeon Instinct MI50.
Radeon Instinct MI50 has exactly the same number of CUs and Memory configuration as Radeon VII , but MI50 has lower clocks.
There is no way the same chip that could make it to become a Radeon VII card with 1800MHz Peak clocks and 300W TDP is rejected to become a MI50 with 1725MHz Peak clocks and 300W TDP.
 

eddman

Senior member
Dec 28, 2010
239
87
101
It's that not everything is about the compute units. What do you think the memory controller is doing? What about texture throughput? The compute doubles but the ROP/TMU stays the same. Memory bandwidth increases but not as much as compute.

People think Compute units/Flops = performance because generally for balance when you increase compute units by 50%, you try to bump everything else(ROP/TMU/Memory bandwidth) by 50%.

I know that, but even with the same number of ROPs, TMUs, etc. had it been able to do FP+FP+INT at the same time, the performance would've probably still been quite higher, even if not all the cores were fully utilized. It was certainly not worth the added chip size, complexity, and hence cost, for a sub-optimal solution that would not have been able to fully utilize the cores.

That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.

I've been thinking a bit more about this. They state that a Turing SM could do 64 FP + 64 INT, while an Ampere SM does either 64 FP + 64 INT, or 64 FP + 64 FP.

Now, if there are physically double the FP32 cores in ampere, why not do it similar to turing and put all the FP cores in their own paths? Wouldn't that result in a 128 FP + 64 INT setup?

Actually, are there even double the FP cores? Jenses said they made the FP cores double-issue; would that imply that the number of physical cores is actually the same, but they've been upgraded to do twice the work?

In that case, perhaps by "datapath", they mean a logical path, not a physical one?

I'm just speculating here (I don't really know how these things work exactly); perhaps the scheduler/dispatcher is itself capable of double-issue at most, so it can either issue two FP instructions at the same time, or FP + INT.

Maybe they could've upgraded the scheduler/dispatcher to a triple-issue setup, but it would've been too big and complex, and probably unnecessary since all the other GPU units weren't increased in numbers?
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
I know that, but even with the same number of ROPs, TMUs, etc. had it been able to do FP+FP+INT at the same time, the performance would've probably still been quite higher, even if not all the cores were fully utilized. It was certainly not worth the added chip size, complexity, and hence cost, for a sub-optimal solution that would not have been able to fully utilize the cores.

It wouldn't have been. They said for every 100 fp instructions, there are 34 integer instructions. So the gain might have been 5-10%, at the most.

Consider that games vary tremendously in requirements so the second unit being flexible for integer or fp means it can adjust to it as needed. The gains in real world with a separate integer path might have been even less, at 3-4%. Certainly not worth the die/power tradeoff.

It's similar to what Intel does with AVX-512. It doesn't have two additional AVX-512 units. Instead, one of them works by treating 2x256-bit units as 1x-512-bit. That saves on die area and power.

Intel actually does the same with their GPU architecture. The Integer unit is always shared with FP and not on a separate pipe. In Gen 9 GPU, you had 2x 4-wide FP/Integer unit per EU. In Gen 11, both can still do FP, but only one does Integer. In Xe it goes back to 2x FP/integer again.

Maybe they could've upgraded the scheduler/dispatcher to a triple-issue setup, but it would've been too big and complex, and probably unnecessary since all the other GPU units weren't increased in numbers?

Very likely these are low level detail decisions.

It's less of a "what's possible" but rather "what's possible in a given power/area budget?"
 

xpea

Senior member
Feb 14, 2014
451
153
116
Little information about the new dual issue FP32 that I got from my usual source, Ampere is apparently a monster in video encoding, 3D Rendering and compute in general. 2xFP32 are not fake and when the workloads match the new SM organization, the speedup over Turing is close to the theoretical TFLOPS numbers... For example I saw crazy numbers in Blender bench...
 

majord

Senior member
Jul 26, 2015
493
641
136
Re FP32 - One thing I can't work out (until the whitepaper anyway) , Can the execution units in the shared datapath execute FP and INT concurrently? , The split data path arrangement with Turing is what allowed this concurrent execution ? The info from the QA doesn't make this clear either
 

MrTeal

Diamond Member
Dec 7, 2003
3,614
1,816
136
Makes perfectly sens. Knowing that, 3090 is an incredible bargain for compute guys and I guess it will be in shortage for long time...
It wouldn’t even have to be strictly traditional compute guys. If you have a job that benefits from GPU acceleration, the ROI starts to look pretty good if you save 5 minutes a day but you’re charging out $150/hr.
 

beginner99

Diamond Member
Jun 2, 2009
5,233
1,610
136
If you have a job that benefits from GPU acceleration, the ROI starts to look pretty good if you save 5 minutes a day but you’re charging out $150/hr.

5 min usually spent on more breaks, talk or browsing the internet and not actually productivity. But yeah can always use that argument to get a nice new shiny box at work.
 
  • Like
Reactions: lightmanek

MrTeal

Diamond Member
Dec 7, 2003
3,614
1,816
136
5 min usually spent on more breaks, talk or browsing the internet and not actually productivity. But yeah can always use that argument to get a nice new shiny box at work.
The real challenge is convincing the purse strings to open even if there is a compelling case for it. :p
 

traderjay

Senior member
Sep 24, 2015
220
165
116
Is this Samsung foundry's first major contract for a complex chip like this? If so they must've bend over and backward to please the first big customer like nvidia. Plus, nvidia has a history of being extremely aggressive in negotiations as well.
 
  • Like
Reactions: KompuKare

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
Is this Samsung foundry's first major contract for a complex chip like this? If so they must've bend over and backward to please the first big customer like nvidia. Plus, nvidia has a history of being extremely aggressive in negotiations as well.
Just had a terrible thought. Not realistic at all but it just popped up.

What if this results in a reverse scenario to what happened to TSMC when they took on Apple as a client. Apple helps TSMC succeed while Nvidia helps sink SS by being too aggressive. All for me, none for thee.
 

shiznit

Senior member
Nov 16, 2004
424
13
81
Just had a terrible thought. Not realistic at all but it just popped up.

What if this results in a reverse scenario to what happened to TSMC when they took on Apple as a client. Apple helps TSMC succeed while Nvidia helps sink SS by being too aggressive. All for me, none for thee.
It will most likely help Samsung. When other vendors of large ASICs see the volume and margins Nvidia will be pushing with Ampere, they'll come knocking. Nobody wants to fight over TSMC's gaps if they can help it.

Also, Samsung doesn't need Apple's SoC business to fund innovation. Their enormous NAND and RAM business does that already.
 
  • Like
Reactions: n0x1ous

xpea

Senior member
Feb 14, 2014
451
153
116
Is this Samsung foundry's first major contract for a complex chip like this? If so they must've bend over and backward to please the first big customer like nvidia. Plus, nvidia has a history of being extremely aggressive in negotiations as well.
IBM Power10 and Z monster datacenter CPUs will be made by Samsung in 7nm EUV next year
 
  • Like
Reactions: lightmanek

traderjay

Senior member
Sep 24, 2015
220
165
116
So in a round about way your saying Apple drives the success of the foundries?

Its a symbiotic relationship, they are virtually inseparable and they feed on each other's success. Apple's huge volume (revenue for foundry) drives innovation and more R&D. Just look at Huawei's Kirin w/o fab, they are on life support now and won't last too much longer.

IBM Power10 and Z monster datacenter CPUs will be made by Samsung in 7nm EUV next year

Interesting but I guess the IBM Power CPUs doesn't move anywhere near the same volume as nvidia, hence I think nvidia is the make or break moment for Samsung's foundry.
 
  • Like
Reactions: raghu78

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
It will most likely help Samsung. When other vendors of large ASICs see the volume and margins Nvidia will be pushing with Ampere, they'll come knocking. Nobody wants to fight over TSMC's gaps if they can help it.

Also, Samsung doesn't need Apple's SoC business to fund innovation. Their enormous NAND and RAM business does that already.
Well they obviously not quite doing everything right. One only has to compare the 3080 & 3090 parameters to realize that something is very rotten with SS8.

Edit:
Just a thought experiment for argument sake and pushing the corner here. Let's assume that the fab cost of a die is free.

Do we have any idea of what the product using it will have to sell at to breakeven?

Assume now that the performance characteristics are worse than the competitor. Power, performance, etc.

What is the new selling price to compete in the marketplace?

In other words, at what price point is a free die equal or worse than one with superior parameters.
 
Last edited:
  • Like
Reactions: KompuKare