8GB VRAM not enough (and 10 / 12)

BFG10K · Jul 13, 2021

This thread was started in mid 2021 and is being retired/locked. As the OP is no longer active, or updating and maintaining it.

Mod DAPUNISHER

8GB

Dying Light 2, Battlefield 2042, Avatar Frontiers of Pandora, Call of Duty Modern Warfare 3, Starfield, Forza Horizon 5, Horizon Forbidden West, Alan Wake 2, The Last of Us, Tekken 8, FC 24, Detroit Become Human, Death Stranding, Alone in The Dark, Overwatch 2
3060 12GB higher 1% low than 4060 8GB https://www.youtube.com/watch?v=gIfz57wmywg
Cyberpunk 2077, Diablo 4, Hogwarts Legacy, Horizon Zero Dawn, Microsoft Flight Sim 2020, Witcher 3, Watch Dogs Legion skip to 4K + DLSS for each game https://www.youtube.com/watch?v=tPiaWWd0xGc
Ratchet & Clank, Plague Tale Requiem, Last Of Us, Jedi Survivor, Far Cry 6, Hogwarts Legacy, Forza Horizon 5, Cyberpunk 2077 https://www.youtube.com/watch?v=Wr3MckDxMfE&pp=ygUTMTZnYiB2cyA4Z2IgNDA2MCB0aQ==
Far Cry 6, Godfall, Hogwarts Legacy, Forza Horizon 5, Forspoken, Doom Eternal, Watchdogs Legion https://youtu.be/fJc--C01P90?t=268
Last Of Us, Resident Evil 4, Callisto Protocol, Plague Tale Requiem, Halo Infinite, Forspoken https://youtu.be/2_Y3E631ro8?t=429
Last Of Us, Hogwarts Legacy, Resident Evil 4, Forspoken, Plague Tale Requiem, Callisto Protocol https://youtu.be/Rh7kFgHe21k?t=288
Hogwarts Legacy, Last Of Us, Spider Man Remastered, Doom Eternal, Forza Horizon 5 https://youtu.be/R943CbDTq_s?t=1473
Horizon Forbidden West 4060 has issues @ 1080p + DLSS unless you switch off frame generation and reduce settings to high https://youtu.be/3AyKvI23VGw?t=233
Horizon Forbidden West 3070 requires reducing textures 2 notches https://youtu.be/xTwEGy6HHKo?t=351
Deathloop https://youtu.be/lu_XHyO-6zY?t=918
Ghost Recon Breakpoint https://youtu.be/iUGppQVpXzU?t=529
Resident Evil 2 https://youtu.be/Aa-gsgRjMII?t=796
Talos Principle 2 https://youtu.be/K6FATQAuEwI?t=337
Spider Man Remastered https://youtu.be/vRVR_plVSlc?t=1060
Witcher 3 https://youtu.be/vRVR_plVSlc?t=1240

Horizon Forbidden West 3060 is faster than the 2080 Super despite the former usually competing with the 2070. Also 3060 has a better 1% low than 4060 and 4060Ti 8GB.

Resident Evil Village 3060TI/3070 tanks at 4K and is slower than the 3060/6700XT when ray tracing:

Company Of Heroes 3060 has a higher minimum than the 3070TI:

10GB / 12GB

Plague Tale Requiem 3080 10GB tanks if you enable ray tracing, would be fast enough if it had more VRAM because if you stop moving, the framerate stabilizes to >60FPS https://youtu.be/7zjrMww3fAc?t=496
Hogwarts Legacy 4K + RT, 4060Ti 16GB is faster than 4070 12GB https://youtu.be/X-h_R488Mq8?t=191

Reasons why still shipping 8GB since 2014 isn't NV's fault.

It's the player's fault.
It's the reviewer's fault.
It's the developer's fault.
It's AMD's fault.
It's the game's fault.
It's the driver's fault.
It's a system configuration issue.
Wrong settings were tested.
Wrong area was tested.
Wrong games were tested.
4K is irrelevant.
Texture quality is irrelevant as long as it matches a console's.
Detail levels are irrelevant as long as they match a console's.
There's no reason a game should use more than 8GB, because a random forum user said so.
It's completely acceptable for the more expensive 3070/3070TI/3080 to turn down settings while the cheaper 3060/6700XT has no issue.
It's an anomaly.
It's a console port.
It's a conspiracy against NV.
8GB cards aren't meant for 4K / 1440p / 1080p / 720p gaming.
It's completely acceptable to disable ray tracing on NV while AMD has no issue.
Polls, hardware market share, and game title count are evidence 8GB is enough, but are totally ignored when they don't suit the ray tracing agenda.

According to some people here, 8GB is neeeevaaaaah NV's fault and objective evidence "doesn't count" because of reasons(tm). If you have others please let me know and I'll add them to the list. Cheers!

CakeMonster · May 14, 2023

Aapje said:
So, a processing unit in a CPU might be built to operate on 32 bits at a time. If the address to denote a certain part of the RAM is then also 32 bit long, you use that processing unit optimally. If you actually need only need a 24 bit number to address the memory, you still have to use a 32 bit operation and it is no faster than manipulating a 32 bit number.

Off topic but curious: Does this impact the 24/48/96GB DDR5 kits that are coming out now?

Aapje · May 14, 2023

CakeMonster said:
Off topic but curious: Does this impact the 24/48/96GB DDR5 kits that are coming out now?

In my earlier post I kept it a bit simplistic. If you have two DIMMs, then each of those has memory starting from address zero. So you can't actually use a single memory address for both, since if you then refer to a certain address, you would have no idea from which DIMM you should be reading, since both have that address.

So you have to use a translation from the memory address that the software uses to a different address on the actual DIMMs, which is what the memory controller does. This way, software doesn't actually have to keep track of how many DIMMs you have in your system and address them separately, which would be a nightmare.

However, OS developers soon realized that since you have to translate the addresses anyway, you can actually completely separate the addresses that are used by the applications from the real hardware addresses. This is called virtual memory. It has a lot of advantages, like security, since you can now prevent applications from reading memory that doesn't belong to them. In the old school computers, this was a nightmare, because a malfunctioning program could just overwrite the memory of other applications or even the OS. And any malicious program could do pretty much anything.

Another advantage of virtual memory is that you can track what memory is almost never used and store it on disk. So the application puts something in memory, but the OS figures out that it is rarely used and it can move it to the disk and free the real memory. This is especially useful for poorly optimized software that just loads all kinds of crap into memory that it never actually uses.

Because of virtual memory, you could actually use more than 4 GB on (some) 32-bit OS's. So 32 bits is sufficient to address 4 GB. Back then, the application would usually just use 32 bit addresses, but they had a hack in the OS to have 36 bit virtual memory in total, which is good enough for up to 64 GB. So you could have two applications that would each get 4 GB, but the OS would map that to a 36 bit address. So a certain address from app 1 would refer to a completely different location in the virtual memory than the exact same address in program 2.

Of course, nowadays we have 64-bit OS's, which is sufficient for 128 TB of virtual memory (although far less than that for the actual memory, for reasons).

Anyway, to answer your question I have to address the simplification where above, I pretended that DIMMs have a single address space. They don't. In reality, they have banks which are always a fixed number (16 for DDR4, 32 for DDR5). The memory controller sends a command to select a certain bank and also has separate addresses for the row and column to read from. They use 17 bits for the column with DDR5 and 13 bits for the row.

So actually when you get to these lower levels, they don't use power of 2 at all.

In any case, 24/48/96 GB are no problem as long as they fit in the available total address space. In the case of DDR4, that was 64 GB, so you could never have a DIMM bigger than that. With DDR5 they can go up to 512 GB, so theoretically they could make 512 GB DIMMs and if you put 4 of them in your system, you would have 2 TB of RAM and an empty bank account.

CakeMonster · May 14, 2023

Cool, thanks for the explanation, I'm gonna have to read that again to properly understand it all. But glad to have confirmation that uncommon sizes are no problem by themselves, it would have been disappointing if it was. I hope we'll see more flexibility in both the RAM and VRAM spaces going forward.

VRAMdemon · May 14, 2023

jpiniero said:
I expect nVidia to refresh Ada with GDDR7. And I would say 3 or 4 GB chips is theoretically possible then. But that's just a guess.

So you could say that the bus width decisions could have been made thinking that the refresh would have 50% or 100% more memory than the initial.

Thanks for the explanation guys. I always thought 192 bit meant numbers of 6. 6/12/18/24 GB like the 1060 and 3060. I wasn't sure if they could just add another 6 GB to 18 and make the 4070 and 4070ti much better card. I now know it doesn't work that way at the moment.

Aapje · May 15, 2023

VRAMdemon said:
Thanks for the explanation guys. I always thought 192 bit meant numbers of 6. 6/12/18/24 GB like the 1060 and 3060. I wasn't sure if they could just add another 6 GB to 18 and make the 4070 and 4070ti much better card. I now know it doesn't work that way at the moment.

Every individual memory module uses 32 bit of the bus for full speed. 192 bit divided by 32 is 6, so you can support 6 modules at full speed.

Since there are modules of 1 and 2 GB, you can have 6 x 1 GB = 6 GB or 6 x 2 GB = 12 GB.

However, the GDDR standard actually requires the memory modules to support a 'clamshell' mode, where each module only gets a 16 bit wide bus. So then you can support up to 192/16 = 12 modules. So with 2 GB modules, that is 12 x 2 = 24 GB on a 192 bit bus. However, AFAIK there is no actual requirement that every module uses the same mode, so theoretically they could run 4 modules in normal mode and 4 modules in clamshell mode, for a total of 16 GB.

Note that the actual implementation of the clamshell mode uses striping, similar to RAID 0. Memory modules in that mode always operate in pairs that share a command line:

So both memory modules get the instruction to store data at the same address, but the memory controller sends half of the data to each memory module.

So to give a simplified example, imagine that the GPU asks to store the word 'complicated' at address 5. Then both memory modules get the instruction to store something at address 5, but memory module 1 gets 'compl' and the other memory module gets 'icated.' Then when reading the data, the opposite happens, where both memory modules get asked to return the value at address 5. Then module 1 returns 'compl' and module 2 returns 'icated.' Then the memory controller combines this back into 'complicated.'

This setup means that reading from and writing to modules in clamshell mode is just as fast as reading/writing from/to modules in normal mode. However, of course it twice as fast to have 24 GB of memory on a 384 bit bus rather than have 24 GB on a 192 bit bus with clamshell mode.

DeathReborn · May 15, 2023

Aapje said:
Every individual memory module uses 32 bit of the bus for full speed. 192 bit divided by 32 is 6, so you can support 6 modules at full speed.

Since there are modules of 1 and 2 GB, you can have 6 x 1 GB = 6 GB or 6 x 2 GB = 12 GB.

However, the GDDR standard actually requires the memory modules to have a 'clamshell' mode, where each module only gets a 16 bit wide bus. So then you can support up to 192/16 = 12 modules. So with 2 GB modules, that is 12 x 2 = 24 GB on a 192 bit bus. However, AFAIK there is no actual requirement that every module uses the same mode, so theoretically they could run 4 modules in normal mode and 4 modules in clamshell mode, for a total of 16 GB.

Note that the actual implementation of the clamshell mode uses striping, similar to RAID 0. Memory modules in that mode always operate in pairs that share a command line:

View attachment 80629
So both memory modules get the instruction to store data at the same address, but the memory controller sends half of the data to each memory module.

So to give a simplified example, imagine that the GPU asks to store the word 'complicated' at address 5. Then both memory modules get the instruction to store something at address 5, but memory module 1 gets 'compl' and the other memory module gets 'icated.' Then when reading the data, the opposite happens, where both memory modules get asked to return the value at address 5. Then module 1 returns 'compl' and module 2 returns 'icated.' Then the memory controller combines this back into 'complicated.'

This setup means that reading from and writing to modules in clamshell mode is just as fast as reading/writing from/to modules in normal mode. However, of course it twice as fast to have 24 GB of memory on a 384 bit bus rather than have 24 GB on a 192 bit bus with clamshell mode.

GDDR6 introduced chips with 2 Channel 16bit and for Clamshell that drops to 8bit per channel. GDDR6 also has specs for 1GB, 1.5GB, 2GB, 3GB & 4GB capacity dies, Samsung is believed to have 4GB GDDR6 dies but nobody has the 1.5GB/3GB dies as of yet.

coercitiv · May 15, 2023

Aapje said:
so theoretically they could run 4 modules in normal mode and 4 modules in clamshell mode, for a total of 16 GB.

Would that not results in a region of very slow memory? AFAIK in this case the last 4GB of memory space on the card would be addressed though just 64 bits.

Aapje · May 15, 2023

DeathReborn said:
GDDR6 introduced chips with 2 Channel 16bit and for Clamshell that drops to 8bit per channel. GDDR6 also has specs for 1GB, 1.5GB, 2GB, 3GB & 4GB capacity dies, Samsung is believed to have 4GB GDDR6 dies but nobody has the 1.5GB/3GB dies as of yet.

Samsung announced GDDR6W, which does have 4 GB, but it it also has a 64 bit connection, rather than 32 bit. So you don't actually get any more memory, because two modules of 2 GB together use 64 bits of the bus, the same as a single 4 GB GDDR6W module.

The main advantage of GDDR6W is that you can use a smaller PCB and require fewer traces for the power delivery. So it's cheaper.

Aapje · May 15, 2023

coercitiv said:
Would that not results in a region of very slow memory? AFAIK in this case the last 4GB of memory space on the card would be addressed though just 64 bits.

The latency is the same because of the striping, but the throughput is halved of those memory modules that share a 32 bit bus compared to memory modules that each get a dedicated 32 bits.

coercitiv · May 15, 2023

Aapje said:
The latency is the same because of the striping, but the throughput is halved of those memory modules that share a 32 bit bus compared to memory modules that each get a dedicated 32 bits.

Like I said, data striping would work on all 6 buses until the 12GB limit is reached, after which the last 4GB would be striped on just 2 buses. It's the same mechanic that allows us to have asymmetric memory arrangement in a dual-channel CPU system: 8GB on channel A and 4GB on channel B would result in dual-channel speed fort the first 8GB of the memory space, and another 4GB working in single channel mode.

Aapje · May 15, 2023

I think that you are confusing striping with parallelization.

Lets say that you have 4 modules, where the last two are in clamshell mode, then those share a connection. So every cycle, something can be written or read from each of module 1, module 2 and (module 3+4). So the last two modules act like one.

So the maximum every cycle:
1. read/write -> Module 1
2. read/write -> Module 2
3. read/write -> Module 3+4

While if you have a bigger bus so module 3 and 4 get 32 bits of dedicated bus space:
1. read/write -> Module 1
2. read/write -> Module 2
3. read/write -> Module 3
4. read/write -> Module 4

However, you don't get worse throughput without clamshell mode (and thus less memory):
1. read/write -> Module 1
2. read/write -> Module 2
3. read/write -> Module 3

The main issue with sharing part of the bus while having more memory is that there is a bigger chance that you need a lot of data from module 3+4, and the relatively small bus becomes a choke point. Of course, using a clamshell for all the memory creates a choke point on all memory modules, but then it is consistent, so the 1% lows should be better. An asymmetric design should cause more variance.

DeathReborn · May 15, 2023

Aapje said:
Samsung announced GDDR6W, which does have 4 GB, but it it also has a 64 bit connection, rather than 32 bit. So you don't actually get any more memory, because two modules of 2 GB together use 64 bits of the bus, the same as a single 4 GB GDDR6W module.

The main advantage of GDDR6W is that you can use a smaller PCB and require fewer traces for the power delivery. So it's cheaper.

GDDR6 (no W) does go up to 32Gbit 4GB, this is Micron documentation showing it (Bottom of Page 2). Further down it does gop into Clamshell for GDDR6.

https://media-www.micron.com/-/media/client/global/documents/products/technical-note/dram/tn-ed-04_gddr6_design_guide.pdf?rev=b010b4b6181d4153b778152c617a61a4

Aapje · May 15, 2023

DeathReborn said:
GDDR6 (no W) does go up to 32Gbit 4GB, this is Micron documentation showing it (Bottom of Page 2).

It says that the standard supports it, but that Micron doesn't make them.

And as far as I know, no one does. So in reality we are limited to 2 GB modules or a clamshell.

Mopetar · May 15, 2023

coercitiv said:
Would that not results in a region of very slow memory? AFAIK in this case the last 4GB of memory space on the card would be addressed though just 64 bits.

It doesn't slow the memory down due to the way the data is stored. The throughput is always based on the size of the bus and the operating frequency of the memory. The bus width isn't changing so the throughput to any memory in clamshell configuration is unchanged.

The only theoretical way to see that slowdown is to actually reduce the size of the bus servicing the last region of memory. I don't even know if the standard supports this or if anyone builds hardware that way. The last time anything remotely like this happened was with the 970 and that caused such a mess that I don't think anyone would even try something that results in a similar sort of issue.

Clamshell just means you have twice as much memory, but it's not any faster to access it. A 24 GB 4070 Ti would have the same memory bandwidth as a 12 GB 4070 Ti. The only time you see a performance benefit is when the VRAM isn't sufficient to hold everything and main memory is used, which is far slower to access because the GPU has to go off card to retrieve it.

CakeMonster · May 15, 2023

Mopetar said:
The only theoretical way to see that slowdown is to actually reduce the size of the bus servicing the last region of memory. I don't even know if the standard supports this or if anyone builds hardware that way. The last time anything remotely like this happened was with the 970 and that caused such a mess that I don't think anyone would even try something that results in a similar sort of issue.

The Xbox X does something similar I think, although it performs quite closely to the PS5 still (based on watching DF videos, don't know the specs).

igor_kavinski · May 15, 2023

CakeMonster said:
The Xbox X does something similar I think, although it performs quite closely to the PS5 still

That's coz the developers know about the limitation and code accordingly. Few, if any, game developers will spend time on optimizing their game for a few SKUs in the PC space that have two different levels of VRAM performance.

But I think there is merit in having different VRAM performance levels. First level could be GDDR6 and second level could be plain and cheap DDR5 and it would still be a heck of a lot faster than going out to system RAM. Benefit would be having more VRAM available (suppose 8GB GDDR6 and 12 or 16GB DDR5).

coercitiv · May 15, 2023

Mopetar said:
The only theoretical way to see that slowdown is to actually reduce the size of the bus servicing the last region of memory.

That's exactly what we were discussing. The example I asked about was 192 bit bus width with 4 modules in normal mode and 4 modules in clamshell mode, all of them being 2GB to achieve 16GB.

CakeMonster said:
The Xbox X does something similar I think, although it performs quite closely to the PS5 still (based on watching DF videos, don't know the specs).

Yes, the Xbox series X has 16GB of RAM, out of which only 10GB are serviced through the entire 320bit memory bus. This is happening because 6 out of the 10 memory modules are 2GB, while the rest are 1GB in size. The first 10GB of the memory space can therefore be serviced using the full bus width, while the last 6GB can be accessed via 192 bits.

A detailed description is available here.

Mopetar · May 15, 2023

CakeMonster said:
The Xbox X does something similar I think, although it performs quite closely to the PS5 still (based on watching DF videos, don't know the specs).

They're just using two separate types of memory. It does encourage developers to stick to 10 GB of VRAM usage as anything beyond that will be hitting slower memory, but that can be designed around to some extent.

A console can get away with this because it's the only specification to develop towards for a first party title. That's not always the case, but targeting a few different hardware specifications to optimize for is a lot less development effort than the PC space where it's thousands of different permutations.

DeathReborn · May 15, 2023

Aapje said:
It says that the standard supports it, but that Micron doesn't make them.

And as far as I know, no one does. So in reality we are limited to 2 GB modules or a clamshell.

That paper is from 2018, it was updated later but not that section. Micron could have made 24/32Gbit prototypes, they could have none but they'd only say so publicly if they were going to launch them. I have heard that Samsung had 32Gbit GDDR6 in the labs a few months back but a little searching shows kopite7kimi saying they had dies back in 2022.

https://twitter.com/x/status/1498929902955147271

Aapje · May 15, 2023

DeathReborn said:
Micron could have made 24/32Gbit prototypes, they could have none but they'd only say so publicly if they were going to launch them. I have heard that Samsung had 32Gbit GDDR6 in the labs a few months back but a little searching shows kopite7kimi saying they had dies back in 2022.

Prototypes don't mean much. They don't go on cards unless they can be produced in large numbers for a decent price.

By now it may not be worth it to bring them out, rather than focus on GDDR7.

BFG10K · May 18, 2023

$399 & $499. LMAO.

NVIDIA GeForce RTX 4060 Ti rumored to cost $399 (8GB) and $499 (16GB) - VideoCardz.com

NVIDIA RTX 4060 Ti has rumored prices MEGAsizeGPU, an NVIDIA hardware leaker who is rarely wrong has just shared the first pricing rumors for NVIDIA’s upcoming RTX 4060 Ti series. According to the leaker, there should be a $100 price gap between the 8GB and 16GB version. The cheaper SKU is said...

videocardz.com

If true this is utter garbage. Same price as a two year old 3060Ti with 35% less bandwidth, a bit less power, and (probably) not much faster.

And they're trying to tell us 8GB VRAM costs $100. Pffff.

As I said above, this is NV's 14+++++ nm because they're a monopoly. Like Intel's 4 cores with +5% performance and a forced motherboard upgrade every 18 months.

igor_kavinski · May 18, 2023

Well, at least it's progress for a lot of people to be able to afford a 16GB Nvidia card, rather than stick with 8GB coz they refuse to consider AMD as an option.

VirtualLarry · May 18, 2023

BFG10K said:
If true this is utter garbage. Same price as a two year old 3060Ti with 35% less bandwidth, a bit less power, and (probably) not much faster.

And they're trying to tell us 8GB VRAM costs $100. Pffff.

As I said above, this is NV's 14+++++ nm because they're a monopoly. Like Intel's 4 cores with +5% performance and a forced motherboard upgrade every 18 months.

Yep, pure garbage. NVidia has basically lost me as a customer forever, at this point.

TESKATLIPOKA · May 18, 2023

BFG10K said:
If true this is utter garbage. Same price as a two year old 3060Ti with 35% less bandwidth, a bit less power, and (probably) not much faster.

It's not a bad card per se, only the price is not right.

If you think about It, then AMD did pretty much the same thing with RX 6600(6650)XT vs RX 5700XT.

VirtualLarry said:
Yep, pure garbage. NVidia has basically lost me as a customer forever, at this point.

It's not like there is anything in this price range from the competition this gen.
Unless AMD releases N32, then there is no other option.

Ok, you can buy 6800(XT) 16GB, but that also Is not a great option If you ask me.
If we talk only about raster, then they will be faster, but the moment you enable RT the performance will be pretty comparable to this "garbage".

Aapje · May 18, 2023

TESKATLIPOKA said:
It's not like there is anything in this price range from the competition this gen.
Unless AMD releases N32, then there is no other option.

The best option is just not to buy this crap. Be creative and play older games that you missed the first time around or just accept that you have to lower settings more for a while.

The more this generation fails to sell, the better they are going to make the next generation, if they actually still are about large sales. We've already seen that they can't even wait that long and have to discount the current generation, although I doubt that it will ever become a good deal.

8GB VRAM not enough (and 10 / 12)

Lifer

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Platinum Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Golden Member

Platinum Member

Golden Member

Diamond Member

Golden Member

Lifer

Diamond Member

Diamond Member

Platinum Member

Golden Member

Lifer

Lifer

No Lifer

Platinum Member

Golden Member