WCCftech: Memory allocation problem with GTX 970 [UPDATE] PCPer: NVidia response

Page 19 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

hawtdawg

Golden Member
Jun 4, 2005
1,223
7
81
It would only be a stupid statement if the 780ti did suffer the same issues as the 970 under same settings, circumstances. If the Ti cruised smoothly at conditions the 970 faltered at 3.5gb+ vram, then it is not a vram issue but simply GPU power that is slowing the 970 down. Did you not even consider that possibility ffs?


This is an even dumber statement than your last.

You can think about what you said for a day. Obviously you don't care to read moderator warnings in threads since you continue to post these types of statements which adds no value to the discussion.

-Rvenger
 
Last edited by a moderator:

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
Something that crossed my mind:

What about DirectCompute? Does it have the same problem CUDA has? What about games that use a lot of DirectCompute?

DirectCompute is handled via DirectX.

I would bet you every single harvested part works the same way, including AMD before they decoupled the ROPs.
 

amenx

Diamond Member
Dec 17, 2004
4,616
2,930
136
This is an even dumber statement than your last.
Dumbness is for those who cannot even elaborate.

Yet you took the bait and fired back. Next time it happens you will not be posting in this thread any longer.

-Rvenger
 
Last edited by a moderator:

Abwx

Lifer
Apr 2, 2011
11,991
4,948
136
It's not going over the bus. Someone in OCN just ran a comparison between a pinned copy of 1.5 GB in VRAM against 2 GB in VRAM. The last 500 MB is getting a little over 25 GB/s so that's way too fast for the PCIe bus.

PCIe speed is 22.5GB/s and the 25GB/s are an aggregated bandwith, that is from the 768MB RAM to a cache and then eventualy to the PCIe, at first the caches are rapidly filled and the first slowed chunk benefit from this available cache ressources, but the next chunks will see bandwith collapsing close to PCIe BW due to the cache ressources being exhausted by the previously loaded data, those latter cant be evacuated fast enough through said PCIe.
 

dacostafilipe

Senior member
Oct 10, 2013
810
315
136
DirectCompute is handled via DirectX.

The swapping is not done by DirectX (Direct3D) but by the driver. And because DirectCompute needs a special driver (like CUDA), we don't know if it's also affected.

I would bet you every single harvested part works the same way, including AMD before they decoupled the ROPs.

I don't think so. The problem seems to be the lack of complex crossbar switch (that requires more transistor and consume more energy).
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
the swapping is not done by directx (direct3d) but by the driver. And because directcompute needs a special driver (like cuda), we don't know if it's also affected.



I don't think so. The problem seems to be the lack of complex crossbar switch (that requires more transistor and consume more energy).

gtx 670
allocating memory . . .
Chunk size: 128 mibyte
allocated 15 chunks
allocated 1920 mibyte
benchmarking dram
dram-bandwidth of chunk no. 0 (0 mibyte to 128 mibyte):159.97 gbyte/s
dram-bandwidth of chunk no. 1 (128 mibyte to 256 mibyte):160.25 gbyte/s
dram-bandwidth of chunk no. 2 (256 mibyte to 384 mibyte):160.12 gbyte/s
dram-bandwidth of chunk no. 3 (384 mibyte to 512 mibyte):160.09 gbyte/s
dram-bandwidth of chunk no. 4 (512 mibyte to 640 mibyte):159.93 gbyte/s
dram-bandwidth of chunk no. 5 (640 mibyte to 768 mibyte):159.90 gbyte/s
dram-bandwidth of chunk no. 6 (768 mibyte to 896 mibyte):159.83 gbyte/s
dram-bandwidth of chunk no. 7 (896 mibyte to 1024 mibyte):160.06 gbyte/s
dram-bandwidth of chunk no. 8 (1024 mibyte to 1152 mibyte):160.16 gbyte/s
dram-bandwidth of chunk no. 9 (1152 mibyte to 1280 mibyte):160.17 gbyte/s
dram-bandwidth of chunk no. 10 (1280 mibyte to 1408 mibyte):160.00 gbyte/s
dram-bandwidth of chunk no. 11 (1408 mibyte to 1536 mibyte):159.85 gbyte/s
dram-bandwidth of chunk no. 12 (1536 mibyte to 1664 mibyte):158.89 gbyte/s
dram-bandwidth of chunk no. 13 (1664 mibyte to 1792 mibyte): 8.21 gbyte/s
dram-bandwidth of chunk no. 14 (1792 mibyte to 1920 mibyte): 3.30 gbyte/s
benchmarking l2-cache
l2-cache-bandwidth of chunk no. 0 (0 mibyte to 128 mibyte):289.68 gbyte/s
l2-cache-bandwidth of chunk no. 1 (128 mibyte to 256 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 2 (256 mibyte to 384 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 3 (384 mibyte to 512 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 4 (512 mibyte to 640 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 5 (640 mibyte to 768 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 6 (768 mibyte to 896 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 7 (896 mibyte to 1024 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 8 (1024 mibyte to 1152 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 9 (1152 mibyte to 1280 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 10 (1280 mibyte to 1408 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 11 (1408 mibyte to 1536 mibyte):289.69 gbyte/s
l2-cache-bandwidth of chunk no. 12 (1536 mibyte to 1664 mibyte):289.69 gbyte/s
kernel launch failed: The launch timed out and was terminated
kernel launch failed: The launch timed out and was terminated
press any key to continue . . .
 

Pneumothorax

Golden Member
Nov 4, 2002
1,182
23
81
It would only be a stupid statement if the 780ti did suffer the same issues as the 970 under same settings, circumstances. If the Ti cruised smoothly at conditions the 970 faltered at 3.5gb+ vram, then it is not a vram issue but simply GPU power that is slowing the 970 down. Did you not even consider that possibility ffs?

Nope plain and simple. I wouldn't have sold my 780ti and 'upgraded' to a 3.5gb 970 SLI setup. One of the main reasons was thinking I was getting a more future proof setup than just adding another 3gb 780TI. I would've either just went single 980 or CF 290x which both have full speed 4gb of vram. Now that devs aren't being held back by 7 year old consoles and now have up to 5-6gb to play with on the xbone/ps4, the days of 3gb being enough for >1080P are over...
 
Last edited:

dacostafilipe

Senior member
Oct 10, 2013
810
315
136
It is flawed. Since it doesnt reach the second segment on harvested parts.

Yes, but we know this already ... why bring that up again?

There are other numbers floating around the internet at the moment. For example Marc from Hardware.fr using OCCT 4.4.1 DirectX 11:

OCCT 4.4.1 Detection d'erreur / utilisation mémoire, 970 Strix vs 980 ref
réglage 1500 Mo = +/- 3337 Mo d'usage vram en début de test : 207.4 vs 246.9 fps
réglage 1750 Mo = +/- 3935 Mo d'usage vram en début de test : 165.5 vs 246.1 fps

I'm not saying that this proves anything, just that we have to keep looking because this does not look right.
 

amenx

Diamond Member
Dec 17, 2004
4,616
2,930
136
Nope plain and simple. I wouldn't have sold my 780ti and 'upgraded' to a 3.5gb 970 SLI setup. One of the main reasons was thinking I was getting a more future proof setup than just adding another 3gb 780TI. I would've either just went single 980 or CF 290x which both have full speed 4gb of vram. Now that devs aren't being held back by 7 year old consoles and now have up to 5-6gb to play with on the xbone/ps4, the days of 3gb being enough for >1080P are over...
Valid point, but not exactly what I was getting at. The point is from an analytical, testing perspective. That a 780ti can be an illustrative example of what happens when or if a GPU runs out of vram. If someone can run the same game, settings, res, etc to the point where the 970 falters @ 3.5gb+..... and if the 780ti can do it without hitching stuttering or faltering, then it is not a vram issue, but rather the 970 simply not having enough GPU power to push through. Your point though as far as purchasing 2 970s to provide enough GPU power to complement a full 4gb vram card is well taken.
 

Abwx

Lifer
Apr 2, 2011
11,991
4,948
136
and if the 780ti can do it without hitching stuttering or faltering, then it is not a vram issue, but rather the 970 simply not having enough GPU power to push through.

That s not a question of RAM quantity but of RAM speed, what if the 970 has enough GPU power but that RAM is too slow on some instance.?.

The 780ti will do better because there will be no stall in the data flow, it could be even slower FPS wise but with no stutter.
 

amenx

Diamond Member
Dec 17, 2004
4,616
2,930
136
That s not a question of RAM quantity but of RAM speed, what if the 970 has enough GPU power but that RAM is too slow on some instance.?.

The 780ti will do better because there will be no stall in the data flow, it could be even slower FPS wise but with no stutter.
You mean slow in the last 500mb section of its vram? And that that will hamper the 3.5gb fast sections data flow? If so then it would be a poor design choise on Nvidias part. Shouldt the .5gb part be eliminated altogether?
 

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
You mean slow in the last 500mb section of its vram? And that that will hamper the 3.5gb fast sections data flow? If so then it would be a poor design choise on Nvidias part. Shouldt the .5gb part be eliminated altogether?

That would be poor marketing choice.
 

Abwx

Lifer
Apr 2, 2011
11,991
4,948
136
You mean slow in the last 500mb section of its vram? And that that will hamper the 3.5gb fast sections data flow? If so then it would be a poor design choise on Nvidias part. Shouldt the .5gb part be eliminated altogether?

Yes, in essence the 768MB partition is useless for anything related to the game, it s just too slow to have any valuable usage, as said thoses datas cant be accessed by the functional computing units, they could as well just sell it with 3GB RAM, perfs would be exactly the same.

Now on why they did choose such a scheme, surely that the GPU design started well before games became RAM hogs with values approaching 4GB, the 970 wouldnt had been as appaling with 3328MB, so they ressorted to this lame implementation to release a card with a fake 4GB memory pool, notice that the 3328MB are accessed through a 208bit bus, hence the GPU has not the material possibility to process the full bandwith of a genuine 256 bit bus, the advertised 224GB/s is not accurate, real speed is logicaly 182GB/s at stock settings, the OP use of AIDA was very intelligent in this respect since he didnt rely only on the CUDA bench.
 

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
so they ressorted to this lame implementation to release a card with a fake 4GB memory pool, notice that the 3328MB are accessed through a 208bit bus, hence the GPU has not the material possibility to process the full bandwith of a genuine 256 bit bus, the advertised 224GB/s is not accurate, real speed is logicaly 182GB/s at stock settings, the OP use of AIDA was very intelligent in this respect since he didnt rely only on the CUDA bench.

Is it so? In my country that would force shops to refund or a free upgrade to 980 anyone who bought 970.
 

Abwx

Lifer
Apr 2, 2011
11,991
4,948
136
How did it now grow to 768MB?

There s only 208 bits active out of 256 when it comes to feeding the SMMs, 48 bits are inactive computing wise and thoses ones are used for the datas of the 768MB pool, do the maths starting from 4096MB, you ll have 3328MB adressed with the first 208 bits and 768MB adressed by the remaining 48 bits.

The discretanpcy comes from the fact that the 768MB adressed by the 48 remaining bits cant be processed by the functional SMMs, there s no crossbars to send the datas in the functional SMMs caches.

Is it so? In my country that would force shops to refund or a free upgrade to 980 anyone who bought 970.


Not sure, all they have to do is to use the 768MB for anything, even close to being useless, and technicaly it will be a 4GB card, the eventual weak point is the advertised 224GB/s bandwith wich can be proved to not being accurate, the 256bit bus claim cant be attacked since the bus is effectively 256 bit, it s just that 48 bits are almost useless for about anything else than said marketing.
 
Last edited:

amenx

Diamond Member
Dec 17, 2004
4,616
2,930
136
There s only 208 bits active out of 256 when it comes to feeding the SMMs, 48 bits are inactive computing wise and thoses ones are used for the datas of the 768MB pool, do the maths starting from 4096MB, you ll have 3328MB adressed with the first 208 bits and 768MB adressed by the remaining 48 bits.

The discretanpcy comes from the fact that the 768MB adressed by the 48 remaining bits cant be processed by the functional SMMs, there s no crossbars to send the datas in the functional SMMs caches.




Not sure, all they have to do is to use the 768MB for anything, even close to being useless, and technicaly it will be a 4GB card, the eventual weak point is the advertised 224GB/s bandwith wich can be proved to not being accurate, the 256bit bus claim cant be attacked since the bus is effectively 256 bit, it s just that 48 bits are almost useless for about anything else than said marketing.
In essence what you're saying is that not only is the 768mb part useless but detrimental. If they just built a straight 208bit bus card with only 3328mb vram, it would be faster with less potential of hitching or stuttering.
 

dacostafilipe

Senior member
Oct 10, 2013
810
315
136
Hardware.fr Forum:

Concernant la question de la GTX 970 et des 3.5/4 Go c'est donc assez complexe et l'incompréhension part d'une erreur dans les specs communiquées à la presse technique. Nvidia vient de s'expliquer, rendez-vous à 19h pour comprendre ce qui se passe ;) (mini-NDA oui :eek:)

Source: http://forum.hardware.fr/hfr/Hardware/2D-3D/unique-nvidia-maxwell-sujet_962857_393.htm#t9398954

Apparently, there was a mistake in the spec nVidia provided to the press. More information 19h CET.
 

Magic Carpet

Diamond Member
Oct 2, 2011
3,477
234
106
the eventual weak point is the advertised 224GB/s bandwith wich can be proved to not being accurate, the 256bit bus claim cant be attacked since the bus is effectively 256 bit, it s just that 48 bits are almost useless for about anything else than said marketing.
3dm_color.gif

The 970 is just too gimped. Period.
 
Last edited:

ocre

Golden Member
Dec 26, 2008
1,594
7
81
It's ~3% relative to itself <3.5gb but ~5% relative to the 980, in that the 970 suffers more in that AVERAGE FPS result that NV posted. Note that even severe min fps spikes over a long bench will only affect the average fps by a small amount.

The Shadow of Mordor bench clearly show major frame time stutters when vram is above 3.5gb, the results you posted from neogaf shows that. You need to re-examine it again, look at the time axis and compare the vram/frame time. When vram >3.5gb, you get lots of bad frame latency spikes. It's linked.

The skyrim gameplay video shows the stutter behavior also.

I hope it makes sense, if you take a bench that goes on for a few minutes, but one has stutters or low min fps spikes, while the other does not, overall their average fps is going to vary by a small margin.

That is not what you experience when you run out of vram
You looked at those results and concluded that?

Coming from a 2gb card, when you run out of VRAM it becomes a slide show. There is a blip but the settings are high enough to cause that blip all by itself. Did you see how the performance leveled out? Even though the card jumped up to nearly 4gb the performance, frame fate and frame times all leveled out and there is no consistent stutter.

Running dead smack out of VRAM is nothing like that at all. It's a slide show.

It is much slower If data in the 500mb has to run back down the pice to the system ram then back up to be loaded in the 3.5gb, much slower than just having it load from the system ram all along. Loading the 500mb would cause crippling performance far worse than my experiences running out of VRAM on a 2gb card. Once I ran out of VRAM, it literally hard stops and true blue slide show. The issue is, your not able to recover. Your out of VRAM= completely unplayable.

The blip in the example I shown is not representive of running out of VRAM. Not at all. All of a sudden you have settings that demand almost all the VRAM, right up to the 4gb max, and there is a blip but everything levels out. You cannot level out when your out of VRAM.

I cannot say for sure how nvidia is managing the ram that has no dedicated SM attached to it. But it's not acting like the the ram goes back down the pice bus. It seems unlikely that data cannot be swapped on the GPU itself. It makes much more sense to me that the SM stay saturated and fed well enough without the extra 500mb and that is reason enough to not have to swap for data on the 500mb segment if you don't have to.

I am interested in how this is handled. Can't wait to see what reviewers find. But I have the 970 and have forced it to use more than 3.5gb. Just like people have noticed, it seems to avoid going over 3500mb. It will stay happily under unless I pile on higher resolutions/settings. But once it decides to go over, once the settings are high enough, it seems to stay above 3500mb no problem. It doesn't seem like I have run out of VRAM, its not like that at all. The settings often make my card struggle but the frame rate is consistent.

So, I am not saying that there is no penalty for using the extra 500mb. I am not saying that at all. Really, I think that's the wrong way to even look at it. I believe that each SM has its own dedicated cache and memory blocks. That its engineered efficiently enough that each SM has enough resources to completely overwhelm it already. So if that data in the other segment has to be swapped, its just an extra step that naturally would be avoided unless you absolutely have no other choice. I think this is the situation, the ram is usable just generally not needed.

I have the card and have forced it to use up to 4 gb of ram. As long as I stay under 4gb, like 3800-3950mb the frame rate is steady. If I force the card over 4gb, or right on it, there is a huge difference. It's completely inconsistent and unplayable. Completely different scenario. The two are very different situations.

This is why I have an opinion on it. That 500mb seems very usable even though at first it seems to avoid it. Something is going on but the gtx970 can use up to 4gb of ram. I have been playing with it and this is what I see. Take it or leave it.
 

Pneumothorax

Golden Member
Nov 4, 2002
1,182
23
81
Is it so? In my country that would force shops to refund or a free upgrade to 980 anyone who bought 970.

Don't worry us United States Citizens get even a better deal! Some class action lawyers get together and sue Nvidia win millions of dollars each while each poor sod get's a $10 rebate on their next Nvidia card. A win-win.
 
Status
Not open for further replies.