Question Diablo4 causing gpu's to die

Ranulf

Platinum Member
Jul 18, 2001
2,385
1,264
136
Just a heads up for anyone trying the Diablo4 beta this weekend. Game is apparently bricking 3080ti cards, gigabyte ones. There are reports of it hitting other cards though, including AMD.


While Diablo IV's lenient PC spec requirements indicate a well-optimized game, some users share troubling reports of their expensive graphics cards failing during gameplay. There have been multiple reports of NVIDIA RTX 3080 Ti GPUs failing while playing the Diablo IV early access beta, with symptoms like GPU fan usage skyrocketing to 100% following an outright hardware shut down.

Blizz forum post on it:


Jayz2c video:

 

Mopetar

Diamond Member
Jan 31, 2011
7,936
6,233
136
This reminds me of earlier years when people would use fur mark as a torture test or to ensure that their overclock was actually stable. Even review sites would use it to help test max power draw since it would usually max out the GPU more than any actual game or software could.

I think both AMD (then ATI) and NVidia hated it to a large extent and I remember them basically calling it a power virus, but they still had to keep their cards from blowing up while running it. This shouldn't be a difficult problem and there's no reason my 100 billion transistor GPU can't devote a few hundred thousand to preventing its own destruction.
 

coercitiv

Diamond Member
Jan 24, 2014
6,256
12,189
136
In other words folks, don't buy used video cards from gamers. What if they disabled vsync and the card is already one foot in the grave? /s
 
  • Like
Reactions: Leeea

Hail The Brain Slug

Diamond Member
Oct 10, 2005
3,178
1,531
136
But there are key differences.

1: A CPU does not have any power circuits on it. The motherboard handles all the power delivery. If this issue happened for a CPU, the CPU would not be impacted at all. It would be the motherboard that failed. Which HAS happened. But the blame is always on the motherboard maker, not the CPU manufacturer.

2: At most, a consumer level CPU has 16 cores. A GPU has thousands, so load transients can be significantly larger.

3: CPUs use less power than the GPUs being impacted by these issues. So if a load profile was created that could cause large transient spikes on a CPU, those spikes would be much smaller.

4: The types of loads a CPU sees are drastically different from what a GPU sees. CPUs being general purpose means they are constantly context switching. Video encoding or the like would be similar, but those loads are very constant. Little risk of transients.

And for the second time, I am not defending the GPU makers.

And for those saying there is no reason the GPU makers could not prevent this sort of thing, they are limited in what they could do. Yes, the board maker could add in some sort of hardware over current protection, and have circuit breakers that would shut all power off to the card if it hit the designated limit. However, these circuits rarely react fast enough to handle transients. So then it comes down to only triggering for sustained load, which would result in system crashes if that limit was hit. Which would inevitably make end users angry, and then look bad for the board manufacturer.

Most cards already have software power limits, but these will also not catch transients. And if the software detects high power usage, all it can really do is ramp down clocks in an attempt to lower power consumption. These systems have to have a lot of averaging in them though, and they are slow to react.

All of these issues can likely be tracked back to the fact that newer high end GPUs draw crazy amounts of power. With these very high loads, there is far less room for error in the power delivery circuit. We saw this with the 12pin power connectors, and the EVGA 3090s. A tiny build up of tolerances resulted in catastrophic failure of the units. High end GPUs used to only draw 200-300W. Now we have cards that draw 450W.
If you have to state multiple times you're not defending a party, you're probably defending the party.

This isn't 1999 when power viruses could actually exploit the hardware to damage it. It's 2023. The hardware can manage itself just fine. Anything less Is a total and complete failure of the design or components.

Keep in mind new world and diablo 4 are not maliciously designed applications - they are completely legitimate games designed to be games. If you can somehow twist and spin legitimate, AAA games causing GPUs to physically damage themselves as not 100% completely the OEM or vendors fault, you need to reevaluate.
 

BFG10K

Lifer
Aug 14, 2000
22,709
2,978
126
And its worth noting with DX12 and Vulcan, games are much closer to the metal than they were with older APIs.
Can you please demonstrate anything intrinsic in those APIs that's specifically designed to kill cards? Show us a kill_hardware(), fry_card(), invoke_RMA_process(), or similar. Thanks.

The issue with New World was the game itself being poorly coded which resulted in load characteristics that caused massive spikes in current.
Games do not cause "massive spikes in current" or even "load characteristics". They call an intermediate software stack called an API.

The API is then translated by the GPU driver which causes the GPU UEFI/hardware to react based on how the driver programs it, including physical changes such as current and transistor load.

We do not yet know what is causing these failures, but it would not be surprising if it is caused by a load profile that induces radical current transients.
Which again means a faulty driver/UEFI/hardware stack on nVidia's cards.
 
  • Like
Reactions: KompuKare

Rebel_L

Senior member
Nov 9, 2009
449
61
91
Sounds nice, but it’s just naive. How terrible of input power are they required to handle? Are they going to insulate them from solar flares too?

At some point they have to be “used as intended” or they will break like any other tool. Usually they just break in a less permanent manner.

And that’s ignoring the fact that with an engineering stack as complicated as these there will be issues that are unknown at the time of engineering that manifest themselves after. Look at the hardware and software stack that’s required for these to function - and the bugs and hand optimizations/fixes that happen at a title by title basis. It’s clear that that some of this interaction between software and hardware is made up as they go along, putting fingers in the holes in the dam.

I guess it’s not hard for me to conclude with a systems engineering background that conditions exist for naughty software to put the hardware at risk, especially considering how non-standard and fully variable a hand built PC is.

What if there are other bad variables? Lmao, like if these Gigabyte GPUs were bundled with their garbage PSUs for some sort of promotion 😂
If you dont think those things are accounted for you are the naive one. While Im sure they dont sit down and individually account for all the variables they can think of and instead have it covered in a standard safety margin they slap onto specs, it doesnt mean its not accounted for. If GPU's couldnt handle solar flares everyone would loose all their hardware on a daily to weekly basis.

No one said its easy to design these things with all the variables to consider but it seems pretty obvious that if your cards circuitry allows for x power to reach a component it better be capable of handling x power. If your idea of "intended use" for a GPU doesnt cover situations like trying out a public beta for a new game that uses your hardware differently than a previous game you should put large disclaimers on the box warning people of that. Working in an industrial industry I can tell you that design there is expected to cover the possibilities. If you want want a cheaper component in place you better have interlocks or something else in place that ensures the part cant be exposed to over spec conditions. To not design for that is certainly not considered "good" engineering.

It does seem that generally the engineering on GPU's is pretty decent as these kind of issues are not that common but it doesnt mean we shouldnt call them out when they screw up and come up with flimsy excuses to try and deflect blame.
 

Saylick

Diamond Member
Sep 10, 2012
3,217
6,583
136
Thanks for the PSA. My brother has a 3080 Ti and he loves Blizzard games, so I immediately told him NOT to touch the D4 beta. He doesn't have a Gigabyte version (it's a Founder's Edition) but can't be too safe. I ain't tryna pay another $650 to replace it with something equivalent.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
Sounds like more people that think frame caps are a joke, and they should be able to run any game they want at unlocked frame rates. Because as we all know, in this PvE game, higher FPS means you kill demons way faster.

I just do not understand people that think its fine to run the GPU at 100% load with uncapped FPS all the time.
 

Ranulf

Platinum Member
Jul 18, 2001
2,385
1,264
136
Sigh. Under-engineered cards FTL.

Maybe. It could also be bad programming with some games. Total War Warhammer 3 has a known problem with many cards and pushing the gpu to 100% in the campaign map. It means my 2060S runs its fans at 2500rpm, no matter the graphics settings unless I cap the frames in nvidia control panel to 30fps.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,400
10,082
126
Maybe. It could also be bad programming with some games. Total War Warhammer 3 has a known problem with many cards and pushing the gpu to 100% in the campaign map. It means my 2060S runs its fans at 2500rpm, no matter the graphics settings unless I cap the frames in nvidia control panel to 30fps.
Could do what miners do, power-limit in AB and lower core clock.
 
  • Like
Reactions: Shmee

Saylick

Diamond Member
Sep 10, 2012
3,217
6,583
136
If a card can't handle that, then it's faulty defective design.
In a vacuum, I think this is the correct take. If a card can't do 100% sustained load for an extended period of time that is analogous to a long gaming session, the card wasn't built right. This naturally includes the power circuitry along with the cooling system. However, I also do think that game developers should be cognizant that certain portions of their games should not be pegging the GPU at 100% load, e.g. menus and pre-scripted animations. It shouldn't require an fps cap that is set by the user to resolve this.
 
  • Like
Reactions: Pohemi and Ranulf

Ranulf

Platinum Member
Jul 18, 2001
2,385
1,264
136
Could do what miners do, power-limit in AB and lower core clock.

Yup, that is the other solution, lower clocks etc. to 75-85% maybe. It is just easier to limit it to 30fps. Sucks for the rts battles though. It is clearly the game though. The recommended settings for my system put almost everything at high and tweaking most settings does nothing other than I think turning off AA completely. I've kinda given up on the game though and gone back to WH2.
 

Ranulf

Platinum Member
Jul 18, 2001
2,385
1,264
136
100% is not "pushing".
GPU have to able to handle that.

Sure, what I mean is that the graphics on screen should not be pegging a gpu at 100% given what is displayed on screen. If it does, fine but it is sloppy coding by the devs given the game is on a cutscene, menu screen or the campaign map in WH3 is not objectively better looking than WH2 which the same system can run at ultra settings and hit over 80fps vsync off. A game where the rts battles with hundreds of little units on screen fighting stress the card less than an animated campaign map.
 
  • Like
Reactions: Pohemi

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
If a card can't handle that, then it's faulty defective design.

While I believe that's likely the case here, there is nuance to it.

100% utilization with static load is one thing.

100% utilization with sudden changes in load is another. In the case of New World, the sudden changes in load caused massive current spikes. The utilization stayed 100%, but the type of load would suddenly change.

When you have chips that suffer from crazy transient issues like the 3000 series does, these fluctuations in the load make it significantly worse.

My comment was more meant to my annoyance anytime I run across somebody that absolutely has to run their games with uncapped FPS. Absolutely nothing is gained from it in an action RPG like Diablo. If you want to run the game at a high refresh, thats one thing. 120hz, 144hz, etc. The game is very well optimized, so that's not hard to do. But if somebody kills their card because they were running hundreds, and in some cases thousands of FPS, I have zero pitty for them.
 

ZGR

Platinum Member
Oct 26, 2012
2,054
661
136
I’m pretty sure this game would eventually kill my 3080 if I left it at stock. Stock fan curve at stock voltage sees my junction temps go well above 105C and power consumption is over 300W sustained. I like to stay below 95C and well under 0.9V

I’m not gonna test this. I like my GPU. I think this is another wake up call to undervolt cheaply made RTX 3000 GPU’s. Performance scaling is awful past 0.85v anyways.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
I’m pretty sure this game would eventually kill my 3080 if I left it at stock. Stock fan curve at stock voltage sees my junction temps go well above 105C and power consumption is over 300W sustained. I like to stay below 95C and well under 0.9V

I’m not gonna test this. I like my GPU. I think this is another wake up call to undervolt cheaply made RTX 3000 GPU’s. Performance scaling is awful past 0.85v anyways.

And what happens if you turn on vsync?

Software should not be able to kill hardware operating in 'normal' conditions.
Imagine if CPUs were this finicky and sensitive.

Right, but running without a frame cap isn't "normal" conditions.