Question Reference Vega64 Temps Jumped Overnight

May 3, 2019
53
20
81
I suddenly started experiencing game stuttering. Normally I'm maxed settings, max AA, uncapped FPS (240-ish which is good for game tick). I checked the GPU temp and when it hit 85c it would start. I read that it was likely throttling at 85c causing it. I had to manually tune the fan rpm to keep it under 85c and working properly. I think standard fan settings where nothing over 45% rpm, now I'm nearly double that and that's with no AA, capped fps at 144. Anything more and I'll be at 100% fan speed.

It works, but this is all out of the norm. I'm mostly curious about the health of the card. There was a crazy heat wave here 3 weeks ago. Maybe that did some damage (but 3 weeks later and suddenly?). The fan revs up nice I got it to over 4k and there's no bad bearing sounds. Right now my best guess is thermal paste suddenly expired? Ya... I'm not liking that's my best guess. Any advice or insight on these symptoms would be nice.

I've tried rolling back and updating graphics drivers as well. Verified game files, but it's multiple games/platforms.

Also, I had my pci bus drivers randomly reinstall and require a reboot, twice. I suspected the graphics driver, but after all the graphics driver updates it hasn't done it again.

Thanks for your time. =)
 

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,446
20,437
146
First thing: Look up a vid on how to replace the thermal paste on the ref. model.

Next: Look up how to under-volt the ref. model. Unless you are an ultimate loser of the silicon lottery, your card will run its rated speeds on less voltage. Less voltage = less heat.
 

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,446
20,437
146
I did the auto-undervolt tune in the drivers. I did that after all of this started, but yes, something I plan on keeping.
Then you got this. Replacing the paste is very straight forward, and should get you back to normal.

You should probably use DDU in safe mode to wipe all the drivers and start fresh just to be safe. And since you don't see artifacts or any other symptoms, hopefully the card is fine.

EDIT: worst case is you may have to underclock it to be stable. Sometimes even faulty ram will play nice for another year or more that way.
 
  • Like
Reactions: An Arable Hill

Leeea

Diamond Member
Apr 3, 2020
3,617
5,363
136
I checked the GPU temp and when it hit 85c it would start. I read that it was likely throttling at 85c causing it.

Verify the heatsink is clear of dust and debris. Verify your power supply is clear of dust and debris.

If heatsink is clear of junk, it is time to paste it up.

If that does not work, take a hard look at your power supply*. It is not uncommon for components to overheat as the power supply goes. Under volting the graphics card could help with this, simply because it would reduce the load on the powersupply.

*as the power supply fails volts go down, amps go up, heat goes way up

I had my pci bus drivers randomly reinstall and require a reboot
-raises eyebrow- that is not normal, and is not caused by the graphics card or the graphics card drivers
 
Last edited:
  • Like
Reactions: An Arable Hill
May 3, 2019
53
20
81
I verified the system was clean. I keep my computers spotless so I wasn't expecting anything, but you can't post for help and not take the advice right?

New events: When I had my computer opened up I decided to reconfigure my water cooling and lighting. When I put it back together I put the power cord on only and powered it up to verify fan directions and pump function. Then I hooked everything up. I don't normally do it that way. I'd hook everything up and then power on. While I was hooking it up I could hear the GPU fan start. Then the red lights that say Radeon on the side would blink and the fan would stop. Then 5 seconds would go by, the fan would start, the lights blink once, the fan stops. Repeat. When the mouse and keyboard was hooked up I logged in to windows and it stopped... The PCI bus hiccup I read could absolutely be caused by the driver and it makes sense given it hasn't done it again and it's odd that the blinking/fan nonsense would stop after signing in.

I can't imagine thermal paste would "suddenly" fail. It simply doesn't make sense to me, but I've never had a card act like this before and it's getting weirder by the minute.

My PSU is an AX1200i by Corsair. How might I test it?

I'm waiting until tomorrow morning to try the DDU driver suggestion made here. I'll report back, Thanks again for the help so far.
 
  • Like
Reactions: Leeea

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,446
20,437
146
My PSU is an AX1200i by Corsair. How might I test it?
While in older model, that is a great PSU. And it powers up the system fine. Easiest way to check if all the rails are holding steady is monitor them with HWinfo while gaming and see if the minimum on any of the rails is dipping too low/out of spec. I doubt it is the culprit.

And while paste does not suddenly fail, perhaps the temps were creeping up and you did not notice? And there does come a point when it dries out enough to separate and stop making good contact in places. That's when people definitely take note.

I don't know what the card's normal behavior is, but if you have another x16 slot on your board, you could try it in the 2nd, just in case the primary is getting wonky.
 

solidsnake1298

Senior member
Aug 7, 2009
302
168
116
While in older model, that is a great PSU. And it powers up the system fine. Easiest way to check if all the rails are holding steady is monitor them with HWinfo while gaming and see if the minimum on any of the rails is dipping too low/out of spec. I doubt it is the culprit.

And while paste does not suddenly fail, perhaps the temps were creeping up and you did not notice? And there does come a point when it dries out enough to separate and stop making good contact in places. That's when people definitely take note.

I don't know what the card's normal behavior is, but if you have another x16 slot on your board, you could try it in the 2nd, just in case the primary is getting wonky.

Didn't Vega cards use some sort of semi-custom graphene thermal pad instead of thermal paste? Because of the slight height difference between the GPU die and HBM.
 
  • Like
Reactions: Leeea

solidsnake1298

Senior member
Aug 7, 2009
302
168
116
Looks like I was thinking of the Radeon VII.

 
  • Like
Reactions: Leeea
May 3, 2019
53
20
81
I followed the DDU directions in safe mode etc and there's been no change. Worth a shot though.

I tried my number 2 PCI slot (x8) and there was no change.

I have a post here about thermal pads as I've never changed them out before and I wanted size recommendations. I'll grab some AS5 or ICD7 and tackle that in a few days.

Once again, I appreciate all the help so far.
 
May 3, 2019
53
20
81
I changed out the pads and paste and there was a decrease in temps. I was able to enable SMAA which just wasn't possible before, but I still had to cap fps to 144. It would still stutter like it did at 85c, but now at 82c so I still had to goose the fans at higher temps. I was able to maintain good playable results with AA but things did not return to how they were before. I'm not sure if my thermal paste needs to break in or not. Thanks for the help.
 

Leeea

Diamond Member
Apr 3, 2020
3,617
5,363
136
I changed out the pads and paste and there was a decrease in temps. I was able to enable SMAA which just wasn't possible before, but I still had to cap fps to 144. It would still stutter like it did at 85c, but now at 82c so I still had to goose the fans at higher temps. I was able to maintain good playable results with AA but things did not return to how they were before. I'm not sure if my thermal paste needs to break in or not. Thanks for the help.
Did you ever check your power supply voltages in HWInfo* while your game or etc was running?

*HWinfo is free software
 
  • Like
Reactions: An Arable Hill

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,446
20,437
146
I changed out the pads and paste and there was a decrease in temps. I was able to enable SMAA which just wasn't possible before, but I still had to cap fps to 144. It would still stutter like it did at 85c, but now at 82c so I still had to goose the fans at higher temps. I was able to maintain good playable results with AA but things did not return to how they were before. I'm not sure if my thermal paste needs to break in or not. Thanks for the help.
You mentioned at the beginning that the undervolt is auto. You may have to experiment with it manually. Your GPU and/or HBM may be degrading i.e. plain old wearing out. It may be at a point where it needs more voltage to maintain the same or less performance. You have done so much already, there is little left to point the finger at.
 
  • Like
Reactions: An Arable Hill
May 3, 2019
53
20
81
Did you ever check your power supply voltages in HWInfo* while your game or etc was running?

*HWinfo is free software

I forgot to do that. Thanks for the reminder. I wasn't able to see where that data might be extracted. I'll need to do some more research on that. I tried HWMonitor since I've been a long time supporter and they had some data logs, but it wasn't easy for me to understand. But, it might be moot. Moot and very helpful.

You mentioned at the beginning that the undervolt is auto. You may have to experiment with it manually. Your GPU and/or HBM may be degrading i.e. plain old wearing out. It may be at a point where it needs more voltage to maintain the same or less performance. You have done so much already, there is little left to point the finger at.

That thought occurred to me and I removed the undervolt yesterday during testing.

However, there were some breakthroughs today! When I was attempting to tax the system for the PSU logging (in my normal fashion) I couldn't get the GPU over 75c. I uncapped FPS, made sure SMAA was enabled, and de-goosed the fans. I couldn't get it above 83c and there was no stuttering. I went as slow as 38% on fans at 80c and it hit 86c with no stuttering. Obviously I won't leave the fans that low, but I witnessed one stutter in 20 mins of testing which is absolutely acceptable. I'm unsure about whether to add the under-volt back.

Lesson I learned here is change your paste every few years and if you do allow for it to break in. You guys nailed it first guess. Thanks so much!
 
May 3, 2019
53
20
81
Couple weeks later update.

Everything was going great, no stuttering, temps were good. Today while gaming the screens went black. It rebooted ok, but the driver settings for the fans I changed earlier were all set to automatic again. I assume that's because it crashed, but I'm not sure. Now games stutter quite a bit worse than before and at 75c now instead of 85c.
 
  • Wow
Reactions: Leeea

Leeea

Diamond Member
Apr 3, 2020
3,617
5,363
136
I assume that's because it crashed, but I'm not sure.
Every time an amd card crashes, all of the settings are reverted to default.

It is annoying, but kind of makes sense. I have crashed a lot of amd cards over the years when I set out to abuse them.


Everything was going great, no stuttering, temps were good. Today while gaming the screens went black. It rebooted ok, but the driver settings for the fans I changed earlier were all set to automatic again. I assume that's because it crashed, but I'm not sure. Now games stutter quite a bit worse than before and at 75c now instead of 85c.

Something weird is going on. I imagine you did a visual inspection when you had it apart. If you pull back on the power bar, (cut it back to 50% or whatever), does that reduce the stuttering? (you will drop FPS when you do this, but looking to see if the stuttering goes away)
 
Last edited:

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,446
20,437
146
Time to test another power supply. After that undervolting and underclocking to try to address potential component degradation.
 
  • Like
Reactions: An Arable Hill

Leeea

Diamond Member
Apr 3, 2020
3,617
5,363
136
address potential component degradation.
Yea, I am wondering that to. But need to get the more likely possibility of power issues out of the way first.

That is a nice power supply he has, but every company has one fail a bit early every once in a while. In my very limited experience, the higher wattage units, specifically the 950 watt and larger seem to fail more frequently to me, regardless of company.

Thinking specifically the memory. The GPU checksums the memory, and it will reject bad data.

His situation is similar to when the memory is pushed to far when over clocking, and GPU rejects the corrupted data from the ram and either re-reads it or requests it from the system memory. Can create a stutter of doom.

Re-pasting likely helped him with this, as he would have repasted all the HBM memory chips to.

But I stuck this in a spoiler because my gut says power issues. Leaky caps, flakey VRMs, and power supplies on the fritz seem far more common.

The power supply seems like his logical next stop.
 
Last edited:
May 3, 2019
53
20
81
He really should stop that. :rolleyes:

I've had some deadlines so I couldn't mess with it, but today I walked away and came back and the fans, pump, lights were all on, but the screens were black. The debug light on the motherboard said CPU. This reminded me of some other odd that happened around the same time, like the PCI Bus reinstalling (mentioned above). So, I promptly pulled the PSU and threw in a 750W I had around and the card still stuttered if I got it to 77. Not the PSU.

I wasn't sure what to try so I:

volts-low.png


Defaults:
volts.png


Results:
165fps @ 74c
102fps @ 85c
Zero stuttering.

Progress is nice, but I was pretty happy it was the PSU. He has to get over that disappointment.
 
May 3, 2019
53
20
81
It stopped booting consistently. Cpu debug light. Tried a few times. Dram debug. Then only dram. Swapped sticks, tried every config and a2/b2 didn't work but all possible others did. Then a2/b2 worked again for some reason. It wouldn't boot with only 1 stick which I found odd. I'm running memtest86 now. It's made it through 2 passes no errors.

This all happened when I put the old psu back in because we thought we ruled it component degradation.

Just a tad overwhelmed so I'll come back to it tomorrow.
 

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,446
20,437
146
You have done more than due diligence, and have reached the point where we joke about exorcism and gremlins. But being more serious, maybe the board is the bad guy here. Certainly a cheaper replacement than a GPU right now, if it comes to that.

Grabbing a decent in stock AM4 board from your local Best Buy or MicroCenter, or ordering from Amazon, so you can return without restocking fee, would be my next move.Testbed it barebones on the box, and replicate results if possible. If it stutters still, maybe the card is on its last legs. I would also use an annoyingly high fan curve to make certain it is getting good heat dissipation.
 
  • Like
Reactions: An Arable Hill
May 3, 2019
53
20
81
I was thinking about it and it's funny. The last time I had these types of issues was 1999. It was my second computer build. I was buying parts at Frys because I hated myself. I did all the testing and decided it was the MB. I retuned it and I had the same problems. Questioned my results, retested, and still suspected the MB. My cousin worked there and told me they buy bad lots at discount rates. Stuck the next one in and pulled it out immediately. I went back one more time and got a board that worked and lasted me many years. Asus A8N-E Deluxe, AMD 3200+ :cool:

I'll grab a new MB and see how it goes. It's actually one of the cheapest solutions. However, in my experience ghost in the machine errors come from the PSU more often and motherboards start bad, but don't go bad. (unless you overclock) If it ends up being the PSU I'll likely keep this motherboard/CPU for my wife.

I really appreciate the help.
 

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,446
20,437
146
I was buying parts at Frys because I hated myself.
LULZ that is comedy gold. Never been to one, but when they were going OOB I read a bunch of stories about their terrible practices. Stuff like, you had to check the box before leaving the store, because they would put broken returns right back out for sale. Then deny the return because it is broken.

Anyways, hopefully it is a less expensive part than the GPU behind the demonic possession. Good luck, and thanks for checking back in. Hoping you get to mark this thread solved at some point before the heat death of the universe. ;)
 

Ranulf

Platinum Member
Jul 18, 2001
2,348
1,165
136
There was an old website with Fry's horror stories, mostly from former and current employees. I can't remember the site name but they were hilarious and shocking. The last time I bothered to make the 45min trip to the one near me for parts was 10+ years ago. Fry's was also a big proponent for the Door Nazi security check phenomenon. All the drama during checkout, parts brought by staff to the cashier and then some goon at the exit had to check your receipt despite the fact that you were funneled from the cashier to the exit with no way to go back into the store but out the exit and back in the entrance.
 
  • Like
Reactions: DAPUNISHER
May 3, 2019
53
20
81
They aren't exaggerations. We had to tell the guy that he was putting the boards we were returning back onto the shelves. He just shrugged so we stashed them under the heaviest and tallest stack of motherboards we could find so nobody would buy them.

So my little story about my early system build experience made me realize something, I've gone soft. At some point you get married, have kids, and your time is worth so much that you just become a consumer even though you have the knowledge and experience and interest. Time passes and you assume everything has changed so you reach out for help (which you guys did!), but it started to bother me. I unpacked the old test bench, stress tested everything, modified the heatsinks on a Morpheus 2 to fit my vega and we're good. Things haven't changed that much.

Temps (depending on ambient):
Idle 25-40c
Load 35-54c

There was a dodgy cable in the mix and I believe some RAM wasn't seated properly at one point which caused the boot issues (and then no boot issues). PSU and MB are fine. And obviously as you pointed out the card has degraded (114f heat wave pushed it over the edge) and at these temps and underclock I hope to ride it out until the GPU market levels off. Can't say thanks enough! I have a lot of fun. :)
 
Last edited:
  • Like
Reactions: DAPUNISHER