Installed a new cooler on my Vega64; temps (mostly) look great, but 3DMark is now crashing my PC

Dankk · Sep 10, 2017

Today I installed a MORPHEUS II CORE cooler on my Vega64. It doesn't look like I've broken anything (yet); the system boots with no issues. But, when when I run a graphics-intensive program such as 3DMark, the program will either crash or shut down my PC entirely after just a couple of minutes.

Under load, both my GPU and HBM temperatures look very good... around 60 (C) for both.

In GPU-Z though, the "Hot Spot" reading starts going over 100 degrees. Could that be causing it? With the stock cooler, this sensor already getting pretty hot (up into the 90's) but with the new cooler, it seems to be getting even hotter.

My power supply is a Rosewill Capstone 550w Gold-certified unit. I know 550w is maybe cutting it close for Vega, but normally this is a really efficient PSU, and I had no issues with crashing before installing the new cooler.

(There was some thermal throttling with the stock cooler, but even before it reached the thermal limit, it would run just fine for a few minutes with no throttling or crashing.)

Speaking of throttling: Another thing I've noticed, is that while running 3DMark, my GPU speed begins throttling and fluctuating all over the place. I can actually see 3DMark stuttering when it happens. But, I'm nowhere near the GPU's thermal limit. I have the power limit set to 50% as well (crashing happens with power limit set to either 0 or +50, doesn't matter). So why would it be throttling?

Any advice would be appreciated. Thanks.

Dankk · Sep 11, 2017

I'm beginning to wonder if I don't have good enough heatsink contact on the card's VRMs and MOSFETs. I thought I did, but in hindsight, there are some areas where I could've done better. Tomorrow I think I'll try disassembling the cooler re-applying the heatsinks to see if that helps.

Wall Street · Sep 11, 2017

If you have a hotspot, then you spread your thermal paste poorly and/or didn't use enough. A GPU isn't like a CPU heat spreader - your GPU core absolutely needs 100% coverage. I would try to remount the cooler. I like to do a few methods that differ from mounting a CPU heatsink when doing a GPU:

1. Use quite a bit more thermal paste - a grain of rice is not even close to enough.
2. I like to apply a very small amount to both the cooler and the GPU die and, with my finger through a some seran wrap to prevent finger oils, use my finger to rub it into both surfaces.
3. Then I apply a pea-sized drop to the GPU core (larger for the biggest GPUs) and instead of just mounting the cooler straight down, I move the core in a small circular motion before tightening it down 2-3 times to help spread the TIM.

Remember, too much TIM just makes a mess and almost never hurts you.

R0H1T · Sep 11, 2017

Tried undervolting, using wattman? I've seen results or even video reviews where undervolting makes a significant difference.

naukkis · Sep 11, 2017

PSU-problem, with better cooling GPU will use more power. If your pc power down completely it's severe overload and PSU protecting circuits kick in.

Minimum recommended psu for Vega 64 is 750w

EXCellR8 · Sep 11, 2017

+1 on PSU... you need a 650w absolute min. 750w is an "ideal min" in my opinion and that's what I run.

even if the card is running super hot it shouldn't crash the entire computer

thilanliyan · Sep 11, 2017

If the card was running fine before the cooler change then it isn't the PSU, unless it was throttling severely enough to reduce clocks/volts/power draw.

EDIT: On 2nd look at that PSU, yeah I'd not rule that out as the cause. Of 65 reviews on newegg, 9 of them were bad...not good percentage wise.

thilanliyan · Sep 11, 2017

OP, how did you mount the cooler? The hole spacing on Vega is different to every card listed on the compatibility list.

krumme · Sep 11, 2017

I run a rx64 50% perf on a 500w psu no issue. Same for rx56 on another 550w psu. Both highest end bequiet e10 and corsair something models. The watt label means nothing. Like watt on a amp for music. Its like people never understands it.
Eg the be quiet e10 500w psu have 480w on the 12v rails and 130w on the 3 and 5V rails. With good specs to boot at that load. With less stricts demand and later shutdown this could even be labelled a 650w psu.

As for this psu i have even trouble finding proper specs and it seem cheap to me. Never good signs.

Reapply paste as described then set power -35%. If the problem goes away your psu is probably to weak.

Dankk · Sep 12, 2017

Hey guys. I appreciate the advice, but I'm not convinced that my PSU is at fault here. Consider the following:

I tried lowering the Power Limit in Wattman to -15%, and then booted up 3DMark to see what would happen. To be fair, yes, I was able to run the benchmark much longer this time without crashing. However, after ~10 minutes or so, my PC still eventually shut off.

Now, my PC and all of it's peripherals are connected to a UPS (uninterruptable power supply) under my desk. The UPS has a digital readout showing how many watts the system is pulling. When I run 3DMark on a loop, and I have the Power Limit set to -15%, the UPS shows no more than ~380w being used. (Keep in mind that this "380w" includes both of my monitors, and other accessories plugged into the UPS, so in reality, my PC itself is using even less power. But to be generous, we'll just pretend that the PC is drawing all of those 380 watts).

That would mean: My 550w gold-rated PSU is dying at only 380w power usage. Okay, so maybe it is, in fact, just a crappy PSU after all?

Except, I was already using my Vega64 for about two weeks with the stock cooler, and on default settings (no change to the power limit) I had no crashing whatsoever. And yes, I know that the Vega64 will definitely thermal throttle with the reference cooler... but, even on cold nights, where I have the AC on full blast and my room is cold, I was able to run the card at a pretty stable boost clock without any crashes, for a good few minutes before reaching the 85c limit. During that time, my UPS was showing a power usage upwards of ~450w, and showing no signs of stability issues.

Meanwhile, with my new aftermarket cooler, my PC is shutting off at a measly 380w. Unless my PSU coincidentally just had a big drop in efficiency, or decided to start giving up, then I don't think it's the PSU.

Today, I disassembled the cooler, cleaned off all of the thermal compound, and I'm in the process of re-applying heatsinks on the VRMs and MOSFETs. I don't have enough sinks of the right size, so I have a few more coming in the mail tomorrow so I can finish the job.

If I achieve better heat transfer with the new sinks, but my PC still shuts off, then I will stand corrected and I'll go out and buy a new, higher-wattage PSU.

thilanliyan said:
OP, how did you mount the cooler? The hole spacing on Vega is different to every card listed on the compatibility list.

I mounted it perfectly fine. Raijintek doesn't officially list Vega as a supported card, but several users over on the AMD subreddit have pretty much confirmed that it works. Vega requires the 64x64 size bracket on the cooler, same size as Fury. As far as I'm aware, the MORPHEUS II is the only aftermarket air cooler that's big enough to support Vega right now.

Wall Street said:
If you have a hotspot, then you spread your thermal paste poorly and/or didn't use enough.

Just to be clear: Are you saying that every temperature sensor Vega has, is located solely within the GPU and HBM?

When TechPowerup released a new version of GPU-Z a few days ago, they included proper support for Vega cards. Under the Sensors tab, there's a new sensor called "GPU Hotspot", which is different from the regular "GPU" sensor. This is what I'm referring to. I've never seen a sensor called "Hotspot" in GPU-Z before, and I think it's specific to Vega. I'm just not 100% sure what it means.

I think AMD once stated that there are multiple temperature sensors placed on the card. The "Hotspot" reading may just be finding the hottest one, and displaying it. I'm guess that, if my VRMs are too hot, then maybe the card is monitoring this, and that would explain why the Hotspot reading is going even higher than it was before. But this is just a guess.

Wall Street said:
A GPU isn't like a CPU heat spreader - your GPU core absolutely needs 100% coverage. I would try to remount the cooler. I like to do a few methods that differ from mounting a CPU heatsink when doing a GPU:

1. Use quite a bit more thermal paste - a grain of rice is not even close to enough.
2. I like to apply a very small amount to both the cooler and the GPU die and, with my finger through a some seran wrap to prevent finger oils, use my finger to rub it into both surfaces.
3. Then I apply a pea-sized drop to the GPU core (larger for the biggest GPUs) and instead of just mounting the cooler straight down, I move the core in a small circular motion before tightening it down 2-3 times to help spread the TIM.

Remember, too much TIM just makes a mess and almost never hurts you.

I've replaced the thermal paste on a GPU before, and I'm mostly aware of how to do it correctly. In this case, I put a generously-sized blob on the main GPU die, and then another couple of smaller (but still generous) blobs on the two HBM modules. Like I said: The main GPU and HBM temperature readings are coming up very nicely. It's the hotspot reading I'm more concerned about.

krumme said:
I run a rx64 50% perf on a 500w psu no issue. Same for rx56 on another 550w psu. Both highest end bequiet e10 and corsair something models. The watt label means nothing. Like watt on a amp for music. Its like people never understands it.

Eg the be quiet e10 500w psu have 480w on the 12v rails and 130w on the 3 and 5V rails. With good specs to boot at that load. With less stricts demand and later shutdown this could even be labelled a 650w psu.

Yeah, this is my basic understanding. A high-quality 500w PSU can sustain about the same load as a crappy 750w one. But, manufacturers have to overshoot the PSU requirements on the product labels, in order to account for the lowest common denominator (people who buy really mediocre/crappy "high watts" PSUs).

Assuming a high quality PSU, saying that the Vega64 "requires" 750w is a bit silly. The OC'd/liquid-cooled edition is a different story though.

In my case - I know that Rosewill isn't necessarily a luxury brand, but my research showed that the Capstone units in particular are nice, and I actually bought one based off of a recommendation from this very website. Not sure if Rosewill is still actively making those, but from my personal experience it's been a solid unit thus far.

I'll still keep the PSU side of the problem in mind. I'm just saying, most signs are pointing to the PSU not being the problem.

Anyway, I'll report back tomorrow with my findings.

krumme · Sep 12, 2017

I tend to agree this psu probably isnt the cause.
Can you provoke the issue in other situations like gaming where you get a near 100% gpu activity?
When it crash does the computer shut down?

krumme · Sep 12, 2017

I got the same cooler underway and others have same thoughts. Pls post pics if installatiin in thr builders thread so we have it on longer term there. I will update op then when get right approach.

thilanliyan · Sep 12, 2017

Dankk said:
I mounted it perfectly fine. Raijintek doesn't officially list Vega as a supported card, but several users over on the AMD subreddit have pretty much confirmed that it works. Vega requires the 64x64 size bracket on the cooler, same size as Fury. As far as I'm aware, the MORPHEUS II is the only aftermarket air cooler that's big enough to support Vega right now.

Thanks for that info. Will come in handy when I build a custom mount for my waterblock.

naukkis · Sep 12, 2017

ATX power supply is designed to handle power spikes 1/2 of it rated wattage. It's not only average wattage what determines is psu is enough or not. Vega 64 will hit 375W power spikes and so ATX-specified minimum psu is 750W, and even good psu's are in trouble of meeting ATX-standards for power spikes.

If pc shuts down there is nothing wrong with gpu. Motherboard can power system down when it's possible guard devices find unstable power supply but usually will also report it at next boot - psu critical shut down is silent. A quality psu will power down not only for overload but also for unstable output.

krumme · Sep 12, 2017

naukkis said:
ATX power supply is designed to handle power spikes 1/2 of it rated wattage. It's not only average wattage what determines is psu is enough or not. Vega 64 will hit 375W power spikes and so ATX-specified minimum psu is 750W, and even good psu's are in trouble of meeting ATX-standards for power spikes.

If pc shuts down there is nothing wrong with gpu. Motherboard can power system down when it's possible guard devices find unstable power supply but usually will also report it at next boot - psu critical shut down is silent. A quality psu will power down not only for overload but also for unstable output.

If a psu specs says it can handle 500w on the 12v rails it can take 500w on the 12v rails. Not 250w.
If the specs is correct and voltage flutuations at that load is sane spikes to 500w its surely no problem. There is capacitors in a psu to handle short spikes and keep the noise low. Shut down limit is often well above the paper specs for a good quality psu.

The idea to go for the watt label on the paper box is wrong. Funds are limited and they are better directed for higher efficiency and lower noise, longevity and stability. It just take some work.

Since maxwell we have seen the gpu have some hefty spikes. And now for vega. And it can also be seen at the demands for 8 pins. And yes its stresses the psu. Especially if the capacitors is crap as they always are on the cheaper stuff.

Dankk · Sep 13, 2017

I carefully re-seated chipsinks on all of the VRMs and MOSFETs that require it, re-applied thermal paste to the die, and then re-seated the cooler.

"Hot Spot" temperature has maybe improved a little bit, but not by much. After running 3DMark for several minutes, the Hot Spot temperature sensor still slowly creeps up to over 100 degrees. I canceled the test before it got any higher, since my PC seems to usually shut down when this sensor hits 110 degrees.

However, I think I've alleviated the issue, but in a different way. I think I may have narrowed down the reasons why I'm getting abnormally high readings from this sensor:

1) Low fan speed. Considering how massive the MORPHEUS II cooler is, I was being cocky, and assuming that I could get away with having both 120mm fans run at a static ~1,000 RPM each, with no temperature issues. This is still partly true - but not entirely. GPU and HBM temperatures look very good, but something about that damn "Hot Spot" sensor requires me to crank up the fan speed to something higher. This helps a little bit. In a couple days, I will install an an adapter that lets me plug the fans directly into the GPU itself (instead of the motherboard) so the GPU can control the speed dynamically.

2) Sustained boost clocks. Earlier, I argued that I never had this issue with crashing while I was using the stock cooler, even for the few minutes that I could get the card to run at full speed before thermal throttling kicked in. Well, it turns out that the Hot Spot sensor increases temperature quite gradually, before it reaches awful temps (past 90). Even under ideal conditions, in a cold room, I think I still would have hit the GPU thermal limit before the Hot Spot sensor would have gotten that high, which would explain why I never saw it get that high in the first place. To fix this, I've applied a more aggressive underclock to my card. P6 and P7 are at 1.03v, HBM is at .950v, and PL is set to +25%.

This underclock, - combined with higher fan speeds - seems to have mostly solved my problem, if not eliminated it entirely. Will still have to do some more rigorous testing, but so far, running Firestrike on a loop for 1 hour seemed to be fine.

This also pretty much shelves my concern about having an underpowered PSU. It's running great. (Unless, for some reason having an underpowered PSU can make your VRMs hotter, in which case someone please tell me because that might actually explain some things too.)

krumme said:
I got the same cooler underway and others have same thoughts. Pls post pics if installatiin in thr builders thread so we have it on longer term there. I will update op then when get right approach.

Sure, I'll try to post something in that thread soon. Even though the cooler definitely fits, it's not an easy installation, and there are definitely some pitfalls that need to be considered if you want to do this mod (most importantly - the cooler doesn't come with enough of the correct sized heatsinks to fit all of the card's components, and also, some components have to be partially uncovered because they sit too close to the cooler bracket).

krumme · Sep 27, 2017

Dankk said:
I carefully re-seated chipsinks on all of the VRMs and MOSFETs that require it, re-applied thermal paste to the die, and then re-seated the cooler.

"Hot Spot" temperature has maybe improved a little bit, but not by much. After running 3DMark for several minutes, the Hot Spot temperature sensor still slowly creeps up to over 100 degrees. I canceled the test before it got any higher, since my PC seems to usually shut down when this sensor hits 110 degrees.

However, I think I've alleviated the issue, but in a different way. I think I may have narrowed down the reasons why I'm getting abnormally high readings from this sensor:

1) Low fan speed. Considering how massive the MORPHEUS II cooler is, I was being cocky, and assuming that I could get away with having both 120mm fans run at a static ~1,000 RPM each, with no temperature issues. This is still partly true - but not entirely. GPU and HBM temperatures look very good, but something about that damn "Hot Spot" sensor requires me to crank up the fan speed to something higher. This helps a little bit. In a couple days, I will install an an adapter that lets me plug the fans directly into the GPU itself (instead of the motherboard) so the GPU can control the speed dynamically.

2) Sustained boost clocks. Earlier, I argued that I never had this issue with crashing while I was using the stock cooler, even for the few minutes that I could get the card to run at full speed before thermal throttling kicked in. Well, it turns out that the Hot Spot sensor increases temperature quite gradually, before it reaches awful temps (past 90). Even under ideal conditions, in a cold room, I think I still would have hit the GPU thermal limit before the Hot Spot sensor would have gotten that high, which would explain why I never saw it get that high in the first place. To fix this, I've applied a more aggressive underclock to my card. P6 and P7 are at 1.03v, HBM is at .950v, and PL is set to +25%.

This underclock, - combined with higher fan speeds - seems to have mostly solved my problem, if not eliminated it entirely. Will still have to do some more rigorous testing, but so far, running Firestrike on a loop for 1 hour seemed to be fine.

This also pretty much shelves my concern about having an underpowered PSU. It's running great. (Unless, for some reason having an underpowered PSU can make your VRMs hotter, in which case someone please tell me because that might actually explain some things too.)

Sure, I'll try to post something in that thread soon. Even though the cooler definitely fits, it's not an easy installation, and there are definitely some pitfalls that need to be considered if you want to do this mod (most importantly - the cooler doesn't come with enough of the correct sized heatsinks to fit all of the card's components, and also, some components have to be partially uncovered because they sit too close to the cooler bracket).

How is it going. Did raising the fanspeed solve the issue?

I have some instability issues as well in bf1 every hour or so. Hot spot goes to 105 but temp stays a 65 or so.

Got a new psu so its defenitively not that.

Dankk · Oct 4, 2017

krumme said:
How is it going. Did raising the fanspeed solve the issue?

I have some instability issues as well in bf1 every hour or so. Hot spot goes to 105 but temp stays a 65 or so.

Got a new psu so its defenitively not that.

Sorry for the delayed response. I went ahead and added a post in the main Vega Builders thread:

https://forums.anandtech.com/thread...read-rx-64-rx-56.2516510/page-5#post-39103634

tl;dr: GPU temps are in mostly good shape, but could be better. Thermal application and mounting technique are definitely important here. Still an improvement over the crappy reference cooler overall

Phynaz · Oct 4, 2017

Installed a new cooler on my Vega64; temps (mostly) look great, but 3DMark is now crashing my PC

Diamond Member

Diamond Member

Senior member

Platinum Member

Golden Member

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer