Hey guys. I appreciate the advice, but I'm not convinced that my PSU is at fault here. Consider the following:
I tried lowering the Power Limit in Wattman to -15%, and then booted up 3DMark to see what would happen. To be fair, yes, I
was able to run the benchmark much longer this time without crashing. However, after ~10 minutes or so, my PC still eventually shut off.
Now, my PC and all of it's peripherals are connected to a UPS (uninterruptable power supply) under my desk. The UPS has a digital readout showing how many watts the system is pulling. When I run 3DMark on a loop, and I have the Power Limit set to -15%, the UPS shows no more than ~380w being used. (Keep in mind that this "380w" includes both of my monitors, and other accessories plugged into the UPS, so in reality, my PC itself is using even less power. But to be generous, we'll just pretend that the PC is drawing all of those 380 watts).
That would mean: My 550w gold-rated PSU is dying at only 380w power usage. Okay, so maybe it is, in fact, just a crappy PSU after all?
Except, I was already using my Vega64 for about two weeks with the stock cooler, and on default settings (no change to the power limit) I had no crashing whatsoever. And yes, I know that the Vega64 will definitely thermal throttle with the reference cooler... but, even on cold nights, where I have the AC on full blast and my room is cold, I was able to run the card at a pretty stable boost clock without any crashes, for a good few minutes before reaching the 85c limit. During that time, my UPS was showing a power usage upwards of ~450w, and showing no signs of stability issues.
Meanwhile, with my new aftermarket cooler, my PC is shutting off at a measly 380w. Unless my PSU coincidentally just had a big drop in efficiency, or decided to start giving up, then I don't think it's the PSU.
Today, I disassembled the cooler, cleaned off all of the thermal compound, and I'm in the process of re-applying heatsinks on the VRMs and MOSFETs. I don't have enough sinks of the right size, so I have a few more coming in the mail tomorrow so I can finish the job.
If I achieve better heat transfer with the new sinks, but my PC still shuts off, then I will stand corrected and I'll go out and buy a new, higher-wattage PSU.
OP, how did you mount the cooler? The hole spacing on Vega is different to every card listed on the compatibility list.
I mounted it perfectly fine. Raijintek doesn't officially list Vega as a supported card, but several users over on the AMD subreddit have pretty much confirmed that it works. Vega requires the 64x64 size bracket on the cooler, same size as Fury. As far as I'm aware, the MORPHEUS II is the only aftermarket air cooler that's big enough to support Vega right now.
If you have a hotspot, then you spread your thermal paste poorly and/or didn't use enough.
Just to be clear: Are you saying that every temperature sensor Vega has, is located solely within the GPU and HBM?
When TechPowerup released a new version of GPU-Z a few days ago, they included proper support for Vega cards. Under the Sensors tab, there's a new sensor called "GPU Hotspot", which is different from the regular "GPU" sensor. This is what I'm referring to. I've never seen a sensor called "Hotspot" in GPU-Z before, and I think it's specific to Vega. I'm just not 100% sure what it means.
I think AMD once stated that there are multiple temperature sensors placed on the card. The "Hotspot" reading may just be finding the hottest one, and displaying it. I'm guess that, if my VRMs are too hot, then maybe the card is monitoring this, and that would explain why the Hotspot reading is going even higher than it was before. But this is just a guess.
A GPU isn't like a CPU heat spreader - your GPU core absolutely needs 100% coverage. I would try to remount the cooler. I like to do a few methods that differ from mounting a CPU heatsink when doing a GPU:
1. Use quite a bit more thermal paste - a grain of rice is not even close to enough.
2. I like to apply a very small amount to both the cooler and the GPU die and, with my finger through a some seran wrap to prevent finger oils, use my finger to rub it into both surfaces.
3. Then I apply a pea-sized drop to the GPU core (larger for the biggest GPUs) and instead of just mounting the cooler straight down, I move the core in a small circular motion before tightening it down 2-3 times to help spread the TIM.
Remember, too much TIM just makes a mess and almost never hurts you.
I've replaced the thermal paste on a GPU before, and I'm mostly aware of how to do it correctly. In this case, I put a generously-sized blob on the main GPU die, and then another couple of smaller (but still generous) blobs on the two HBM modules. Like I said: The main GPU and HBM temperature readings are coming up
very nicely. It's the hotspot reading I'm more concerned about.
I run a rx64 50% perf on a 500w psu no issue. Same for rx56 on another 550w psu. Both highest end bequiet e10 and corsair something models. The watt label means nothing. Like watt on a amp for music. Its like people never understands it.
Eg the be quiet e10 500w psu have 480w on the 12v rails and 130w on the 3 and 5V rails. With good specs to boot at that load. With less stricts demand and later shutdown this could even be labelled a 650w psu.
Yeah, this is my basic understanding. A high-quality 500w PSU can sustain about the same load as a crappy 750w one. But, manufacturers have to overshoot the PSU requirements on the product labels, in order to account for the lowest common denominator (people who buy really mediocre/crappy "high watts" PSUs).
Assuming a high quality PSU, saying that the Vega64 "requires" 750w is a bit silly. The OC'd/liquid-cooled edition is a different story though.
In my case - I know that Rosewill isn't necessarily a luxury brand, but my research showed that the Capstone units in particular are nice, and I actually bought one based off of a recommendation from this very website. Not sure if Rosewill is still actively making those, but from my personal experience it's been a solid unit thus far.
I'll still keep the PSU side of the problem in mind. I'm just saying, most signs are pointing to the PSU not being the problem.
Anyway, I'll report back tomorrow with my findings.