Warning: This is a very involved read that probably would take around 15 minutes.
1. Question/issue: How should we test the stability of CPUs which are able to boost past their all-core frequency? In other words, how should we test the stability of CPUs which are able to operate at a higher frequency than their “base frequency” when only a few of their cores are loaded?
Why this is more complicated than it might first appear
2. Take for example my AMD 3950X. It is able to operate up to ~4.2GHz when all cores are loaded, but is supposedly able to run 1 core at up to 4.7GHz. The implication is that if I were just to run 32 threads of Prime95, I am only really testing whether my 3950X is Prime95 stable at or below ~4.2GHz (and not whether it is stable running 1 core at 4.7GHz, or anything above ~4.2GHz for that matter). The corollary, leaving aside the complication discussed in paragraph 5 of this post, is that if I wanted to check whether the 3950X is Prime95 stable, I would have to run 1 thread of Prime95, 2 threads of Pime95, up until 32 threads of Prime95, so that I can be sure my 3950X is Prime95 stable regardless how many cores are active
3. A further complication is that the maximum heat output does not necessarily occur when all cores are loaded. For example, the 3950X that Anandtech tested consumed the most amount of power when 10 out of 16 cores are loaded. (See Anandtech’s 3950X review, page 2.) The implication is that a 3950X may very well be stable when all 16 cores are loaded, but unstable when 10 cores are loaded, because the temperature of the 3950X is higher when 10 cores are loaded.
4. Tentative conclusion #1 in light of the issues I raised in paragraphs 2 and 3 of this post: One characteristic of an ideal stress test is that it is able to dynamically adjust the number of active threads as the stress test progresses.
5. In addition, Ryzen 3rd generation CPUs (i.e. Zen 2 architecture using TSMC’s 7nm manufacturing process) are only able to reach (close to) their advertised single-core max boost speeds for extremely brief periods of time. (See Anandtech’s 3950X review (at page 2), where it was stated that peak single core frequency of 4650 MHz on the Ryzen High Performance (RHP) power plan was “very instantaneous, as when we put a consistent single thread load on the core, the [frequency] very quickly came down”. Also see this Anandtech article (at page 7) where it was stated that “Ultimately, by opting for a more aggressive binning strategy so close to silicon limits, AMD has reached a point where, depending on the workload and the environment, a desktop CPU might only sustain a top Turbo bins momentarily”.)
This behaviour is unlike modern Intel CPUs which, given sufficient cooling and a sufficiently high Power Limit 2 value, are able to boost to their maximum single-core boost frequencies until the Power Limit 2 (PL2) duration – aka Turbo Time Parameter (Tau or τ) – is reached. (See Anandtech’s 2019 interview with Guy Therien, this 2019 article, and this 2018 article.)
The implication is that since there’s no way of sustaining the maximum frequencies achieved by the Ryzen 3rd generation CPUs for any meaningful duration, there is no way of testing whether such a CPU is stable at the highest frequencies which it is able to achieve for only brief periods of time.
6. Tentative conclusion #2 in light of the issue I raised in paragraph 5 of this post: Another characteristic of an ideal stress test is that it is able to generate bursts of intense workloads interspersed with zero loads, in order to coax the CPU into operating at its highest frequencies.
Different instruction sets
7. An ideal stress test would also test every possible type of instruction a CPU supports (and every combination thereof).
8. Prime95 presumably doesn’t do this, hence I chose my words very carefully and said “Prime95 stable” in my posts and not merely “stable”. Ex hypothesis, this also means the Prime95 algorithm shouldn’t be placed on a pedestal as the gold-standard of stability tests, but merely one of several stability tests to perform.
9. Pirme95 specific observation: I noticed that Prime95 version 28.9 causes my 3950X to produce varying amounts of heat, at least when 32 threads are running. To elaborate:
(a) The 3950X would hum along at ~70C most of the time, then occasionally hit ~90C before going go back to ~70C. The cycle then repeats.
(b) Also, the 3950X would operate anywhere between 3.3GHz and 4.2GHz when all cores are loaded, but mostly between 3.8GHz and 4GHz. This is probably because (a) the intensity of the Prime95 workload varies over time and (b) the 3950X is being forced to operate within the specified power or current limits viz.:
(i) Package Power Tracking (PPT), the power threshold that is allowed to be delivered to the socket;
(ii) Thermal Design Current (TDC), the maximum amount of current delivered by the motherboard’s voltage regulators when under thermally constrained scenarios; and
(iii) Electrical Design Current (EDC), the maximum amount of current at any instantaneous short period of time that can be delivered by the motherboard’s voltage regulators.
(Definitions taken from Anandtech’s deep dive article on 3700x and 3900x, penultimate page.)
Those with 3950X or any other Ryzen 3rd gen CPU, do you notice a similar behaviour when running Prime95?
Screenshot of HWInfo and Ryzen Master after running Prime95 Blend (32 threads)10. I’ll include the relevant specifications/configuration of my system for reference:
- Motherboard: MSI x570 Unify
- Motherboard BIOS: 7C35vA2 (released 2019-11-07), and most likely includes the AMD ComboPI188.8.131.52 Patch B (SMU v46.54)
- AMD Chipset Driver version: 184.108.40.2064 (released 11/25/2019), which inter alia includes AMD Ryzen Power Plan v220.127.116.11
- Windows Power Plan: AMD Ryzen High Performance plan (which, unlike the Ryzen Balanced plan, retains the fast Frequency Ramp-Up times - see Anandtech’s article on Collaborative Processor Performance Control 2 (CPPC2), but see this Anandtech article (at page 7) for a better explanation of CCPC2)
- Windows build: 10.0.18363 (version 1909)
11. As an aside, I am of the opinion that my Noctua NH-U14S is adequate for running 3950X at stock, since it is, broadly speaking, able to keep the 3950X at around 70C when running Prime95 even when ambient temperature is a fairly warm ~28.5C. The occasional spikes to 90C when running Prime95 will probably still occur even on the best ambient water-cooling system, since the bottleneck of the heat dissipation seems to occur at the interface between the die and head spreader, or even within the die itself. Moreover, the heatsink only feels warm to the touch (as distinct from being so warm that it is unconformable to touch for long periods of time), further suggesting that the heat dissipation capability of the NH-U14S is adequate for a 3950X running at stock.
12. The high temperatures observed with Ryzen 3rd Generation on ambient cooling is likely due to the 7nm node – a lot of heat is being generated by a relatively small die.
13. If you want to do significantly better than air cooling, then you would have to look at cooling solutions which are able to bring the temperature of the heat spreader below ambient temperatures, such as phase-change systems or Peltier coolers (aka thermoelectric cooling). Because it is only by decreasing the temperature of the heat spreader that the rate of heat dissipation from the die to the heat spreader would improve. (For the proposition that rate of heat dissipation is proportional to the temperature difference, see Fourier’s Law.) A high-end air cooling already seems to be already be able to maintain the heat spreader at close to ambient temperatures, so the best ambient water-cooling is unlikely to yield any significant benefit.
14. Rant #1: HWInfo v6.20 does not report correct voltages, clock speeds, etc. when “Memory Integrity” in Windows Security is enabled.
15. Rant #2: Ryzen Master v18.104.22.1684 will not open/start if Virtualization Based Security is enabled.
16. To reiterate, and putting my original question differently: is there any CPU stability test which incorporates the principles I mentioned in paragraphs 4, 6 and 7 of this post? If not, I would like to invite developers or programmers out there to design one.
17. The reasons why I believe it’s particularly important to test Ryzen 3rd Generation CPUs even at stock and independently verify the stability of the CPU is as follows:
(a) First, because of how close these CPUs are operating to their maximum potential. As noted in this Anandtech write-up on AMD’s boost behaviour (at page 3), “….the CPU out of the box is already near its peak limits, and AMD’s metrics from manufacturing state that the CPU has a lifespan that AMD is happy with despite being near silicon limits…”.
(b) Second, apparently AMD would gradually increase the voltage over time to compensate for the effects of electromigration (ibid). But can we trust this algorithm to accurately compensate for the effects of electromigration?