Solved! How should we test the stability of CPUs which are able to boost past their all-core frequency? (With particular focus on AMD's 3950X)

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Pro-competition

Junior Member
Dec 13, 2019
11
0
6
Warning: This is a very involved read that probably would take around 15 minutes.

1. Question/issue: How should we test the stability of CPUs which are able to boost past their all-core frequency? In other words, how should we test the stability of CPUs which are able to operate at a higher frequency than their “base frequency” when only a few of their cores are loaded?

Why this is more complicated than it might first appear
2. Take for example my AMD 3950X. It is able to operate up to ~4.2GHz when all cores are loaded, but is supposedly able to run 1 core at up to 4.7GHz. The implication is that if I were just to run 32 threads of Prime95, I am only really testing whether my 3950X is Prime95 stable at or below ~4.2GHz (and not whether it is stable running 1 core at 4.7GHz, or anything above ~4.2GHz for that matter). The corollary, leaving aside the complication discussed in paragraph 5 of this post, is that if I wanted to check whether the 3950X is Prime95 stable, I would have to run 1 thread of Prime95, 2 threads of Pime95, up until 32 threads of Prime95, so that I can be sure my 3950X is Prime95 stable regardless how many cores are active

3. A further complication is that the maximum heat output does not necessarily occur when all cores are loaded. For example, the 3950X that Anandtech tested consumed the most amount of power when 10 out of 16 cores are loaded. (See Anandtech’s 3950X review, page 2.) The implication is that a 3950X may very well be stable when all 16 cores are loaded, but unstable when 10 cores are loaded, because the temperature of the 3950X is higher when 10 cores are loaded.

4. Tentative conclusion #1 in light of the issues I raised in paragraphs 2 and 3 of this post: One characteristic of an ideal stress test is that it is able to dynamically adjust the number of active threads as the stress test progresses.

5. In addition, Ryzen 3rd generation CPUs (i.e. Zen 2 architecture using TSMC’s 7nm manufacturing process) are only able to reach (close to) their advertised single-core max boost speeds for extremely brief periods of time. (See Anandtech’s 3950X review (at page 2), where it was stated that peak single core frequency of 4650 MHz on the Ryzen High Performance (RHP) power plan was “very instantaneous, as when we put a consistent single thread load on the core, the [frequency] very quickly came down”. Also see this Anandtech article (at page 7) where it was stated that “Ultimately, by opting for a more aggressive binning strategy so close to silicon limits, AMD has reached a point where, depending on the workload and the environment, a desktop CPU might only sustain a top Turbo bins momentarily”.)

This behaviour is unlike modern Intel CPUs which, given sufficient cooling and a sufficiently high Power Limit 2 value, are able to boost to their maximum single-core boost frequencies until the Power Limit 2 (PL2) duration – aka Turbo Time Parameter (Tau or τ) – is reached. (See Anandtech’s 2019 interview with Guy Therien, this 2019 article, and this 2018 article.)

The implication is that since there’s no way of sustaining the maximum frequencies achieved by the Ryzen 3rd generation CPUs for any meaningful duration, there is no way of testing whether such a CPU is stable at the highest frequencies which it is able to achieve for only brief periods of time.

6. Tentative conclusion #2 in light of the issue I raised in paragraph 5 of this post: Another characteristic of an ideal stress test is that it is able to generate bursts of intense workloads interspersed with zero loads, in order to coax the CPU into operating at its highest frequencies.

Different instruction sets

7. An ideal stress test would also test every possible type of instruction a CPU supports (and every combination thereof).

8. Prime95 presumably doesn’t do this, hence I chose my words very carefully and said “Prime95 stable” in my posts and not merely “stable”. Ex hypothesis, this also means the Prime95 algorithm shouldn’t be placed on a pedestal as the gold-standard of stability tests, but merely one of several stability tests to perform.

Prime95 oddity

9. Pirme95 specific observation: I noticed that Prime95 version 28.9 causes my 3950X to produce varying amounts of heat, at least when 32 threads are running. To elaborate:

(a) The 3950X would hum along at ~70C most of the time, then occasionally hit ~90C before going go back to ~70C. The cycle then repeats.​
(b) Also, the 3950X would operate anywhere between 3.3GHz and 4.2GHz when all cores are loaded, but mostly between 3.8GHz and 4GHz. This is probably because (a) the intensity of the Prime95 workload varies over time and (b) the 3950X is being forced to operate within the specified power or current limits viz.:​
(i) Package Power Tracking (PPT), the power threshold that is allowed to be delivered to the socket;​
(ii) Thermal Design Current (TDC), the maximum amount of current delivered by the motherboard’s voltage regulators when under thermally constrained scenarios; and​
(iii) Electrical Design Current (EDC), the maximum amount of current at any instantaneous short period of time that can be delivered by the motherboard’s voltage regulators.​

Those with 3950X or any other Ryzen 3rd gen CPU, do you notice a similar behaviour when running Prime95?

4GHz 90C 97% power - 24 Dec 2019.jpg
Screenshot of HWInfo and Ryzen Master after running Prime95 Blend (32 threads)
10. I’ll include the relevant specifications/configuration of my system for reference:
  • Motherboard: MSI x570 Unify
  • Motherboard BIOS: 7C35vA2 (released 2019-11-07), and most likely includes the AMD ComboPI1.0.0.4 Patch B (SMU v46.54)
  • AMD Chipset Driver version: 1.11.22.454 (released 11/25/2019), which inter alia includes AMD Ryzen Power Plan v5.0.0.0
  • Windows Power Plan: AMD Ryzen High Performance plan (which, unlike the Ryzen Balanced plan, retains the fast Frequency Ramp-Up times - see Anandtech’s article on Collaborative Processor Performance Control 2 (CPPC2), but see this Anandtech article (at page 7) for a better explanation of CCPC2)
  • Windows build: 10.0.18363 (version 1909)
Air cooling is adequate for 3950X at default settings

11. As an aside, I am of the opinion that my Noctua NH-U14S is adequate for running 3950X at stock, since it is, broadly speaking, able to keep the 3950X at around 70C when running Prime95 even when ambient temperature is a fairly warm ~28.5C. The occasional spikes to 90C when running Prime95 will probably still occur even on the best ambient water-cooling system, since the bottleneck of the heat dissipation seems to occur at the interface between the die and head spreader, or even within the die itself. Moreover, the heatsink only feels warm to the touch (as distinct from being so warm that it is unconformable to touch for long periods of time), further suggesting that the heat dissipation capability of the NH-U14S is adequate for a 3950X running at stock.

12. The high temperatures observed with Ryzen 3rd Generation on ambient cooling is likely due to the 7nm node – a lot of heat is being generated by a relatively small die.

13. If you want to do significantly better than air cooling, then you would have to look at cooling solutions which are able to bring the temperature of the heat spreader below ambient temperatures, such as phase-change systems or Peltier coolers (aka thermoelectric cooling). Because it is only by decreasing the temperature of the heat spreader that the rate of heat dissipation from the die to the heat spreader would improve. (For the proposition that rate of heat dissipation is proportional to the temperature difference, see Fourier’s Law.) A high-end air cooling already seems to be already be able to maintain the heat spreader at close to ambient temperatures, so the best ambient water-cooling is unlikely to yield any significant benefit.

Rant

14. Rant #1: HWInfo v6.20 does not report correct voltages, clock speeds, etc. when “Memory Integrity” in Windows Security is enabled.

15. Rant #2: Ryzen Master v2.1.0.1424 will not open/start if Virtualization Based Security is enabled.

Conclusion

16. To reiterate, and putting my original question differently: is there any CPU stability test which incorporates the principles I mentioned in paragraphs 4, 6 and 7 of this post? If not, I would like to invite developers or programmers out there to design one.

17. The reasons why I believe it’s particularly important to test Ryzen 3rd Generation CPUs even at stock and independently verify the stability of the CPU is as follows:

(a) First, because of how close these CPUs are operating to their maximum potential. As noted in this Anandtech write-up on AMD’s boost behaviour (at page 3), “….the CPU out of the box is already near its peak limits, and AMD’s metrics from manufacturing state that the CPU has a lifespan that AMD is happy with despite being near silicon limits…”.​
(b) Second, apparently AMD would gradually increase the voltage over time to compensate for the effects of electromigration (ibid). But can we trust this algorithm to accurately compensate for the effects of electromigration?​
 
Solution
I understand the original post, and the questions make sense in a way, ultimately boiling down to “how do you know a processor is stable if you can’t test it in a manner that stresses it similarly to normal use due to how the boost behavior works”

I would answer by saying “just use it”. If the system crashes a lot, it’s not stable. If it doesn’t crash, it’s good.

-AG

DrMrLordX

Lifer
Apr 27, 2000
21,637
10,855
136
Then you've completely missed my point...

Your point wasn't worth getting.

Also, Anandtech shouldn't bother writing articles more than ~500 words then

AT tends to write articles I actually want to read.

Could you summarise my post in 100 words then? I'd be grateful.

How about we address the issue of your 3950X having temp/power fluctuations first, and then go from there? That's the predicate for much of your post.
 

Makaveli

Diamond Member
Feb 8, 2002
4,723
1,058
136
I must say you guys handled this exceptionally well this post is suspect.

No wonder i've been a member for like 18 years.

Well Done boys!
 

Micrornd

Golden Member
Mar 2, 2013
1,279
178
106
Just a question, since I haven't seen it asked or answered anywhere :confused_old:

When boosted above base turbo (all cores same speed) my Xeons are constantly shifting the cores that are boosted (I assume this is to even out thermal loads).
Do AMD processors perform the same way?
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,776
3,156
136
Just a question, since I haven't seen it asked or answered anywhere :confused_old:

When boosted above base turbo (all cores same speed) my Xeons are constantly shifting the cores that are boosted (I assume this is to even out thermal loads).
Do AMD processors perform the same way?
Is it changing boost ,or is the scheduler moving threads that then result in different core clocks/boost?
 

Micrornd

Golden Member
Mar 2, 2013
1,279
178
106
The boosted cores are constantly changing. As an example 2@3.9ghz and 22 @3.1ghz, the 2@ 3.9 may be core#1 and core #22 one second and core#5 and core#16 the next, etc., etc..
This is readily visible in HWiNFO64 for instance.
Based on the percentage of load per core (as shown in Windows10 and CoreTemp) the OS scheduler is also moving threads.
But the thread movement all happens too fast for me to verify that the greatest thread count/percentage of load is assigned to and moved with the highest boosted cores as they switch constantly and my old eyes don't move that fast. :oops:

My reasoning is that this is done to distribute the thermal load across the die when boosted above base turbo speeds, as the OS also constantly moves threads/loads from core to core at non-boosted speeds also (again I assume to distribute thermal load across the die and thereby lower the package temp)
The threads/loads are still constantly moved from core to core at max base turbo speed (in my case all 24 cores@3.1ghz)which seems to reinforce the idea that thread/load movement is to reduce heat build up in the die.

I'm just curious as to whether AMD processors do this also?
 

DrMrLordX

Lifer
Apr 27, 2000
21,637
10,855
136
Just a question, since I haven't seen it asked or answered anywhere :confused_old:

When boosted above base turbo (all cores same speed) my Xeons are constantly shifting the cores that are boosted (I assume this is to even out thermal loads).
Do AMD processors perform the same way?

Depends on the generation of the chip. If you're asking about Zen2/Matisse, then no, they do not perform that way.

Matisse CPUs have "good" cores and "not so good" cores. Ryzen Master can usually tell you which ones, and somehow the Windows (and Linux) schedulers also have this knowledge and use it to shift workloads accordingly. Matisse CPUs with two CCDs (3900x, 3950x) have a "good" and "bad" CCD, with better clocks possible on the "good" CCD and all-core clocks being mostly governed by the limits of the "bad" CCD for reasons having to do with the way the boost algorithm works. Even if you had two "good" CCDs on a single Matisse CPU, your all-core clocks wouldn't change much @ default since the boost maps would keep you at the same clocks/voltages/temps anyway. The use of a "bad" CCD that is only invoked in MT workloads makes sense. As an end-user, you lose nothing unless your goal is all-core overclocks (then you lose maybe 100-200 MHz over what you would get with two "good" CCDs).

Furthermore, unlike the Xeon, Matisse tends not to boost small numbers of cores above the pack when all cores are loaded. Your Xeon might do it, but Matisse instead tries to keep the highest MT clocks it can get on every core while remaining within its TDP limit (if any; PBO is broken though, so let's not discuss it here at length). If a smaller number of cores is loaded - let's say half of them or less - then the threads are biased heavily towards the "good" cores. For chips with two CCDs, that means the "good" cores on the "good" CCD see most of the activity. The scheduler won't shift work away from those cores very often to manage thermal load. The 3900x from which I'm posting this likes Cores 1 and 3 the most, so those see the most work. The second CCD is barely used at all.