Question DEGRADING Raptor lake CPUs

Page 22 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kocicak

Golden Member
Jan 17, 2019
1,074
1,131
136
I noticed some reports about degrading i9 13900K and KF processors.

I experienced this problem myself, when I ran it at 6 GHz, light load (3 threads of Cinebench), at acceptable temperature and non extreme voltage. After only few minutes it crashed, and then it could not run even at stock setting without bumping the voltage a bit.

I was thinking about the cause for this and I believe the problem is, that people do not appreciate, how high these frequencies are and that the real comfortable frequency limit of these CPUs is probably at something like 5500 or 5600 MHz. These CPUs are made on a same process (possibly improved somehow) on which Alder lake CPUs were made. See the frequencies 12900KS runs at. The frequency improvement of the new process tweak may not be so high as some people presume.

Those 13900K CPUs are probably highly binned to be able to find those which contain some cores which can reliably run at 5800 MHz. Some of the 13900K probably have little/no OC reserve left and pushing them will cause them to degrade/break.

The conclusion for me is that the best you can do to your 13900K or 13900KF is to disable the 5800 MHz peak, which will allow you to offset the voltage lower, and then set all core maximal frequency to some comfortable level, I guess the maximum level could be 5600 MHz. With lowered voltage this frequency should be gentler to the processor than running it at original 5500 MHz at higher voltage. You can also run it at lower frequencies, allowing for even higher voltage drop, but then the CPU is slowly loosing its sense (unless you want some high efficiency CPU intended for heavy multithread loads).

Running it with some power consumption limit dependent on your cooling solution to keep the CPU at sensible temperature will help too for sure.
 
Last edited:

alcoholbob

Diamond Member
May 24, 2005
6,311
357
126
What's the best way to underclock / undervolt the cpu? I wouldn't mind giving up 5-10% of performance or even more if it means i'll have a stable NAS / home server appliance. I dont need to squeak out every last bit of performance in my use case.

Undervolting tends to increase performance, not decrease it. Generally doing an adaptive voltage offset while keeping CEP on is a good strategy.
 

Kocicak

Golden Member
Jan 17, 2019
1,074
1,131
136
Are we really comparing HEDT and with a 14900KS?
Why not? Some "gaming service providers" apparently had no problem to use 14900K in their servers.

There is no reason to limit frequency of the workstation CPUs without a real reason, because it limits the performance. If Intel feels they cannot run these CPUs made on the same process using the same cores (?) quicker than at 4800 MHz, they probably have some serious reason for doing so?
 
Last edited:
  • Like
Reactions: Joe NYC

cebri1

Senior member
Jun 13, 2019
350
373
136
Why not? Some "gaming service providers" apparantly had no problem to use 14900K in their servers.

There is no reason to limit frequency of the workstation CPUs without a real reason, because it limits the performance. If Intel feels they cannot run these CPUs made on the same process using the same cores (?) quicker than at 4800 MHz, they probably have some serious reason for doing so?

There is no consumer part with 16 P cores. AMD has an end consumer and HEDT product with the same number and same type of cores and the HEDT product is also clocked lower.

Why? Stability, Thermals, Power Consumption, Not needed for HDET typical workloads, etc.
 

Hulk

Diamond Member
Oct 9, 1999
4,472
2,438
136
Are these CPUs made on the same process as normal desktop CPUs are?

View attachment 106466

Intel seems to think that for reliable workstation operation these CPUs can run only up to 4,8 GHz. That is 1,4 GHz lower than what 14900KS can run at.

Does anybody have experience with these CPUs, does this 4,8 GHz limit have any significance, or the "real limit" is even lower - 4,6 GHz?

The limit 5400 MHz I set on the new CPU suddenly seems way too high again... It is funny that everything seems to indicate that my "gut feeling" I got years ago that this process is good only up to 5 GHz was correct.
Are you having stability issues at 5.4? I've been fine at 5.5 for over a year.
 

Kocicak

Golden Member
Jan 17, 2019
1,074
1,131
136
Are you having stability issues at 5.4? I've been fine at 5.5 for over a year.
The CPU is brand new, it would not have any problems even at Intel stock breakneck speeds now. I would like to get a few years of completely stable operation out of this CPU, not sure if 5,4 is not too high? I was running the previous 14900K with 5 or 5,2 GHz limit most of the time.
 

Hulk

Diamond Member
Oct 9, 1999
4,472
2,438
136
The CPU is brand new, it would not have any problems even at Intel stock breakneck speeds now. I would like to get a few years of completely stable operation out of this CPU, not sure if 5,4 is not too high? I was running the previous 14900K with 5 or 5,2 GHz limit most of the time.
I would be more concerned with voltage and heat. I have decided on a manual max voltage of 1.3V, which is about 1.15V under heavy load. Temps are always under 75C with my cooling. I can get 5.5GHz without HT stable in all my apps with that setting. Personally I think anything over 1.3V is too much for long-term operation.
 
  • Like
Reactions: DAPUNISHER

Kocicak

Golden Member
Jan 17, 2019
1,074
1,131
136
Gee, good thing Intel capped voltage at 1.55v . . .
They prolonged the time before failure by a few months by what they did now, I am not sure it is even a half of a year. They are not solving the problem for consumers, they are just minimising the short term damage to the company.
 
Last edited:

Kocicak

Golden Member
Jan 17, 2019
1,074
1,131
136
I have decided on a manual max voltage of 1.3V ... I can get 5.5GHz
You DO NOT WANT to limit both voltage and frequency!!!

See this example: if I wanted to cap voltage at 1,2, I will set ONLY the frequency limit and I will retain the whole 150mV voltage safety margin for stability (green).

cpu limits safety margin.png

If you decide to limit both voltage at 1,3V and will select frequency at which the CPU is at the moment stable: say 5,6 GHz, you will be left just with a fraction of the original safety voltage margin. And after time as the CPU slowly naturally degrades, the stability voltage increases and reaches your voltage limit and you may start getting instability without really noticing that, as data corruption, etc. You do not want that.
 
Last edited:

Hulk

Diamond Member
Oct 9, 1999
4,472
2,438
136
You DO NOT WANT to limit both voltage and frequency!!!

See this example: if I wanted to cap voltage at 1,2, I will set ONLY the frequency limit and I will retain the whole 150mV voltage safety margin for stability (green).

View attachment 106516

If you decide to limit both voltage at 1,3V and will select frequency at which the CPU is at the moment stable: say 5,6 GHz, you will be left just with the fraction of the original safety voltage margin. And after time as the CPU slowly naturally degrades, the stability voltage increases and reaches your voltage limit and you may start getting instability without really noticing that, as data corruption, etc. You do not want that.
But in reality the cpu is almost never at the selected voltage because of C states when not under load.
 

Hulk

Diamond Member
Oct 9, 1999
4,472
2,438
136
Rendering with Magix Vegas Pro 21 using Vokouder to x265.

Totally stable running 5.5/4.4 no HT at 170ish watts under 80C.

Posting for those who might be looking for good Raptor settings.

Been running this way over a year on air with no issues.

1726670472332.png
 
  • Like
Reactions: lightmanek

IGBT

Lifer
Jul 16, 2001
17,961
140
106
I downloaded SPECCY..my XPS 8960 / I9-13900 has the 0x129 update. I'm seeing idle temps at aprox. 38 degrees Celsius / 100.4 degrees fahrenheit. Is this considered normal after the 0x129 update?? (Ambient room temp is 78 degrees fahrenheit and my XPS has Dell installed liquid cooling)..are there any reviews of the before / after results of the 0x129 update??
 

coercitiv

Diamond Member
Jan 24, 2014
6,631
14,065
136
So they finally identified the root cause, and as we speculated on this thread before, we had a perfect storm in terms of factors that accelerated the process:
Intel® has localized the Vmin Shift Instability issue to a clock tree circuit within the IA core which is particularly vulnerable to reliability aging under elevated voltage and temperature. Intel has observed these conditions can lead to a duty cycle shift of the clocks and observed system instability.

Intel® has identified four (4) operating scenarios that can lead to Vmin shift in affected processors:
  1. Motherboard power delivery settings exceeding Intel power guidance.
  2. eTVB Microcode algorithm which was allowing Intel® Core™ 13th and 14th Gen i9 desktop processors to operate at higher performance states even at high temperatures.
  3. Microcode SVID algorithm requesting high voltages at a frequency and duration which can cause Vmin shift.
  4. Microcode and BIOS code requesting elevated core voltages which can cause Vmin shift especially during periods of idle and/or light activity.
 

Kocicak

Golden Member
Jan 17, 2019
1,074
1,131
136
Intel posted an update on the 13th/14th gen stability issues and referenced their next microcode update. They’re saying this next update doesn’t really change performance; it’s within run-to-run variability.
... Intel can now confirm the root cause diagnosis for the issue. ...
Intel has localized the Vmin Shift Instability issue to a clock tree circuit within the IA core which is particularly vulnerable to reliability aging under elevated voltage and temperature. Intel has observed these conditions can lead to a duty cycle shift of the clocks and observed system instability.
So they finally found the part of the silicon particularly sensitive to degradation? That would be good.

Intel® has identified four (4) operating scenarios that can lead to Vmin shift in affected processors:

  1. Motherboard power delivery settings exceeding Intel power guidance.
    a. Mitigation: Intel® Default Settings recommendations for Intel® Core™ 13th and 14th Gen desktop processors.
  2. eTVB Microcode algorithm which was allowing Intel® Core™ 13th and 14th Gen i9 desktop processors to operate at higher performance states even at high temperatures.
    a. Mitigation: microcode 0x125 (June 2024) addresses eTVB algorithm issue.
  3. Microcode SVID algorithm requesting high voltages at a frequency and duration which can cause Vmin shift.
    a. Mitigation: microcode 0x129 (August 2024) addresses high voltages requested by the processor.
  4. Microcode and BIOS code requesting elevated core voltages which can cause Vmin shift especially during periods of idle and/or light activity.
    a. Mitigation: Intel® is releasing microcode 0x12B, which encompasses 0x125 and 0x129 microcode updates, and addresses elevated voltage requests by the processor during idle and/or light activity periods.
Regarding the 0x12B update, Intel® is working with its partners to roll out the relevant BIOS update to the public.
...
Next Steps
For all Intel® Core™ 13th/14th Gen desktop processor users:
the 0x12B microcode update must be loaded via BIOS update and has been distributed to system and motherboard manufacturers to incorporate into their BIOS. Intel is working with its partners to encourage timely validation and rollout of the BIOS update for systems currently in service. This process may take several weeks.
...

The real reason for quick degradation - too high frequency, which is the underlying cause for the elevated temperature and voltage causing high electric current density, is missing.

I am not convinced that even a brand new CPU running the 12B microcode will reliably work for long years at those extreme frequencies.
 
Last edited:

KompuKare

Golden Member
Jul 28, 2009
1,183
1,470
136
So they finally identified the root cause, and as we speculated on this thread before, we had a perfect storm in terms of factors that accelerated the process:
Also inclined to think that the 5th cause not listed was running the CPUs too fast - even if parts other than the top i9 are affected. That plus inadequate binning of that part of the die.

Anyway, of the four causes listed presumably not all have an equally contribution to the degrading, so why did Intel lead with the part for which motherboard vendors are to blame?!
 

coercitiv

Diamond Member
Jan 24, 2014
6,631
14,065
136
Also inclined to think that the 5th cause not listed was running the CPUs too fast - even if parts other than the top i9 are affected. That plus inadequate binning of that part of the die.
The 14600K has a top frequency of 5.4Ghz, the 12900KS reaches 5.5Ghz. The first CPU is subject to Vmin shift (albeit "slower" degradation), the second is not.

They made multiple bad choices when it came to keeping voltage and temperature under control and all of them combined nuked the (overly?) sensitive circuit in the IA cores. I would argue all of the bad choices can be considered catalysts: no limits for power/current, eTVB disabled and/or bugged, voltage spikes from microcode etc. High frequency indirectly exacerbates all of the above, by amplifying the effects of badly configured voltage and temp limits.

I would put high frequency as the 5th cause if fixing the first four is not enough to ensure both reliability and max boost clocks.

Anyway, of the four causes listed presumably not all have an equally contribution to the degrading, so why did Intel lead with the part for which motherboard vendors are to blame?!
Because they want to remind us how they failed to enforce safe settings on premium motherboards aimed at high spenders and prosumers, essentially destroying their brand in the face of the biggest spenders? Intel inside, they used to say. Now it's Asus, Gigabyte, Acer, MSI inside :)
 

Hulk

Diamond Member
Oct 9, 1999
4,472
2,438
136
Turbo Velocity Boost ignoring temps is a big no, no. I had always wondered why these things were boosting at temps over 70C. I don't think that was a bug at all. I think it was "marketing." Now they are slowly walking back some bad decisions and calling them "bugs."
 

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
29,572
24,451
146
The 14600K has a top frequency of 5.4Ghz, the 12900KS reaches 5.5Ghz. The first CPU is subject to Vmin shift (albeit "slower" degradation), the second is not.
It may be too early to sound the all clear for the KS. They stopped making it after all. Using 13th gen as a touchstone , the discontinuation of LGA 1700 SKUs is a bad sign. It's only been 2.5yrs since it hit the shelves. Perhaps it'll just take an extra year or 2 to crap the bed compared to raptor? Intel probably sold a fairly limited number, so it is unlikely failure complaints will ever be widespread enough to raise red flags either. /hot take
Because they want to remind us how they failed to enforce safe settings on premium motherboards aimed at high spenders and prosumers, essentially destroying their brand in the face of the biggest spenders? Intel inside, they used to say. Now it's Asus, Gigabyte, Acer, MSI inside :)
That's a great way to put it; take all the credit, but share all the blame.
 

Kocicak

Golden Member
Jan 17, 2019
1,074
1,131
136
You DO NOT WANT to limit both voltage and frequency!!!

See this example: if I wanted to cap voltage at 1,2, I will set ONLY the frequency limit and I will retain the whole 150mV voltage safety margin for stability (green).

View attachment 106516

If you decide to limit both voltage at 1,3V and will select frequency at which the CPU is at the moment stable: say 5,6 GHz, you will be left just with a fraction of the original safety voltage margin. And after time as the CPU slowly naturally degrades, the stability voltage increases and reaches your voltage limit and you may start getting instability without really noticing that, as data corruption, etc. You do not want that.

I found this 5 years old interview, in which Guy Therien from Intel talks about degradation, or as he calls it "wearing out":

https://www.anandtech.com/show/1458...ng-an-interview-with-intel-fellow-guy-therien

When we sell our parts, in respect to our internal tools, we do modelling to detemine how long they expected to last. There is a mathematical wear out expression that is our spec. That spec is based upon projected wear of the CPU, and how long we think typical parts or worst case (or most used) parts will spend in turbo. Our internal datasheets specify the percentage of the parts are projected to last a certain amount of time with what workload. So our users understand that, they'll buy our parts (we’re talking the really high-end server folks), and they’ll say that they understand that there is a limit on how long they will last when used under the conditions Intel have specified – but that they won’t be using it under those conditions. They ask us that if they put it in turbo and leave it in turbo, 24 hours a day, 7 days a week, how long will it last? They look at us, they understand what we sell and at what price, but they ask us how long our products last under their specific conditions. When this first started happening, we said we didn’t know, but we would try and figure it out.

So an effort went underway to try to measure the wear of these systems. Just to be clear, all systems slowly wear out and become slower / need a higher voltage for the same frequency over time. What we do, as what everyone in the industry does, is add some voltage, a wear-out margin, to ensure that the part continues to operate in spec over a specific lifetime of the part. So you can measure how much voltage the parts need as they wear out over time, and hopefully figure out when parts are wearing out (if they wear out at all, as some don’t wear out very much at all). They wanted to know if we could assess this offline and give them an indicator of when a part was going to wear out. It turned out that there was a long effort to try to do this. As server availability has to be up like 99.999% of the time, ultimately the project was unsuccessful. We had false positives and false negatives and we couldn’t tell them exactly when each specific part was going to wear out, and when we did tell them it wasn’t going to wear out, it eventually did. So it’s a very difficult task, right? So I learned about this effort, and one of the revelations that the team had was in order to improve their accuracy they couldn’t take measures while an OS was running, because of the variability caused by the OS, due to interrupts and other background processes, so they learned to do the measurements in an environment offline outside the OS.
What I got from this:

The degradation is a process that can affect the CPUs even in the shorter few year long time span, as it is evident from the discussion with the customers.

The idea many people apparently still hold that CPUs should last decades is simply WRONG.

Second takeaway is that the life of the CPU really depends a lot on a frequency the CPUs runs at.

What surprised me it that when they wanted to get accurate predictions, they needed to make the measurement outside of the OS, as it alone could affect life of the CPU.


So no, please do not undervolt or set the CPU in a way that could lower the "wear-out voltage margin".

Intel insisting to run the CPUs at those extreme frequencies PREVENTS SOLVING the degradation problem.

As Intel apparently knows a lot about the CPUs, they know at what frequencies they should be running at to ENSURE LONG TERM STABILITY and they should publish these frequencies for people who voluntarilly want to decrease the performance of their CPUs in order to ensure their long term reliable operation.


I am really pissed off that they are leaving people to grope in the dark and guess if they should limit their CPUs to 5400, 5000 or even 4600 MHz.
 

Jan Olšan

Senior member
Jan 12, 2017
404
710
136
I found this 5 years old interview, in which Guy Therien from Intel talks about degradation, or as he calls it "wearing out":

https://www.anandtech.com/show/1458...ng-an-interview-with-intel-fellow-guy-therien


What I got from this:

The degradation is a process that can affect the CPUs even in the shorter few year long time span, as it is evident from the discussion with the customers.

The idea many people apparently still hold that CPUs should last decades is simply WRONG.

Second takeaway is that the life of the CPU really depends a lot on a frequency the CPUs runs at.

What surprised me it that when they wanted to get accurate predictions, they needed to make the measurement outside of the OS, as it alone could affect life of the CPU.


So no, please do not undervolt or set the CPU in a way that could lower the "wear-out voltage margin".

Intel insisting to run the CPUs at those extreme frequencies PREVENTS SOLVING the degradation problem.

As Intel apparently knows a lot about the CPUs, they know at what frequencies they should be running at to ENSURE LONG TERM STABILITY and they should publish these frequencies for people who voluntarilly want to decrease the performance of their CPUs in order to ensure their long term reliable operation.


I am really pissed off that they are leaving people to grope in the dark and guess if they should limit their CPUs to 5400, 5000 or even 4600 MHz.
This post is hugely incorrect.

Frequency is not really accelerating (or even causing) the wear and degradation, it's voltage that the frequency requires. Bleeding edge boost frequency like 6 GHz is not happening in isolation, it requires voltages that are close to being able to quickly damage the chip. It's the voltage that is doing the harm.
There is likely a non-linear curve where voltages like 1.0V could likely keep the chip operating forever (like, decades), >1.5V causing quite accelerated aging and say 1.8V being able to kill the chip instantly. (at room temperature, it changes when temps are low/subzero/LN2)


What the Intel dude meant by the voltage margin is that when they find the CPU needs 1.1V to be stable at required clock, they will sell it with 1.2V stock voltage, so that as the silicon may degrade over time and the actual required voltage becomes say 1.15 V after 6 years, the CPU still works fine. (Remember how people undervolting GPUs rant that company X does poor binning if you can undervolt a GPU? Perhaps this is part of the reason. Also means that those undervolts will likely become unstable over time and need dialing-back).

So if you undervolt, you lower the margin your processor has left against that inherent aging of the silicon. YOu are close to the initial required voltage so any deterioration in required voltage trips you.

However, that doesn't mean the undervolt was harmful. It just makes you notice the aging sooner (and if oyu readjust sour underclock, stuff will continue to work). Actually, since the aging gets faster with voltage, what actually means that if you underclocked at the start, you likely slowed the aging down. If it was a significant reduction, you could even slow it down so much you won't realistically see the effects of aging before you stop using the PC.

Of course, it gets a bit more complicated because voltage for current CPUs is not constant - they run at lower voltages in multicore loads, but then ramp to those 1.5V voltages during brief single-thread boosting when you browse web. The undervolt margins are different between those states I think, but generally, by lowering the voltage, you can only make the CPU to age/degrade less, not more.
 
Last edited:

Jan Olšan

Senior member
Jan 12, 2017
404
710
136
Also inclined to think that the 5th cause not listed was running the CPUs too fast - even if parts other than the top i9 are affected. That plus inadequate binning of that part of the die.

Anyway, of the four causes listed presumably not all have an equally contribution to the degrading, so why did Intel lead with the part for which motherboard vendors are to blame?!
Well, they probably don't mind if people keep blaming the motherboard vendors (and I think there are types of Intel fans out there that will).
But I think the list could actually simply be chronological, it shows the development of the analysis and solutions Intel has come up with.

The power limits (and the related Intel ACTUAL Default Settings countermeasure) was the first thing they went public about. Then they went public with the TVB issue (late june or soemthing like that) and the big announcement of the whole the high-voltage problem and the 1.55V request cap as the countermesure.