Question Machine Check Exception Event 46 and 3B crashes

LordMangas

Junior Member
Apr 12, 2022
12
1
11
Hi forum

First type of crash: EVENT 46 Component is the Memory and a Memory Check Exception. PC freezes in game and reboots.

Second crash I find: When browsing on windows screen YouTube etc, PC freezes and reboots. This is a crash error of 3B code.

These memory check exception crashes are recent. I have undervolted with PBO2 and I have a Ryzen 5800X with a 6700XT.

Motherboard red light shows BOOT and VGA light.

I post WinDbg crash dumps below if anyone can tell me what the issue is.


SYSTEM_SERVICE_EXCEPTION (3b)
An exception happened while executing a system service routine.
Arguments:


PROCESS_NAME: procexp64.exe





STACK_TEXT:
ffff800a`b6587590 fffff800`716251e4 : 00000000`00000003 ffff800a`b6587b20 00000000`00000000 00000000`00000004 : dxgkrnl!DxgkVidMmAllowFailOnOfferReclaimErrors+0xb85
ffff800a`b6587600 fffff800`715740cb : ffffa989`637ec080 ffffa989`637ec080 00000000`00000004 ffff73fd`1a200000 : dxgkrnl!NtGdiDdDDIQueryClockCalibration+0x6b4
ffff800a`b6587a70 fffff800`63a28a78 : ffffa989`637ec080 ffff800a`b6587b20 00000000`000008fc ffffa989`00000000 : dxgkrnl!NtGdiDdDDIQueryStatistics+0xb


SYMBOL_NAME: dxgmms2!VidSchQueryProcessNodeStatistics+30

MODULE_NAME: dxgmms2

IMAGE_NAME: dxgmms2.sys

FAILURE_BUCKET_ID: AV_dxgmms2!VidSchQueryProcessNodeStatistics


WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
nt!_WHEA_ERROR_RECORD structure that describes the error condition. Try !errrec Address of the nt!_WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000000, Machine Check Exception

MODULE_NAME: AuthenticAMD

IMAGE_NAME: AuthenticAMD.sys

STACK_COMMAND: .cxr; .ecxr ; kb

FAILURE_BUCKET_ID: 0x124_0_AuthenticAMD_MEMORY__UNKNOWN_FATAL_IMAGE_AuthenticAMD.sys
 
Jul 27, 2020
15,749
9,812
106

At the end of that page:

1649815914734.png

Looks bad. You may have to RMA the CPU.
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
7,381
2,415
146
Have you tried resetting the unervolt? If you undervolt too far, that is bound to happen.
 
  • Like
Reactions: In2Photos

In2Photos

Golden Member
Mar 21, 2007
1,600
1,637
136
I ran into some crashing on my son's 5600x build. It ran benchmarks fine and a few games also ran without any problems. But then some different games started crashing. A BIOS update fixed his issue.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
MCE => machine check expection, basically hardware has detected that some horrible error is happening ( like uncorrectable cache hierachy error) and throws towel.

Not stable "undervolt" will result in errors like these. When undervolting sometimes CPU can be perfectly fine under heavy AVX2 load and then crash at idle when browsing web due to CPU not being fed enough voltage when transiting between low clocks etc.
Usually the good test to confirm this theory is disabling C-state in BIOS ( but beware high power use in idle and don't forget to restore it ). I recently had a system running 9700K start throwing MCE errors in cache and correctly deduced that it is due to chip degradation and cache no longer happy with idle voltage and transiting between deeper sleep mode. Fixed it by passing "intel_idle.max_cstate=3" in kernel params, works perfectly for like 4month now, when it would soil pants every week before.

Or simply reduce undervolt by 1 step and hope it sticks, that's what i'd do.
 
  • Like
Reactions: igor_kavinski

Soulkeeper

Diamond Member
Nov 23, 2001
6,712
142
106
Yes, MCE is machine check exception. Not "memory" check exception.
It could just as likely be any component of the cpu which could include the UMC, IF (IOD and/or CCD), cache, etc.
Really, the error is not descriptive enough to be certain of anything. But the fact that you have modified some cpu settings with PBO means anything is likely.
 

LordMangas

Junior Member
Apr 12, 2022
12
1
11
MCE => machine check expection, basically hardware has detected that some horrible error is happening ( like uncorrectable cache hierachy error) and throws towel.

Not stable "undervolt" will result in errors like these. When undervolting sometimes CPU can be perfectly fine under heavy AVX2 load and then crash at idle when browsing web due to CPU not being fed enough voltage when transiting between low clocks etc.
Usually the good test to confirm this theory is disabling C-state in BIOS ( but beware high power use in idle and don't forget to restore it ). I recently had a system running 9700K start throwing MCE errors in cache and correctly deduced that it is due to chip degradation and cache no longer happy with idle voltage and transiting between deeper sleep mode. Fixed it by passing "intel_idle.max_cstate=3" in kernel params, works perfectly for like 4month now, when it would soil pants every week before.

Or simply reduce undervolt by 1 step and hope it sticks, that's what i'd do.
Crashes both times even with Global C State disable
Yes, MCE is machine check exception. Not "memory" check exception.
It could just as likely be any component of the cpu which could include the UMC, IF (IOD and/or CCD), cache, etc.
Really, the error is not descriptive enough to be certain of anything. But the fact that you have modified some cpu settings with PBO means anything is likely.
When this crash would have been due to undervolt, event viewer would always show a Processor Core crash and with the WHEA APIC to tell me which core crashed. This is not whats happening. Event viewer is telling me "Memory" is crashing. I dont know if its the Memory in the CPU or if its RAM. OR if it's the GPU.
 

LordMangas

Junior Member
Apr 12, 2022
12
1
11

At the end of that page:

View attachment 59938

Looks bad. You may have to RMA the CPU.
I am suspecting it is the cpu.. but I would like to diagnose before doing any RMA if it really is the CPU at fault. How can I test if it was the undervolt possibly doing this?
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,712
142
106
To be honest here, i'm having similar problems with my 5950x.
It's producing MCE errors every 6hrs+ or so.
I keep adjust voltages and retrying for the past 2 weeks, they won't go away.

My error messages are slightly different than yours however:
ras-mc-ctl --errors |tail
137 2022-04-13 07:38:43 -0700 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=1, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x41e432940, misc=0xd01a000101000000, walltime=0x6256e073, cpuid=0x00a20f12, bank=0x00000012
138 2022-04-13 15:27:22 -0700 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x4d0493d00, misc=0xd01a000101000000, walltime=0x62574e4a, cpuid=0x00a20f12, bank=0x00000011

Looking at the amd pdf datasheets for what bank 17 and 18 are and they just list "Reserved". Not even documented. So the entire output of the ras-mc-ctl is suspect, it says "Unified Memory Controller" and always "CPU 2", but this could be wrong.
It's also worth noting that I have ecc memory and have not seen a single edac-utils ecc error reported for my memory. I can also pass memtest86 runs over and over without errors.

Your errors are popping up as "UNCORRECTABLE" and mine are popping up as CECC ie: corrected ecc
But I also have experienced several random reboots.

Short of having defective hardware, my gut is telling me that both our problems relate to voltages.
Possibly memory timings or signal integrity.

Even if the memory itself is stable, the UMC, Infinity fabric (vddg iod and ccd), ddr phy, other parts of the soc, cpu cache, etc. might not be able to handle certain memory timings, clocks, CAD, etc..

PBO2 and curve optimizer are perhaps the biggest pita i've ever experienced with a CPU in my 20+ years of hobbying around.
 
Last edited:

Soulkeeper

Diamond Member
Nov 23, 2001
6,712
142
106
I suggest disabling PBO2 and curve optimizer (officially not supported/guaranteed to work). Use AMD officially supported settings (AUTO or NORMAL) for most other settings and try to rule out a defective cpu.

Although "officially" supported is kinda hard to pinpoint with an automatically overclocking design like these, and motherboard specific settings varying.
 

LordMangas

Junior Member
Apr 12, 2022
12
1
11
I suggest disabling PBO2 and curve optimizer (officially not supported/guaranteed to work). Use AMD officially supported settings (AUTO or NORMAL) for most other settings and try to rule out a defective cpu.

Although "officially" supported is kinda hard to pinpoint with an automatically overclocking design like these, and motherboard specific settings varying.
If it is really due to voltages, the fix is to disable PBO2 and curve optimiser?

I had a search on google and 3 people with the exact same EVENT 46 error as me fixed the issue by replacing the CPU. Of course that isn't me so I might have another problem. I already tried changing the RAM down to 3200mhz with 1600mhz FCLK and same crashes. I don't think it is the RAM sticks. I am going to try disabling PBO2 and the curve optimiser however the reason I undervolted in the first place is due to thermal issues on the 5800X. So undoing the curve optimiser is against my original plan.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,478
14,434
136
If it is really due to voltages, the fix is to disable PBO2 and curve optimiser?

I had a search on google and 3 people with the exact same EVENT 46 error as me fixed the issue by replacing the CPU. Of course that isn't me so I might have another problem. I already tried changing the RAM down to 3200mhz with 1600mhz FCLK and same crashes. I don't think it is the RAM sticks. I am going to try disabling PBO2 and the curve optimiser however the reason I undervolted in the first place is due to thermal issues on the 5800X. So undoing the curve optimiser is against my original plan.
What is the ram voltage ? 3200 needs about 1.35-1.375 to be stable.
 
  • Like
Reactions: Drazick

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,478
14,434
136
On XMP Profile after double checking, it was on 1.35v so no stability issue there.
However, that brings up a new possibility. XMP does not always get the settings right. What are the exact specs of your memory ?

For example, most of mine is samsung b-die 3200. 14-14-14-34 1.35v. I set it to 3200/1600/1T/1.35v and make sure the timings are correct.

Here:
 
  • Like
Reactions: Drazick

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
I had a search on google and 3 people with the exact same EVENT 46 error as me fixed the issue by replacing the CPU.
So by that account you're pretty close to RMA your CPU anyway. If I were you I wouldn't waste any more time. The last thing you could try is reset everything and test if that's stable. If it is you could use it for some more time. If not (or you don't want non-optimized defaults, then you don't need to test that) do an RMA request.
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,712
142
106
If it is really due to voltages, the fix is to disable PBO2 and curve optimiser?

I had a search on google and 3 people with the exact same EVENT 46 error as me fixed the issue by replacing the CPU. Of course that isn't me so I might have another problem. I already tried changing the RAM down to 3200mhz with 1600mhz FCLK and same crashes. I don't think it is the RAM sticks. I am going to try disabling PBO2 and the curve optimiser however the reason I undervolted in the first place is due to thermal issues on the 5800X. So undoing the curve optimiser is against my original plan.

Personally, i'd rather know what the issue was with some certainty before doing a RMA.
I read a reddit post where the guy returned his CPU for RMA and the new one gave the exact same MCE error.
It can be time consuming, but really all we can do is try eliminating the components one by one to determine what is having the problem. Unfortunately I have no experience with windows, but i'd think there is a way to get a more detailed error report than what's in your first post.
When I was dialing in my ram settings I first set my cpu to an all core clock with all the pbo stuff disabled.

You have to find some state where everything is fully stable and work from there (if at all possible).
You can try running your RAM 1 step slower MHz just to see if it is stable. Is your FCLK/UCLK running 1:1 with MCLK ?

EDIT: post some pictures of your bios settings (all voltages in particular).