Dual XEON 2696 v4 + Supermicro X10DAX Build - BSOD frequently - please help

traderjay

Senior member
Sep 24, 2015
220
165
116
I finally received the CPU and had a chance to install it in my new system using the SuperMicro X10DAX motherboard. Unfortunately Windows 10 Pro is experiencing random and frequent BSOD with varying error messages such as whea uncorrectable error, irql_not_less_or_equal, kernel mode trap, KMODE_EXCEPTION_NOT_HANDLED.

I am using Crucial ECC ram that is compatible with the board, along with a Seasonic 1000W Titanium PSU. I also tried running the system in single CPU mode to try isolate the problem and the BSOD persists....any ideas?

The CPU steppings are:

CPU1 - SR2J0 (reading this off the headspreader)
CPU2 - Revision B00001B, Stepping 1, Model 4F CPU Family 6

Any idea guys?
 

StefanR5R

Elite Member
Dec 10, 2016
5,498
7,786
136
No idea what's happening there.
You do have both 8 pin power connectors plugged in, right?
Are you sure both CPUs are at same microcode version?
Have you run a memory tester?
Maybe try a live Linux on it and see whether or not that's stable.

Mine are SR2J0 as well, and Linux' /proc/cpuinfo shows the same microcode version as your CPU2:
Code:
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
stepping        : 1
microcode       : 0xb00001b
I run BIOS 2.0b, October 05 2016.
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
Memtest and BIOS update would be my first two ideas. Although I don't like the fact you have mismatched stepping on your CPU's.
 

traderjay

Senior member
Sep 24, 2015
220
165
116
Thanks all for the response. Do I for sure have mismatched stepping? One CPU is on the table and all I can read is the prints on the heatspreader.

CPU1 - SR2J0 (reading this off the headspreader)
CPU2 - Revision B00001B, Stepping 1, Model 4F CPU Family 6
 

evilr00t

Member
Nov 5, 2013
29
8
81
Thanks all for the response. Do I for sure have mismatched stepping? One CPU is on the table and all I can read is the prints on the heatspreader.

CPU1 - SR2J0 (reading this off the headspreader)
CPU2 - Revision B00001B, Stepping 1, Model 4F CPU Family 6
The last xeons to need mismatched stepping support were the Sandy Bridge Xeons, so if your cpuid's don't match, you're running an ES with a retail stepping of the chip. That's unsupported and will blow up.

whea uncorrectable error
bad hardware - post your crash dump (you should have a minidump from the WHEA uncorrectable error saved somewhere on your drive), I can decode MCEs. It may point to processor, memory, cache, QPI, or IIO issues.

irql_not_less_or_equal

driver fault

kernel mode trap

driver fault

KMODE_EXCEPTION_NOT_HANDLED

driver fault
 

traderjay

Senior member
Sep 24, 2015
220
165
116
The last xeons to need mismatched stepping support were the Sandy Bridge Xeons, so if your cpuid's don't match, you're running an ES with a retail stepping of the chip. That's unsupported and will blow up.

whea uncorrectable error
bad hardware - post your crash dump (you should have a minidump from the WHEA uncorrectable error saved somewhere on your drive), I can decode MCEs. It may point to processor, memory, cache, QPI, or IIO issues.

irql_not_less_or_equal

driver fault

kernel mode trap

driver fault

KMODE_EXCEPTION_NOT_HANDLED

driver fault

Thanks for the help and I've some update on stability testing:

I've confirmed via CPU-Z and HwInfo that both CPUs have identical steppings.

Last night I was able to stabilize the system running in single CPU and updating the nvidia drivers to the latest version and setting performance mode to High. Adding the second CPU back to the motherboard resulted in BSOD again so I bought a copy of HCI Memtest pro and let it run and completed 100% coverage with no errors.

I also let Linx run for more than 12 errors without any errors or crashes. At this stage, can I rule out hardware issues?

I've uploaded my crash dump files here - https://drive.google.com/drive/folders/0BwLj-DyWRhuzTWt3dnV0a3JtMjA?usp=sharing
 

evilr00t

Member
Nov 5, 2013
29
8
81
9812 - nvidia driver bug
The rest: whea uncorrectable.
0xf200020000010005 - Proc 67, bank 0 - 14265
0xb200000000010005 - Proc 66, bank 0 - 14250
0xf2001b8000010005 - Proc 66, bank 0 - 14078
0xf200d88000010005 - Proc 66, bank 0 - 10765
Bank 0 is the CPU L1 instruction cache. Notice how all of the problems seem to happen on Proc 66 or 67 - they're likely the same core, different thread. I can't check because !smt does not work with your dumps. For 14265, 14078, 10765, you have logged multiple machine checks because of 0xf instead of 0xb, and you also have logged many correctable errors, specifially 866 errors on 10765. The error is a parity error, so consider lowering clocks or reducing thermals and see if things become more stable. If you're undervolting or overclocking, stop!

tldr probably a bad processor. Try each processor separately in the first CPU socket alone, and run something like linpack on it to see if it blows up. Chances are, you probably have a bad chip.
 

traderjay

Senior member
Sep 24, 2015
220
165
116
9812 - nvidia driver bug
The rest: whea uncorrectable.
0xf200020000010005 - Proc 67, bank 0 - 14265
0xb200000000010005 - Proc 66, bank 0 - 14250
0xf2001b8000010005 - Proc 66, bank 0 - 14078
0xf200d88000010005 - Proc 66, bank 0 - 10765
Bank 0 is the CPU L1 instruction cache. Notice how all of the problems seem to happen on Proc 66 or 67 - they're likely the same core, different thread. I can't check because !smt does not work with your dumps. For 14265, 14078, 10765, you have logged multiple machine checks because of 0xf instead of 0xb, and you also have logged many correctable errors, specifially 866 errors on 10765. The error is a parity error, so consider lowering clocks or reducing thermals and see if things become more stable. If you're undervolting or overclocking, stop!

tldr probably a bad processor. Try each processor separately in the first CPU socket alone, and run something like linpack on it to see if it blows up. Chances are, you probably have a bad chip.

Thanks so much for the response. I've been testing the system now in dual CPU config with two instances of Linx (problem size 61474, memory 31 GB) and its loading both CPU at 100% with 99% memory consumption. So far it just passed the 12 hours mark without problem.
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
7,400
2,437
146
I also suspect CPU, though running memtest would also be a good idea.
 

traderjay

Senior member
Sep 24, 2015
220
165
116
I also suspect CPU, though running memtest would also be a good idea.

I ran HCI memory test for 24 hours, close to 200% coverage without any issues. Linx ran for 35 hours straight without errors and once I stopped the stress test the system went into BSOD straight away. I've ordered a ASUS Z10PE-D16 WS to do additional testing
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
7,400
2,437
146
I would use bootable memtest 86+ if possible. at least 4 passes.
 

traderjay

Senior member
Sep 24, 2015
220
165
116
I would use bootable memtest 86+ if possible. at least 4 passes.

I read the sticky posted on this forum that HCI mem test is better than memtest86+ ? Either way, I just swapped out the motherboard and the BSOD continues with the ASUS board.
 

traderjay

Senior member
Sep 24, 2015
220
165
116
Post all BIOS settings.
I do have experience with these type setups and can help.

Thanks and I will post my BIOS setting once the new CPUs arrive. I am now using the ASUS Z10PE-D16, which is very similar to your board.

I have another pair of brand new E5-2696V4 coming with identical stepping to do additional testing while the seller of the first pair of CPUs have refunded the purchase. I figured the chances of 3 motherboards, 2 sets of RAM, 2 PSUs (Corsair & Seasonic) and two GPUs being bad is next to nil and the culprit might just be the CPU.

When I put the E5-2696V3 CPUs in the same three motherboard, no BSOD whatseover happens.

By the way, how are you liking your motherboard and setup so far? Must perform like a champ! In the rare occasion when the two 2696V4 work, getting a cinebench score in the high 5000s make me smile :)