Data integrity with a MCE overclocked high core count E5 Xeon?

cbn · Mar 24, 2015

The Xeons and MCE thread got me wondering what would happen to data integrity if MCE overclocking really does turn out to be possible on the various Haswell E5 Xeons?

In the case of enabling MCE on the E5-2699 v3 (18 cores/36 threads with 3.6 GHz turbo and 2.3 Ghz base clock) this would amount to 56% overclock for operation involving all cores

In the case of enabling MCE on the E5-2690 v3 (12 cores/24 threads with 3.5 Ghz turbo and 2.6 Ghz base clock) this would amount to a 35% overclock for operation involving all cores

If CPU cooling were sufficient how much more prevalent could various types of errors be? Could ECC RAM be used with such a MCE overclocked E5 Xeon?

Assume the primary task is video editing, but I am very interesting in hearing opinions about other types of tasks classic to high core count Server Xeons?

Idontcare · Mar 24, 2015

This question is unanswerable by the target audience.

Only Intel can fully, properly, and sufficiently respond to your concerns.

If you truly care to know the answer to your question, sampling random forum folks is no substitute.

If, however, you were simply looking for conversation akin to 12 midnight bar room philosophizing, you have come to the right place!

cbn · Mar 24, 2015

Idontcare said:
This question is unanswerable by the target audience.

Only Intel can fully, properly, and sufficiently respond to your concerns.

If you truly care to know the answer to your question, sampling random forum folks is no substitute.

If, however, you were simply looking for conversation akin to 12 midnight bar room philosophizing, you have come to the right place!

I'd imagine Intel would probably tell me to buy a 2P motherboard with C6xx chipset and two lesser E5 Xeons...... and actually the purchase price and performance of using two $2094 E5-2690 v3 at stock speed would be comparable to one $4115 Xeon E5-2699 v3 that is MCE overclocked. The tradeoff would be greater motherboard cost for 2P, but lower cpu cooling costs. Of course, once warranty, reliability and other variables are factored the 2P no doubt wins hands down. However, for those of us still using X99 at some point down the road I think the idea of extending the usage of our X99 via MCE overclocking (although not officially sanctioned) with some kind of used E5 Xeon (that has fallen out of warranty) is a very practical consideration. So any intelligent theorizing is very welcome.

Idontcare · Mar 24, 2015

cbn said:
any intelligent theorizing is very welcome.

Well in that case!

Provided the random (daily-weekly) error isn't going to render you sans honeymoon pictures or pennyless and without a job, what's to worry about? Go for it!

SOFTengCOMPelec · Mar 24, 2015

Redacted

Headfoot · Mar 24, 2015

cbn said:
The Xeons and MCE thread got me wondering what would happen to data integrity if MCE overclocking really does turn out to be possible on the various Haswell E5 Xeons?

In the case of enabling MCE on the E5-2699 v3 (18 cores/36 threads with 3.6 GHz turbo and 2.3 Ghz base clock) this would amount to 56% overclock for operation involving all cores

In the case of enabling MCE on the E5-2690 v3 (12 cores/24 threads with 3.5 Ghz turbo and 2.6 Ghz base clock) this would amount to a 35% overclock for operation involving all cores

If CPU cooling were sufficient how much more prevalent could various types of errors be? Could ECC RAM be used with such a MCE overclocked E5 Xeon?

Assume the primary task is video editing, but I am very interesting in hearing opinions about other types of tasks classic to high core count Server Xeons?

If you're talking about video editing, its pretty irrelevant. Data corruption would still be highly, highly infrequent even overclocked (so long as you're stable, of course) and would manifest as an incorrectly colored pixel in many cases. It would be nearly impossible to notice.

Jovec · Mar 24, 2015

I think the core premise is wrong. I don't think you'll hit +13 multi on 18 cores at any reasonable voltage and temp at near 100% load, even if MCE worked on it. +2/+4 may be possible, but there is a reason Intel drops clocks as it adds cores.

AS IDC said, you have little to no method to verify corruption, though it might not matter for Video encoding. ECC isn't going to help CPU errors - ECC will protect against the memory flipping bits or corrupting data, but will do nothing if the CPU itself is sending bad data to the RAM (ECC will just ensure that the bad data doesn't change silently while it is in memory).

Editing may be a real-time task, but most encoding isn't. A dual-core celeron can effectively be as fast as that 18 core Xeon as long as they both complete the task before next use (encode over-night, in the background, etc.). Something like a 4790k at 4.4GHz (stock, single core turbo) might even be preferable if the editing tasks are single-threaded, providing more performance while you are actually using the system for editing.

DrMrLordX · Mar 24, 2015

If you are talking about data corruption in the sense of "I tried to store x to the <insertstoragesolutionhere> and instead I got y which is not what I intended", then I would say +0%. What you are talking about is only really going to be possible if the storage controller(s) involved mishandle(s) a write operation, which DOES happen, but again, that's on the storage controller.

So, unless you are pushing the storage controller out of spec (which you aren't with MCE), then you have nothing to worry about in that department.

Yes, a machine running "out-of-spec" could flip a wrong bit somewhere from a CPU or memory overclock, making it possible that (to cite an above example) you'd have one pixel colored incorrectly on one frame of a video or . . . something. But I drew a distinction between memory corruption and storage corruption for a reason: if you flip the wrong bit in memory, there's a strong probability that doing so in a non-trivial fashion will cause an error or just bring down the entire machine. The probability that you will flip a wrong bit AND flip a bit that is non-essential to the current stable operation of the machine is really quite low.

In contrast, a storage controller silently switching a 0 to a 1 somewhere won't make a whit of difference, since the system isn't depending on that written bit for operation, unless it accidentally corrupts the entire file table or something.

Any kind of potential problem from overclocking CPU or RAM will probably show up as system instability before you get to the point that you are unintentionally writing garbage to the disc/SSD/SAN/whatever.

Jovec · Mar 24, 2015

DrMrLordX said:
memory corruption and storage corruption

There is also calculation corruption. It might not matter for video tasks, but sometimes every 64, 128, or 256 bits of precision is needed.

ClockHound · Mar 24, 2015

Oh gawd....not again....silent data corruption from overclocking is about #178 on the top 100 things that can go wrong in an independent, budget-strapped video edit suite.

Let's get all the video drivers perfect first. And new, must-have efx plugins that don't trash a project non-silently.

Then never, ever have a drive fail in your 16TB RAID0 array that was last backed up last month (incompletely) by an intern who didn't have time to finish it because the RAID0 backup drives were too full.

RAID0 is the #1 choice for cheapskate VEs and is the default setup for most commercial video edit drive array vendors. Because....4k source footage to be transformed into a crappy 1 megabit stream youtube video for the betterment of cats, their staff and their nutritional needs.

If I only had to face the extreme lethal danger of silent data corruption from my OC'd VE workstations, my work life would would be far too easy and profitable.

Anyways, this is a fun midnight bar room philosophizing thread. Be very afraid of everything. Drink up! ;-)

DrMrLordX · Mar 24, 2015

Jovec said:
There is also calculation corruption. It might not matter for video tasks, but sometimes every 64, 128, or 256 bits of precision is needed.

Right, and that inevitably becomes corruption of something in memory if the CPU spits out the wrong or "unexpected" result. Different cause, same basic effect, and highly likely to halt the program or crash the machine.

cbn · Mar 24, 2015

...Looking through some old news posts I think it is interesting there were rumors of unlocked Haswell E5 Xeons back in May of 2014:

http://vr-zone.com/articles/computex-will-show-desktop-alive-well/77282.html

Add to this the rumoured confirmation that, unlike their predecessors, Haswell-EP Xeons, including likely thae 14-core and 18-core flavours, will have several top bin un-locked and even liquid-cooling optimised variants meant for HPC, workstations and high frequency trading, and you can guess the implications: the Haswell-E and Haswell-EP platforms will again be the overclockers heaven.

In general, these are very good news as, with Haswell and Broadwell next-gen high end platforms, we will get the unlocking and speeding-up capabilities we saw in the high end desktops of the past, this time spread across both single socket and dual socket desktops and workstations, not to mention HPC supercomputing platforms.

Crossing fingers that Intel eventually releases unlocked E5 SKUs.

Bubbleawsome · Mar 24, 2015

Last edited by SOFTengCOMPelec; Today at 06:41 AM. Reason: Stabilty=ECC+Xeon. Overclocking=unstable. Leaving thread, quickly

I think that is his thoughts.

I think it would be fine unless you want to host an always up server or do mission critical computation.

SOFTengCOMPelec · Mar 24, 2015

Bubbleawsome said:
I think that is his thoughts.

I think it would be fine unless you want to host an always up server or do mission critical computation.

I agree.

I can't see that there is much point to using ECC, if the work is NOT important (critical), and you are overclocking (or similar), which throws maximum stability out of the window.

On the Intel website (official Intel forum), it seems to indicate that MCE, on Xeon's, (if it is even possible, on later Xeon's), is probably limited to the (spec sheet) TDP.

I.e. Even if you cooled it with liquid Nitrogen, it will NOT allow you to exceed the rated TDP (on modern Xeons), even if the chip is currently sitting at -200 degrees C.

Source

Disclaimer:
That is "MY" interpretation of the "source", your opinion of it might differ. It is talking about the 2013 E3. Other Xeon's, e.g. (unreleased/upcoming) Skylake-E's etc, may be different.

I would suggest pointing to the nature of the Intel® Turbo Boost technology. This feature is auto-managed by the processor having a balance between the power consumption and the chip heat.

Also, the maximum Turbo Boost speed won’t be reached by all of the processor cores at the same time, the value is assigned to the cores being used gradually at the moment the Turbo Boost kicks out so it is possible that only one core reaches the maximum Turbo Boost speed assigned by default.

If you change the Turbo Boost frequency multiplier manually then the processor will try to apply it to the cores in use but if the maximum power allowed (TDP) is already covered then the CPU won’t continue increasing the cores speed to prevent heat problems. Since you are setting a new value manually for the Turbo Boost frequency multiplier it is possible that the CPU is finding a power limit when reaching the 3.5GHz.

If you compare this Xeon processor (max TDP 69W) with the I7-3820 (max TDP 130W), the behavior will be different because the architecture is not the same. The Intel I7 processor is able to get to the maximum turbo frequency since it has higher TDP.

Again, you can change the settings, but we do not recommend.

Search

Data integrity with a MCE overclocked high core count E5 Xeon?

cbn

Lifer

Idontcare

Elite Member

cbn

Lifer

Idontcare

Elite Member

SOFTengCOMPelec

Platinum Member

Headfoot

Diamond Member

Jovec

Senior member

DrMrLordX

Lifer

Jovec

Senior member

ClockHound

Golden Member

DrMrLordX

Lifer

cbn

Lifer

Bubbleawsome

Diamond Member

SOFTengCOMPelec

Platinum Member

TRENDING THREADS