Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations

formulav8 · Mar 21, 2018

2 + 2 = 4, er, 4.1, no, 4.3... Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations

Fine for gaming, not so much for modeling, it is claimed

Krteq · Mar 22, 2018

Wow, it seems it's even worse then FDIV bug on first Pentiums.

LTC8K6 · Mar 22, 2018

Krteq said:
Wow, it seems it's even worse then FDIV bug on first Pentiums.

Article says it could be the memory.

Krteq · Mar 22, 2018

There is an ECC built in GV100 and even in HBM2s itself , so those errors couldn't be caused by "bad memory".

ECC Memory Resiliency
The Tesla V100 HBM2 memory subsystem supports Single Error Correcting Double-Error Detecting (SECDED) Error Correction Code (ECC) to protect data. ECC provides higher reliability for compute applications that are sensitive to data corruption. It is especially important in large scale cluster computing environments where GPUs process very large datasets and/or run applications for extended periods.

HBM2 supports native or sideband ECC where a small memory region, separate from main memory, is used for ECC bits. This compares to inline ECC where a portion of main memory is carved out for ECC bits, as in the GDDR5 memory subsystem of the Tesla K40 GPU where 6.25% of the overall GDDR5 is reserved for ECC bits. With V100 and P100, ECC can be active without a bandwidth or capacity penalty. For memory writes, the ECC bits are calculated across 32 bytes of data in a write request. Eight ECC bits are created for each eight bytes of data. For memory reads, the 32 ECC bits are read in parallel with each 32 byte read of data. ECC bits are used to correct single bit errors or flag double bit errors.

Other key structures in GV100 are also protected by SECDED ECC, including the SM register file, L1 cache, and L2 cache. The same SECDED ECC protection was provided across the same structures in Pascal GP100 to ensure a high level of error detection and correction, and overall memory resiliency.

Volta architecture whitepaper

Kenmitch · Mar 22, 2018

Maybe it has something to do with the not for data center usage that popped up on the release of the card. Either it's intentionally done to enforce it's uses or there is a flaw in the design of the gpu.

Phynaz · Mar 22, 2018

"Article" has no sources, it's just clickbait.

Let's see who else falls for it.

LTC8K6 · Mar 22, 2018

If it were the GPU, wouldn't we expect all four cards to have the error?
And for it to have shown up widely already?

What is the likelihood that HBM2 memory chips are bad vs the GPU being bad?

formulav8 · Mar 22, 2018

An update from NVidia.

A spokesperson for Nvidia has been in touch to say: "All of our GPUs add correctly. Other than an isolated report from a user of Amber, a molecular dynamics application that requires days to compute, we have not seen any issues in the field. Our Tesla line, which has ECC [error-correcting code], is designed for these types of large scale, high performance simulations. Anyone who does experience issues should contact support@nvidia.com."

piesquared · Mar 22, 2018

Hmm, but doesn't multiply and divide properly i guess? And are they admitting that if you want that to function correctly you need to buy a tesla line card? Strange wording.

Flapdrol1337 · Mar 22, 2018

Is ecc turned off on the non tesla's?

sontin · Mar 22, 2018

Titan cards dont support ECC.

DaveSimmons · Mar 22, 2018

One report, of one(?) card possibly defective or possibly run out of spec.

I'm going to need a little more evidence than that before accepting a worse-than-GTX 970-level design flaw.

sontin · Mar 22, 2018

Maybe HBM is more error phrone for complex calculations.

Genx87 · Mar 23, 2018

Need a little more than one issue this one time at band camp. If it possible? Very much so. But lets see something a little more in depth before declaring Titan V needs to be avoided.

dajeepster · Apr 24, 2018

how about the released code for this so called simulation software?... oh ... my bad.. my code is bad

, but let's blame the hardware so i can get my bonus

as the poster said above... no source, it's just clickbait.

no software name, no data to back up story. just links to nvidia's product line and presentations.

DaveSimmons · Apr 24, 2018

Also, one month later and no confirmation from reputable sites.

So: false rumor is false.

Search

Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations

formulav8

Diamond Member

Krteq

Golden Member

LTC8K6

Lifer

Krteq

Golden Member

Kenmitch

Diamond Member

Phynaz

Lifer

LTC8K6

Lifer

formulav8

Diamond Member

piesquared

Golden Member

Flapdrol1337

Golden Member

sontin

Diamond Member

DaveSimmons

Elite Member

sontin

Diamond Member

Genx87

Lifer

dajeepster

Golden Member

DaveSimmons

Elite Member

TRENDING THREADS