Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations

Krteq

Golden Member
May 22, 2015
1,007
719
136
Wow, it seems it's even worse then FDIV bug on first Pentiums.
 

Krteq

Golden Member
May 22, 2015
1,007
719
136
There is an ECC built in GV100 and even in HBM2s itself , so those errors couldn't be caused by "bad memory".
ECC Memory Resiliency
The Tesla V100 HBM2 memory subsystem supports Single Error Correcting Double-Error Detecting (SECDED) Error Correction Code (ECC) to protect data. ECC provides higher reliability for compute applications that are sensitive to data corruption. It is especially important in large scale cluster computing environments where GPUs process very large datasets and/or run applications for extended periods.

HBM2 supports native or sideband ECC where a small memory region, separate from main memory, is used for ECC bits. This compares to inline ECC where a portion of main memory is carved out for ECC bits, as in the GDDR5 memory subsystem of the Tesla K40 GPU where 6.25% of the overall GDDR5 is reserved for ECC bits. With V100 and P100, ECC can be active without a bandwidth or capacity penalty. For memory writes, the ECC bits are calculated across 32 bytes of data in a write request. Eight ECC bits are created for each eight bytes of data. For memory reads, the 32 ECC bits are read in parallel with each 32 byte read of data. ECC bits are used to correct single bit errors or flag double bit errors.

Other key structures in GV100 are also protected by SECDED ECC, including the SM register file, L1 cache, and L2 cache. The same SECDED ECC protection was provided across the same structures in Pascal GP100 to ensure a high level of error detection and correction, and overall memory resiliency.
Volta architecture whitepaper
 
  • Like
Reactions: DarthKyrie

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,250
136
Maybe it has something to do with the not for data center usage that popped up on the release of the card. Either it's intentionally done to enforce it's uses or there is a flaw in the design of the gpu.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
"Article" has no sources, it's just clickbait.

Let's see who else falls for it.
 

LTC8K6

Lifer
Mar 10, 2004
28,520
1,575
126
If it were the GPU, wouldn't we expect all four cards to have the error?
And for it to have shown up widely already?

What is the likelihood that HBM2 memory chips are bad vs the GPU being bad?
 

formulav8

Diamond Member
Sep 18, 2000
7,004
522
126
An update from NVidia.

A spokesperson for Nvidia has been in touch to say: "All of our GPUs add correctly. Other than an isolated report from a user of Amber, a molecular dynamics application that requires days to compute, we have not seen any issues in the field. Our Tesla line, which has ECC [error-correcting code], is designed for these types of large scale, high performance simulations. Anyone who does experience issues should contact support@nvidia.com."
 

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
Hmm, but doesn't multiply and divide properly i guess? And are they admitting that if you want that to function correctly you need to buy a tesla line card? Strange wording.
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
One report, of one(?) card possibly defective or possibly run out of spec.

I'm going to need a little more evidence than that before accepting a worse-than-GTX 970-level design flaw.
 

Genx87

Lifer
Apr 8, 2002
41,091
513
126
Need a little more than one issue this one time at band camp. If it possible? Very much so. But lets see something a little more in depth before declaring Titan V needs to be avoided.
 

dajeepster

Golden Member
Apr 15, 2001
1,974
16
81
how about the released code for this so called simulation software?... oh ... my bad.. my code is bad :eek: , but let's blame the hardware so i can get my bonus :oops:

as the poster said above... no source, it's just clickbait.

no software name, no data to back up story. just links to nvidia's product line and presentations.