ECC Memory Resiliency
The Tesla V100 HBM2 memory subsystem supports Single Error Correcting Double-Error Detecting (SECDED) Error Correction Code (ECC) to protect data. ECC provides higher reliability for compute applications that are sensitive to data corruption. It is especially important in large scale cluster computing environments where GPUs process very large datasets and/or run applications for extended periods.
HBM2 supports native or sideband ECC where a small memory region, separate from main memory, is used for ECC bits. This compares to inline ECC where a portion of main memory is carved out for ECC bits, as in the GDDR5 memory subsystem of the Tesla K40 GPU where 6.25% of the overall GDDR5 is reserved for ECC bits. With V100 and P100, ECC can be active without a bandwidth or capacity penalty. For memory writes, the ECC bits are calculated across 32 bytes of data in a write request. Eight ECC bits are created for each eight bytes of data. For memory reads, the 32 ECC bits are read in parallel with each 32 byte read of data. ECC bits are used to correct single bit errors or flag double bit errors.
Other key structures in GV100 are also protected by SECDED ECC, including the SM register file, L1 cache, and L2 cache. The same SECDED ECC protection was provided across the same structures in Pascal GP100 to ensure a high level of error detection and correction, and overall memory resiliency.