Which hash algorithm samples the most data?

TheDarkKnight

Senior member
Jan 20, 2011
321
4
81
This question has arisen from a need to verify the integrity of some of my DVDs using a software that uses the CRC stored on a CD/DVD disc.

Which hash method samples the most data that it represents...

CRC-32, MD5, or SHA-1?

I am curious given a hash generated on a regular size DVD .ISO file of a length/size of ~4.38 GiB, what percentage of all those bytes would be used to generate the final hash in the algorithm?
 

Zxian

Senior member
May 26, 2011
579
0
0
All of them should be hashing the entire contents. The only difference is the length of the resultant output, and therefore the probability of collision. If you're checking what should be an identical file, then any errors will result in a mismatched hash.
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,739
156
106
crc and md5 are usually used for speed, sha is slower but more accurate
they all use 100% of the bytes, but it is possible to get two different files with the same hash, just rare.
generally the larger the hash, the more accurate. IE: 32-bits 256-bits 512-bits etc.
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
Don't worry about just use whatever is easier. They will all detect a problem on the DVDs perfectly fine. The chance that any hash has a collision is just too tiny to matter at this size.
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
use par2 and you'll not only detect problems but might actually be able to fix them
 

StorageGuru

Junior Member
Sep 9, 2012
12
0
0
Actutally the data on a DVD is protected against errors by using erasure codes.

On a CD and DVD Reed Solomon codes are used. You can read about it on wikipedia.
 

Elixer

Lifer
May 7, 2002
10,371
762
126
Actutally the data on a DVD is protected against errors by using erasure codes.

On a CD and DVD Reed Solomon codes are used. You can read about it on wikipedia.

Eh ? Erasure codes ? You mean ECC. (Error correcting code). While it is true that the hardware does have ECC ability, both DVDs & CDs can indeed develop errors that can make it worthless, there is only so much that ECC can "fix".

Using PAR2 recovery volumes is a excellent way to add a extra layer of protection to be able to recover your data. You just need to have more recovery volumes available than the amount of errors in the original file(s).
 

StorageGuru

Junior Member
Sep 9, 2012
12
0
0
Eh ? Erasure codes ? You mean ECC. (Error correcting code). While it is true that the hardware does have ECC ability, both DVDs & CDs can indeed develop errors that can make it worthless, there is only so much that ECC can "fix".

Using PAR2 recovery volumes is a excellent way to add a extra layer of protection to be able to recover your data. You just need to have more recovery volumes available than the amount of errors in the original file(s).

Reed Solomon is an erasure code, also known as forward error correction. And erasure codes is an error correcting code. So not tell me what I mean...
The Parchive erasure code might not be the most effective.
In order to be able to be 100% sure of you can recover the data you need to used a code with maximum hamming distance.
However the question is fomulated in an unspecific way. The hash function is a oneway function not ment to hold any additional information.

-SG
 

Mark R

Diamond Member
Oct 9, 1999
8,513
16
81
The Parchive erasure code might not be the most effective.
-SG

Correct.

Parchive version 2 (.par2) has a minor bug in the specification which can, under some circumstances, give suboptimal protection. There are also a number of PAR2 software packages which have additional bugs (in particular, there is one package with a major bug in construction of the Reed-Solomon codes resulting in readable and valid files, but with virtually no redundancy).

There is a proposed .par3 standard, but only 1 beta software package uses it (multipar) as the author has written the .par3 proposal. This algorithm is mathematically optimal, as far as is known; it is also computationally far more efficient than .par2, so can be an order of magnitude faster to calculate for large datasets. The risk with .par3 is that the standard has not been decided upon, and therefore the file-format is subject to change. If you chose to use this, make sure you keep a copy of the exact software you used to create the files, in case future versions cannot read them.