MD5 Sum not being unique (different data having same sum)

RaiderJ

Diamond Member
Apr 29, 2001
7,582
1
76
Saw this on Slashdot the other day, and it basically says that two different sets of data could potentially have the same MD5 sum, or a set of data could be changed but still give the same sum (therefore MD5 is basically unsafe).

But isn't this obvious? Maybe I'm missing something, but the only way every possible set of data could have a unique MD5 sum is if that sum could perfectly describe that data? Also, isn't the point of MD5 not to stop "doppelganger" changes, but to make sure your download matches what was originally on the server?

My post reads kinda confusing, sorry!
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Originally posted by: RaiderJ
Saw this on Slashdot the other day, and it basically says that two different sets of data could potentially have the same MD5 sum, or a set of data could be changed but still give the same sum (therefore MD5 is basically unsafe).

But isn't this obvious? Maybe I'm missing something, but the only way every possible set of data could have a unique MD5 sum is if that sum could perfectly describe that data?

Yep -- but it was thought that it was very difficult to take a specific MD5 sum and produce a piece of data that matches it. Turns out it's not quite as hard as was previously thought.

Also, isn't the point of MD5 not to stop "doppelganger" changes, but to make sure your download matches what was originally on the server?

Well, um, those are sort of the same thing. MD5 can be used as a way to validate secure connections and messages. If someone could forge a fake message (or a virus-laced file) that has the same MD5 checksum as a real message or file, it makes it less useful as a verification tool.
 

uOpt

Golden Member
Oct 19, 2004
1,628
0
0
Of course it is just plain impossible to create a shorter checksum out of longer data that is never the same.

However, this is missing the point. md5 sum are not to tell one file from another. For example, you cannot use it to find identical files on yoru system without doing a real comparision pass over the whole files after the md5s match.

md5sum are meant to prevent people from tampering with files.

md5 is meant to protect a file against accidentially or deliberate modifications. It is hard in the extreme to change some bytes or sections in a file and have the same md5sum for the changed and unchanged file. It is incredibly hard to create sequences of bytes that do what you want (e.g. break into a system) and have a predictable md5 result.

Hope this clears things up.
 

bsobel

Moderator Emeritus<br>Elite Member
Dec 9, 2001
13,346
0
0
Originally posted by: RaiderJ
Saw this on Slashdot the other day, and it basically says that two different sets of data could potentially have the same MD5 sum, or a set of data could be changed but still give the same sum (therefore MD5 is basically unsafe).

Your right, otherwise we'd only need 128 bits to describe any file that ever existed, exists, or will exist. The issue the article is talking about is that it should be impossible to specifically (short of brute force trying all the combinations) generate a new file with the same MD5 sum. However, it appears there may be some weakness to the algorithm which would allow an attacker to create payloads (eventually) with MD5 sums designed to match those of the legitimate payloads.

Lets say I have a document where I promise to pay you $10,000. I have an MD5 of the document so I can prove it's the right one and you haven't changed it. But if MD5 is indeed broken, now lets say you can craft a replacement letter and change the amount to $100,000. Your new file as the same MD5 (maybe you add random data into another part of the word doc that isn't normally displayed). Now, how do I prove which file is 'real'

Bill
 

bsobel

Moderator Emeritus<br>Elite Member
Dec 9, 2001
13,346
0
0
md5 sum are not to tell one file from another. For example, you cannot use it to find identical files on yoru system without doing a real comparision pass over the whole files after the md5s match.

Actually thats wrong. File name/file length/along with MD5 is an EXTREMELY reliable indicator of comparison. Moving to SHA1 the rate of collision is less than the rate of random drive failure and other events which may effect file contents.

Bill

edit: Errr, should have said 'was' not that MD5 appears to have been weakened. Replace MD5 above with SHA1
 

uOpt

Golden Member
Oct 19, 2004
1,628
0
0
Originally posted by: bsobel
md5 sum are not to tell one file from another. For example, you cannot use it to find identical files on yoru system without doing a real comparision pass over the whole files after the md5s match.

Actually thats wrong. File name/file length/along with MD5 is an EXTREMELY reliable indicator of comparison. Moving to SHA1 the rate of collision is less than the rate of random drive failure and other events which may effect file contents.

Bill

edit: Errr, should have said 'was' not that MD5 appears to have been weakened. Replace MD5 above with SHA1

Fair points. In practice you can probably rely on length and md5, it will be statistically sufficient for the amounts of files today's computers can store (not too sure about my newsspool, though :)).

I was just illustrating the point what m5 and other cryptographic checksumming is about. Not to tell random files from each other. To tell related, slightly modified files from each other to detect tampering and corruption.