Anyone know of a program that will diff a large number of files and give statistics on ones that are similar?

pX

Golden Member
Feb 3, 2000
1,895
0
71
I have reason to believe that a lot of students may have cheated as they don't think I am going to read their code, which is a pretty good assumption since there are 80 students. I'd love to have a program (or script) that reads all files in a directory (text files) and diffs them all versus one another and then records how many lines are exactly the same and then spits out a report of some sort. I can even see writing this with some sort of UNIX script, but I'd assume something like this is "out there".

Anyone?
 

BCYL

Diamond Member
Jun 7, 2000
7,803
0
71
I KNOW they exist, since when I was still in school they run them on all our assignments/projects (and they catch people regularly too)... However I don't know what it's called as they won't tell us (afraid we will find a way to beat it)
 

glugglug

Diamond Member
Jun 9, 2002
5,340
1
81
Identical lines aren't that meaningful, since EVERY program will have lines like
}
or
{

identical BLOCKs of code are another matter. And I know there ARE tools to detect them even on compiled code, which have been used to prove copyright violations before.
 

pX

Golden Member
Feb 3, 2000
1,895
0
71
Ah, maybe I should just ask the professor if there is such a tool here.

I can see just writing a simple unix script which does a diff xxx.asm yyy.asm on every file with whatever flags diff uses to not spit out an actual diff but the number of lines in common. Meh, too much work.
 

BCYL

Diamond Member
Jun 7, 2000
7,803
0
71
Originally posted by: pX
Ah, maybe I should just ask the professor if there is such a tool here.

I can see just writing a simple unix script which does a diff xxx.asm yyy.asm on every file with whatever flags diff uses to not spit out an actual diff but the number of lines in common. Meh, too much work.

That wouldn't work too well i think, anyone copying code would have the sense to change variable names around, which would make the lines completely different but basically the same code...

We were told the program they used at my school compares compiled code and memory usage etc, so changing variable names mean nothing