Fastest Way: Checking for Duplicates in an Array

Schnozzinkobenstein · Jul 23, 2010

Is it possible to check for duplicates in an array faster than (N^2)/2?

The only way I can think of to do so would be like so with an array of 5 elements:
check 1 against 2,3,4,5
check 2 against 3,4,5
check 3 against 4,5
check 4 against 5

Ken g6 · Jul 23, 2010

Absolutely! I can think of several methods. But I hesitate to disclose them until I know exactly what this homework problem is asking.

eLiu · Jul 23, 2010

Schnozzinkobenstein said:
Is it possible to check for duplicates in an array faster than (N^2)/2?

The only way I can think of to do so would be like so with an array of 5 elements:
check 1 against 2,3,4,5
check 2 against 3,4,5
check 3 against 4,5
check 4 against 5

You need to provide more information:
1) how large is the array?
2) what is the data inside the array? (positive integers? doubles? other?)
3) can I modify the original array?
4) can I use additional storage?
5) do you need to locate ALL duplicates?

If the original array cannot be changed and I cannot create additional storage, then this is pretty hard. If you can restrict the data type to like positive integers, there might be some magic you can play with imagining the array as a directed graph. Each array index is a node; the value at each index describes an edge going to the node with that value. Then something like the tortoise & hare algorithm can find cycles which correspond to duplicates. Not sure that this would work... I think you need pretty strong restrictions on the data type/range?

Hm, Ken brings up a good point. I removed the two actual solutions that are the most practical in case this is homework.

Ken g6 · Jul 23, 2010

Having thought about it, I'll provide this general hint: Presuming you can rearrange or duplicate the array elements, how could you move all duplicates closer together? Like right next to each other or right on top of each other?

I can think of *three* practical solutions.

None of them use graphs.

eLiu · Jul 23, 2010

Ken g6 said:
Having thought about it, I'll provide this general hint: Presuming you can rearrange or duplicate the array elements, how could you move all duplicates closer together? Like right next to each other or right on top of each other?

I can think of *three* practical solutions. None of them use graphs.

Haha. I can only think of two practical solutions; one is faster but requires more space... one is slower but requires more time.

The graph thing was kind of a random idea I had from the way the OP listed the numbers. Like consider this set:
A = 2,3,1,4,1,5 <--data
ind=1 2 3 4 5 6 <--array indexes (yeah Fortran indexing...)

Ok so following my idea, you start at node 6. A[6] = 5 tells you to go to node 5. A[5] =1 tells you to go to node 1. 1 goes to 2, 2 goes to 3, 3 goes back to 1. So that's a cycle, which we can detect using say tortoise vs hare. Though the idea seems to be extremely limited. Like I'm not sure how you'd detect all duplicates and prove it's faster than O(N^2).

This is of course... not a practical solution. I just thought it was kind of cute.

esun · Jul 23, 2010

Yes.

Look up hashing.

eLiu · Jul 23, 2010

esun said:
Yes.

Look up hashing.

Ok Ken, that covers 2 methods... your hint & hashing. What on earth is the third one?

Cogman · Jul 23, 2010

Well lets see, there's a heap of things that might work. The quick one is probably what you are after. Merging these hints might help. Sounds sorted to me.

Ken g6 · Jul 23, 2010

Sounds like a bunch of treemendous ideas in this thread.

Cogman · Jul 23, 2010

Ken g6 said:
Sounds like a bunch of treemendous ideas in this thread.

No two ways about it

(that was another I thought about doing)

iCyborg · Jul 23, 2010

Are there space constraints? In the OP, the elements are integers, with ~512MB extra space, it can be done in O(n) (worst case)

Cogman · Jul 23, 2010

iCyborg said:
Are there space constraints? In the OP, the elements are integers, with ~512MB extra space, it can be done in O(n) (worst case)

Well, you could get an O(n) response in much less (If you remember that O(3*n) = O(n))

iCyborg · Jul 23, 2010

With constant space? I can't see what you have in mind...

lambchops511 · Jul 23, 2010

you can do it in O(N) .. hashmap

Daverino · Jul 28, 2010

If you can use a hash (which takes space) you can do it in O(n)

O(n log n) for constant space. Sort the array. All the duplicates will be next to each other and can be found in n time.

Cogman · Jul 28, 2010

Daverino said:
If you can use a hash (which takes space) you can do it in O(n)

O(n log n) for constant space. Sort the array. All the duplicates will be next to each other and can be found in n time.

There is a constant space O(n) sort. It just isn't a comparison sort.

Radix sort.

Scali · Jul 29, 2010

Heh, I implemented a routine like that just yesterday.
I was importing some raw vertex data from 3dsmax, and wanted to convert it into an indexed vertexbuffer.
I used the hash method (memory is cheap these days).

Daverino · Jul 31, 2010

Cogman said:
There is a constant space O(n) sort. It just isn't a comparison sort.

Radix sort.

Radix is O(nk) where k is the width of the value being sorted. If you're sorting fixed width integers, it's great. It's not a general sorting algorithm, though. If you make no assumptions about the data, O(n log n) is the best you're going to do for sorting, practically speaking.

Cogman · Aug 1, 2010

Daverino said:
Radix is O(nk) where k is the width of the value being sorted. If you're sorting fixed width integers, it's great. It's not a general sorting algorithm, though. If you make no assumptions about the data, O(n log n) is the best you're going to do for sorting, practically speaking.

O(n log n) is the best you can get with a comparison sort. However, a count sort is a general sort that will always be O(n) (with a space requirement the size of all possible values of element n). Theoretically, O(n) is the best general sorting speed you can get. Practically, it depends. an n^2 algorithm will often beat most n log n algorithms for small n.

Daverino · Aug 1, 2010

Cogman said:
O(n log n) is the best you can get with a comparison sort. However, a count sort is a general sort that will always be O(n) (with a space requirement the size of all possible values of element n). Theoretically, O(n) is the best general sorting speed you can get. Practically, it depends. an n^2 algorithm will often beat most n log n algorithms for small n.

I wouldn't say a count sort is general. It can only sort algebraically and only when the numbers are fixed width. If you're sorting strings, you'll need to convert them to some numeric form and your bucket count could be the number of possible characters. That's fine for lower-case English (26) or even ASCII (256), but it's going to be scary for Unicode. Your bucket width will be the maximum length of the strings. Sorting mixed types is also more difficult, as well as handling arbitrary precision floats.

The thing is, with a comparison sort the algorithm stays the same and the only thing that changes between implementations is your comparison operator. With counting sorts, you need to change the counting method and the counting space. That's usually a LOT harder to do, which is why counting sorts are not general sorting algorithms. They need to be customized to your data and it's not guaranteed that a counting sort will be appropriate to that data. IF the data works well with a counting sort then you can beat a comparison sort. In most common examples of sorting, like integers, counting sorts will be superior in complexity.

Cogman · Aug 2, 2010

Daverino said:
I wouldn't say a count sort is general. It can only sort algebraically and only when the numbers are fixed width. If you're sorting strings, you'll need to convert them to some numeric form and your bucket count could be the number of possible characters. That's fine for lower-case English (26) or even ASCII (256), but it's going to be scary for Unicode. Your bucket width will be the maximum length of the strings. Sorting mixed types is also more difficult, as well as handling arbitrary precision floats.

The thing is, with a comparison sort the algorithm stays the same and the only thing that changes between implementations is your comparison operator. With counting sorts, you need to change the counting method and the counting space. That's usually a LOT harder to do, which is why counting sorts are not general sorting algorithms. They need to be customized to your data and it's not guaranteed that a counting sort will be appropriate to that data. IF the data works well with a counting sort then you can beat a comparison sort. In most common examples of sorting, like integers, counting sorts will be superior in complexity.

Don't get me wrong, I wasn't trying to argue that a counting sort was the best sort out there. My argument was more along the lines of "Be careful not to get too tied up in the big oh notation." Anything that can be comparison sorted, can be count sorted. The problem is you have to assign each "thing" a value in order for them to be sorted. That operation alone is probably going to be an n log n operation. As you pointed out, that is only not the case for the simplest of situation (integers). However, it IS a general case sorting algorithm because it CAN sort anything that a comparison sort can.

Daverino · Aug 2, 2010

Cogman said:
However, it IS a general case sorting algorithm because it CAN sort anything that a comparison sort can.

How would you sort an array of strings based on the count of vowels in each string using a count sort?

Cogman · Aug 2, 2010

Daverino said:
How would you sort an array of strings based on the count of vowels in each string using a count sort?

Go Order the string using a comparison sort,
Assign a number to each string.
Sort with the count sort using those numbers.

(if you wanted to not physically sort the string, you could use something like a selection sort to find the smallest value without a value, give it a value, and repeat until all values have a value.)

Not as efficient, still O(n) (because the preprocessing doesn't fall in the sorting complexity).

Heck, if all you are doing is counting vowels, then you can just use the valow count (a standard integer) to do the sort. What I provided above is the general case to sort anything a comparison sort can sort (by, using a comparison sort to assign the numbers used for the count sort :awe: )

Fastest Way: Checking for Duplicates in an Array

Member

Programming Moderator, Elite Member

Diamond Member

Programming Moderator, Elite Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Programming Moderator, Elite Member

Lifer

Golden Member

Lifer

Golden Member

Senior member

Platinum Member

Lifer

Banned

Platinum Member

Lifer

Platinum Member

Lifer

Platinum Member

Lifer