Fastest Way: Checking for Duplicates in an Array

Nov 17, 2005
86
0
0
Is it possible to check for duplicates in an array faster than (N^2)/2?

The only way I can think of to do so would be like so with an array of 5 elements:
check 1 against 2,3,4,5
check 2 against 3,4,5
check 3 against 4,5
check 4 against 5
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,609
4,530
75
Absolutely! I can think of several methods. But I hesitate to disclose them until I know exactly what this homework problem is asking. ;)
 

eLiu

Diamond Member
Jun 4, 2001
6,407
1
0
Is it possible to check for duplicates in an array faster than (N^2)/2?

The only way I can think of to do so would be like so with an array of 5 elements:
check 1 against 2,3,4,5
check 2 against 3,4,5
check 3 against 4,5
check 4 against 5

You need to provide more information:
1) how large is the array?
2) what is the data inside the array? (positive integers? doubles? other?)
3) can I modify the original array?
4) can I use additional storage?
5) do you need to locate ALL duplicates?

If the original array cannot be changed and I cannot create additional storage, then this is pretty hard. If you can restrict the data type to like positive integers, there might be some magic you can play with imagining the array as a directed graph. Each array index is a node; the value at each index describes an edge going to the node with that value. Then something like the tortoise & hare algorithm can find cycles which correspond to duplicates. Not sure that this would work... I think you need pretty strong restrictions on the data type/range?

Hm, Ken brings up a good point. I removed the two actual solutions that are the most practical in case this is homework.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,609
4,530
75
Having thought about it, I'll provide this general hint: Presuming you can rearrange or duplicate the array elements, how could you move all duplicates closer together? Like right next to each other or right on top of each other?

I can think of *three* practical solutions. :) None of them use graphs. :confused:
 

eLiu

Diamond Member
Jun 4, 2001
6,407
1
0
Having thought about it, I'll provide this general hint: Presuming you can rearrange or duplicate the array elements, how could you move all duplicates closer together? Like right next to each other or right on top of each other?

I can think of *three* practical solutions. :) None of them use graphs. :confused:

Haha. I can only think of two practical solutions; one is faster but requires more space... one is slower but requires more time.

The graph thing was kind of a random idea I had from the way the OP listed the numbers. Like consider this set:
A = 2,3,1,4,1,5 <--data
ind=1 2 3 4 5 6 <--array indexes (yeah Fortran indexing...)

Ok so following my idea, you start at node 6. A[6] = 5 tells you to go to node 5. A[5] =1 tells you to go to node 1. 1 goes to 2, 2 goes to 3, 3 goes back to 1. So that's a cycle, which we can detect using say tortoise vs hare. Though the idea seems to be extremely limited. Like I'm not sure how you'd detect all duplicates and prove it's faster than O(N^2).

This is of course... not a practical solution. I just thought it was kind of cute.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
Well lets see, there's a heap of things that might work. The quick one is probably what you are after. Merging these hints might help. Sounds sorted to me.
 
Last edited:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,609
4,530
75
Sounds like a bunch of treemendous ideas in this thread. :D
 

iCyborg

Golden Member
Aug 8, 2008
1,344
61
91
Are there space constraints? In the OP, the elements are integers, with ~512MB extra space, it can be done in O(n) (worst case) :)
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
Are there space constraints? In the OP, the elements are integers, with ~512MB extra space, it can be done in O(n) (worst case) :)

Well, you could get an O(n) response in much less (If you remember that O(3*n) = O(n))
 

Daverino

Platinum Member
Mar 15, 2007
2,004
1
0
If you can use a hash (which takes space) you can do it in O(n)

O(n log n) for constant space. Sort the array. All the duplicates will be next to each other and can be found in n time.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
If you can use a hash (which takes space) you can do it in O(n)

O(n log n) for constant space. Sort the array. All the duplicates will be next to each other and can be found in n time.

There is a constant space O(n) sort. It just isn't a comparison sort.

Radix sort.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
Heh, I implemented a routine like that just yesterday.
I was importing some raw vertex data from 3dsmax, and wanted to convert it into an indexed vertexbuffer.
I used the hash method (memory is cheap these days).
 

Daverino

Platinum Member
Mar 15, 2007
2,004
1
0
There is a constant space O(n) sort. It just isn't a comparison sort.

Radix sort.
Radix is O(nk) where k is the width of the value being sorted. If you're sorting fixed width integers, it's great. It's not a general sorting algorithm, though. If you make no assumptions about the data, O(n log n) is the best you're going to do for sorting, practically speaking.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
Radix is O(nk) where k is the width of the value being sorted. If you're sorting fixed width integers, it's great. It's not a general sorting algorithm, though. If you make no assumptions about the data, O(n log n) is the best you're going to do for sorting, practically speaking.

O(n log n) is the best you can get with a comparison sort. However, a count sort is a general sort that will always be O(n) (with a space requirement the size of all possible values of element n). Theoretically, O(n) is the best general sorting speed you can get. Practically, it depends. an n^2 algorithm will often beat most n log n algorithms for small n.
 

Daverino

Platinum Member
Mar 15, 2007
2,004
1
0
O(n log n) is the best you can get with a comparison sort. However, a count sort is a general sort that will always be O(n) (with a space requirement the size of all possible values of element n). Theoretically, O(n) is the best general sorting speed you can get. Practically, it depends. an n^2 algorithm will often beat most n log n algorithms for small n.

I wouldn't say a count sort is general. It can only sort algebraically and only when the numbers are fixed width. If you're sorting strings, you'll need to convert them to some numeric form and your bucket count could be the number of possible characters. That's fine for lower-case English (26) or even ASCII (256), but it's going to be scary for Unicode. Your bucket width will be the maximum length of the strings. Sorting mixed types is also more difficult, as well as handling arbitrary precision floats.

The thing is, with a comparison sort the algorithm stays the same and the only thing that changes between implementations is your comparison operator. With counting sorts, you need to change the counting method and the counting space. That's usually a LOT harder to do, which is why counting sorts are not general sorting algorithms. They need to be customized to your data and it's not guaranteed that a counting sort will be appropriate to that data. IF the data works well with a counting sort then you can beat a comparison sort. In most common examples of sorting, like integers, counting sorts will be superior in complexity.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
I wouldn't say a count sort is general. It can only sort algebraically and only when the numbers are fixed width. If you're sorting strings, you'll need to convert them to some numeric form and your bucket count could be the number of possible characters. That's fine for lower-case English (26) or even ASCII (256), but it's going to be scary for Unicode. Your bucket width will be the maximum length of the strings. Sorting mixed types is also more difficult, as well as handling arbitrary precision floats.

The thing is, with a comparison sort the algorithm stays the same and the only thing that changes between implementations is your comparison operator. With counting sorts, you need to change the counting method and the counting space. That's usually a LOT harder to do, which is why counting sorts are not general sorting algorithms. They need to be customized to your data and it's not guaranteed that a counting sort will be appropriate to that data. IF the data works well with a counting sort then you can beat a comparison sort. In most common examples of sorting, like integers, counting sorts will be superior in complexity.

Don't get me wrong, I wasn't trying to argue that a counting sort was the best sort out there. My argument was more along the lines of "Be careful not to get too tied up in the big oh notation." Anything that can be comparison sorted, can be count sorted. The problem is you have to assign each "thing" a value in order for them to be sorted. That operation alone is probably going to be an n log n operation. As you pointed out, that is only not the case for the simplest of situation (integers). However, it IS a general case sorting algorithm because it CAN sort anything that a comparison sort can.
 

Daverino

Platinum Member
Mar 15, 2007
2,004
1
0
However, it IS a general case sorting algorithm because it CAN sort anything that a comparison sort can.

How would you sort an array of strings based on the count of vowels in each string using a count sort?
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
How would you sort an array of strings based on the count of vowels in each string using a count sort?

Go Order the string using a comparison sort,
Assign a number to each string.
Sort with the count sort using those numbers.

(if you wanted to not physically sort the string, you could use something like a selection sort to find the smallest value without a value, give it a value, and repeat until all values have a value.)

Not as efficient, still O(n) (because the preprocessing doesn't fall in the sorting complexity).

Heck, if all you are doing is counting vowels, then you can just use the valow count (a standard integer) to do the sort. What I provided above is the general case to sort anything a comparison sort can sort (by, using a comparison sort to assign the numbers used for the count sort :awe: )
 
Last edited: