• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Fastest Way: Checking for Duplicates in an Array

Is it possible to check for duplicates in an array faster than (N^2)/2?

The only way I can think of to do so would be like so with an array of 5 elements:
check 1 against 2,3,4,5
check 2 against 3,4,5
check 3 against 4,5
check 4 against 5
 
Absolutely! I can think of several methods. But I hesitate to disclose them until I know exactly what this homework problem is asking. 😉
 
Is it possible to check for duplicates in an array faster than (N^2)/2?

The only way I can think of to do so would be like so with an array of 5 elements:
check 1 against 2,3,4,5
check 2 against 3,4,5
check 3 against 4,5
check 4 against 5

You need to provide more information:
1) how large is the array?
2) what is the data inside the array? (positive integers? doubles? other?)
3) can I modify the original array?
4) can I use additional storage?
5) do you need to locate ALL duplicates?

If the original array cannot be changed and I cannot create additional storage, then this is pretty hard. If you can restrict the data type to like positive integers, there might be some magic you can play with imagining the array as a directed graph. Each array index is a node; the value at each index describes an edge going to the node with that value. Then something like the tortoise & hare algorithm can find cycles which correspond to duplicates. Not sure that this would work... I think you need pretty strong restrictions on the data type/range?

Hm, Ken brings up a good point. I removed the two actual solutions that are the most practical in case this is homework.
 
Having thought about it, I'll provide this general hint: Presuming you can rearrange or duplicate the array elements, how could you move all duplicates closer together? Like right next to each other or right on top of each other?

I can think of *three* practical solutions. 🙂 None of them use graphs. 😕
 
Having thought about it, I'll provide this general hint: Presuming you can rearrange or duplicate the array elements, how could you move all duplicates closer together? Like right next to each other or right on top of each other?

I can think of *three* practical solutions. 🙂 None of them use graphs. 😕

Haha. I can only think of two practical solutions; one is faster but requires more space... one is slower but requires more time.

The graph thing was kind of a random idea I had from the way the OP listed the numbers. Like consider this set:
A = 2,3,1,4,1,5 <--data
ind=1 2 3 4 5 6 <--array indexes (yeah Fortran indexing...)

Ok so following my idea, you start at node 6. A[6] = 5 tells you to go to node 5. A[5] =1 tells you to go to node 1. 1 goes to 2, 2 goes to 3, 3 goes back to 1. So that's a cycle, which we can detect using say tortoise vs hare. Though the idea seems to be extremely limited. Like I'm not sure how you'd detect all duplicates and prove it's faster than O(N^2).

This is of course... not a practical solution. I just thought it was kind of cute.
 
Well lets see, there's a heap of things that might work. The quick one is probably what you are after. Merging these hints might help. Sounds sorted to me.
 
Last edited:
Sounds like a bunch of treemendous ideas in this thread. 😀
 
Are there space constraints? In the OP, the elements are integers, with ~512MB extra space, it can be done in O(n) (worst case) 🙂
 
If you can use a hash (which takes space) you can do it in O(n)

O(n log n) for constant space. Sort the array. All the duplicates will be next to each other and can be found in n time.
 
If you can use a hash (which takes space) you can do it in O(n)

O(n log n) for constant space. Sort the array. All the duplicates will be next to each other and can be found in n time.

There is a constant space O(n) sort. It just isn't a comparison sort.

Radix sort.
 
Heh, I implemented a routine like that just yesterday.
I was importing some raw vertex data from 3dsmax, and wanted to convert it into an indexed vertexbuffer.
I used the hash method (memory is cheap these days).
 
There is a constant space O(n) sort. It just isn't a comparison sort.

Radix sort.
Radix is O(nk) where k is the width of the value being sorted. If you're sorting fixed width integers, it's great. It's not a general sorting algorithm, though. If you make no assumptions about the data, O(n log n) is the best you're going to do for sorting, practically speaking.
 
Radix is O(nk) where k is the width of the value being sorted. If you're sorting fixed width integers, it's great. It's not a general sorting algorithm, though. If you make no assumptions about the data, O(n log n) is the best you're going to do for sorting, practically speaking.

O(n log n) is the best you can get with a comparison sort. However, a count sort is a general sort that will always be O(n) (with a space requirement the size of all possible values of element n). Theoretically, O(n) is the best general sorting speed you can get. Practically, it depends. an n^2 algorithm will often beat most n log n algorithms for small n.
 
O(n log n) is the best you can get with a comparison sort. However, a count sort is a general sort that will always be O(n) (with a space requirement the size of all possible values of element n). Theoretically, O(n) is the best general sorting speed you can get. Practically, it depends. an n^2 algorithm will often beat most n log n algorithms for small n.

I wouldn't say a count sort is general. It can only sort algebraically and only when the numbers are fixed width. If you're sorting strings, you'll need to convert them to some numeric form and your bucket count could be the number of possible characters. That's fine for lower-case English (26) or even ASCII (256), but it's going to be scary for Unicode. Your bucket width will be the maximum length of the strings. Sorting mixed types is also more difficult, as well as handling arbitrary precision floats.

The thing is, with a comparison sort the algorithm stays the same and the only thing that changes between implementations is your comparison operator. With counting sorts, you need to change the counting method and the counting space. That's usually a LOT harder to do, which is why counting sorts are not general sorting algorithms. They need to be customized to your data and it's not guaranteed that a counting sort will be appropriate to that data. IF the data works well with a counting sort then you can beat a comparison sort. In most common examples of sorting, like integers, counting sorts will be superior in complexity.
 
I wouldn't say a count sort is general. It can only sort algebraically and only when the numbers are fixed width. If you're sorting strings, you'll need to convert them to some numeric form and your bucket count could be the number of possible characters. That's fine for lower-case English (26) or even ASCII (256), but it's going to be scary for Unicode. Your bucket width will be the maximum length of the strings. Sorting mixed types is also more difficult, as well as handling arbitrary precision floats.

The thing is, with a comparison sort the algorithm stays the same and the only thing that changes between implementations is your comparison operator. With counting sorts, you need to change the counting method and the counting space. That's usually a LOT harder to do, which is why counting sorts are not general sorting algorithms. They need to be customized to your data and it's not guaranteed that a counting sort will be appropriate to that data. IF the data works well with a counting sort then you can beat a comparison sort. In most common examples of sorting, like integers, counting sorts will be superior in complexity.

Don't get me wrong, I wasn't trying to argue that a counting sort was the best sort out there. My argument was more along the lines of "Be careful not to get too tied up in the big oh notation." Anything that can be comparison sorted, can be count sorted. The problem is you have to assign each "thing" a value in order for them to be sorted. That operation alone is probably going to be an n log n operation. As you pointed out, that is only not the case for the simplest of situation (integers). However, it IS a general case sorting algorithm because it CAN sort anything that a comparison sort can.
 
How would you sort an array of strings based on the count of vowels in each string using a count sort?

Go Order the string using a comparison sort,
Assign a number to each string.
Sort with the count sort using those numbers.

(if you wanted to not physically sort the string, you could use something like a selection sort to find the smallest value without a value, give it a value, and repeat until all values have a value.)

Not as efficient, still O(n) (because the preprocessing doesn't fall in the sorting complexity).

Heck, if all you are doing is counting vowels, then you can just use the valow count (a standard integer) to do the sort. What I provided above is the general case to sort anything a comparison sort can sort (by, using a comparison sort to assign the numbers used for the count sort :awe: )
 
Last edited:
Back
Top