Originally posted by: kamper
if you had, for example, only the 4 unique characters in your example G,C,T and A, there are only 4^8 unique strings, (64k). So most of your strings are duplicates and the short-circuiting means O(n) turns into O(64k) or, more generally, the lesser of O(n) and O(m^8) where m is the number of unique characters in your strings. At m = 6 your example n is again smaller though.
Another thought is that if you did do a hash, you could take the limited string possibilities into account and devise a very good hashing function, much better than a string one. Then you could cut your memory usage down without collisions. With only 4 characters you could make a bloody good one too. Eg:
int hashcode = 0
for each char in string
...hashcode *= 4 // literally, shift 2 bits
...if G hashcode += 0 // now fill in the low bits...
...if C hashcode += 1
...if T hashcode += 2
...else hashcode += 3
You end up with an int between 0 and 64k, guaranteed to be unique so collisions = duplicates, which is what you want.
If you only have ACTG, then I agree that this is probably the best approach to take -- take advantage of the fact that you can only have 64K possible values.
E.g., to elaborate:
Create a char array of 64k values. Initialize it to 0. Then take each input string, and convert it to a number using the above "hash function". Look at the array at the position of the hash -- if it's 0, put a 1 there. If it's 1, then you've already been there and have a duplicate. Output your input index.
This can be extended to store the original position as well. This would cost a bit more space, but still be small.
But if your input character set was bigger, e.g. 15 characters as in
http://en.wikipedia.org/wiki/DNA_sequence , then the space would get large. 15^8 = 2,562,890,625. You could reduce space consumption by storing only a bit at each position, using a bit vector implementation or rolling your own. The space would be around 2,562,890,625 / 8 = 320,361,328 bytes, which is perhaps manageable.
If the space wasn't manageable, or you had additional storage needs for other purposes, then you'd have to look at other data structures. E.g. an associate map on the "hash" (maybe even using a real hash table). E.g. a trie.
Now that I've written all this, it seems that most of these ideas have already been touched upon by kamper. Well, I hope it's a bit clearer now..