searching an unsorted list

kamper · Oct 12, 2007

Originally posted by: Markbnj
I like btrees as well, and find them very useful for large static data sets. One way to help keep them balanced is to either use some algorithm to select a root term that is somewhere in the middle of your possible range, or to insert an artifical middle value as the root and ignore that node in searches. This makes sure that statistically half your inserts will go in one direction, and half in the other. If the data is dynamic I don't like them as much, due to the costs of dynamically rebalancing the tree. In that case I lean toward hashed dictionaries.

Picking one good root won't help that much, as you could still end up with lists on either side. You could pick artificial middle values at each level, that's the same as a trie 🙂 If you went dynamic, avl and rb trees are still logn for most stuff.

KLin · Oct 13, 2007

Throw it into a database table and run the following query

SELECT Table.RecordID, Table.DNAString from
(
SELECT DNAString, Count(DNAString) As DNAStringTotal from Table GROUP BY DNAString having Count(DNAString) > 1
) sub INNER JOIN Table on sub.DNAString = Table.DNAString
ORDER BY Table.DNAString, Table.RecordID

That query would show you any strings that are duplicates, and it's RecordID(aka it's position) if you imported it in the same order that you'd look at it in an array.

statik213 · Oct 13, 2007

Originally posted by: kamper
I'd build a trie. Building it is O(n) and searching is O(1), just like a hash but probably using less memory and with no chance of collisions. Plus, in case you ever need it, you get free prefix matching 🙂

The problem with a binary tree is that you get O(log n) searching iff you keep it balanced, which is extra complexity. And building a binary tree is O(n log n).

Actually the interesting thing about a binary tree is that the complexity will never actually be as bad as n if your possible strings are fewer. What I mean is that if you had, for example, only the 4 unique characters in your example G,C,T and A, there are only 4^8 unique strings, (64k). So most of your strings are duplicates and the short-circuiting means O(n) turns into O(64k) or, more generally, the lesser of O(n) and O(m^8) where m is the number of unique characters in your strings. At m = 6 your example n is again smaller though.

Another thought is that if you did do a hash, you could take the limited string possibilities into account and devise a very good hashing function, much better than a string one. Then you could cut your memory usage down without collisions. With only 4 characters you could make a bloody good one too. Eg:
int hashcode = 0
for each char in string
...hashcode *= 4 // literally, shift 2 bits
...if G hashcode += 0 // now fill in the low bits...
...if C hashcode += 1
...if T hashcode += 2
...else hashcode += 3

You end up with an int between 0 and 64k, guaranteed to be unique so collisions = duplicates, which is what you want.

good advice there, the trie is definitely a good option..
you might also want to look at this:
http://en.wikipedia.org/wiki/Judy_array

Madwand1 · Oct 13, 2007

Originally posted by: kamper
if you had, for example, only the 4 unique characters in your example G,C,T and A, there are only 4^8 unique strings, (64k). So most of your strings are duplicates and the short-circuiting means O(n) turns into O(64k) or, more generally, the lesser of O(n) and O(m^8) where m is the number of unique characters in your strings. At m = 6 your example n is again smaller though.

Another thought is that if you did do a hash, you could take the limited string possibilities into account and devise a very good hashing function, much better than a string one. Then you could cut your memory usage down without collisions. With only 4 characters you could make a bloody good one too. Eg:
int hashcode = 0
for each char in string
...hashcode *= 4 // literally, shift 2 bits
...if G hashcode += 0 // now fill in the low bits...
...if C hashcode += 1
...if T hashcode += 2
...else hashcode += 3

You end up with an int between 0 and 64k, guaranteed to be unique so collisions = duplicates, which is what you want.

If you only have ACTG, then I agree that this is probably the best approach to take -- take advantage of the fact that you can only have 64K possible values.

E.g., to elaborate:

Create a char array of 64k values. Initialize it to 0. Then take each input string, and convert it to a number using the above "hash function". Look at the array at the position of the hash -- if it's 0, put a 1 there. If it's 1, then you've already been there and have a duplicate. Output your input index.

This can be extended to store the original position as well. This would cost a bit more space, but still be small.

But if your input character set was bigger, e.g. 15 characters as in http://en.wikipedia.org/wiki/DNA_sequence , then the space would get large. 15^8 = 2,562,890,625. You could reduce space consumption by storing only a bit at each position, using a bit vector implementation or rolling your own. The space would be around 2,562,890,625 / 8 = 320,361,328 bytes, which is perhaps manageable.

If the space wasn't manageable, or you had additional storage needs for other purposes, then you'd have to look at other data structures. E.g. an associate map on the "hash" (maybe even using a real hash table). E.g. a trie.

Now that I've written all this, it seems that most of these ideas have already been touched upon by kamper. Well, I hope it's a bit clearer now..

kamper · Oct 13, 2007

My hash function is also a string trie where the bit pattern is actually the branches of the tree. It's also kind of a bucket sort. As long as the population is dense (m^8 < n), it's basically all the same, just with different implementations. 😛

Markbnj · Oct 13, 2007

Originally posted by: kamper

Originally posted by: Markbnj
I like btrees as well, and find them very useful for large static data sets. One way to help keep them balanced is to either use some algorithm to select a root term that is somewhere in the middle of your possible range, or to insert an artifical middle value as the root and ignore that node in searches. This makes sure that statistically half your inserts will go in one direction, and half in the other. If the data is dynamic I don't like them as much, due to the costs of dynamically rebalancing the tree. In that case I lean toward hashed dictionaries.

Click to expand...

Picking one good root won't help that much, as you could still end up with lists on either side. You could pick artificial middle values at each level, that's the same as a trie 🙂 If you went dynamic, avl and rb trees are still logn for most stuff.

Yeah, but it's a nice start, and if the dataset is sufficiently disordered and isn't dynamic the chance of getting a pretty balanced tree is high. You have to choose the structure based on the data, and trees are for inherently unordered data. If you get lists then you chose the wrong structure.

I was finishing up an implementation of a BSP tree last night, three dimensions, 10k nodes. With a mean-valued root and a randomized input set I ended up with a depth of 97 and was able to do range searches checking distance on an average of just 15 nodes to completion. For some things trees excel. I think in the OPs case it really depends entirely on the nature of the data.

kamper · Oct 13, 2007

Sure, by why bother when hashing is still better? Arguing for a btree here is like arguing for a bubble sort because it's better than O(n^3). A btree's best case scenario is worse than hashing and, if you insist on not using an established general-case balancing algorithm, then your worst case is far worse.

I still don't see the value in choosing a mean-valued root. It does not improve your runtime order and should only be cutting your time in half at best (I'm guessing 25% will be the average). If you're confident enough in the randomness of the distribution of your data to not balance the whole tree, you should be confident enough to not bother balancing the top level.

Markbnj · Oct 13, 2007

Sure, by why bother when hashing is still better? Arguing for a btree here is like arguing for a bubble sort because it's better than O(n^3). A btree's best case scenario is worse than hashing and, if you insist on not using an established general-case balancing algorithm, then your worst case is far worse.

I'm not really in disagreement with that. Sort of got off the OP's original question. The tree approach might or might not be better in this case. I don't really know enough about the data to say.

Choosing a mean-valued root is pretty much a standard approach for building any n-ary tree. All it does is avoid the pathological case where you hang everything off ione branch of the root. In the case of sufficiently random data you're right that it doesn't have much overall affect on balance, since the next layer down should distribute fairly evenly, but it's just not as aesthetic to have a lopsided tree 😉.

searching an unsorted list

kamper

Diamond Member

KLin

Lifer

statik213

Golden Member

Madwand1

Diamond Member

kamper

Diamond Member

Markbnj

Elite Member <br>Moderator Emeritus

kamper

Diamond Member

Markbnj

Elite Member <br>Moderator Emeritus

TRENDING THREADS