searching an unsorted list

MrDudeMan · Oct 10, 2007

I have an array 1,167,501 members long, each with a pointer to 8 characters, effectively being array[1167501][8].

I need to check to see if any of the 1167501 members contain the same 8 characters verbatim.

I am reading in a DNA pattern from a file, i.e. GCTGACGA. If any of them are the same, I need to know the position of the duplicate.

Is the fastest way to do this just to read the 8 characters from position 0, then check them against the 8 in position 1-1167501, increment to position 1 and check 0 and 2-1167501?

I can't sort the list, so normal searching algorithms won't work.

Crusty · Oct 10, 2007

That's not the fastest way, that's a O(n^2) algorithm, which is horribly inefficient.

Are all 1,167,501 pointers different? Or is it really the pointers that might be duplicated? I would double check that before you start comparing the wrong values for duplicates.

Either way you will be having to sort the list somehow, whether or not it's implicit in your search algorithm or you sort the list into a new data structure then search.

Your current choice is similar to how a bubble sort would work.

I would create a hash table that uses the 8-char sequence as the key, and the index in the array that that sequence belongs to as the value for the table. Just make sure you allow for multiple values with the same key.

Then just iterate down the hash table, and any key that has more then 1 value associated with it, iterate down the list of values and those are your indexes of the original array that contain the same 8 character sequence.

That will give you much better runtime performance.

You could also do this by inserting the character sequences into a binary search tree, but that's not ideal in my opinion.

DaveSimmons · Oct 10, 2007

> I am reading in a DNA pattern from a file, i.e. GCTGACGA. If any of them are the same, I need to know the position of the duplicate.
> Is the fastest way to do this just to read the 8 characters from position 0, then check them against the 8 in position 1-1167501, increment to position 1 and check 0 and 2-1167501?

What do you do when you find a pair? What about a triple or more?

I'm too lazy to grab an algorithm text or visit wiki to see if a hash will work, but if not:

For index n, why do you think you need to search 0....n-1 again when you already searched using them as the starting index? Only searching forward will at least cut the work of a set of linear searches in half to roughly (n^2)/2.

Crusty · Oct 10, 2007

Originally posted by: DaveSimmons
> I am reading in a DNA pattern from a file, i.e. GCTGACGA. If any of them are the same, I need to know the position of the duplicate.
> Is the fastest way to do this just to read the 8 characters from position 0, then check them against the 8 in position 1-1167501, increment to position 1 and check 0 and 2-1167501?

What do you do when you find a pair? What about a triple or more?

I'm too lazy to grab an algorithm text or visit wiki to see if a hash will work, but if not:

For index n, why do you think you need to search 0....n-1 again when you already searched using them as the starting index? Only searching forward will at least cut the work of a set of linear searches in half to roughly (n^2)/2.

Which still makes it O(n^2) 😛

DaveSimmons · Oct 10, 2007

In abstract terms yes, but in the real world twice as fast does often matter.

Obviously if there is an O(n log n) way such as sorting, it would be better.

> I can't sort the list, so normal searching algorithms won't work.

Why not? With only 12 MB of data (8 + 4 pointer) I don't see why buffering the file and doing an in-memory sort wouldn't work, unless this is running on some low-memory embedded device.

Crusty · Oct 10, 2007

Twice as fast doesn't work with n^2 algorithms. What if all of the sudden his list has 20,000,000 records in it. It will not scale.

I could see using something that is n^2 IF you knew your data length will NEVER change and is relatively small compared to the rest of the programs operations.

If this is a quick and dirty hack to do a one time parse/transform of some data then not worrying about poorly scalable algorithms is fine, but as soon as this code is used somewhere in a user land or server application then that is a very poor design decision.

Now, if this is on an embedded device, then all sorts of other considerations need to be looked at such as processor type and speed, memory type and speed, and even your input/output sources.

Either way... some more information is needed from the OP.

Markbnj · Oct 10, 2007

Are all 1,167,501 pointers different? Or is it really the pointers that might be duplicated? I would double check that before you start comparing the wrong values for duplicates.

If the pointers point to the same string then someone already did the work 🙂.

OP, the hashed dictionary already suggested, or sorting the array alphabetically into a binary tree, will probably work well if the data isn't very dynamic. Either will greatly reduce the number of direct comparisons you have to make.

Crusty · Oct 10, 2007

Originally posted by: Markbnj

Are all 1,167,501 pointers different? Or is it really the pointers that might be duplicated? I would double check that before you start comparing the wrong values for duplicates.

Click to expand...

If the pointers point to the same string then someone already did the work 🙂.

OP, the hashed dictionary already suggested, or sorting the array alphabetically into a binary tree, will probably work well if the data isn't very dynamic. Either will greatly reduce the number of direct comparisons you have to make.

Yes, but he would still be storing multiple copies of the same pointer!

aCynic2 · Oct 10, 2007

I'm not the very good at trees and tree like structures because I'm not clear on how to prevent them from becoming dangling lopsided monstrosities.

If you're in the same boat, here is an idea off the top of my head:

First, I'm assuming that this array is a db (flat, relational, ISAM, doesn't matter, storage method is what matters) so it's not in memory all the time. or at least comes from a db. If this is not the case, a rethinking is needed, but without knowing for certain, I'll go with what seems logical.

The array index is the record position? Yes...no...maybe?

You need to sort it. I'm not a tree man, so I'll come up with what I think can be a decent solution for a non-tree algo:

You'll need a structure, one that stores two values: index of record position and value of data.

Sort on data. Don't read it all in at once before sorting. Insert sort (not insertion sort). Read one, attempt to find it in the list, if it's found, discard and record the info you need. If it's not found, insert in the list, record the info you need.

I use this method to read in a list of triangles the make up a 3D geometry. It eliminates redundant vertices by attempting to find the current in the list. If it's found, it returns the index of the existing vertex. If it's not found, it returns -1 (since arrays are zero based, 0 is a valid return value).

The test model for my program is 1366 polys, 3x that for vertices. It reduces it to 390 vertices (I think 390, it's in the 300s).

On top of removing redundant vertices, it also links referenced between polys and vertices, so that vertices contain references to the polys that use them and each poly has a pointer to the vertex that defines it's extents. It has zero redundancy and loads that 1366 poly count model in less time than I can measure by consideration. IOW, I don't realize a slowness in the algo. It load and processes in much less than a second, so I'd have to use profiling tools to get an idea.

Argo · Oct 10, 2007

HashSet is your answer.

statik213 · Oct 10, 2007

Originally posted by: Argo
HashSet is your answer.

Yup... basically create a hash_set:
http://www.sgi.com/tech/stl/hash_set.html

Then read your file in one item at a time. Check to see if it is in your set (O(log n)), if not add it (O(log n)). If it is in the set then you have a duplicate.

Checking & adding n items should be O(n log n)... There is going to be some memory overhead for the hash_set, probably several megs for that much data....

statik213 · Oct 10, 2007

Also, as a complete alternative, you can embed sqlite into your app, create a temp table, insert the data... the rest is some simple SQL.

http://www.sqlite.org/

aCynic2 · Oct 11, 2007

I need to start applying OOP more. I have a book, "Data Structures in C++" , from a prior programming class, that is entirely about STL, but I know using templates creates serious overhead and in real-time computer games, that overhead can be cumulative.

Venix · Oct 11, 2007

Originally posted by: aCynic2
I need to start applying OOP more. I have a book, "Data Structures in C++" , from a prior programming class, that is entirely about STL, but I know using templates creates serious overhead and in real-time computer games, that overhead can be cumulative.

Templates create compile-time overhead. There is no run-time cost, and in many cases code using STL will be considerably faster than its C equivalent. See the performance of qsort vs. std::sort, for example.

Argo · Oct 11, 2007

Originally posted by: statik213

Originally posted by: Argo
HashSet is your answer.

Click to expand...

Yup... basically create a hash_set:
http://www.sgi.com/tech/stl/hash_set.html

Then read your file in one item at a time. Check to see if it is in your set (O(log n)), if not add it (O(log n)). If it is in the set then you have a duplicate.

Checking & adding n items should be O(n log n)... There is going to be some memory overhead for the hash_set, probably several megs for that much data....

The lookup time of a HashSet is constant (provided a good hashCode function), so the total cost would be O(n) with some memory overhead.

aCynic2 · Oct 11, 2007

Originally posted by: Venix

Templates create compile-time overhead. There is no run-time cost, and in many cases code using STL will be considerably faster than its C equivalent. See the performance of qsort vs. std::sort, for example.

I don't believe that at all! There is an overhead incurred from the adaptability, and the excess instructions to which the template compiles to. This is not a case of where I'm going to accept someone/somecompany's word for it.

I almost never take someone or some company's word at face value. To do so is just plain stupid.

I'll need to see ASM code of the resulting object file to believe it.

Show me proof.

This kind of attitude has gotten me behind in school, simply because I wouldn't accept what they said. If I could not reason it out, there was no proof it worked under all circumstances (one of the reasons I have trouble with infinite series).

I can generally see a near 1:1 conversion in C. C++ makes me think it's going to bloat and slow down and, pardon me, young people aren't familiar with C to C++ comparison. They aren't familiar with C and Pascal's performance (pascal actually outperforming C in general).

Sc4freak · Oct 11, 2007

Originally posted by: aCynic2

Originally posted by: Venix

Templates create compile-time overhead. There is no run-time cost, and in many cases code using STL will be considerably faster than its C equivalent. See the performance of qsort vs. std::sort, for example.

Click to expand...

I don't believe that at all! There is an overhead incurred from the adaptability, and the excess instructions to which the template compiles to. This is not a case of where I'm going to accept someone/somecompany's word for it.

I almost never take someone or some company's word at face value. To do so is just plain stupid.

I'll need to see ASM code of the resulting object file to believe it.

Show me proof.

This kind of attitude has gotten me behind in school, simply because I wouldn't accept what they said. If I could not reason it out, there was no proof it worked under all circumstances (one of the reasons I have trouble with infinite series).

I can generally see a near 1:1 conversion in C. C++ makes me think it's going to bloat and slow down and, pardon me, young people aren't familiar with C to C++ comparison. They aren't familiar with C and Pascal's performance (pascal actually outperforming C in general).

What "excess instructions"? Templates incur only a compile time overhead - there is no extra code created for templated code. Templates resolutions are ALL made during compile time - there's no such thing as a "dynamic template" (since that just doesn't make sense, the way templates work inherently makes it compile-time). It's the same as saying that using #define will slow down runtime performance; these are all compile-time operations and not runtime.

If you don't believe me, or anyone else, then try it for yourself. Compile a piece of code using templates and one without. They'll perform exactly the same. In some cases, templated code can increase performance. For example, to convert strings to numeric data types, C provides atoi() and atof(). In C++, these can be merged into a single templated function, which is only compiled if you actually use them, which results in a smaller executable and less code.

Venix · Oct 11, 2007

Originally posted by: aCynic2

Originally posted by: Venix

Templates create compile-time overhead. There is no run-time cost, and in many cases code using STL will be considerably faster than its C equivalent. See the performance of qsort vs. std::sort, for example.

Click to expand...

I don't believe that at all! There is an overhead incurred from the adaptability, and the excess instructions to which the template compiles to. This is not a case of where I'm going to accept someone/somecompany's word for it.

I almost never take someone or some company's word at face value. To do so is just plain stupid.

I'll need to see ASM code of the resulting object file to believe it.

Show me proof.

This kind of attitude has gotten me behind in school, simply because I wouldn't accept what they said. If I could not reason it out, there was no proof it worked under all circumstances (one of the reasons I have trouble with infinite series).

I can generally see a near 1:1 conversion in C. C++ makes me think it's going to bloat and slow down and, pardon me, young people aren't familiar with C to C++ comparison. They aren't familiar with C and Pascal's performance (pascal actually outperforming C in general).

"Overhead from adaptability"? STL algorithms are faster than their C equivalents, and match hand-coded versions.

"Excess instructions"? Are you seriously suggesting that SomeFunc<int> compiles to something different than if it were written without a template?

In any case, there are tons of real world examples of highly-performant template-dependent code. Blitz++ offers scientific computation peformance that rivals Fortran. Boost.Graph is fast enough that I use it in performance-critical image processing code. The Havok physics engine relies heavily on STL containers and algorithms. To quote the Blitz++ documentation: "These results are being obtained not through better optimizing compilers, preprocessors, or language extensions, but through the use of template techniques. By using templates cleverly, optimizations such as loop fusion, unrolling, tiling, and algorithm specialization can be performed automatically at compile time."

exdeath · Oct 11, 2007

Just to add something else that hasn't been covered:

If you can't sort or move data, you can create a second array of pointers to the data and sort that instead.

However, while you will then be able to implement a smarter search algorithm, spanning the sorted list (of pointers) will not be memory or cache efficient as the data itself will still be fragmented (ie: linearly walking the sorted list will access random elements of the original data set)

aCynic2 · Oct 11, 2007

Very well, let me ask, sc4freak and venix, if C++ is so all-around good, why is C still predominantly used in embedded control?

DaveSimmons · Oct 11, 2007

> "if C++ is so all-around good, why is C still predominantly used in embedded control?"

Because the C runtime has a smaller memory footprint? Or because it was cheaper to stop with C runtime support, and C was good enough?

In your earlier comment you appear to have confused classes (which often do add slight overhead) with templates which are compile-time code directives that work like macros.

C++ can be as efficient as C, where you can run into trouble is in using existing libraries without being aware of their performance and memory use. That's true with any language though.

For heavy processing of several GB of text data it might make sense to use fixed char buffers and C functions instead of std library strings and string functions, but you can use either approach in C++.

Outside of embedded systems, maintainability of code is often much more important than pure efficiency, so a few lost cycles from vtable lookup won't matter.

Markbnj · Oct 11, 2007

aCynic, as Dave and others have suggested, there is no additional runtime overhead involved with using templated rather than hand-specialized classes. The C++/C debate in general was interesting in about 1990, and believe me it was vigorous. The bottom line is that C++ introduces no more runtime overhead than any other highly-modular programming style. Classes aren't different from structs in most metrics, and the overhead of indirection through a virtual function table isn't worth worrying about.

Embedded systems are not high performance systems, in terms of code execution speed. The kinds of systems that are, i.e. games and realtime simulations, are almost all written in C++ now, because they need the organizational capabilities of C++. What embedded systems require is very tight control over memory layout and adherence to severe constraints on the amount of memory, and precise timing of instructions. Assembly is preferred for this, and C by extension, and I suspect that long-standing practice is not a small part of the reason. These also typically are not large, complicated systems that benefit as much from the high-level organizational concepts of C++.

aCynic2 · Oct 11, 2007

Very well, I'm going to start looking into it again. The book I've always liked, "C++: How to Program," by Deitel and Deitel is out of date by 4 editions, so it's time to update. I got this book first for a college course, it's good to see it hasn't lost it's price tag.🙁

Dave, I've only ever seen templates use in classes, so for me, it's a way of making a general catch-all class that can adapt to what ever base data type is considered. Afterall, isn't the STL is class template library?

Added a tutorial and reference for STL as well...bah...almost $200 for two books.

kamper · Oct 11, 2007

I'd build a trie. Building it is O(n) and searching is O(1), just like a hash but probably using less memory and with no chance of collisions. Plus, in case you ever need it, you get free prefix matching 🙂

The problem with a binary tree is that you get O(log n) searching iff you keep it balanced, which is extra complexity. And building a binary tree is O(n log n).

Actually the interesting thing about a binary tree is that the complexity will never actually be as bad as n if your possible strings are fewer. What I mean is that if you had, for example, only the 4 unique characters in your example G,C,T and A, there are only 4^8 unique strings, (64k). So most of your strings are duplicates and the short-circuiting means O(n) turns into O(64k) or, more generally, the lesser of O(n) and O(m^8) where m is the number of unique characters in your strings. At m = 6 your example n is again smaller though.

Another thought is that if you did do a hash, you could take the limited string possibilities into account and devise a very good hashing function, much better than a string one. Then you could cut your memory usage down without collisions. With only 4 characters you could make a bloody good one too. Eg:
int hashcode = 0
for each char in string
...hashcode *= 4 // literally, shift 2 bits
...if G hashcode += 0 // now fill in the low bits...
...if C hashcode += 1
...if T hashcode += 2
...else hashcode += 3

You end up with an int between 0 and 64k, guaranteed to be unique so collisions = duplicates, which is what you want.

Markbnj · Oct 12, 2007

I like btrees as well, and find them very useful for large static data sets. One way to help keep them balanced is to either use some algorithm to select a root term that is somewhere in the middle of your possible range, or to insert an artifical middle value as the root and ignore that node in searches. This makes sure that statistically half your inserts will go in one direction, and half in the other. If the data is dynamic I don't like them as much, due to the costs of dynamically rebalancing the tree. In that case I lean toward hashed dictionaries.

searching an unsorted list

Lifer

Lifer

Elite Member

Lifer

Elite Member

Lifer

Elite Member <br>Moderator Emeritus

Lifer

Senior member

Lifer

Golden Member

Golden Member

Senior member

Golden Member

Lifer

Senior member

Guest

Golden Member

Lifer

Senior member

Elite Member

Elite Member <br>Moderator Emeritus

Senior member

Diamond Member

Elite Member <br>Moderator Emeritus