Two questions re: string comparisons in Perl and hashing

KingNothing · Apr 9, 2005

Question the first:

Is there a more efficient way to see if two strings are equal than $x eq $y?

Question the second:

It seems that the smaller two strings are, the faster the comparison would be. Is it feasible to hash two large (>500 character) strings and compare the hashes to determine if the original strings were equal? What would be a good algorithm for doing this?

Thanks in advance.

UCJefe · Apr 10, 2005

Originally posted by: KingNothing
Question the first:

Is there a more efficient way to see if two strings are equal than $x eq $y?

I doubt it. That's probably the most efficient method built-in to Perl.

Are you really doing something where performance is this critical? You're talking about saving nanoseconds here, maybe with your size of string it would be a millisecond or 2.

Question the second:

It seems that the smaller two strings are, the faster the comparison would be. Is it feasible to hash two large (>500 character) strings and compare the hashes to determine if the original strings were equal? What would be a good algorithm for doing this?

Well i suppose you could do something like this but now you have a storage versus space tradeoff. If you figure you use SHA-256 (just one of the many hash algorithms) you'll be using an extra 32 bytes to store each hash which may or may not matter to you. You also have the extra computation of creating the hash, which by the way will have to scan through every character in your string anyway. Unless you have a scenario where time isn't critical when the strings are created but it is when you compare them, I don't know that a hash comparison would buy you anything.

An alternate way to using a SHA implementation (or some other cryptographic hashing algorithm) would be just to throw both strings into a Perl hash and check the length. Again, Perl will have to scan the strings when hashing. You're not going to get better than O(n) as whatever method you use will have to go through the whole string either while doing the comparison or doing the hashing.

All in all, it is really quite pointless to try to beat the performance of the built-in Perl operands. There were probably people a lot smarter than you who spent a long time making sure they could run as fast as possible.

If you really need to ensure that two strings are character by character equivalent, use 'eq'

Barnaby W. Füi · Apr 10, 2005

If you need to do the comparison multiple times, hashing could help. Otherwise, there's not much you can do. You can't compare something without looking at it!

KingNothing · Apr 10, 2005

Originally posted by: BingBongWongFooey
If you need to do the comparison multiple times, hashing could help. Otherwise, there's not much you can do. You can't compare something without looking at it!

Right, but if it's possible to create, say, a 20-character hash out of a 500-character string it'd certainly be faster to compare the smaller strings.

Every little bit helps, I guess. The hashing can be done ahead of time so there's no performance hit during the time-critical portion of the program.

Well i suppose you could do something like this but now you have a storage versus space tradeoff. If you figure you use SHA-256 (just one of the many hash algorithms) you'll be using an extra 32 bytes to store each hash which may or may not matter to you.

If the hash isn't significantly smaller than the source string then there's not much point to this idea.

An alternate way to using a SHA implementation (or some other cryptographic hashing algorithm) would be just to throw both strings into a Perl hash and check the length.

Can you elaborate?

Basically what I have right now is two files, each of which has several hundred long strings, one on each line. I need to determine if any string in file 1 matches any string in file 2 in the fastest manner possible. Right now I do this. Better ideas are appreciated.

KingNothing · Apr 10, 2005

Just by way of explanation, I'm doing this for several hundred files. This loop is contained in an outer loop that generates all the 2-combinations of the set of files. The length check is a hack to make sure blank or very short lines are ignored.

Barnaby W. Füi · Apr 10, 2005

Originally posted by: KingNothing

Right, but if it's possible to create, say, a 20-character hash out of a 500-character string it'd certainly be faster to compare the smaller strings.

Not if you're just doing a single comparison, because then you'd have to loop through each string once, and on top of that, compare the hashes. Without the hashes you'd just have to compare the strings, which doesn't even require you loop through the entire strings, unless the differences are right at the end.

Every little bit helps, I guess. The hashing can be done ahead of time so there's no performance hit during the time-critical portion of the program.

That is something I was also going to mention, but I wasn't sure if you had distinct "startup" and "run" times, or if it was just a script that did everything in one shot and then exited.

KingNothing · Apr 10, 2005

Originally posted by: BingBongWongFooey

Originally posted by: KingNothing

Right, but if it's possible to create, say, a 20-character hash out of a 500-character string it'd certainly be faster to compare the smaller strings.

Click to expand...

Not if you're just doing a single comparison, because then you'd have to loop through each string once, and on top of that, compare the hashes. Without the hashes you'd just have to compare the strings, which doesn't even require you loop through the entire strings, unless the differences are right at the end.

Every little bit helps, I guess. The hashing can be done ahead of time so there's no performance hit during the time-critical portion of the program.

Click to expand...

That is something I was also going to mention, but I wasn't sure if you had distinct "startup" and "run" times, or if it was just a script that did everything in one shot and then exited.

I do have distinct startup and run times. I think what you're saying in the first paragraph assumes I do not, or am I reading it wrong?

Barnaby W. Füi · Apr 10, 2005

Yes, you're right. With hashes you'd technically be doing more computation overall, but you could get the long string inspections out of the way at the beginning and only compare smaller hashes at runtime.

And I dunno about SHA and all that crypto business, but in Python, hashes are just normal 32 bit ints. That's what Python dicts (hash tables) use for their keys. Doesn't get much faster than comparing ints!

>>> hash('hey guy!')
498234720
>>> hash('foo bar')
-584096744

KingNothing · Apr 10, 2005

Originally posted by: BingBongWongFooey
Yes, you're right. With hashes you'd technically be doing more computation overall, but you could get the long string inspections out of the way at the beginning and only compare smaller hashes at runtime.

And I dunno about SHA and all that crypto business, but in Python, hashes are just normal 32 bit ints. That's what Python dicts (hash tables) use for their keys. Doesn't get much faster than comparing ints!

>>> hash('hey guy!')
498234720
>>> hash('foo bar')
-584096744

Indeed. That's exactly what I'm looking for. Any idea how to do that in Perl?

Googling for "perl hash function" turns up a lot of stuff on the hash datatype, which isn't what I want.

Barnaby W. Füi · Apr 10, 2005

Google is being unhelpful for me too. NOTFRED!

KingNothing · Apr 10, 2005

I found this...but it appears to have an (unknown) upper limit on the string length because my strings return Inf values.

From here

KingNothing · Apr 10, 2005

By the way, I just checked, average string length is about 1300 characters.

Barnaby W. Füi · Apr 10, 2005

Yeah, I found that as well, and wrote it off for that same reason.

This is how Python does it, where p is a pointer to the char array, and x is the hash itself (a long):

x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= a->ob_size;
if (x == -1)
x = -2;

KingNothing · Apr 10, 2005

Originally posted by: BingBongWongFooey
Yeah, I found that as well, and wrote it off for that same reason.

This is how Python does it, where p is a pointer to the char array, and x is the hash itself (a long):

x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= a->ob_size;
if (x == -1)
x = -2;

^ = xor, right?
What is a->ob_size? I don't see an a variable anywhere else in the code.

Barnaby W. Füi · Apr 10, 2005

a is the PyStringObject, and ob_size is the string length. 'len' is simply a copy of ob_size, so that it can be deincremented.

KingNothing · Apr 10, 2005

Originally posted by: BingBongWongFooey
a is the PyStringObject, and ob_size is the string length. 'len' is simply a copy of ob_size, so that it can be deincremented.

Gotcha. Looks like my source strings are too long for that one to be effective, unfortunately. Getting too many spurious matches. Assuming hashes are even a viable way of doing this...but if they are, it looks like I'll need an algorithm that generates a hash string, not simply an integer. I'll play with it some more later, maybe I can shorten my string length but concatenate the hashes of the pieces together or something.

Thanks for all your help!

notfred · Apr 10, 2005

I don't think that hashing your values is really going to help much, since well over 90% of your strings will fail to match on the first character.

a 1300 character string starting with "a" with fail to match a 1300 character string starting with "b" in just one comparison, not 1300. You would only get any significant benefit from hashing from strings that actually match.

If you REALLY want to do the hashing, you can use crypt. but I doubt it's going to help much.

You should really think about building search trees out of each file instead. The comparisons would be much faster.

notfred · Apr 10, 2005

Here's another idea involving hashing that might be more useful: Actually build a hash (the data structure) to check these things...

This will be memory-intensive, but it'll be fast.
If you need to keep track of which files each string belongs to, use arrays as the values in the hash.

statik213 · Apr 11, 2005

Originally posted by: notfred
Here's another idea involving hashing that might be more useful: Actually build a hash (the data structure) to check these things...

This will be memory-intensive, but it'll be fast.
If you need to keep track of which files each string belongs to, use arrays as the values in the hash.

but.... two different strings (esp. long ones) can get hashed to the same value...
so you'll need to do an actual string comparison to determine if the strings are really the same... most hashing functions are not 1-1

notfred · Apr 11, 2005

Originally posted by: statik213

Originally posted by: notfred
Here's another idea involving hashing that might be more useful: Actually build a hash (the data structure) to check these things...

This will be memory-intensive, but it'll be fast.
If you need to keep track of which files each string belongs to, use arrays as the values in the hash.

Click to expand...

but.... two different strings (esp. long ones) can get hashed to the same value...
so you'll need to do an actual string comparison to determine if the strings are really the same... most hashing functions are not 1-1

1) You don't understand the point of hashing.
2) You don't understand what a hash (the data structure) is.

Two questions re: string comparisons in Perl and hashing

Diamond Member

Senior member

Elite Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Lifer

Lifer

Golden Member

Lifer