Two questions re: string comparisons in Perl and hashing

KingNothing

Diamond Member
Apr 6, 2002
7,141
1
0
Question the first:

Is there a more efficient way to see if two strings are equal than $x eq $y?

Question the second:

It seems that the smaller two strings are, the faster the comparison would be. Is it feasible to hash two large (>500 character) strings and compare the hashes to determine if the original strings were equal? What would be a good algorithm for doing this?

Thanks in advance.
 

UCJefe

Senior member
Jan 27, 2000
302
0
0
Originally posted by: KingNothing
Question the first:

Is there a more efficient way to see if two strings are equal than $x eq $y?

I doubt it. That's probably the most efficient method built-in to Perl.

Are you really doing something where performance is this critical? You're talking about saving nanoseconds here, maybe with your size of string it would be a millisecond or 2.

Question the second:

It seems that the smaller two strings are, the faster the comparison would be. Is it feasible to hash two large (>500 character) strings and compare the hashes to determine if the original strings were equal? What would be a good algorithm for doing this?

Well i suppose you could do something like this but now you have a storage versus space tradeoff. If you figure you use SHA-256 (just one of the many hash algorithms) you'll be using an extra 32 bytes to store each hash which may or may not matter to you. You also have the extra computation of creating the hash, which by the way will have to scan through every character in your string anyway. Unless you have a scenario where time isn't critical when the strings are created but it is when you compare them, I don't know that a hash comparison would buy you anything.

An alternate way to using a SHA implementation (or some other cryptographic hashing algorithm) would be just to throw both strings into a Perl hash and check the length. Again, Perl will have to scan the strings when hashing. You're not going to get better than O(n) as whatever method you use will have to go through the whole string either while doing the comparison or doing the hashing.

All in all, it is really quite pointless to try to beat the performance of the built-in Perl operands. There were probably people a lot smarter than you who spent a long time making sure they could run as fast as possible.

If you really need to ensure that two strings are character by character equivalent, use 'eq'

 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
If you need to do the comparison multiple times, hashing could help. Otherwise, there's not much you can do. You can't compare something without looking at it!
 

KingNothing

Diamond Member
Apr 6, 2002
7,141
1
0
Originally posted by: BingBongWongFooey
If you need to do the comparison multiple times, hashing could help. Otherwise, there's not much you can do. You can't compare something without looking at it!

Right, but if it's possible to create, say, a 20-character hash out of a 500-character string it'd certainly be faster to compare the smaller strings.

Every little bit helps, I guess. The hashing can be done ahead of time so there's no performance hit during the time-critical portion of the program.

Well i suppose you could do something like this but now you have a storage versus space tradeoff. If you figure you use SHA-256 (just one of the many hash algorithms) you'll be using an extra 32 bytes to store each hash which may or may not matter to you.

If the hash isn't significantly smaller than the source string then there's not much point to this idea.

An alternate way to using a SHA implementation (or some other cryptographic hashing algorithm) would be just to throw both strings into a Perl hash and check the length.

Can you elaborate?

Basically what I have right now is two files, each of which has several hundred long strings, one on each line. I need to determine if any string in file 1 matches any string in file 2 in the fastest manner possible. Right now I do this. Better ideas are appreciated.

 

KingNothing

Diamond Member
Apr 6, 2002
7,141
1
0
Just by way of explanation, I'm doing this for several hundred files. This loop is contained in an outer loop that generates all the 2-combinations of the set of files. The length check is a hack to make sure blank or very short lines are ignored.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Originally posted by: KingNothing

Right, but if it's possible to create, say, a 20-character hash out of a 500-character string it'd certainly be faster to compare the smaller strings.

Not if you're just doing a single comparison, because then you'd have to loop through each string once, and on top of that, compare the hashes. Without the hashes you'd just have to compare the strings, which doesn't even require you loop through the entire strings, unless the differences are right at the end.

Every little bit helps, I guess. The hashing can be done ahead of time so there's no performance hit during the time-critical portion of the program.
That is something I was also going to mention, but I wasn't sure if you had distinct "startup" and "run" times, or if it was just a script that did everything in one shot and then exited.
 

KingNothing

Diamond Member
Apr 6, 2002
7,141
1
0
Originally posted by: BingBongWongFooey
Originally posted by: KingNothing

Right, but if it's possible to create, say, a 20-character hash out of a 500-character string it'd certainly be faster to compare the smaller strings.

Not if you're just doing a single comparison, because then you'd have to loop through each string once, and on top of that, compare the hashes. Without the hashes you'd just have to compare the strings, which doesn't even require you loop through the entire strings, unless the differences are right at the end.

Every little bit helps, I guess. The hashing can be done ahead of time so there's no performance hit during the time-critical portion of the program.
That is something I was also going to mention, but I wasn't sure if you had distinct "startup" and "run" times, or if it was just a script that did everything in one shot and then exited.

I do have distinct startup and run times. I think what you're saying in the first paragraph assumes I do not, or am I reading it wrong?
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Yes, you're right. With hashes you'd technically be doing more computation overall, but you could get the long string inspections out of the way at the beginning and only compare smaller hashes at runtime.

And I dunno about SHA and all that crypto business, but in Python, hashes are just normal 32 bit ints. That's what Python dicts (hash tables) use for their keys. Doesn't get much faster than comparing ints!

>>> hash('hey guy!')
498234720
>>> hash('foo bar')
-584096744
 

KingNothing

Diamond Member
Apr 6, 2002
7,141
1
0
Originally posted by: BingBongWongFooey
Yes, you're right. With hashes you'd technically be doing more computation overall, but you could get the long string inspections out of the way at the beginning and only compare smaller hashes at runtime.

And I dunno about SHA and all that crypto business, but in Python, hashes are just normal 32 bit ints. That's what Python dicts (hash tables) use for their keys. Doesn't get much faster than comparing ints!

>>> hash('hey guy!')
498234720
>>> hash('foo bar')
-584096744

Indeed. That's exactly what I'm looking for. Any idea how to do that in Perl? :D Googling for "perl hash function" turns up a lot of stuff on the hash datatype, which isn't what I want.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Yeah, I found that as well, and wrote it off for that same reason.

This is how Python does it, where p is a pointer to the char array, and x is the hash itself (a long):

x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= a->ob_size;
if (x == -1)
x = -2;
 

KingNothing

Diamond Member
Apr 6, 2002
7,141
1
0
Originally posted by: BingBongWongFooey
Yeah, I found that as well, and wrote it off for that same reason.

This is how Python does it, where p is a pointer to the char array, and x is the hash itself (a long):

x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= a->ob_size;
if (x == -1)
x = -2;

^ = xor, right?
What is a->ob_size? I don't see an a variable anywhere else in the code.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
a is the PyStringObject, and ob_size is the string length. 'len' is simply a copy of ob_size, so that it can be deincremented.
 

KingNothing

Diamond Member
Apr 6, 2002
7,141
1
0
Originally posted by: BingBongWongFooey
a is the PyStringObject, and ob_size is the string length. 'len' is simply a copy of ob_size, so that it can be deincremented.

Gotcha. Looks like my source strings are too long for that one to be effective, unfortunately. Getting too many spurious matches. Assuming hashes are even a viable way of doing this...but if they are, it looks like I'll need an algorithm that generates a hash string, not simply an integer. I'll play with it some more later, maybe I can shorten my string length but concatenate the hashes of the pieces together or something.

Thanks for all your help!
 

notfred

Lifer
Feb 12, 2001
38,241
4
0
I don't think that hashing your values is really going to help much, since well over 90% of your strings will fail to match on the first character.

a 1300 character string starting with "a" with fail to match a 1300 character string starting with "b" in just one comparison, not 1300. You would only get any significant benefit from hashing from strings that actually match.

If you REALLY want to do the hashing, you can use crypt. but I doubt it's going to help much.

You should really think about building search trees out of each file instead. The comparisons would be much faster.
 

notfred

Lifer
Feb 12, 2001
38,241
4
0
Here's another idea involving hashing that might be more useful: Actually build a hash (the data structure) to check these things...

This will be memory-intensive, but it'll be fast.
If you need to keep track of which files each string belongs to, use arrays as the values in the hash.
 

statik213

Golden Member
Oct 31, 2004
1,654
0
0
Originally posted by: notfred
Here's another idea involving hashing that might be more useful: Actually build a hash (the data structure) to check these things...

This will be memory-intensive, but it'll be fast.
If you need to keep track of which files each string belongs to, use arrays as the values in the hash.

but.... two different strings (esp. long ones) can get hashed to the same value...
so you'll need to do an actual string comparison to determine if the strings are really the same... most hashing functions are not 1-1
 

notfred

Lifer
Feb 12, 2001
38,241
4
0
Originally posted by: statik213
Originally posted by: notfred
Here's another idea involving hashing that might be more useful: Actually build a hash (the data structure) to check these things...

This will be memory-intensive, but it'll be fast.
If you need to keep track of which files each string belongs to, use arrays as the values in the hash.

but.... two different strings (esp. long ones) can get hashed to the same value...
so you'll need to do an actual string comparison to determine if the strings are really the same... most hashing functions are not 1-1

1) You don't understand the point of hashing.
2) You don't understand what a hash (the data structure) is.