regex question

Schrodinger

Golden Member
Nov 4, 2004
1,274
0
0
Hey

Well I'm writing a Bencoding class in Java for my own BitTorrent client from scratch (just for fun.. I know there are a billion of them :p) I'm trying to decode strings and I lack the regex skills to do the check.

Basically a string is encoded into the message with the length prefix (base ten) a separating colon and the string (along with the rest of the bencoded message).

8:doorknobxyz

where I want to check that the prefixed length matches enough characters (8 characters in doorknob, the ending xyz represents the rest of the bencoded message which is ignore at this point). So far I've come up with:

^([0-9]+):.{}$

I don't know what expression to put inside the {} though. IIRC (it has been years) in PERL you could reference the groupings using $1, $2... $9 but that was a language feature itself. Is there anyway which I can take the grouping value (in this case its a base ten length value) and put it into the { } ?

thanks ;)
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
I don't think you can do this with a single regex. Even the backreferences you mentioned in Perl only (afaik) applied to the substitution. i.e. s/([0-9]+)/the number is $1/
 

Schrodinger

Golden Member
Nov 4, 2004
1,274
0
0
oh well thanks anyhow

I'm going to look into Java's regular expressions support more. The only thing I've used so far has been the matches() method from the String class.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
You're using $ as if the end of the string is detectable -- if it is, then why do you need to know the length? ^([0-9]+):.*$
 

Schrodinger

Golden Member
Nov 4, 2004
1,274
0
0
Originally posted by: BingBongWongFooey
You're using $ as if the end of the string is detectable -- if it is, then why do you need to know the length? ^([0-9]+):.*$

Whoops, well you see this is just my test regex but it will have .* on the actual end. I shouldn't even have the $. But the ^ can stay there because each time I extract a bencoded value (string, int, list or dictionary) it will then take that off the front and continue to call the decode() method over and over (recursively calls it on lists and dictionaries). The $ was a mistake.

Basically I'm just using the regex to see that the prefixed digits represent a valid length and that there are enough characters for it after the colon. Its a preliminary check and then I will go into the java code and extract the substring value.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Aha. Yeah I'm pretty sure it would have to be a multi-step process like:

#pseudo-code
num = replace("([0-9]+):.*", "$1", text);

if(!num.isnumeric())
return failure;

if(text.length < num.length + 1 + int(num))
return failure;
 

Schrodinger

Golden Member
Nov 4, 2004
1,274
0
0
good news is that you can use the Matcher object (java.util.regex package) and it's group() method (returns String value of the grouped value)

;)

(just playing with it now...sweet )


Edit: thanks again BingBongWongFooey for giving it a look over
 

notfred

Lifer
Feb 12, 2001
38,241
4
0
I wrote a torrent decoder in Perl and even in perl, I didn't use regualr expressions. You can do the whole thing looping through the .torrent file one char at a time, with only a couple comparisions.
 

DJFuji

Diamond Member
Oct 18, 1999
3,643
1
76
if you find yourself writing RegEx's often, google for the MS RegEx Workbench. Awesome tool to use for writing them. ASP.NET lets you validate text fields in a webform using RegExes. Very handy.
 

Schrodinger

Golden Member
Nov 4, 2004
1,274
0
0
Originally posted by: notfred
I wrote a torrent decoder in Perl and even in perl, I didn't use regualr expressions. You can do the whole thing looping through the .torrent file one char at a time, with only a couple comparisions.

I'm writing a generic Bencoding class w/ decode() method though. Probably never need it again but whatever ;)
 

agnitrate

Diamond Member
Jul 2, 2001
3,761
1
0
I thought the entire limitation of regular expressions is that they cannot 'count'. You can make it count a predefined number, but you can't have the input be a number and then expect it to recognize that number unless you build a DFA for each possible length you would encounter. That was my understanding of DFAs and regular expressions when I was taking compilers and built our scanner.