Playing with regular expression. A little help.

Locut0s · Jul 15, 2009

So here's what I'm trying to do. I did a raw dump of a dir command to a txt file and I want to edit out everything but the file names from the txt file. Here is an example of what the txt looks like:

10/07/2009 03:27 AM 82,779 007_-_a_view_to_kill.rar
10/07/2009 03:27 AM 101,585 007_-_goldfinger.rar
10/07/2009 03:27 AM 221,073 007_licence_to_kill.rar
11/08/2005 05:38 PM 316,779 1 To Nil Soccer Manager (1992)(Wizard Games Of Scotland Ltd).zip
11/08/2005 05:38 PM 1,203,348 1000 Miglia (1991)(Simulmondo).zip
11/08/2005 05:38 PM 100,283 100000 Pyramid (1988)(Basada).zip
11/08/2005 05:38 PM 30,017 10th Frame (1986)(Access Software Inc).zip
11/08/2005 05:38 PM 645,102 15x15 Picture Puzzle (1996)(Freeware).zip
11/08/2005 05:38 PM 4,129,713 1830 Railroads And Robber Barons (1995)

So I was going to use a regular expression search and replace to strip out everything before the file names. Stupid thing is I got 1/2 way in before I realized that in the actual txt file all the file names all start exactly 39 character in so I could have just deleted the first 39 characters of each line (stupid me). Anyway now that I am doing this I have a different question. Supposing I want to do it the hard way like I was doing it what would be a good regular expression to match this first section. What I came up with works but it looks awkward. Is there a better way?

I came up with this:

[:digit:]{2}/[:digit:]{2}/[:digit:]{4}[:space:]{2}[:digit:]{2}:[:digit:]{2}[:space:]{1}(AM|PM)

QUESTION 2

Suppose I do just want to match the first 39 character of each line. I would think this would do it:

^[😛rint:]{39}

And it does. However in Open Office which is what I'm using it keeps matching ANY 39 character stretch so if there are 78 characters on the line it will match both 39 char segments when all I want it to do is pick up ONLY the first 39char part of each line and nothing after that on the line.

Example if I wanted to match the first 4 chars instead of the first 39 I would want this:

[asdf]jhasdkfjhasdkjfhqw

Instead I'm getting this

[asdf][jhas][dkfj][hasd][kjfh]qw

Ken g6 · Jul 15, 2009

1. Your old regular expression does well until near the end. {1} is redundant, and (AM|PM) could be better done with [AP]M. I personally do [0-9] instead of [:digit:]; but I haven't done Open Office regular expressions.

2. ^.{39} ought to work. Your expression should, too, unless 😛rint: does something really weird. It might be a search-and-replace bug in OOo, where the replace is done, and the line is searched again.

Locut0s · Jul 16, 2009

Originally posted by: Ken g6
1. Your old regular expression does well until near the end. {1} is redundant, and (AM|PM) could be better done with [AP]M. I personally do [0-9] instead of [:digit:]; but I haven't done Open Office regular expressions.

2. ^.{39} ought to work. Your expression should, too, unless 😛rint: does something really weird. It might be a search-and-replace bug in OOo, where the replace is done, and the line is searched again.

Thanks for the help!

1. I thought the {1} should be redundant as well but I couldn't get it to match unless I added it. Thanks I'll change it to 0-9 as that is more literal than digit however [:digit:] is part of the POSIX character class standard.

2.Yes OOo is indeed doing the replace then searching the same line again. I don't know if this is a bug or a design implimentation?

esun · Jul 16, 2009

Although your expression isn't too far off, it's needlessly complicated. Generally speaking I try to find the minimum unique pattern to search for. For example, if I wanted just the filenames from that file, I would do something like this (done in Perl, but it should be easy to see what I mean):

($filename) = ($line =~ /M [\d,]+ (.*)/);

That is, match the capital M in AM/PM, the following space, the number possibly with commas, the following space, then capture the filename which is everything after all of those things. Note that the .* is greedy, so it will match as much as possible, ensuring that the portion prior matches the first instance of that pattern, which is the desired behavior (since otherwise a filename containing a pattern like that would cause problems).

If you wanted to do it via replacement rather than matching, it would be even easier:

$filename =~ s/^.*?M [\d,]+ //;

That is, replace any number of characters (non-greedy) from the start of a line followed by an M, a space, a number (with commas), and another space with nothing.

Regarding your second question, /^.{39}/ should work.

Locut0s · Jul 16, 2009

Originally posted by: esun
Although your expression isn't too far off, it's needlessly complicated. Generally speaking I try to find the minimum unique pattern to search for. For example, if I wanted just the filenames from that file, I would do something like this (done in Perl, but it should be easy to see what I mean):

($filename) = ($line =~ /M [\d,]+ (.*)/);

That is, match the capital M in AM/PM, the following space, the number possibly with commas, the following space, then capture the filename which is everything after all of those things. Note that the .* is greedy, so it will match as much as possible, ensuring that the portion prior matches the first instance of that pattern, which is the desired behavior (since otherwise a filename containing a pattern like that would cause problems).

If you wanted to do it via replacement rather than matching, it would be even easier:

$filename =~ s/^.*?M [\d,]+ //;

That is, replace any number of characters (non-greedy) from the start of a line followed by an M, a space, a number (with commas), and another space with nothing.

Regarding your second question, /^.{39}/ should work.

Thanks! Will go over that a bit. As for the reason the 2nd question doesn't work, it's because Open Office is performing the search on the line again after it has edited it. So it's running the same line through the search and replace multiple times until the regex doesn't find anything more before going on to the next line.

Edit: Ok I think I get those thanks! That's exactly what I was aiming for. I think this can be done in Open Office though the way OO does regular expressions it wouldn't look anything like that.

esun · Jul 16, 2009

BTW, while you're learning regexes (and even after you know them well but are building complicated ones), you may find a tool like this helpful:

http://gskinner.com/RegExr/

Basically it lets you build up a regex and see how it matches on some sample text that you can paste in.

Locut0s · Jul 16, 2009

Originally posted by: esun
BTW, while you're learning regexes (and even after you know them well but are building complicated ones), you may find a tool like this helpful:

http://gskinner.com/RegExr/

Basically it lets you build up a regex and see how it matches on some sample text that you can paste in.

Thanks!

statik213 · Jul 16, 2009

Just as an alternate. on windows, you can do "dir /b" to just get a list of filenames. Adding /a will give you absolute file paths, and /s will cause it to recurse. Most of the other options (see dir /?) will work as well.

Playing with regular expression. A little help.

Locut0s

Lifer

Ken g6

Programming Moderator, Elite Member

Locut0s

Lifer

esun

Platinum Member

Locut0s

Lifer

esun

Platinum Member

Locut0s

Lifer

statik213

Golden Member

TRENDING THREADS