Perl expression help *problem solved*

kt

Diamond Member
Apr 1, 2000
6,032
1,348
136
Anyone want to point out what's wrong with this subroutine? I am trying to read in a HTML file and parse it into two sections. The two sections are they head section and the body section. I am not sure why, but somehow the head section is repeated each time this subroutine is run. For example if the head section look something like this:

<head>
<title>my title</title>
</head>

Then what gets returned in the $head is:

<head>
<title>my title</title>
</head>
<head>
<title>my title</title>
</head>


Here's the code for the subroutine:

sub load_file {

my $self = shift;
my $filename = shift;
my $productdir = shift;

if(!open(PRODFH,$productdir."/".$filename)) {
$self{'error'} = "cannot open $filename $!";
return -1;
}

$text = join '',<PRODFH>;
close(PRODFH);

$text =~ /(<\s*HEAD\s*>.*<\s*\/\s*HEAD\s*>)/ism;
$head = $1;
$text =~ /<\s*\/\s*HEAD\s*>(.*<\s*\/\s*HTML\s*>)/ism;
$body = $1;
$body =~ s/<\s*\/html\s*>//ismg;

$self{'init'} = 1;
$self{'head'} = $head;
$self{'body'} = $body;

if($body =~ /<!--PROD_DESC-->(.*?)<!--END_PROD_DESC-->/isgm) {
$self{'proddesc'} = $1;
}

return 1;
}
 

Palek

Senior member
Jun 20, 2001
937
0
0
Well, I just ran your slightly modified subroutine (see below) and it returned only a single copy of the header for me. I don't know, maybe we have different versions of perl. One thing's for sure, you do not need the "m" option for the pattern match since the lines are joined into one long string. Using /is versus /ism made no difference. In fact, are "s" and "m" even supposed to be used together? Somehow I do not think so. You might want to try removing that "m". Maybe that's what's causing the problem.

#! usr/local/bin/perl
$infile = $ARGV[0];
open(INFILE, "<$infile") || die "could not open $infile.\n";
$text = join '',<INFILE>;
close(INFILE);

$text=~/(<\s*HEAD\s*>.*<\s*\/\s*HEAD\s*>)/is;
$head = $1;

print $head,"\n";
 

stndn

Golden Member
Mar 10, 2001
1,886
0
0
Originally posted by: kt

$text =~ /(<\s*HEAD\s*>.*<\s*\/\s*HEAD\s*>)/ism;
$head = $1;
$text =~ /<\s*\/\s*HEAD\s*>(.*<\s*\/\s*HTML\s*>)/ism;
$body = $1;
$body =~ s/<\s*\/html\s*>//ismg;

is there any reason why you don't match the second expression as <body> to </body> if you want the second section to be body section?

i tried running your subroutine separately (since i don't know what the parameters you passed into are), and i only got one head and one body section ... maybe it has something to do with the parameters you pass to the function?

and oh, just a few things i thought i'd throw in (not really a fix, just some little things):

sub load_file {
my ($self, $filename, $productdir) = @_; # is easier than three separate shifts

$text =~ m|(<\s*HEAD\s*>.*<\s*/\s*HEAD\s*>|ism
# so you don't have to use \/ when referring to a "/" ...
# nothing major, just makes it a little easier to read -)
}

not really of much help here, but ... it works over my place, so .... check the call to the function or something...
 

kt

Diamond Member
Apr 1, 2000
6,032
1,348
136

$text =~ /(<\s*HEAD\s*>.*<\s*\/\s*HEAD\s*>)/ism;
$head = $1;
$text =~ /<\s*\/\s*HEAD\s*>(.*<\s*\/\s*HTML\s*>)/ism;
$body = $1;
$body =~ s/<\s*\/html\s*>//ismg;

[/quote]
is there any reason why you don't match the second expression as <body> to </body> if you want the second section to be body section?
What do you mean? The second expression take on everything inbetween </head> and </html>. That's where the <body> and </body> usually are in. So, $body will contain the whole body including the <body> and </body> tags.

i tried running your subroutine separately (since i don't know what the parameters you passed into are), and i only got one head and one body section ... maybe it has something to do with the parameters you pass to the function?
I think there may be something with the webserver machine. Several of my friends said the same thing, only one head and one body section. There's nothing special with the parameters I pass it, just the filename.
 

stndn

Golden Member
Mar 10, 2001
1,886
0
0
is there any reason why you don't match the second expression as <body> to </body> if you want the second section to be body section?
What do you mean? The second expression take on everything inbetween </head> and </html>. That's where the <body> and </body> usually are in. So, $body will contain the whole body including the <body> and </body> tags.

what i mean by the above is, instead of using:
$text =~ /<\s*\/\s*HEAD\s*>(.*<\s*\/\s*HTML\s*> )/ism;
$body = $1;
$body =~ s/<\s*\/html\s*>//ismg;

why not just use:
$text =~ m|(<\s*body\s*>.*<\s*/\s*body\s*> )|ism;
$body = $1;

(like you did for <head>)
and save removing the </head> and and </html> from the entry?

(and the second thing i changed is that i use m|/| instead of /\// ...)

and maybe that will probably work on your server (although i doubt there will be any difference, but who knows?)

edit1: i spaced out the last > from ) because apparently no space between them makes the >) image and can be confusing ...

edit 2: the other thing is that some people may put comments between the </head> and <body> section, and that will result in you getting the comments before <body> section.... that's why i think just matching the body sections is better than trying to match everything and removing some things
 

kt

Diamond Member
Apr 1, 2000
6,032
1,348
136
what i mean by the above is, instead of using:
$text =~ /<\s*\/\s*HEAD\s*>(.*<\s*\/\s*HTML\s*> )/ism;
$body = $1;
$body =~ s/<\s*\/html\s*>//ismg;

why not just use:
$text =~ m|(<\s*body\s*>.*<\s*/\s*body\s*> )|ism;
$body = $1;

(like you did for <head>)
and save removing the </head> and and </html> from the entry?
Oh, i see. The reason for that is because there may be some other parameters for the <body> tag. Like background color, etc. The easiest way is to do it from </head> to </html> since I don't think there are parameters for those tags. I rather just account for empty spaces in the </head> tag and the </html> tag.

(and the second thing i changed is that i use m|/| instead of /\// ...)

and maybe that will probably work on your server (although i doubt there will be any difference, but who knows?)
I tried that.. no luck. Still does the same thing.