www.yellowpages.com - anyway to extract Business names & addresses?

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81

When I go to a site such as yellowpages.com & enter my City & Business Type (such as Electronics) I get a listing back with X number of hits.

Anybody know of a way to get that info in text format(only Business name & address)?

I can view the source of the page, or copy/paste the info, but it would be a pain to extract it manually.

thanks alot.
rp.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
nope. like get a listing of all electronics stores in my city so i can convert them to POIs on my GPS.
 

QED

Diamond Member
Dec 16, 2005
3,428
3
0
If you are handy with Unix, you can easily parse the data using your choice of grep/awk/sed or Perl.

Of if you want, post your search parameters here (or PM them to me) I can do the extraction for you.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
i'm decently handy with UNIX.

I can rep/awk etc.. i'll PM you but i'm not sure it's even possible with this site.
 

sdifox

No Lifer
Sep 30, 2005
95,345
15,306
126
Originally posted by: robphelan
nope. like get a listing of all electronics stores in my city so i can convert them to POIs on my GPS.

Print to pdf, then get a text extractor? Or just print to a post script file.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
i may just have to copy/paste into Excel and figure out a macro to delete the unwanted rows.

it looks like each business name came in with a space in front of it - it should be easy enough to test to make sure the cell is populated then if the 1st char is a space
 

QED

Diamond Member
Dec 16, 2005
3,428
3
0
Pipe the page source code through a perl script like what's below.

It will parse out the store name and address, and write it to the terminal. It is trivially easy to modify the code to instead write it in whatever
delimited format you would like.

===== begin parse.pl =========
#!/usr/bin/perl
my $inname=0;
my $inaddr=0;

while (<>){
$line=$_;
$line=~s/\n//;
$line=~s/<a.*>//;
$line=~s/<\/a.*>//;
$line=~s/<br.*>//;

if ($line =~ /<\/span>/ ){
if ($inaddr > 0){
print "Name: $curname \nAddress: $curaddr\n\n";
$curname="";
$curaddr="";
}
$inname=0;
$inaddr=0;
}

if ($inname > 0){
$curname = $curname." ".$line;
}

if ($inaddr > 0){
$curaddr = $curaddr." ".$line;
}

if ($line =~ /<span class=\"name\"\>/ ){
$inname=1;
}

if ($line =~ /<span class=\"list_address\">/ ){
$inaddr=1;
}
}
====end parse.pl====================
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
thanks alot for the code - my telnet access into our unix environment is locked down pretty tight.

i've ftp'd the perl & source to my personal web site & will give it a go.

thanks again.
rp.
 

Bootprint

Diamond Member
Jan 11, 2002
9,847
0
0
I usually use excel for that sort of thing. If you can get it into excel, you could use the CONCATENATE command, along with 'Text to Columns' command.

If the addresses look like:

1) Place
2) Address
3) City
4) Phone Number

and want them to look like place,address, city, phone #

I usually just use Concatenate(a1,"|",a2,"|",a3"|",a4) a couple of columns over, then save as a .csv and re-import, erase the old columns and sort to get rid of the blank rows.
Then run the 'Text to Columns' command, using the | as a delimiter.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
actually, I was able to execute it here. it worked very well :thumbsup:

the only thing is that some of the "Names" contained the actual Anchor link because that's what was contained in the inname variable.

is there any easy way to scrape all that out & leave only the needed text?

for instance, the output for the name of the Modern Wire & Lighting is:


thanks alot.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
Originally posted by: Bootprint
I usually use excel for that sort of thing. If you can get it into excel, you could use the CONCATENATE command, along with 'Text to Columns' command.

If the addresses look like:

1) Place
2) Address
3) City
4) Phone Number

and want them to look like place,address, city, phone #

I usually just use Concatenate(a1,"|",a2,"|",a3"|",a4) a couple of columns over, then save as a .csv and re-import, erase the old columns and sort to get rid of the blank rows.
Then run the 'Text to Columns' command, using the | as a delimiter.

thanks, this is what I was going to resort to, but it does involve a fair amount of copy/cut/pasting - I was hoping to avoid that since I'd like to make many of these POI files.
 

QED

Diamond Member
Dec 16, 2005
3,428
3
0
The line:

$line=~s/<a.*>//;

is supposed to take care of removing <a> links inside of the code. Maybe try replacing that line of code with:

$line=~s/<a.*>//g;

and see if that makes a difference.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
hmm. that didn't work either..

#!/usr/bin/perl
my $inname=0;
my $inaddr=0;

while (<>){
$line=$_;
$line=~s/\n//;
$line=~s/<a.*>//g;
$line=~s/<\/a.*>//;
$line=~s/<br.*>//;

if ($line =~ /<\/span>/ ){
if ($inaddr > 0){
print "Name: $curname \nAddress: $curaddr\n\n";
$curname="";
$curaddr="";
}
$inname=0;
$inaddr=0;
}

if ($inname > 0){
$curname = $curname." ".$line;
}

if ($inaddr > 0){
$curaddr = $curaddr." ".$line;
}

if ($line =~ /<span class=\"name\"\>/ ){
$inname=1;
}

if ($line =~ /<span class=\"list_address\">/ ){
$inaddr=1;
}
}
 

QED

Diamond Member
Dec 16, 2005
3,428
3
0
Ok... I see the problem. When you go to the second page of results they break up the link tags <A> so they span multiple lines.

Add the following two lines to the code in the same area as the other filters:

$line=~s/<a.*$//;
$line=~s/*****=.*>//;


EDIT:

Replace the stars "****" with "on Click" (remove the spaces). The forum is censoring it because it is a javascript keyword.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
i'm now getting the folling error msg.

Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE ****=.*>/ at addressparse.pl line 10.

line 10 is $line=~s/*****=.*>//;

again, thanks for your help.
 

QED

Diamond Member
Dec 16, 2005
3,428
3
0
See my edit above. The line should be

$line=~s/on Click=.*>//

without the extra space between "on" and "Click". The forum is censoring the combined word because it is a javascript keyword.
 

robphelan

Diamond Member
Aug 28, 2003
4,085
17
81
that absolutely did it.

thanks aton.. this is going to save me alot of tedious work. :beer: