Need advice on how to Crawl/Parse websites (i'm thinking a need a perl guru)

Mrdzone

Senior member
Sep 29, 2002
322
0
0
OK here's what I need to do. I'm wondering if you guys can help me figure out the best way to do it. I have limited experience in perl and python, and good experience in C++. I feel comfortable learning languages quickly though.

I need to access a particular URL which is in the form of www.domian.com/query?page=i. Follow every link on the page with the format "www.domain.com/directory/page" where domain and directory are constant but page is different every time.

Once I get the new page I need to extract a particular table and place the information in the tr and th tags into an excel (or csv) file.

Once I finish this I need to increment i and retrieve a new page and start the whole process over again.

I'm thinking the best way to do this is in perl as parsing the files shouldn't be too hard, but if I go this route can someone reccomend to me a way to retreive/spider the pages?

If there is an easier way to do this then perl I'm definitly open to ideas as I'm largely going to have to relearn most of the perl syntax anyway.

Merry Christmas

-Bob
 
Oct 27, 2007
17,009
5
0
I did something similar in Java and it was fairly straight-forward. It was one of the first things I programmed in Java. Use a Set to maintain the list of pages so you don't get duplicated. The crawler just needs to iterate through every line of the HTML page looking for ahref= (I stripped spaces out of mine to make this easier / more reliable). Grab the substring between the quotes and do the same thing to the page you grab.

As for doing the table stuff, I haven't done this before but I can't imagine it would be difficult as long as the tables are well-formed HTML.

Having said that, I have absolutely no experience in Perl or Python so it could be the case that this is much easier in those languages. But post back if you want to view some code snippets from my crawler, or you can download the source at http://martindoms.com/crawler.html

Edit - I should note that the code in that program is pretty horrible, I rushed it out and I was a total noob so don't judge me :)
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,831
4,814
75
Wget will get pages recursively for you. If I read that and the manual right...

wget -r -l1 -I/directory http://www.domain.com/query?page=i

That saves each page as a file. If every "page" is unique, you can do "-nd -P." and save them in the local directory.

Next, are you looking for one Excel file per "page", or per "i"? "-O-" might allow doing it all in pipes if you want one file per "i"; but I haven't looked into that.