Program that will crawl and extract all URLs from a site?

Syringer · Sep 19, 2008

I want a program that will crawl through a site and get all the URLs and spit it into an excel sheet--anyone know of a program that'll do it?

alexvorn2 · Sep 20, 2008

It's will be very hard to find such software, but you can ask a programmer to make for you such thing or to try to make it in Visual Basic 2008. If you have questions about VB 2008 - ask me.

nova2 · Sep 20, 2008

httrack does this already but you'll still need to enter all the URLs into excel yourself.

after you download the site with httrack, look at the new.lst & new.txt files which httrack will have created

EagleKeeper · Sep 21, 2008

Why the need to directly enter it into a spreadsheet?

Get the list and then import it yourself.

Syringer · Oct 21, 2008

Hmm I've tried httrack but it seems to only have the option of d/ling entire websites..is there a way to just get the new.list file since that's all I'd need?

nova2 · Oct 22, 2008

not that i know of, but you can avoid downloading any big files using the exclude list, or under the links tab try setting 'get html files first' and then try canceling the operation after it seems to have all html files

CrazyLazy · Oct 22, 2008

See if the website has a sitemap already built into it, would make it much easier to achieve what you want.

Syringer · Oct 22, 2008

Originally posted by: nova2
not that i know of, but you can avoid downloading any big files using the exclude list, or under the links tab try setting 'get html files first' and then try canceling the operation after it seems to have all html files

Ah, I figured it out now, the program is awesome! Is there any way to tell when it's saved all html files while it's running?

nova2 · Oct 22, 2008

well, you'll see it DLing the html files first, and then after them it usually starts downloading whatever is next, so probably when it starts downloading images/whatever else is when you can then try canceling the download and see if the list of links is compiled

if canceling the website download (after it gets all html files) makes it not create the list of links then you could exclude all images and everything else besides HTML files.

or you can do this under the scan rules tab

-* (very important, "exclude all links")
-*.jpg (dont download jpg files)
+http://www.somewebsite.com/folder1/*
+http://www.somewebsite.com/folder2/*

and what this will do,, is only download what you have specified in the scan rules list box.

0roo0roo · Oct 22, 2008

Schadenfroh · Oct 23, 2008

Spider:
http://www.developer.com/java/other/article.php/1573761

Program that will crawl and extract all URLs from a site?

Syringer

Lifer

alexvorn2

Member

nova2

Senior member

EagleKeeper

Discussion Club Moderator<br>Elite Member

Syringer

Lifer

nova2

Senior member

CrazyLazy

Platinum Member

Syringer

Lifer

nova2

Senior member

0roo0roo

No Lifer

Schadenfroh

Elite Member

TRENDING THREADS