• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Program that will crawl and extract all URLs from a site?

I want a program that will crawl through a site and get all the URLs and spit it into an excel sheet--anyone know of a program that'll do it?
 
It's will be very hard to find such software, but you can ask a programmer to make for you such thing or to try to make it in Visual Basic 2008. If you have questions about VB 2008 - ask me.
 
httrack does this already but you'll still need to enter all the URLs into excel yourself.

after you download the site with httrack, look at the new.lst & new.txt files which httrack will have created
 
Why the need to directly enter it into a spreadsheet?

Get the list and then import it yourself.
 
Hmm I've tried httrack but it seems to only have the option of d/ling entire websites..is there a way to just get the new.list file since that's all I'd need?
 
not that i know of, but you can avoid downloading any big files using the exclude list, or under the links tab try setting 'get html files first' and then try canceling the operation after it seems to have all html files

 
Originally posted by: nova2
not that i know of, but you can avoid downloading any big files using the exclude list, or under the links tab try setting 'get html files first' and then try canceling the operation after it seems to have all html files

Ah, I figured it out now, the program is awesome! Is there any way to tell when it's saved all html files while it's running?
 
well, you'll see it DLing the html files first, and then after them it usually starts downloading whatever is next, so probably when it starts downloading images/whatever else is when you can then try canceling the download and see if the list of links is compiled

if canceling the website download (after it gets all html files) makes it not create the list of links then you could exclude all images and everything else besides HTML files.

or you can do this under the scan rules tab

-* (very important, "exclude all links")
-*.jpg (dont download jpg files)
+http://www.somewebsite.com/folder1/*
+http://www.somewebsite.com/folder2/*

and what this will do,, is only download what you have specified in the scan rules list box.
 
Back
Top