APACHE access logs

IBhacknU

Diamond Member
Oct 9, 1999
6,855
0
0
My question is as follows:

searching my access logs, I find this:

216.35.103.80 - - [24/Oct/2000:07:34:59 -1000] "GET /robots.txt HTTP/1.0" 404 283
216.35.103.80 - - [24/Oct/2000:07:35:21 -1000] "GET / HTTP/1.0" 200 1788
216.35.103.79 - - [24/Oct/2000:07:39:03 -1000] "GET /robots.txt HTTP/1.0" 404 283
216.35.103.79 - - [24/Oct/2000:07:39:35 -1000] "GET / HTTP/1.0" 200 1788
216.35.103.81 - - [24/Oct/2000:07:50:26 -1000] "GET /robots.txt HTTP/1.0" 404 283
216.35.103.81 - - [24/Oct/2000:07:50:47 -1000] "GET / HTTP/1.0" 200 1788

and then this....
213.216.143.39 - - [25/Oct/2000:03:18:30 -1000] "GET /robots.txt HTTP/1.0" 404 283
213.216.143.39 - - [25/Oct/2000:03:18:32 -1000] "GET / HTTP/1.0" 200 1788
213.216.143.37 - - [25/Oct/2000:05:18:04 -1000] "GET /robots.txt HTTP/1.0" 404 283
213.216.143.37 - - [25/Oct/2000:05:18:05 -1000] "GET / HTTP/1.0" 200 1788


Is this a web crawler?
 

cmv

Diamond Member
Oct 10, 1999
3,490
0
76
Most likely. You can do the same thing too though by going to, say, anandtech.com like this:

www.anandtech.com/robots.txt

Judging by the user agent (which can be hacked) it is a web crawler.

EDIT: anandtech.com/robots.txt is a bad example because they don't have a robots.txt. Try this one:

google.com/robots.txt
 

IBhacknU

Diamond Member
Oct 9, 1999
6,855
0
0
so, given this reasoning (that crawlers are are looking for this file), what sort of info might one want to put in this .txt file, and how would it be treated?

example of google.com/robots.txt:

User-agent: *
Disallow: /search
Disallow: /keyword/

 

NiTeByTe

Junior Member
Oct 5, 2000
8
0
0
Although I use a default template with all the websites I've designed, here are some links to documents describing the robots.txt file.

http://www.searchtools.com/robots/robots-txt.html

http://info.webcrawler.com/mak/projects/robots/norobots.html

Here's one of my robots.txt files.

-----8<-----
User-agent: *
Disallow: /cgi-bin/*
Disallow: /reports/*
-----8<-----

It only tells spiders to stay out of the CGI directory and the reports directory (Reports created by AccessWatch)

On one of those pages above there is mention of a Perl script which 'fakes out' the spiders. I have not used a script like that (yet) but have seen how well they work.

Just my $.02 worth.

BTW: To difinitively find out if it was a spider, you can do a reverse DNS lookup on the numbers OR, if that's unsuccessful, do a `whois -h arin.net xxx.xxx.xxx.0` to find out who owns the IPs.