How does google cache pages?

CrazyLazy

Platinum Member
Jun 21, 2008
2,124
1
0
How does google cache pages? What would it take for me to cache pages on my host in a similar manner? Obviously on a much much smaller scale, just caching a couple of pages. I googled in turned up nothing helpful so I turn to you ATP.
 

Fallen Kell

Diamond Member
Oct 9, 1999
6,249
561
126
Well, google caches pages by using automated "bots" or spiders crawling links on the internet. When the program loads up a page, it saves it to a local file inside google's server farm which gets propogated to their many different server farm sites across the world. When captured, they also grab other meta data like time it was captured, where it was captured and other things so that it can be linked to from their search functions and have the cached page added as a link with search results...

Your browser already does some local caching of pages, and uses them instead of downloading the page every time you hit the back/forward buttons. Now how do you intend to use the cached pages? Do you have your own website that dynamically creates content which you want to have it create a caches page so that it doesn't need to access a database backend for every single person who goes to the page, and instead have them simply see the cached version? Or are you trying to cache pages that you browse to often? I mean, if it is just stuff that you browse to, well, that would be a function of the browser to look at the cached version instead of downloading the one from the internet.

On a side note, WOW you have a lot of posts for only being here 5 months...
 

CrazyLazy

Platinum Member
Jun 21, 2008
2,124
1
0
Thanks for the reply. I am looking to cache pages as google does, for other people to use. I don't need spider to crawl looking for pages, I just have a few pages I need to cache individually/manually. Hope that explained things well enough.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
Hold on a sec. Google does cache pages, but I think you're mixing up some concepts here. The purpose of a cache is either: a) to provide faster access to a resource by delivering it from a server that is closer to the client than the server that originally provided it; or b) to retain an archival copy of a page as it appeared at a given point in time. An adjunct to both of these is that caching might be more granular than page-level, i.e. images, media, and other individual resources can also be cached.

Both of these definitions of caching are independent of the mechanisms used to locate pages in the first place, which is what the spider bots Fallen Kell referred to do.

So, the question is: what do you mean by caching? It doesn't sound like you want a general performance increase, since you mention caching just a few pages. If you want general caching you can get it pretty easily using squid on a linux box.

So I guess you want to snapshot an offline version of these pages, and check for updates in the background? There are a couple of ways to do this, and in fact I think you can do it within both IE and Firefox without requiring any other tools.
 

CrazyLazy

Platinum Member
Jun 21, 2008
2,124
1
0
I want to save the pages for archival purposes. I would want to host the pages online so that other people can access archived versions of a page. I was looking around and found magpierss, which I think might be what I'm looking for. Basically takes pages from RSS feeds and caches the archived version of the page on your server. I still need to mess around with it to make sure it's what I want.
 

Leros

Lifer
Jul 11, 2004
21,867
7
81
What are you trying to accomplish? I mean, I know you're saving copies of a page at certain points in time, but why?
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,817
75
Sounds like WGet might be what you're after, at least on the retrieval side. Take a look at the manual and see if it will help you.

I used to use it to save these very forums offline when I had a very limited dial-up connection.
 

CrazyLazy

Platinum Member
Jun 21, 2008
2,124
1
0
Originally posted by: Leros
What are you trying to accomplish? I mean, I know you're saving copies of a page at certain points in time, but why?

I have a dynamic page that changes a lot and I want to have copies of it from different points in time. I then want to host the archived pages so other people can see them.

The closest I can come to caching stuff now is opening Firefox, right clicking and hitting "Save Page As". I then have to upload the html file and all other files associated with that page. Doing this manually is a pain in the ass, and it's the kind of thing I'm sure I can automate fairly easily with the right knowledge.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,817
75
WGet should work for you. To get a page and all associated files, I would use "wget -ENHkp -P.", which saves the page and any associated filesto a directory structure matching the source URLs. If you don't want all those directories, you could add the "-nd" switch as well to put everything in the current directory.

Next, you'll need to copy/rename whatever files changed with new filenames, probably based on the date. Hopefully, your "other associated files", which includes all images on the page, scripts, and anything in a frame or iframe, don't change. If something does change, you either need to upload all files for each snapshot, or do some automated HTML editing. Otherwise, a short script (e.g. a Perl one-liner) can do the renaming.

If the other associated files were static, you only have to upload all the files once; then just automatically upload the current version of the main HTML file into the same directory. You may want to try cURL for automatic uploading; it works with all the protocols I know of.
 

CrazyLazy

Platinum Member
Jun 21, 2008
2,124
1
0
Originally posted by: Ken_g6
WGet should work for you. To get a page and all associated files, I would use "wget -ENHkp -P.", which saves the page and any associated filesto a directory structure matching the source URLs. If you don't want all those directories, you could add the "-nd" switch as well to put everything in the current directory.

Next, you'll need to copy/rename whatever files changed with new filenames, probably based on the date. Hopefully, your "other associated files", which includes all images on the page, scripts, and anything in a frame or iframe, don't change. If something does change, you either need to upload all files for each snapshot, or do some automated HTML editing. Otherwise, a short script (e.g. a Perl one-liner) can do the renaming.

If the other associated files were static, you only have to upload all the files once; then just automatically upload the current version of the main HTML file into the same directory. You may want to try cURL for automatic uploading; it works with all the protocols I know of.

Thanks, wget seems to be what I want. Right now I am just doing wget -m http://mysite.com which saves the whole site just fine. I don't need the whole site though, just a single page. Is there a command to do this? I googled and the command I found didn't still seemed to save everything.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,817
75
I suggested a command before, but I'm going to modify it a little:

wget -ENHKkp -nd -P. http://mysite.com

wget -p gets a Page.
-k converts links, so the site shows up right.
-N prevents re-downloading stuff you already have.
-K basically makes -N and -k work nicely together.
-H spans hosts (so you can get external sites' graphics).
-E adds a ?.html? extension to ?text/html? or ?application/xhtml+xml? files without it.
-nd doesn't create subdirectories.
-P. gets everything to the current directory, and creates links that way too.

This is all in that manual I linked to in my first post.
 

CrazyLazy

Platinum Member
Jun 21, 2008
2,124
1
0
Originally posted by: Ken_g6
I suggested a command before, but I'm going to modify it a little:

wget -ENHKkp -nd -P. http://mysite.com

wget -p gets a Page.
-k converts links, so the site shows up right.
-N prevents re-downloading stuff you already have.
-K basically makes -N and -k work nicely together.
-H spans hosts (so you can get external sites' graphics).
-E adds a ?.html? extension to ?text/html? or ?application/xhtml+xml? files without it.
-nd doesn't create subdirectories.
-P. gets everything to the current directory, and creates links that way too.

This is all in that manual I linked to in my first post.

Thanks that works, I need to work on my reading comprehension skills some. Is there a way to cache images that are within the style.css automatically? Right now it just links to the external site containing the image. I'll look through the manual more to see if I can find anything on it.

EDIT: Question #2, is there an easy way to use wget with rss feeds? My attempts at this have failed as well.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,817
75
WGet does (X)HTML. As far as I know, it doesn't do Javascript, and it doesn't do CSS.

But if the .css file is always the same, and the images it links to are always the same, you could make a static copy manually, or semi-automatically as Crusty describes. FYI, wget's -i option will download from a given text file listing URLs. If all that changes is the HTML, and not any of the images or other files, and you upload just that changed file as I described earlier, this should work.

If some of my assumptions are wrong...then your task is hard.