How does google cache pages?

CrazyLazy · Nov 21, 2008

How does google cache pages? What would it take for me to cache pages on my host in a similar manner? Obviously on a much much smaller scale, just caching a couple of pages. I googled in turned up nothing helpful so I turn to you ATP.

Fallen Kell · Nov 21, 2008

Well, google caches pages by using automated "bots" or spiders crawling links on the internet. When the program loads up a page, it saves it to a local file inside google's server farm which gets propogated to their many different server farm sites across the world. When captured, they also grab other meta data like time it was captured, where it was captured and other things so that it can be linked to from their search functions and have the cached page added as a link with search results...

Your browser already does some local caching of pages, and uses them instead of downloading the page every time you hit the back/forward buttons. Now how do you intend to use the cached pages? Do you have your own website that dynamically creates content which you want to have it create a caches page so that it doesn't need to access a database backend for every single person who goes to the page, and instead have them simply see the cached version? Or are you trying to cache pages that you browse to often? I mean, if it is just stuff that you browse to, well, that would be a function of the browser to look at the cached version instead of downloading the one from the internet.

On a side note, WOW you have a lot of posts for only being here 5 months...

CrazyLazy · Nov 21, 2008

Thanks for the reply. I am looking to cache pages as google does, for other people to use. I don't need spider to crawl looking for pages, I just have a few pages I need to cache individually/manually. Hope that explained things well enough.

Markbnj · Nov 22, 2008

Hold on a sec. Google does cache pages, but I think you're mixing up some concepts here. The purpose of a cache is either: a) to provide faster access to a resource by delivering it from a server that is closer to the client than the server that originally provided it; or b) to retain an archival copy of a page as it appeared at a given point in time. An adjunct to both of these is that caching might be more granular than page-level, i.e. images, media, and other individual resources can also be cached.

Both of these definitions of caching are independent of the mechanisms used to locate pages in the first place, which is what the spider bots Fallen Kell referred to do.

So, the question is: what do you mean by caching? It doesn't sound like you want a general performance increase, since you mention caching just a few pages. If you want general caching you can get it pretty easily using squid on a linux box.

So I guess you want to snapshot an offline version of these pages, and check for updates in the background? There are a couple of ways to do this, and in fact I think you can do it within both IE and Firefox without requiring any other tools.

CrazyLazy · Nov 22, 2008

I want to save the pages for archival purposes. I would want to host the pages online so that other people can access archived versions of a page. I was looking around and found magpierss, which I think might be what I'm looking for. Basically takes pages from RSS feeds and caches the archived version of the page on your server. I still need to mess around with it to make sure it's what I want.

Leros · Nov 22, 2008

What are you trying to accomplish? I mean, I know you're saving copies of a page at certain points in time, but why?

Ken g6 · Nov 22, 2008

Sounds like WGet might be what you're after, at least on the retrieval side. Take a look at the manual and see if it will help you.

I used to use it to save these very forums offline when I had a very limited dial-up connection.

CrazyLazy · Nov 23, 2008

Originally posted by: Leros
What are you trying to accomplish? I mean, I know you're saving copies of a page at certain points in time, but why?

I have a dynamic page that changes a lot and I want to have copies of it from different points in time. I then want to host the archived pages so other people can see them.

The closest I can come to caching stuff now is opening Firefox, right clicking and hitting "Save Page As". I then have to upload the html file and all other files associated with that page. Doing this manually is a pain in the ass, and it's the kind of thing I'm sure I can automate fairly easily with the right knowledge.

Ken g6 · Nov 25, 2008

WGet should work for you. To get a page and all associated files, I would use "wget -ENHkp -P.", which saves the page and any associated filesto a directory structure matching the source URLs. If you don't want all those directories, you could add the "-nd" switch as well to put everything in the current directory.

Next, you'll need to copy/rename whatever files changed with new filenames, probably based on the date. Hopefully, your "other associated files", which includes all images on the page, scripts, and anything in a frame or iframe, don't change. If something does change, you either need to upload all files for each snapshot, or do some automated HTML editing. Otherwise, a short script (e.g. a Perl one-liner) can do the renaming.

If the other associated files were static, you only have to upload all the files once; then just automatically upload the current version of the main HTML file into the same directory. You may want to try cURL for automatic uploading; it works with all the protocols I know of.

CrazyLazy · Nov 27, 2008

Originally posted by: Ken_g6
WGet should work for you. To get a page and all associated files, I would use "wget -ENHkp -P.", which saves the page and any associated filesto a directory structure matching the source URLs. If you don't want all those directories, you could add the "-nd" switch as well to put everything in the current directory.

Next, you'll need to copy/rename whatever files changed with new filenames, probably based on the date. Hopefully, your "other associated files", which includes all images on the page, scripts, and anything in a frame or iframe, don't change. If something does change, you either need to upload all files for each snapshot, or do some automated HTML editing. Otherwise, a short script (e.g. a Perl one-liner) can do the renaming.

If the other associated files were static, you only have to upload all the files once; then just automatically upload the current version of the main HTML file into the same directory. You may want to try cURL for automatic uploading; it works with all the protocols I know of.

Thanks, wget seems to be what I want. Right now I am just doing wget -m http://mysite.com which saves the whole site just fine. I don't need the whole site though, just a single page. Is there a command to do this? I googled and the command I found didn't still seemed to save everything.

Leafy · Nov 27, 2008

Try httrack

Ken g6 · Nov 27, 2008

I suggested a command before, but I'm going to modify it a little:

wget -ENHKkp -nd -P. http://mysite.com

wget -p gets a Page.
-k converts links, so the site shows up right.
-N prevents re-downloading stuff you already have.
-K basically makes -N and -k work nicely together.
-H spans hosts (so you can get external sites' graphics).
-E adds a ?.html? extension to ?text/html? or ?application/xhtml+xml? files without it.
-nd doesn't create subdirectories.
-P. gets everything to the current directory, and creates links that way too.

This is all in that manual I linked to in my first post.

CrazyLazy · Nov 28, 2008

Originally posted by: Ken_g6
I suggested a command before, but I'm going to modify it a little:

wget -ENHKkp -nd -P. http://mysite.com

wget -p gets a Page.
-k converts links, so the site shows up right.
-N prevents re-downloading stuff you already have.
-K basically makes -N and -k work nicely together.
-H spans hosts (so you can get external sites' graphics).
-E adds a ?.html? extension to ?text/html? or ?application/xhtml+xml? files without it.
-nd doesn't create subdirectories.
-P. gets everything to the current directory, and creates links that way too.

This is all in that manual I linked to in my first post.

Thanks that works, I need to work on my reading comprehension skills some. Is there a way to cache images that are within the style.css automatically? Right now it just links to the external site containing the image. I'll look through the manual more to see if I can find anything on it.

EDIT: Question #2, is there an easy way to use wget with rss feeds? My attempts at this have failed as well.

Crusty · Nov 28, 2008

If the CSS files specify a http://FQDN/image.gif or something it's rather trivial to use sed/awk/grep to do a replace on that string with your localhost version.

Ken g6 · Nov 30, 2008

WGet does (X)HTML. As far as I know, it doesn't do Javascript, and it doesn't do CSS.

But if the .css file is always the same, and the images it links to are always the same, you could make a static copy manually, or semi-automatically as Crusty describes. FYI, wget's -i option will download from a given text file listing URLs. If all that changes is the HTML, and not any of the images or other files, and you upload just that changed file as I described earlier, this should work.

If some of my assumptions are wrong...then your task is hard.

Search

How does google cache pages?

CrazyLazy

Platinum Member

Fallen Kell

Diamond Member

CrazyLazy

Platinum Member

Markbnj

Elite Member <br>Moderator Emeritus

CrazyLazy

Platinum Member

Leros

Lifer

Ken g6

Programming Moderator, Elite Member

CrazyLazy

Platinum Member

Ken g6

Programming Moderator, Elite Member

CrazyLazy

Platinum Member

Leafy

Member

Ken g6

Programming Moderator, Elite Member

CrazyLazy

Platinum Member

Crusty

Lifer

Ken g6

Programming Moderator, Elite Member

TRENDING THREADS