Originally posted by: graemenz
How do these thousands of computers communicate? Presumably they each store a small percentage of the total web pages so how do the results of a search get pooled and how does the search request get sent to them?
With one billion searches per day that is approx 11,500 searches per second which is way too many for one computer to receive and respond to all so they must have a heap of duplication where a search gets farmed out to one particular group of computers. e.g. they could have 500 groups of computers and 100 computers per group. I wonder if there's some customised hardware that routes the initial search request to a particular group of computers. I also wonder how does the result get back onto the internet - whether it all goes through a single server or not.
I also wonder if they have a master database somewhere, with tape backup or something, so that the info on each of the smaller computers can be regenerated at any time.
Graeme
The answers to these questions are basically why the guys who founded Google are worth billions and we're not.
Educated guesses:
They have to use some sort of clustered web hosting to handle that kind of volume (other companies like Microsoft do this as well -- it's just not possible to host such a busy website from one server.) When you submit a request for a search at
www.google.com, your request probably gets handed to one of dozens (or maybe even hundreds) of servers for processing. Then there's some system that just sits in the middle and manages all the connections, so that the right HTTP results page gets back to the right requester.
Probably they parallelize the search in some way -- like you suggested, they may split their systems up into smaller "clusters" internally, and parcel out search requests like that (and then giving them to different "clusters" if the needed data is not available where the request went the first time). My understanding is that the hardware is all off-the-shelf, just using 1Gbps Ethernet switches to connect the systems. However, splitting up and recombining the search results like that is a very tricky problem, especially at the kind of volume they have to handle.
From what I have read, I do not believe that they maintain separate backups, at least not in anything resembling real time. All data in their system is replicated several times on redundant hardware -- and gets re-replicated somewhere else if some hardware goes down. They need almost all the data to be accessible online, so that they can search through it if needed.
Their database is so large and grows so fast that straightforward backups would likely be impossible. However, you could structure it so that you could make a point-in-time snapshot of the whole network, though it might take days or weeks to actually copy out all the data from that snapshot to an external storage array or set of backup tapes.