How does Google perform a search

graemenz · Dec 18, 2006

Does anyone know how google implements its search - (the web search, not the desktop search). I've heard they use tens of thousands of computers but how do they search so many web pages so quickly? Anyone know any details?

Thanks.

CycloWizard · Dec 18, 2006

I believe they 'download' the text of the internet to their local computers. They then process them using some indexer that compiles a database of which words appear where. I believe the pages are ranked based on the number of clicks a link receives given a certain search string. I don't know anything about the details of how this is done, but this is my primitive (and possibly incorrect) understanding.

bobsmith1492 · Dec 18, 2006

I thought I heard somewhere that they have most of the content of the web stored in RAM in their servers - cached, that is - at least for the most-often used sites. Can anyone confirm that? I suppose that would make for pretty quick searches.

Matthias99 · Dec 18, 2006

Originally posted by: bobsmith1492
I thought I heard somewhere that they have most of the content of the web stored in RAM in their servers - cached, that is - at least for the most-often used sites. Can anyone confirm that? I suppose that would make for pretty quick searches.

More or less. They use an internal network consisting of thousands of off-the-shelf systems running a customized OS. Together, they act like a giant database cache of a big chunk of the Internet. The most commonly searched pages are kept in RAM, and others are stored on the hard drives of the systems. Data is duplicated across multiple systems so they don't lose anything in the case of a hardware failure. There are quite a few people whose full-time jobs are maintaining these systems and replacing them when a hard drive crashes, etc.

The algorithms they use to rank/index pages are proprietary, and have evolved considerably over time (usually due to advertisers, etc. figuring out ways to trick it into ranking less useful pages much higher than the ones people really want to see when they type in a given keyword.)

I don't know that they have "most" of the web in RAM, but certainly they have at least several terabytes worth of it in RAM caches spread across thousands of computers. Also, keep in mind that they maintain multiple 'snapshots' of pages over time; if a page gets cached by Google for search indexing, but is later taken down, you can still look at the cached version. So they may have dozens of versions of many web pages.

Trevelyan · Dec 18, 2006

Wow @ the internet

graemenz · Dec 18, 2006

How do these thousands of computers communicate? Presumably they each store a small percentage of the total web pages so how do the results of a search get pooled and how does the search request get sent to them?

With one billion searches per day that is approx 11,500 searches per second which is way too many for one computer to receive and respond to all so they must have a heap of duplication where a search gets farmed out to one particular group of computers. e.g. they could have 500 groups of computers and 100 computers per group. I wonder if there's some customised hardware that routes the initial search request to a particular group of computers. I also wonder how does the result get back onto the internet - whether it all goes through a single server or not.

I also wonder if they have a master database somewhere, with tape backup or something, so that the info on each of the smaller computers can be regenerated at any time.

Graeme

Matthias99 · Dec 18, 2006

Originally posted by: graemenz
How do these thousands of computers communicate? Presumably they each store a small percentage of the total web pages so how do the results of a search get pooled and how does the search request get sent to them?

With one billion searches per day that is approx 11,500 searches per second which is way too many for one computer to receive and respond to all so they must have a heap of duplication where a search gets farmed out to one particular group of computers. e.g. they could have 500 groups of computers and 100 computers per group. I wonder if there's some customised hardware that routes the initial search request to a particular group of computers. I also wonder how does the result get back onto the internet - whether it all goes through a single server or not.

I also wonder if they have a master database somewhere, with tape backup or something, so that the info on each of the smaller computers can be regenerated at any time.

Graeme

The answers to these questions are basically why the guys who founded Google are worth billions and we're not.

Educated guesses:

They have to use some sort of clustered web hosting to handle that kind of volume (other companies like Microsoft do this as well -- it's just not possible to host such a busy website from one server.) When you submit a request for a search at www.google.com, your request probably gets handed to one of dozens (or maybe even hundreds) of servers for processing. Then there's some system that just sits in the middle and manages all the connections, so that the right HTTP results page gets back to the right requester.

Probably they parallelize the search in some way -- like you suggested, they may split their systems up into smaller "clusters" internally, and parcel out search requests like that (and then giving them to different "clusters" if the needed data is not available where the request went the first time). My understanding is that the hardware is all off-the-shelf, just using 1Gbps Ethernet switches to connect the systems. However, splitting up and recombining the search results like that is a very tricky problem, especially at the kind of volume they have to handle.

From what I have read, I do not believe that they maintain separate backups, at least not in anything resembling real time. All data in their system is replicated several times on redundant hardware -- and gets re-replicated somewhere else if some hardware goes down. They need almost all the data to be accessible online, so that they can search through it if needed.

Their database is so large and grows so fast that straightforward backups would likely be impossible. However, you could structure it so that you could make a point-in-time snapshot of the whole network, though it might take days or weeks to actually copy out all the data from that snapshot to an external storage array or set of backup tapes.

spidey07 · Dec 19, 2006

The magic lies in the load balancers. These are sometimes called "content switches" or Layer7 switches. Basically a content switch accepts the connection and then in turn makes a connection to some other server in a large (sometimes huge) pool of servers. This is why you only see a few DNS records, that's the "virtual" ip address of the content and there could be 10s of thousands of servers actually offering that content. content switches send requests to servers based on load and resonse time in addition to a whole slew of other options too numerous to get into detail. So essential they are a "man in the middle" taking requests and then dishing it out to a pool of resources all the while being fully aware of what is happening on that session at all layers of the OSI model. Their ability to change application layer data in each packet is why they are sometimes called "L7 switches".

Why are they so fast? It's all performed in silicon...ASICs basically. Think 350,000 of connections per second for EACH content switch/module. To scale it you just add more modules.

Plus you combine this with global load balancing - the ability to distribute the request to the closest data center/server farm it isn't all that amazing. It's just how modern computing is done.

graemenz · Dec 19, 2006

So do these layer 7 switches have thousands of physical connections to all the servers or just one very fast ethernet connection?

I'm guessing that when I submit a search request to google, all the routers between me and google are using hardware (ASIC) to route the message, which helps me get the response back in under a second sometimes. It might be modern computing but I still think it's amazing.

Graeme

spidey07 · Dec 19, 2006

It's just one big network - 24-48 port 10/100 switches in each cabinet that then connect to distribution layer switches and then to some kind of high speed core. Even in network design there is a hierarchy to the design generally following 3 tiers. So even if you have let's say 500 racks that's 24,000 ports available.

The content switches could be actual blades in a larger Ethernet switch with direct connections to the backplane of that switch. Current ones are 16 gigabits per second on the content blade with the switch itself having a much higher one - 40, 80, 720 Gbs. That's what enables them to be so fast.

Also most all internet routers do the actual routing in hardware/ASIC.

WildHorse · Dec 21, 2006

Google can search out results on the web FASTER than MySQL can searck out something within my own computer (both server & client reside in same computer).

Impressive.

Kreon · Dec 21, 2006

so an interesting new loop in this...

What about different google sites (i.e. google.ca for canada).
Do they work out of the same database and servers? Or do they have their own?

spidey07 · Dec 21, 2006

Originally posted by: Kreon
so an interesting new loop in this...

What about different google sites (i.e. google.ca for canada).
Do they work out of the same database and servers? Or do they have their own?

No way to tell as that is a function of the load balancers/front end.

Susquehannock · Dec 22, 2006

Google only catalogs about 4% of the internet. This is where your search results come from. All the major search engines combined catalog only about 10% of the internet. Therefore the other 90% of the internet remains "hidden". The rest can only be accessed by directly visiting the site, or making a direct database request. Thus the term "The Deep Web" (aka - "Invisible Web").

I remember many people making a big stink back in early 1998 when the major search engines started using "lists" and cataloged data to provide search results. This was done to regulate the results thus providing consistency. Previously, you could search the same term several times and never get the same resilts.

Susquehannock · Dec 22, 2006

Originally posted by: Kreon
so an interesting new loop in this...

What about different google sites (i.e. google.ca for canada).
Do they work out of the same database and servers? Or do they have their own?

They have their own cataloged data to search from. Therefore you will get different results in say, China than you would here.

Search

How does Google perform a search

graemenz

Member

CycloWizard

Lifer

bobsmith1492

Diamond Member

Matthias99

Diamond Member

Trevelyan

Diamond Member

graemenz

Member

Matthias99

Diamond Member

spidey07

No Lifer

graemenz

Member

spidey07

No Lifer

WildHorse

Diamond Member

Kreon

Golden Member

spidey07

No Lifer

Susquehannock

Member

Susquehannock

Member

TRENDING THREADS