how do large-scale web servers work?

bizmark

Banned
Feb 4, 2002
2,311
0
0
Okay, so you can run a basic web server on a P100 running apache. Obviously, bumping things up in speed, hard drive, RAM etc. makes everything run smoother and allows for things such as databases, scripts, smoother access with multiple users at a time, and so forth. So that's for a basic, run-of-the-mill web page. It can all be run off of one single computer no different from the one sitting on my desk right now. Even getting several hundred hits a day, if it's just HTML and images, there's no problem with say a PII with 256MB. (Obviously bandwidth is a concern, but assume for this discussion that bandwidth is available as much as we need.)

What happens when you get thousands of hits a day? what about streaming audio and video (i.e. not just files being transferred -- something like internet radio or live video)? How much processing power does that require, per user?

You get obvious benefits from scaling and speeding things up within the one server, e.g. RAID, SMP, faster RAM and HDD's, faster bus speeds, faster connection to the network. But what happens when you still need more than that? In other words, when you clearly need more than one computer to handle the load. How is this done? Are they all symmetrical, with each holding the same information, and the users are randomly assigned to one server or another to keep the loads as even as possible? Are there different servers for different parts of the page? Is there some sort of protocol that handles this type of thing? Can multiple computers have the same IP, if they manage themselves within that domain (i.e. they communicate and avoid conflicts)?

[edit] I just looked at the other thread in this form where they mentioned the IBM P4. Surely there exist situations in which multiples of those are necessary, right? Or do gigantic websites (say MSNBC) just keep one server and keep upgrading/replacing it with something newer/faster/better all the time?
 

Agent004

Senior member
Mar 22, 2001
492
0
0
It's highly unlike that a site such as MSNBC uses just one server, more like it will uses multiple servers and it also runs load balancing server to spread out the loads, so they can make it more redundant (word?). In case they want to upgrade, they can take one server out and upgrade that one at a time or take the site down for maintenance (doing the upgrades all at once)

How much processing power it needs is highly dependent on the software you use. For you can run IIS with 100 connections and max out your cpu at 100%, 700 MB swap file. Whereas you can use something like phttp (I forgot the name) and you can do 400 connection, at 50%, with only 200Mb swap file. What I trying to say there is significant performace different between different softwares.

 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
Using a server farm (more than one server to distribute the load) is about more than just dealing with the load, it's also for redundancy. If you have 5 web servers being split with a hardware load balancer or round-robin DNS if one goes down noone will notice because the other 4 will pick up the slack.

Also most big web sites have a database holding a chunk of their data (user information, stats, etc) so part of the load is pushed onto that server. Some even have seperate servers for html and images.
 

Mark R

Diamond Member
Oct 9, 1999
8,513
16
81
The answer is multiple servers - the scale of this is sometimes quite extraordinary.

Google.com, for example, is reputed to have 10,000 servers at each of its datacentres - in their case they have found that more relatively modest machines (Pent 3 with 2 GB RAM running linux) is better than fewer more powerful machines - however, this is generalisable rule.
 

bizmark

Banned
Feb 4, 2002
2,311
0
0


<< Google.com, for example, is reputed to have 10,000 servers at each of its datacentres - in their case they have found that more relatively modest machines (Pent 3 with 2 GB RAM running linux) is better than fewer more powerful machines - however, this is generalisable rule. >>



But what exactly happens when I connect to Google? I connect to one of the machines, right? And surely this single machine doesn't hold everything that Google knows on its hard drives. The various servers must communicate with each other somehow and pass information between them. The complexity of this task is almost too much to think about.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
wbwither wants a detailed technical explanation of how multiple servers are used. He's already figured out that multiple servers are necessary. How is it that multiple computers can appear to be a single ip address? What's the division of labor between computers? Is there a structured design heirachy? What protocol is used to communicate between machines? What kind of hardware is the front line interface for the T3/OC12/whatever? I have no idea how it's done myself.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
If load balancing is done right you won't be able to see it in action, so www1, www2, etc is a bad thing IMHO. It's easy for just HTTP because you can forward ports to a NAT'd host without issues because HTTP is so simple of a protocol.

The way I see it is: some load balancers have software you put on each server, that way it can tell the real load (cpu, memory, etc usage) on the boxes and make a real distinction as to which one is the least used right now, and give it the current request. Or you can go for 'dumb' load balancing, where it just hands off the requests to a list of servers in order. The second method is much simpler, but if one box gets stuck with a big request it may get backed up since the load balancer has no way of knowing how busy it is.
 

Buddha Bart

Diamond Member
Oct 11, 1999
3,064
0
0
I can explain the setup I'm about to roll out at work.

first off the connection hits one of the two load balancers. One actually does the job, the otherone monitors the first and if it detects its down, takes over (hot standby). From there it gets NAT'd to one of the three webservers. This is done with a piece of software from http://www.linuxvirtualserver.org. Its basically just portforwarding, except you tell it to forward to multiple servers, and to decide amoungst them using some algorighm (round robin, least-connections, etc).
Then in the case of my website, the webservers run thier little scripts (mostly php, some jsp) which pull info from the datbase servers. There's two of these, and they operate much like the load balancers, with one active, and one on hot standbye. They use MySQL's replication feature to be almost constatly in sync. Plus with this you can take down the slave db server for backups, and then bring it back up, without affecting the service. For jsp pages, we run Tomcat on the webservers whoose serverlets also usualy connect to the database. Also the webservers don't log normally, they use ApacheDB to log to a mysql database located on the log server. This way we can combine thier logs easily, and run complex queries to generate detailed stats. In addition the log server recieves all syslog events from the other servers in the cluster, as well as runs some monitors of the availability of other machines and services. This machine too has a hot-standby backup. Thier databases are kept in sync by the same mysql replication, and thier files by a cron job frequently running rsync. Lastly we have a file server which is what we actually SSH into when we work on files. This controlls our whole publishing system, and rsync's the updates to each of the three webservers.
The load balancers run a daemon called 'mon' which is a really simple framework for running scripts that monitor a service (like a perl one that makes an http connection every 5 seconds) and then initiating some event based on the responce. For instance, if a webserver goes down, it re-writes the loadbalancing table to not include that machine anymore, then it continues checking, and when the machine comes back up, it re-writes the rules and puts it back in. It also email's us, and page's us for different events.
Most of our machines are 1.13ghz p3's in 1U IBM machines. I haven't gotten a chance to really push it yet, but I bet cpu useage is rediculously small.

here's some links
www.inuxvirtualserver.org
www.linux-ha.com
web.marist.edu/~stbm/server/commmajor.pdf (I made this because my boss's boss was too dumb to get the network setup verbally... comm majors...)
www.redbooks.ibm.com/redbooks/SG245994.html

bart
 

Tails

Junior Member
Sep 5, 2001
11
0
0


<< But what exactly happens when I connect to Google? I connect to one of the machines, right? And surely this single machine doesn't hold everything that Google knows on its hard drives. The various servers must communicate with each other somehow and pass information between them. The complexity of this task is almost too much to think about. >>



I'm going from memory here, but if it serves me correctly, when you type in www.google.com you get the IP address to their load balancer. When you request the homepage (which is static) it either pulls it directly out of memory and serves it to you, or looks to one of the n webservers that hold static content and relays it to you (it tries to pull from the server with the least load).

When you do a search, your query is sent to the load balancer which relays your request to the cluster of machines that most likely has the answer to your query. They collaborate to produce the end results and relay the results back to you.

By using 10,000+ modest servers, they are able to distributre the load of 150+ Million queries a day...

--
Tails
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I still find it impressive that google can search 1 billion pages as fast as it does, no matter how many boxes there are
 

bizmark

Banned
Feb 4, 2002
2,311
0
0
awesome. Thanks Bart for the highly detailed info -- that's exactly what I wanted to know (nice comm major diagram BTW!) -- and Zephyr for clarifying what I was asking :)
 

Buddha Bart

Diamond Member
Oct 11, 1999
3,064
0
0
There's actually a lot that can be done to improve my setup (and make it cheaper).

The 3 big servers (4U IBM x350's) were ordered without my consent, they're totaly wrong for the project. It would have been much better if you could load balance your database read's across 2 or 3 more of the 1U's, and then just direct all writes to a single machine (with a hot standbye). This way you're way more fault tolerant, and you can scale a lot better. Plus the 3 machines i mentioned for that setup would cost about 6K, just over half of what ONE of those x350's cost. Thats if you intend to operate your whole DB mostly out of memory. If you ever get to the point where 4GB isn't enough, you can get an external drive array and just chain that on. Or you can cram a pair of 15Krpm scsi drives in a RAID 0 setup in there (keep in mind, you don't need redundant raid here because you have redundant machines, and redundant redundancy is usualy better known as 'waste' )

The file server in that diagram is the single most blatant and downright pathetic case of overkill ever. It exists solely to keep the 'reference copy' of the website, and run the rsync scripts to push changes out to the webservers. It can probably keep all the changes we do in a week in level2 cache (2mb).

Now if you wanted to go even more buckwild with the fault tolerance, each of the machines has 2 ethernet adaptors and the ability to hardware failover between them, so you could add a second switch, and double-cable everything. Frankly... i'm too lazy.

Because this setup is NAT based on a 2.2 kernel, it maxes out at 4000 connections per second. I think i'll do just fine under that.

bart