Setting up a Beowolf Cluster?

excalibur3

Member
Oct 14, 2005
149
0
0
Has anyone had luck setting up a cluster? I do a lot of computational chemistry and the server we have at school is kind of wimpy and very busy. I came across this article from a guy from Calvin College who built a 8 core/ 4 processor beowolf cluster for $2500 in early 2007. Using newegg I determined that I could build 16 core/ 4 processor computer today for under $2000:
Quad Core 2 Duo- Kentsfield- $189/piece
Gigabit Motherboard- $130/ piece
16 gigs memory- $260 total
4 Antec 350 W power supplies- $20/piece
Seagate 750gb Hard Drive- $100
4 Fans

I'm not sure how necessary this is but:
8 Ethernet cards
This could allow for a gigabit ethernet port for each processor when you take into account the two that are on the motherboard

Then you would just get a 16 port ethernet switch. The cheapest I found on Newegg was $150, but I've seen them cheaper elsewhere. (~$70)

To be honest, what got me started on this was this article I saw where a guy made a 24 node cluster and put it in an Ikea cabinet He doesn't seem to take quite as much care about giving each core a dedicated ethernet line as the guys did at Calvin College, but then again 24 is way more nodes than $8. Still, at $7.39 for each card, it isn't a huge investment.

I guess what I am wondering is how much of a pain this would be to interface everything properly. I would obviously want to run Linux and then preferably remotely submit jobs to it. I am familiar with a lot of the basics of linux and I think this could be a cool project, but I have no idea the magnitude of what I'm considering. Is it just better to go with a dual processor motherboard for a similar price range, or is this a somewhat sane idea?
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
It can be done, and I've done it. It is only a slightly sane idea if you REALLY need the extra compute power AND your software scales well over MPI or whatever distributed computing infrastructure it supports.

You should know or be able to find out how compute bound / memory bound or I/O bound your code is and thus create the system accordingly to its needs and optimize for cost / performance.

1 P35 generation Intel chipset with DDR2 PC2-6400 or PC2-8000 memory delivers around 4GBy to 5GBy / second RAM bandwidth given a moderate overclock of the FSB / CPUs. If your application is RAM I/O bound at or below this level of bandwidth, adding CPU cores to the same PC box will not help. i.e. if a single core CPU running a SSE2 code with 4x32 bit single precision operands per cycle does use most all of the maximum possible 4GBy/second I/O from CPU to RAM on a continual basis, then adding a 2nd, 3rd, 4th CPU core or 2nd physical CPU to that motherboard will be useless because the other cores / CPUs will just be fighting over too little RAM memory I/O access bandwidth with the first CPU already using it all. High end GPUs have high RAM bandwidth numbers in the 40GBy/s to 120GBy/s range for various card models; i.e. 10x to 30x higher than is possible on the motherboard with DDR2 RAM. So if you're RAM B/W limited, forget the CPU, go with a GPU if feasible.

Q6600-G0 Kentsfield for around $180 is the best Intel CPU bang for the buck if you are CPU bound and can use SMP parallelism within a given PC effectively. Decent performance and a wide variety of workloads. Some of the AMD CPUs do have some higher arithmetic performance and memory bandwidth benchmarks for the dollar and for the MHz though, so YMMV as to the performance of your codes. Generally compilers (Fortran, C++, ...) favor Intel CPUs over AMD ones though. The higher cache memory quantity on the Q9450 can help certain applications though. The "Nehalem" generation CPUs are looking attractive architecturally for performance over the Kentsfield / Yorkfield quads; no firm idea exists on price and exact availability date though.

If you're I/O bound between PCs then of course adding more GbEth links between the mesh will help. Additional NICs are cheap enough. I don't think that many applications need more than 1-4 GbEth links per chassis, though. Figure out the MPI / BOINC / ... I/O needs of your application and scale the connectivity accordingly. Obviously if you can effectively use SMP power using quad core CPUs in each chassis and equipping each chassis with ample local shared RAM (8GBy) will cut down the overall communication needs (since more of it can be handled locally in a chassis with SMP) so maybe just a single or dual GbE link out of each chassis may be more reasonable.

If your application is embarrassingly parallel and heavily compute bound or heavily RAM I/O bound look at GPUs in Single Precision or Double Precision instead of CPU cores. One HD4850 GPU gives a peak performance of around 1 Teraflop/s in Single Precision versus a peak performance of around 48 Gflop/s single precision for a Q6600 class CPU; clearly there is just no comparison for embarrassingly parallel very simply threaded **MATH HEAVY** applications. If your application is RAM bandwidth heavy / register heavy or has very significant serial portions to it that don't parallelize then your actual performance will drop WAY WAY WAY WAY below peak values on both GPUs and CPUs. It is not uncommon to see GPUs maxed out on RAM bandwidth but still RAM bandwidth starved to the point where the GPU ALUs are only running at 1% or less efficiency due to the RAM bandwidth limits in very I/O heavy workloads where there isn't enough math that can run from a SMALL on-chip cache / register set to take pressure off the memory load/store bus. Of course even if you can use only 0.1% of the ALUs on a GPU (the HD4850 has 800 single precision ALUs or 160 Double Precision ones) efficiently, still, the 10x higher RAM bandwidth on the GPU vs. a CPU can still make the GPU win big over even a high end quad CPU box.

VMD, Folding@Home are computational chemistry applications that run well on GPUs.

Keep in mind that a fully loaded quad core Kentsfield box will be using around 200-300 Watts or so full time while it is doing heavy calculations, and will seldom consume less than 160 Watts even when IDLE (unless put to sleep or powered off). Multiply that out over a cluster and you've got well over 1kWatt of energy use for a small cluster even if you don't involve high end GPUs additionally in the mix. This is roughly 2x the typical energy consumption of many households, so you could easily add $70/month or more in electricity bills just to pay for the cluster's energy use.

*** Also you will find that it makes the room it is in INTOLERABLY HOT *** unless you have EXCELLENT active ventillation basically exhausting the entire contents of the room air outside several times per hour. If the outside temperatures are intolerably hot to blow into the room, congratulations, you're now the proud owner of a new 1 ton (12000 BTU) range or higher (15000, 18000) air conditioner for the cluster room and probably another $50/month air conditioning bill or whatever on top of the electricity cost to run the cluster.

The cost savings on the PC hardware makes this a tempting project, but paying for electricity and cooling / air conditioning will dominate the cost of the project within the 2nd year or so and may well make the difference between it being infeasible / cost ineffective and attractive.

Also these will be noisy units, so I hope you have a 200-400 sq. ft. office room (cum ventillation / air conditioning) to put the cluster in, they'll be louder than most people want to work next to.

Also keep in mind that consumer class hardware doesn't use ECC memory so several times per month you may get glitches of data corruption randomly appearing somewhere in the memory or calculations of your system. You might crash, you might (and probably will) just get the wrong numeric result with no other indication of trouble. These are not the kinds of machines to run month long simulations on with no built in error checking capability in the calculations; you'll have a nice heater that gives you a predictably wrong result most of the time otherwise. If your calculation runs are usually done within a few days or few hours, though, then most of them will "probably" calculate correctly with the occasionally botched one.

 

excalibur3

Member
Oct 14, 2005
149
0
0
Wow! Thank you for such a detailed response! The thing I don't get though is that this should only be as hot and power intensive as 4 computers, right? Or are those quad cores much more power hungry than most other computers? I guess that I never even realized how much power computers use compared to other things in the house. This means that a typical computer is consuming 25% of the electricity in the house or is it different because it is running all the time? I have never even realized how much a single desktop heats up the room. Does this really becomes essential to consider at 4 processors or a huge cluster? How large was the one that you made? Now that I look into it, the plans that I read about the 24 core cluster only used 400 W when running. Why wouldn't my potential cluster use around ~260 W then? If that's the case, it should only run as hot as 2.5 light bulbs, right?
I definitely need to do some research about where exactly the bottleneck is though. The programs I'd want to run do density functional theory calculations, namely VASP and SIESTA. I've noticed significant speed ups when using more nodes on the cluster I use now, but I only have a qualitative sense of it. I can't even thank you for this response though, there are so many things, like error correction, that I take for granted in normal computer operations, that I didn't think about how your setup would vary if you can't accept any. I would say that my calculations would take a few days to a week max. Its largely would do iterations trying to find the lowest energy state of a system.
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Use a better quality PSU than an Antec 350; you'll be glad you did.
In fact if you put a GPU card into the system you'll have no choice BUT to use a better PSU; my pal just blew the fuse (non user serviceable) on an ANTEC 450W unit with a Q9450/4GB RAM/HDD/x1950 GPU system.

Something more like a corsair HX520W or so would be more like it. Don't forget a 500VA to 750VA UPS for each box too, and even that would basically have just enough runtime to shut the box down gracefully.

Since you're not building dozens of boxes with 1GBy each, I'd look to see if 8GB RAM / box would make sense given the scale of your simulations of interest. It is cheap enough that it's almost silly NOT to go with more RAM unless your application just has no need / use for it.

You may want to buy some better CPU heatsinks like the Xigmatek HDT-S1283 ($25 / unit on sale recently at newegg). It is WAY better than the Intel stock cooler, which is nice if you're going to overclock the CPUs up to around 3.0 - 3.3GHz which is very conservatively possible on a Q6600/Q9450. Also they're quieter than the Intel units while being more efficient at heat dissipation and keeping the CPUs cooler.

If you can get a bunch of the clearance prices on the Rocketfish tower cases at Best Buy (few stores have them in stock any more, but if you look around you can probably find 4 somewhere ....) $50 or so for a good full tower. Otherwise something like the P180B or so would be a nice case.

I'm guessing you'll probably end up with 2 quad core X38/X48 boxes with 2xPCIE 2.0 x16 slots and a couple of 1GBy or so VRAM GPUs like the Tesla C1060, GTX 260 per case for computational chemistry rather than 4 quad core no-GPU boxes...
http://www.nvidia.com/object/t...mputing_solutions.html

Anyway start with 1 quad core (+ GPU if NVIDIA+CUDA or AMD+BROOK/CAL works for you) system, benchmark your code, see how it works. If it works well, add a second, see how the efficiency scales. If you get something like 170% performance when you add the 2nd unit, consider a 3rd... check the scaling... look for I/O dependence (GbEth a limit or not?)... if it looks good, add another.....

 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0

I run 4 nodes with each being a Q6600 / 8GB RAM over GbEthernet, and have a few dual core and couple of single core nodes additional to those. They're not dedicated to cluster use since sometimes they're used for sandboxes, development, individual applications, et. al.

Yes, it is only as power intensive as 4 computers, but most computers aren't quad core, and they mostly run at around 10% or less computational load. Since you're talking about power being dominated by the actual computations, the energy consumed rises pretty significantly as you start to increase the computational work load to consume well over 70% of the total CPU power on the units. I believe 200W-350W is pretty average for a Q6600 system with a typical motherboard, hard disk, efficient power supply, a few fans, and some modest video card. I can dig up specific benchmarks of actual numbers measured from the UPS or from the wall socket, but they're usually consistently in that range unless very idle or equipped with very high power GPU card(s).

Still 250W x 4 = 1kW = 30 days / month * 24 hours / day * 1kW = 720kWh / month just for the PCs running 24/7. That's probably at least a little more than the cost of your entire household's current usage for an average US household.
There are some good cost estimators of the cost of running computers 24/7 under heavy load done by some of the distributed computing participants like people running Folding@Home.

Here's an actual AC power draw benchmark; Q6600 whole system = 177W IDLE, 230W LOAD; keeping in mind that your 'LOAD' may be more intensive than theirs if your software makes more efficient use of your hardware and you may be overclocking which will add power consumption proportionally:
http://techgage.com/article/in...2_quad_q9450_266ghz/12

Anyway figure your specific cost given your utility rates per kWh in the 1MWh/month to 2MWh/month range and you'll know the direct electricity cost. Figure added costs for cooling / ventillation and you'll have more realistic numbers for operational costs.

1kW of heating is around 3400 BTU / hour added heat, so figure that amount on top of whatever AC size would normally be needed to keep that room size / type below 80F on a hot day to get your minimum AC size; usually that ends up being a minimum 5000BTU unit for a small closed room + 3400 = 8400 ... round up and that's 10k BTU, more like 12k or 15k if the duty factor will be large and you've got hot environments / rooms even before the PCs are added and you're contemplating possible equipment expansions.

Heat is basically watts for watts so if you assume that 100% of your power draw from the wall will be dissipated as heat (a good assumption), then you're looking at around 1kW or so of heat load in the room for 4x250W units, or something approximating a modest space heater or hair dryer. Not HUGE, but hardly insignificant. It is usually 10F hotter in the computer room here than in the adjacent room and that's usually with the doors / windows open and some fans going etc. AC or strong ventillation is just a necessity on hot days one for staying comfortable and two for keeping the systems cool enough that the heat doesn't give them problems (typically you really don't want to exceed a CPU LOAD temperature of 65C or so, and often you choose your overcllock to keep it near that level on "normal" temperature days).

Your cluster could use less power than 150-250W / node if your nodes aren't very busy computationally or if they're using extremely efficient CPUs; the Yorkfield is more power efficient at low loads / idle than the Kentsfield by far, for instance, then again why build a cluster if you're going to keep it at low loads a lot of the time?

A single high end GPU would use something like 150W all by itself, with some being well higher than that, so basically add another 300W or so to the total power budget for a pair of those in each system case if you're going GPGPU.

Iterative calculations may be able to be made more tolerant of SEU (single event upset .. random glitches changing a bit value) errors since they can be programmed to include some kinds of sanity checking on each iteration as to the reasonableness or consistency of the calculations. Of course it helps if the problem is well conditioned such that small errors don't magnify over time to yield large catastrophic ones. You'd have to check with your code providers to see what their experiences have been with error rates and handling on individual workstations and clusters, keeping in mind that most academic / commercial clusters will already be using ECC corrected RAM & hardware from the start, so they'll see dramatically less errors than you might with consumer commodity hardware. Maybe the Pande Group (Folding@Home) can suggest helpful resources for estimating your error rates given commodity PC clustering.

Wow! Thank you for such a detailed response! The thing I don't get though is that this should only be as hot and power intensive as 4 computers, right? Or are those quad cores much more power hungry than most other computers? I guess that I never even realized how much power computers use compared to other things in the house. This means that a typical computer is consuming 25% of the electricity in the house or is it different because it is running all the time? I have never even realized how much a single desktop heats up the room. Does this really becomes essential to consider at 4 processors or a huge cluster? How large was the one that you made? Now that I look into it, the plans that I read about the 24 core cluster only used 400 W when running. Why wouldn't my potential cluster use around ~260 W then? If that's the case, it should only run as hot as 2.5 light bulbs, right?
I definitely need to do some research about where exactly the bottleneck is though. The programs I'd want to run do density functional theory calculations, namely VASP and SIESTA. I've noticed significant speed ups when using more nodes on the cluster I use now, but I only have a qualitative sense of it. I can't even thank you for this response though, there are so many things, like error correction, that I take for granted in normal computer operations, that I didn't think about how your setup would vary if you can't accept any. I would say that my calculations would take a few days to a week max. Its largely would do iterations trying to find the lowest energy state of a system.