looking for a distributed computing consultant

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
I work for a cloud services company. Our bread and butter is online backup/DR but we also colocate and host servers. typically dell blades + ESXi is our preferred setup for hosting.

My boss is intrigued by the success of others in their use of commodity desktops PCs as worker nodes in vast linux clusters.

I on the other hand am skeptical that the commodity approach is as cost efficient as our current storage hardware.

I'm looking for someone with a good level of experience setting up and administering Hadoop/HDFS, FhGFS, Gluster, QFS or any other open source distributed file system so maybe we can have a conference call in our next meeting and get some conventional wisdom. show us what you've set up... name your price and hopefully I can just put you on speaker and you'll blow our minds. We can set up screen-sharing software and I can put the display on the projector as well if you want to show us something.

I do not have the experience to backup my claims that HPC-style storage clusters are not a good fit for bulk archival storage. Yes a HDFS/FhGFS cluster offers high aggregate bandwidth, but we are not bandwidth limited and the cost is still higher than our current setup for less storage overall. Plus most of these systems require a minimum replication factor of 3, so for every block you write, 2 replicas get pushed to other nodes and now the usable storage in your cluster is divided by 3! That's really not cheap storage at all, it just offers high aggregate bandwidth and reasonable fault tolerance.

We aren't bandwidth starved, and our current setup is pretty dense (384 drives per rack, about one petabyte usable storage per rack after parity). In fact I have tried to price out little barebones systems with cheap 2 TB drives. It still comes out to over $400 per node and nowhere near one petabyte per rack, not even half a petabyte. With so many nodes you are also wasting way too many rack units on 48-port switches rather than disks.

What it all boils down to is that we are all helpless little babies when it comes to the topic of distributed storage or computing, but we want to explore it and are looking for someone to share their experience. what do you guys think? Am I in the wrong forum?
 

lxskllr

No Lifer
Nov 30, 2004
60,926
11,258
126
I'm not sure what forum it should go in. That's kind of a technical niche, and I'm not sure you'll get many bites here. I'd try other places as a backup. Maybe look around reddit. They have a broad range of users, and someone may have experience in what you're looking for.
 

Fallen Kell

Diamond Member
Oct 9, 1999
6,249
561
126
The people who have success with using desktops in vast linux clusters are places that have vast amount of desktops already. Using something like Open Grid Engine, it is extremely easy to use all the space CPU cycles of systems that are spread out across your internal network and allow them to do some meaningful work as long as the work is CPU intensive and not I/O or memory bound. If the jobs are properly written to be able to be paused, Grid Engine will let you pre-empt the "cluster" job in preference for the end user's desktop tasks/functions (assuming he/she is doing something demanding on the computer). But if you have high memory requirements or lots of I/O from the cluster job, commodity desktops are HORRIBLE for doing this kind of work. You could setup a different run queues or create custom consumable resources within Grid Engine to set limits for what certain hardware can actually do (i.e. you know from past profiling that your job needs about 10GB of RAM, so at submit time of the job you request only systems which have 10GB or more of available RAM and "consume" it from the consumable resource which then frees up once your job ends).

But if you are looking to just buy a bunch of desktops and set them up in a hosting room, it is foolish to do. The systems are extremely limited on memory, I/O, and network bandwidth. You are much better off purchasing cheaper dual CPU 1U servers and putting 4 or 8GB of RAM per CPU core and even just using the built in dual or quad 1GB ethernet links for networking. Real clusters though would be using infiniband and connecting to their storage systems over that as well for low latency/high performance access since most tasks are actually I/O bound or bound by message passing between CPUs/systems.
 
Last edited:

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
I guess I wasn't as clear in the OP as I was in my own mind: we are just evaluating the different storage clusters for a backup client we are currently developing.

1 gigabit ethernet per node will offer plenty of throughput and we don't need much CPU at all in the storage node. all the client is doing is "put" or "append" with the occasional "get."

we are not doing any kind of analytics or execution on the data, we just want to explore the "cheap storage with commodity parts" possibilities everyone is yapping about.

we want the throughput and availability of HDFS but the durability and capacity tradeoffs against our multiple RAID setups aren't clear.

here's a pretty rough node I put together. It's 1U, there is no chassis, just thin flat shelf that slides out of the rack and the parts sit on it. assume I will come up with a reasonable way to fasten everything down without screws:

http://www.sdc-hosting.com/images/datanode_draft2.png

40 pentium-class DataNodes in a rack, beefy Ivy-EP NameNode and a 48 port switch at the top with 10 gig uplinks. I'm using the hadoop vocabulary for right now because thats what everyone seems to be using. My point is that even if we get 2.2 petabytes of raw data in a rack, after the cost of replication, we are down to 750 TB usable storage in the rack which is less than we are achieving now with our RAIDs.

I'm just hoping a hadoop expert can come in and make up our minds for us.
 

ggadrian

Senior member
May 23, 2013
270
0
76
I don't know a lot about the matter, but I think that Windows Server 2012 R2 has a lot of emphasis in using servers with commodity hardware as nodes for storage, you might want to check it.
 

Fallen Kell

Diamond Member
Oct 9, 1999
6,249
561
126
I guess I wasn't as clear in the OP as I was in my own mind: we are just evaluating the different storage clusters for a backup client we are currently developing.

1 gigabit ethernet per node will offer plenty of throughput and we don't need much CPU at all in the storage node. all the client is doing is "put" or "append" with the occasional "get."

we are not doing any kind of analytics or execution on the data, we just want to explore the "cheap storage with commodity parts" possibilities everyone is yapping about.

we want the throughput and availability of HDFS but the durability and capacity tradeoffs against our multiple RAID setups aren't clear.

here's a pretty rough node I put together. It's 1U, there is no chassis, just thin flat shelf that slides out of the rack and the parts sit on it. assume I will come up with a reasonable way to fasten everything down without screws:

http://www.sdc-hosting.com/images/datanode_draft2.png

40 pentium-class DataNodes in a rack, beefy Ivy-EP NameNode and a 48 port switch at the top with 10 gig uplinks. I'm using the hadoop vocabulary for right now because thats what everyone seems to be using. My point is that even if we get 2.2 petabytes of raw data in a rack, after the cost of replication, we are down to 750 TB usable storage in the rack which is less than we are achieving now with our RAIDs.

I'm just hoping a hadoop expert can come in and make up our minds for us.

I am all for cheap storage with commodity parts, but that design is going to be a nightmare for maintenance, let alone the vibration, EMI interference, and electrocution/fire risks to the operators and other equipment, and COOLING. Disks die, pure and simple. As you noted with your replication, 750TB is all that is available because you need to have the ability to take an entire 14 drives offline at once just to replace a single failed drive in that group. You are sacrificing way too many disks required for replication/redundancy and not actually gaining any real redundancy in operations due to maintenance tasks requiring so many extra disks to need to go offline for a simple disk failure.

Lets do the math, you have approx 560 hard drives in a rack with that configuration. Studies of HDD failure rates put the between 3-5% per disk annually. With 560 disks, that means each year you should expect to have to replace around 28 disks, (or another way, 1 disk failure every other week). Now disk failure rates are not a straight line over time. It is actually a "bathtub" (high in the beginning, low for about 2 years, and then exponential growth after about the 3rd year, with it hitting 60-70% potential to fail each year around the 5th year). This really means that for disks older than 3 years in operations, you need to consider outright replacing, and after about 5 years you are in serious risk of catastrophic data lose with losing potentially 1/2 your hard drives in a single year if you have not already replaced them.

Now lets look at costs involved in a hard drive swap. In that setup, you will need to identify the rack unit which has problems. Verify that you can safely bring the 13 other good disks in that node offline. Shutdown the node, disconnect the power/network (I am assuming you will not have rackmount cable management arms that allow you to slide out the shelf since this is a custom shelf). Pull out the rack shelf, identify the bad disk (hopefully your controller was able to inform you the port the disk was connected into since you do not have status LEDs like storage arrays have), swap the bad disk, slide the rack tray back into the rack, reconnect power and networking, boot the system back up, tell your software storage management that you replaced the failed disk and that the other 13 are back online (the second part might happen automatically), and now wait for all 14 disks to be rebuilt since the 13 other "good" ones will need to have their data states updated from all the writes that occurred while they were offline, and most software systems will not remember the state the disk was in and what write changes occurred during the time the disks were not online and so will simply do an entire rebuild just as if it was a bad disk replaced. With 4TB drives that are say 50% used, that means 28TB of data will need to be transferred to that node, 28TB of read activity from the replication drives in the rest of the storage solution, and drives themselves only having a true sustained write speed of 50-60MB/s, it will be at absolute best 10-14 hours rebuild assuming the software is smart and only transfers actual data. If your disks are 80-90% utilized, you are now in the 24 hour rebuild times.

I'm not going to go into the electrical and fire hazards of not using a case. Your fire code may not have a problem with it, but ours does and the fire marshal would shut us down without proper equipment certifications (let alone the insurance company denying any claims).

You are also going to need to find a way to dampen the vibrations of those drives, simply screwing them down to a shelf will cause the vibrations to transfer to the rack. With that many disks spinning in a rack, you will be exceeding the limits on the disks without dampening (just look at the white papers out there by places like Google, Nexsan, etc). I believe Nexsan oriented their drives bottom to bottom so that the spin of the disks were in opposite directions with them partially cancelling out the vibrations from each 2 disk grouping, lowering the overall vibration of the group so that the dampening materials on the arm that holds the disks in could dampen the rest.



Look, I am not saying that this can't be done, just that there are a lot more problems that you have not looked at or run into yet because you have not moved to scale. Sure 14 disks will work fine like that on a rack tray. Scale up the vibrations for 40 more trays together and you will have problems with data corruption and head crashes. Add in the additional pains of maintenance for performing a simple disk swap and your costs skyrocket. Being able to access and hotswap a failed disk is something that you really need to do when you are talking about hundreds of drives. A 30 second task was just turned into a 30 minute task in your configuration and a single disk rebuild just turned into a 14 disk rebuild.

You are much, MUCH better off building your own custom disk arrays using COTS (commercial off the shelf) array cases like the Supermicro SC847DE16-R1K28LPB which hold 72 hot swap drives in 4U of space (they use 2 disks in each grouping, so to replace a single disk you would need have to wait for 2 drives to rebuild, but that is 7x better than the 14 disks from the setup you have). Using that case you will increase your disk density to 18 per 1U, an almost 30% increase from your design. And they have already engineered the proper cooling for handling that number of disks in such a small space. Now in that Supermicro case, you simply need a board like the one I am looking at using for my home setup (Supermicro X9SRL-F LGA-2011 with an Intel Xeon E5-1620v2 CPU ~$600 total for both and you can use regular unbuffered non-ECC DDR3 RAM as well). That ~$600 is cheaper than the $840 for the 4xmotherboards+CPU you would need for doing 4U of rackspace, and leaves room for purchasing the extra SATA controller cards you would need to address 72 drives, making the cost about the same. The only added cost is the case itself, but you are saving $1300 in fan and power supply costs which would almost pay the entire cost of the case (between $1500 and 1800). And for that, you get the benefit of being able to transfer the risk of failure from your engineers to Supermicro (or whoever makes the case) as they are certifying that it will operate correctly and send replacement parts if it fails. Oh, and you also gained redundant power.
 
Last edited:

Fallen Kell

Diamond Member
Oct 9, 1999
6,249
561
126
I was trying to remember the name of this company before when I posted my last reply, and they just turned up in one of the sites I monitor today so I thought I would post in here again as it directly pertains.

These guys opensourced their hardware solution to the very problem you are discussing, including custom case schematics, etc... This is the latest version they have posted: http://blog.backblaze.com/2013/02/20/180tb-of-good-vibrations-storage-pod-3-0/

It is not as dense as the Supermicro case I mentioned above, but it is under ~$2,000 for the case, powersupply, fans, motherboard, CPU, RAM, OS drive in their suggested configuration (which you can customize, they just recommend it based on their experiences), and will fit 45 drives (plus OS) in 4U of space. These were designed for use as mass storage, not necessarily high performance storage (similar to the configuration I listed above).

They also have results of fairly extensive hard drive failure rates on many different models of consumer drives. A good read if you are looking to use massive amounts of consumer drives (but nothing that hasn't been studied before, just good empirical evidence of other studies since they have 20,000+ hard drives in use which they collect statistics on).