• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Does anybody have any experience with work distribution algorithms

Argo

Lifer
Here's a quick synopsis of the algorithm:

You have 'n' computers, each computer has 't' available slots for work. Once in a while, one of the computers needs to distribute 'k' tasks among the other 'n' boxes.

The algorithm involves figuring out which box(es) to send the work to. There are several potential algorithms. The easiest one is to send work to the box that has the largest number of free slots (the least busiest box). However, that algorithms exhibits "black hole" problem - where everybody sends work to the least busiest guy, quickly overloading him.

Another potential algorithms is to use a round robin way of distributing work, yet that seems inefficient.

I'd like to learn more on the subject and I'm wondering if anyone can point me towards some research done on this.
 
You can make some sort of work queue and have the slave nodes pick off jobs when they are available.
You can have slave nodes phone home to the master process to request work.

The least busiest box would actually require considerable overhead to record the state of the cluster. If you're going to invest that much, you could at least have a mechanism that allows a slave node to reject work to avoid the black hole problem.
 
That's the problem - there is no slave/masters here. Every box is equal and can be "work distributor"
 
What I do, not for clusters, but for thread pooling on multi-core systems, is have each execution slot self-multi task. Instead of dishing out the work manually with a master planner and guessing and re-balancing work load, you have the first available execution slot just grab the next piece of work via use of a monitor/idle thread that each core runs when it's not executing a user work job. Basically it's communism and capitalism in programming 😉 A master planner isn't as efficient as letting each core determine when it's available for work on it's own.

You can either replicate the job list to all machines and keep it updated across the cluster each time the list changes, or simply implement a broadcast protocol to enumerate available jobs across the cluster network where each computer simply replys to the broadcast with all it's queued jobs.

So if you have 3 PCs with two cores (6 concurrent execution slots) and 7 jobs, one machine is going to have 1 job in queue. The first of the 6 execution slots that frees up would self multi task by first checking the local queue, which if empty, would broadcast to other machines for their queues, until a job is found. In this case only the one machine would reply with it's 1 available job.

Autonomous self tasking is the ideal solution for a 'masterless' distribution system. In fact there is a paper on this tactic as one of the three primary distribution models for Cell, and this would be the same thing, except you'd have to extend the job search functionality across a network that has fragmented and distributed job lists. (the other two models for Cell are using PPC as master planner and SPEs as slaves that only do what they are assigned, or arranging the SPEs like an assembly line serially where each SPE does a specific task and all data passes sequentially through all SPEs that have fixed jobs)

You could even optimize it a bit by having the monitor/idle job (the one that runs on each execution slot by default when there is no work to do) automatically re balance job queues in the background so that they get shuffled around concurrently while other work is being done. The idea being, that the first execution slot that finishes it's current job on a machine with an empty local job queue, goes into a 'redistribute' mode, which only one slot can go into at any time, such that all the job queues are balanced so that all the local queues have something in them by the time the next slot opens up. Ideally, that next available slot will be the one that just completed the redistribution and re-entered it's monitor/idle/job-hunt loop while the others across the cluster won't even know their local queues were ever empty.

The only major synchronization here will be that only one execution slot can go into redistribution mode cluster wide, and on the local level, adding or removing jobs to or from the local queue must be thread safe (locked).

It will be kind of tricky, for example what happens when one core is looking for work across the cluster, and all the other cores cluster wide are working on jobs and there isn't a core available to service the broadcast right away? Should the remote machine wait until the next job completes and service the broadcast as part of the monitor/idle loop before grabbing the next local job, or should such system level requests take precedence and interrupt immediately? What happens if, by the time all the network broadcasting goes on, that another core in the mean time has grabbed the job off its local queue thus aborting the negotiation of transfer of the job to another machine and wasting both CPU time and LAN bandwidth? Should you skip the job being transfered or even better, only transfer jobs from the back of the queue instead of the front? Use timestamps and priorities to determine a distribution schema? etc. Lots to think about.
 
Someone needs to look at Sun Grid Engine...

That said, for developing your own setup, the only real solution is to have a "master" node that schedules all the jobs to other nodes. All the other nodes can have work units that they produce, but in turn, they submit them to the master which then figures out where to actually have the work unit run. All the nodes in theory should be updating the master with basic information like, CPU usage%, memory usage%, network usage%, etc., etc., so that the master can better decide what system to actually send the unit, not just based on "x slots free", because your unit of work might be disk I/O intensive for something in which it spends a lot of time reading/writing to files, and the CPU would not be doing a lot of real work, and thus be wasted idle time. By doing this, you can then have the master keep track of units it has sent and create simulated "load" to a system with a decay time so that it knows that a unit was just sent to that machine, and may not have actually started affecting the actual CPU usage. In this way, the real least utilized systems will get the newest work units. The master can then queue up units of work that might not be able to be run right away, and then submit as slots free up in either a FIFO base, or some other priority given to the individual work units.

Personally I wouldn't invent the wheel here. Look at Sun Grid Engine. It is free download. Works for linux, unix, and Windows.
 
I'd try a csma sort of thing, if you can afford some overhead in distribution. Basically, if a box that is low on work gets flooded really quickly, it starts rejecting work orders. The boxes distributing the rejected orders see this and back off for a random period of time before trying to find the least loaded yet. I don't know how far that would scale, but it would be completely decentralized.

Whoops, just noticed that your title said 'experience' 😛 Obviously I'm just making stuff up here. It's too fun to think about.
 
How about a randomized approach where the probability of sending a job to a machine is inversely proportional to its current load? This may get you to within an (expected) logarithmic factor of evenly distributed jobs.
 
I'm leaning towards a formula based approach - the machine with the highest value get's work assigned to it, where the value is calculated based on the following criteria:

free work slots, resources, number of recently rejected work requests.

That should help route work to the least busiest candidate, yet it would prevent black hole problem.
 
Self task each work slot and have them pull work from the cluster instead of being pushed work. No master needed, no distribution scheme needed, no black hole problem, and scalable. Each slot just stays in an endless loop and just keeps looking for work to do until until there are no more jobs left. The real work here then is balancing the local queues ahead of the local slot availability to minimize network latency and slot idle time when the local queue empties and network broadcasts go out sniffing for more work.

Thats the most popular method for multi-threading the SPEs on Cell, it should extend over a cluster pretty well.
 
Originally posted by: exdeath
Self task each work slot and have them pull work from the cluster instead of being pushed work. No master needed, no distribution scheme needed, no black hole problem, and scalable. Each slot just stays in an endless loop and just keeps looking for work to do until until there are no more jobs left. The real work here then is balancing the local queues ahead of the local slot availability to minimize network latency and slot idle time when the local queue empties and network broadcasts go out sniffing for more work.

Thats the most popular method for multi-threading the SPEs on Cell, it should extend over a cluster pretty well.

Interesting idea. Would you have a shared common queue that is maintained across cluster and have each work slot poll it? Or alternatively have a member broadcast that it has messages in the queue and then all work slots poll that member until it runs out of work?

I'm leaning towards the second.
 
Originally posted by: Argo
Originally posted by: exdeath
Self task each work slot and have them pull work from the cluster instead of being pushed work. No master needed, no distribution scheme needed, no black hole problem, and scalable. Each slot just stays in an endless loop and just keeps looking for work to do until until there are no more jobs left. The real work here then is balancing the local queues ahead of the local slot availability to minimize network latency and slot idle time when the local queue empties and network broadcasts go out sniffing for more work.

Thats the most popular method for multi-threading the SPEs on Cell, it should extend over a cluster pretty well.

Interesting idea. Would you have a shared common queue that is maintained across cluster and have each work slot poll it? Or alternatively have a member broadcast that it has messages in the queue and then all work slots poll that member until it runs out of work?

I'm leaning towards the second.

Up to you to decide. My main idea here is self tasking. Keeping the local queues filled will be the meat of your balancing algorithm, with the goal being to be that the local threads should never have to broadcast over the network because of an empty queue. They should always find something in their local queues due to some house keeping that occurs in the idle loop of each 'slot' before or after a job has started/completed. The idea being that the first slot to idle will lock the queue and pull in more jobs from other machines based on workload stats. It can be assumed that the other slots on that machine will soon be going idle too in just a few moments when they complete their current tasks, and if the first one to go idle can lock queue, import jobs from another, unlock queue. The other slots will then be able to immediately find local work and not have to repeat those steps (they will be blocked briefly if they happen to go idle before the first one finishes filling the queue).

while(1)
{
nextjob().execute();
housekeeping(); // grab jobs from another queue, go into sleep state of all jobs are complete, etc
}

My thought was a distributed job list where no jobs are duplicated anywhere so each machines local list is unique and exclusive, but you can move jobs between machines before they go into "executing" status.

The idea is that each unit machine balances itself by requesting work, so there is no master. First check local queues, and then if local queue is empty, take advantage of that 'slots' momentary free time to going into a 'housekeeping' mode where it polls the status of all participant's queues and based on some performance stats (for example, choose jobs from the queue of the machine that responded with the deepest queue and move n number of jobs from that queue). The machine that is doing this will automatically know it is the least busiest and best candidate to take on a large workload from another simply by nature of it being the first to go into this 'housekeeping' mode in the first place! This is the automatic load balancing feature that keeps everything busy. No guessing or predicting which machine should take more work. No burdening a single machine that has to decide when to devote time to balancing. The machine that is least busy at any given time will elect itself to balance, making even the balancing algorithm a distributed 'job' in and of itself. This is where self tasking has it's strength, that the machine that is least busy will automatically take on work from a busy machine without any master, etc. Having no master also makes it more generic and scalable.

The pitfall is you have to be careful here though about redundant network traffic, as newly idle threads on different machines will go ping ponging jobs around the network. Instead of a black hole problem you'll have the opposite; machines being selfish and trying to horde jobs and pulling them in every which direction. Just include some kind of stat in your polling/balancing algorithm, such as a time stamp on the jobs that is reset when it goes to a new machine. Then in addition to using the machine with the most jobs in queue, also look at the average time stamp of the jobs. A machine with a smaller queue but a higher time stamp indicates few jobs but timely jobs that have been sitting on that machine for a while and might take priority over a machine with a larger queue filled with jobs that just got moved there.

I'm actually making up a lot of this on the fly, and thinking about how I could extend what I know about multi core programming to also work over a network. Thinking aloud 😉
 
Back
Top