Scripts for getting rid of jammed tasks

StefanR5R · Feb 19, 2018

Sometimes you are working on a less than stable BOINC project which gives you tasks that won't finish, and clog up your host. Short of abandoning such a project, there are some measures you can take:

Take leave from your day job, check your rig all day for faulty tasks, clean them out as they occur.
Rent a monkey from the local zoo, train it to do the job for you.
Run a script which keeps watching for such tasks and deals with them.

Here is an example for doing it the 3rd way.

Suspend endless Universe@Home tasks

The problem:
Last year in November during the Formula Boinc sprint at Universe@Home, I and others frequently encountered tasks which ran normally for a while, then got stuck at a point where they ran without making any more progress. I.e. the progress bar as seen in boincmgr did no longer advance.

The workaround:
Many of these task would get going again if the option "Leave non-GPU tasks in memory while suspended" was kept off, the tasks were suspended, and then resumed. Sometimes they would get stuck again after resumption, in which case they should be aborted.

I have not checked whether Universe@Home is still plagued by this problem. (Recent forum posts indicate that it is.) For reference and as inspiration what can be done by scripting, I am reposting a bash script which Luigi R. of team BOINC.Italy posted at the Universe@Home forum in September. (The following code is marginally modified by me.)

Code:

#!/bin/bash

# Slightly modified from Luigi R.'s script,
# https://universeathome.pl/universe/forum_thread.php?id=199&postid=2406
#
# Usage: ./suspend_endless_universe_tasks.sh [host[:port] [password [interval]]]

host=${1:-localhost}
password=${2:-mysupersecurepassword}
boinccmd="boinccmd --host ${host} --passwd ${password}"
universeathome_url="https://universeathome.pl/universe/"

if [[ -z `echo $($boinccmd --get_simple_gui_info)` ]]
then
    echo "BOINC is not running. Exit..."
    exit
fi

name_old=()
fraction_done_old=()
faulty_tasks=()
start_time=$(date +%s)
iter=0
interval=${3:-600}

while true; do
    # Time vars
    time=$(date +%H':'%M':'%S)
    now_time=$(date +%s)
    script_time=$(echo "$now_time - $start_time" | bc)
    script_time_str=$(printf '%03dd:%02dh:%02dm:%02ds\n' $(($script_time/86400)) $(($script_time%86400/3600)) $(($script_time%3600/60)) $(($script_time%60)))
    iter=$((iter+1))
    reset
    echo -e "${host} | Time: ${time} | Execution time: ${script_time_str} | Iteration N.${iter} | Interval: ${interval}s\n"
    ###

    # BOINC vars
    name=(`echo $($boinccmd --get_tasks | grep -v 'WU name'| grep 'name' | awk '{print $2}') | cut -d " " -f 1-`)
    project_url=(`echo $($boinccmd --get_tasks | grep 'project URL' | awk '{print $3}') | cut -d " " -f 1-`)
    active_task_state=(`echo $($boinccmd --get_tasks | grep 'active_task_state' | awk '{print $2}') | cut -d " " -f 1-`)
    fraction_done=(`echo $($boinccmd --get_tasks | grep 'fraction done' | awk '{print $3}') | cut -d " " -f 1-`)
    ###

    # Loop vars
    ntasks=${#name[@]}
    noldtasks=${#name_old[@]}
    tmp_name=() # U@H names
    tmp_fraction_done=() # U@H fractions done
    ###

    # Loop
    for (( i = 0; i < ntasks; i++ )) do
        if [ ${active_task_state[$i]} == "EXECUTING" ]; then
            if [ ${project_url[$i]} == $universeathome_url ]; then
                #if [ "$noldtasks" == "0" ]; then                     # Case 1: no old tasks
                #    echo -e "${name[$i]} | \e[1;32mOK\e[0m"      
                #fi
                name_not_found=1
                for (( j = 0; j < noldtasks; j++ )) do             # Case 2: old tasks exist
                    if [ ${name[$i]} == ${name_old[$j]} ]; then # Case 2a: executing task still in old tasks
                        name_not_found=0
                        if [ ${fraction_done[$i]} == ${fraction_done_old[$j]} ]; then
                            $boinccmd --task $universeathome_url ${name[$i]} suspend
                            echo -e "${name[$i]} | \e[1;31mFAULT\e[0m"
                            faulty_tasks+=("${name[$i]}")
                        else
                            echo -e "${name[$i]} | \e[1;32mOK\e[0m"
                        fi
                        break
                    fi
                done
                if [ "$name_not_found" == "1" ]; then            # noldtasks == 0 is TRUE => name_not_found == 1 is TRUE
                    echo -e "${name[$i]} | \e[1;32mOK\e[0m"     # Case 2b: new executing task, no match in old tasks
                fi
                tmp_name+=("${name[$i]}")
                tmp_fraction_done+=("${fraction_done[$i]}")
            else
                echo -e "${name[$i]} | \e[1;33mNot U@H\e[0m"
            fi
        fi
    done
    name_old=("${tmp_name[@]}")
    fraction_done_old=("${tmp_fraction_done[@]}")
    ###

    # Print faulty U@H tasks
    if [ ${#faulty_tasks[@]} -gt 0 ]; then
        echo -e "\n\e[0;31mFaulty U@H tasks (${#faulty_tasks[@]}):\e[0m\n"$( echo ${faulty_tasks[@]} | sed 's/ /,\n/g' )
    fi
    ###

    sleep $interval
done

The script detects Universe@Home tasks which no longer increase their progress percentage, suspends them, and logs those tasks.

The user still needs to resume or abort these tasks manually, but this is something which needs only infrequent attention.

There is a little bug in the script: In a single loop iteration, it calls "boinccmd --get_tasks ..." four times in order to collect name, project, state, and completion percentage of each tasks. It should call boinccmd only once per iteration, otherwise it is not ensured that task names match task progress and so on. However, while I used this script, it worked perfectly for me despite this more theoretical problem.

Kudos to Luigi R. for this script; it was a huge help to me when I ran Universe@Home.

Next up: A script of my own to deal with a different problem in Cosmology@Home.

StefanR5R · Feb 19, 2018

Abort preempted Cosmology@Home tasks

For the last week I have been running Cosmology@Home, configured to receive the VirtualBox based camb_boinc2docker tasks only. VirtualBox based applications have a lot of problems, such as wasting host resources, not being integrated with boinc-clients resource management, and being difficult to set up.

There are many pitfalls which can cause 100 % of the vbox based tasks to fail. But this post is not about that.

The problem:
This post is about a different problem that I am experiencing: Most of the camb_boinc2docker tasks work well, but just a tiny fraction (less than 1 %, on my host that's circa 1 task per day) start to run, then fail to complete the guest system's initialization with the message "Postponed: VM Hypervisor failed to enter an online state in a timely fashion" in boinc-client's log, and are being suspended by boinc client.

If you get a dump of all tasks and their properties with "boinccmd --get_tasks", those suspended tasks can be identified by having "scheduler state: preempted".

The big deal about those auto-suspended tasks is that the client then goes on to finish all remaining Cosmology@Home tasks, but no longer fetches new work. Soon your host will sit idle (or perhaps run other projects, if allowed to).

The workaround:
As far as I can tell, the camb_boinc2docker tasks which were suspended need to be aborted. I came up with the following bash script to do that for me.

Code:

#!/bin/bash

# Usage: ./abort_preempted_cosmology_tasks.sh [host[:port] [password]]

host=${1:-localhost}
password=${2:-mysupersecurepassword}

project_url="http://www.cosmologyathome.org/"

      check_every_n_minutes=5
  timestamp_every_n_minutes=30
update_at_less_than_n_tasks=10

           name_pattern='^[ ]\+name:[ ]'
        project_pattern='^[ ]\+project[ ]URL:[ ]'
          state_pattern='^[ ]\+state:[ ]'
scheduler_state_pattern='^[ ]\+scheduler[ ]state:[ ]'

          waiting_state="downloaded"
waiting_scheduler_state="uninitialized"
 faulty_scheduler_state="preempted"

boinccmd="boinccmd --host ${host} --passwd ${password}"
i=${timestamp_every_n_minutes}

while true
do
    if (( (i += ${check_every_n_minutes}) >= ${timestamp_every_n_minutes} ))
    then
        date
        i=0
    fi

    tasks=$(${boinccmd} --get_tasks)

               name=($(echo "${tasks}" | grep -e "${name_pattern}"            | sed -e "s/${name_pattern}//"           ))
            project=($(echo "${tasks}" | grep -e "${project_pattern}"         | sed -e "s/${project_pattern}//"        ))
              state=($(echo "${tasks}" | grep -e "${state_pattern}"           | sed -e "s/${state_pattern}//"          ))
    scheduler_state=($(echo "${tasks}" | grep -e "${scheduler_state_pattern}" | sed -e "s/${scheduler_state_pattern}//"))

    k=0
    for j in ${!name[*]}
    do
        if [ "${project[$j]}" != "${project_url}" ]
        then
            continue
        fi
        if [ "${state[$j]}" = "${waiting_state}" -a "${scheduler_state[$j]}" = "${waiting_scheduler_state}" ]
        then
            (( k++ ))
        fi
        if [ "${scheduler_state[$j]}" = "${faulty_scheduler_state}" ]
        then
            ${boinccmd} --task "${project_url}" "${name[$j]}" abort && echo "abort ${name[$j]}"
        fi
    done
    if (( $k < ${update_at_less_than_n_tasks} ))
    then
        ${boinccmd} --project "${project_url}" update && echo "update ${project_url}"
    fi

    sleep "${check_every_n_minutes}m"
done

The script also has a safe-guard built in which forces a project update as soon as it sees less than a certain number of tasks "ready to start" remaining. Normally this section of the code is never performed, since the client keeps fetching new tasks in a timely manner when the bad tasks are aborted early enough. But I left this extra in there for safety and as another demo of how the task list can be evaluated and acted upon.

StefanR5R · Feb 19, 2018

Some general notes about scripted control of boinc-client:

The two scripts which I posted interact with boinc-client solely through the "boinccmd" command line utility. That is, they don't require access to the boinc-client's data directory.
If you have set up remote GUI control to your hosts so that you can run boincmgr or boinctasks on a different PC than the client, then you can also run boinccmd and these scripts remotely.

And more generally about scripting:

The two scripts above require the "bash" shell as interpreter, and a few little programs, e.g. the "sed" text stream editor, which are commonly found on Linux and other unixoid OSs. Occasionally I run such scripts on Windows, and have the Cygwin toolset installed on my Windows PCs for this and similar purposes.
There are a few ways how bash scripts can be started:
- Run it within an existing bash session by "source path/to/the_script.sh" or shorter ". path/to/the_script.sh". I don't recommend this, as all the variables from the script are propagated into the session. Though in special cases this side effect is desired.
- Run it by calling the interpreter explicitely: "bash path/to/the_script.sh".
- Make the file executable by "chmod + x path/to/the_script.sh" or by editing the file properties in the file manager of your choice. Then call the script directly by just "path/to/the_script.sh" or by clicking on it in the file manager. Note, the operating system knows that the script is to be fed into bash not by looking at the file name extension, but by the first line in the script which needs to say "#!/path/to/the_desired_interpreter". You could safe the script with any file name extension you like, or without one.
The two scripts which I posted loop infinitely on their own. You quit them either by closing the terminal window or by hitting [Ctrl][c].
For editing shell scripts and such, I prefer editors with syntax highlighting. This gives good feedback while typing, such that e.g. closing quotes and brackets aren't lost.
When you edit a bash script, you can use [Space] or [Tab] where whitespace is required.
In bash scripts, the # sign starts comment lines. I routinely use this to deactivate unused lines which I may want to edit back in at a later time.

StefanR5R · Mar 3, 2018

BTW, to get a picture about what scripts which use boinccmd could to, check out the boinccmd documentation.

About boinccmd, remote GUI password, and secondary client instances:

If boinccmd and boinc-client run on the same host, and you have set the PATH variable properly, you can access the first client instance by any of the following calls:

boinccmd --get_host_info
boinccmd --host localhost --get_host_info
boinccmd --host localhost:31416 --get_host_info

(Taking --get_host_info merely as an example for the various "get" and "set" commands that are supported.) If you run a second instance of boinc-client on the very same host OS instance (i.e. not in a virtual machine), this second client instance would be accessed by:

boinccmd --host localhost:31499 --get_host_info

Replace 31499 by whatever actual number you chose for the remote GUI port.

I found that the above syntaxes work for "get"/ "set" commands only on Windows, whereas on Linux the "set" commands require that the GUI password from the gui_rpc_auth.cfg file in the boinc data directory is given explicitly as a parameter to boinccmd. So, for the first client instance:

boinccmd --passwd whateveryourpasswordis --set_run_mode auto
boinccmd --host localhost --passwd whateveryourpasswordis --set_run_mode auto
boinccmd --host localhost:31416 --passwd whateveryourpasswordis --set_run_mode auto

(Again taking --set_run_mode only as an example. You can watch its effects by saying either always or auto or never, and check in boincmgr's Activity menu how the client changes its run mode accordingly.) And the second client instance:

boinccmd --host localhost:31499 --passwd whateveryourpasswordis --set_run_mode auto

(Actually, the password parameter is not needed for "set" commands on Linux either under special circumstances: If there is a gui_rpc_auth.cfg in the current directory from where boinccmd is ran, the current user has read permission for this file, and the file contains the correct password. Or, in case of he primary boinc-client instance, if the current user has read access to gui_rpc_auth.cfg file in the boinc data directory. I guess the latter is why it generally works without --passwd parameter on Windows.)

You can run boinccmd and boinc-client on different hosts also. In that case, replace "localhost" by the hostname or IP address of the host on which the client runs. The client must be configured to allow this remote access, and all routers or firewalls between boinccmd and boinc-client need to allow this too, including "desktop firewalls" like Windows' built-in one.

If you run another boinc-client instance in a virtual machine, then I suppose you can either run boinccmd in the same virtual machine too and have it access its own localhost, or run it outside the virtual machine but need to specify a hostname or IP address with which the virtual machine is seen from the outside. However, I never tried using virtual machines for such purposes myself yet.

StefanR5R · Oct 20, 2018

Suspend endless Universe@home tasks - update to post #1

Here is a bug-fixed version of the script which monitors for Universe@home tasks which don't make any progress, and suspends them:

Code:

#!/bin/bash

# modified from Luigi R.'s script,
# https://universeathome.pl/universe/forum_thread.php?id=199&postid=2406
#
# Usage: ./suspend_endless_universe_tasks_v2.sh [host[:port] [password [interval]]]

host=${1:-localhost}
password=${2:-mydefaultpassword}
boinccmd="boinccmd --host ${host} --passwd ${password}"
universeathome_url="https://universeathome.pl/universe/"

if [[ -z `echo $($boinccmd --get_simple_gui_info)` ]]
then
	echo "BOINC is not running. Exit..."
	exit
fi

name_old=()
fraction_done_old=()
faulty_tasks=()
start_time=$(date +%s)
iter=0
interval=${3:-180}

while true
do
	# log a timestamp
	time=$(date +%H':'%M':'%S)
	now_time=$(date +%s)
	script_time=$(echo "$now_time - $start_time" | bc)
	script_time_str=$(printf '%03dd:%02dh:%02dm:%02ds\n' $(($script_time/86400)) $(($script_time%86400/3600)) $(($script_time%3600/60)) $(($script_time%60)))
	iter=$((iter+1))
	reset
	echo -e "${host} | Time: ${time} | Execution time: ${script_time_str} | Iteration N.${iter} | Interval: ${interval}s\n"

	# get state of all tasks
	unset name project_url active_task_state fraction_done
	while read line
	do
		case ${line} in
		                [1-9]* )                     i=${line%)*};;
		             "name: "* )              name[$i]=${line#*"name: "};;
		      "project URL: "* )       project_url[$i]=${line#*"project URL: "};;
		"active_task_state: "* ) active_task_state[$i]=${line#*"active_task_state: "};;
		    "fraction done: "* )     fraction_done[$i]=${line#*"fraction done: "};;
		esac
	done <<< $(${boinccmd} --get_tasks)

	noldtasks=${#name_old[@]}
	unset tmp_name tmp_fraction_done

	# inspect current tasks
	for i in ${!name[@]}
	do
		if [ "${active_task_state[$i]}" != "EXECUTING" ]
		then
			continue
		fi
		if [ "${project_url[$i]}" != "$universeathome_url" ]
		then
			printf "%-58s | \e[1;33mNot U@H\e[0m\n" "${name[$i]}"
			continue
		fi
		# look for tasks which were also executing during the last iteration
		for (( j = 0; j < noldtasks; j++ ))
		do
			if [ "${name[$i]}" = "${name_old[$j]}" ]
			then
				if [ "${fraction_done[$i]}" = "${fraction_done_old[$j]}" ]
				then
					$boinccmd --task $universeathome_url ${name[$i]} suspend
					printf "%-58s | \e[1;31mFAULT\e[0m\n" "${name[$i]}"
					faulty_tasks+=("${name[$i]}")
				else
					printf "%-58s | \e[1;32mOK\e[0m\n" "${name[$i]}"
				fi
				break
			fi
		done
		# the rest are tasks which were not yet executing in the last iteration
		if [ $j = $noldtasks ]
		then
			printf "%-58s | \e[1;32mOK\e[0m\n" "${name[$i]}"
		fi
		tmp_name+=("${name[$i]}")
		tmp_fraction_done+=("${fraction_done[$i]}")
	done
	name_old=("${tmp_name[@]}")
	fraction_done_old=("${tmp_fraction_done[@]}")

	# list faulty tasks
	if [ ${#faulty_tasks[@]} -gt 0 ]
	then
		echo -e "\n\e[0;31mFaulty U@H tasks (${#faulty_tasks[@]}):\e[0m\n"$( echo ${faulty_tasks[@]} | sed 's/ /,\n/g' )
	fi
	echo

	sleep $interval
done

Changes:

Bug fix: Not all task attributes are not listed at every task record in the output of boinccmd --get_tasks. The script could associate these attributes with the wrong task names, causing false positive detections of tasks without progress. (This bug hits especially while uploading tasks are in the queue.) Fixed by a rewritten parser which puts such attributes into sparse array variables.
Bug fix/ optimization: Use only one boinccmd --get_tasks call per iteration. This ensures a consistent view upon task state and improves performance.
Decrease default checking interval from 10 to 3 minutes.

For the purpose of running this script over longer time frames without supervision, it could be extended to (1.) resume previously suspended tasks, after which there is a good chance that they complete normally, (2.) abort tasks which already went through one or two suspend/ resume cycles before but are again observed to not make any progress.

Search

Scripts for getting rid of jammed tasks

StefanR5R

Elite Member

StefanR5R

Elite Member

StefanR5R

Elite Member

StefanR5R

Elite Member

StefanR5R

Elite Member

TRENDING THREADS