The current status of Rosetta@home appears to be that there are
- batches of classic Rosetta work becoming available at some "happy hours" almost every day recently,¹ which are slurped up quickly by users who have their clients set up to pull Rosetta work whenever possible (on hosts without vbox or disabled vbox),
- batches of the newer vbox based work available at all times of day, but supported by much fewer contributors compared to the classic Rosetta workqueue.
¹) Or perhaps "happy minutes", rather.
That is, those who want to run classic Rosetta, still can do so, but need to go through hoops with the work buffering due to its intermittent availability. I haven't tried this myself, but the steps to take in order to run classic Rosetta are evident:
- Do not install VirtualBox, or set <dont_use_vbox>1</dont_use_vbox> in cc_config.xml.
- Set a large enough work buffer. But don't overdo it; the reporting deadline is 3 days.
- Do not run any other project in the same client, or if you do, set the other project(s) to 0 % resource share, or set up a limit on tasks in progress for these projects. (The latter is only possible at a few projects as a web preferences option.)
- Run a script which reminds the client to 'update' the Rosetta project periodically (I haven't investigated what period would be good to set), or run a client with modified source code which is not prolonging the request backoff period too much (I never tested this approach myself).
- Optionally, set the "Target CPU run time" at the Rosetta@home web preferences to more than the default 8 hours, such that the tasks which you can get last longer.
If you go for such a setup but want to add some safeguards against idle time, you could activate one or another additional project in the same client but set them to 0 % resource share, such that they don't fill your work buffer which you want to reserve for Rosetta. Or, speaking theoretically, you could set up another client instance and have a watchdog script monitor the number of running tasks in the Rosetta instance and reduce/ increase the number of active CPUs in the other instance accordingly.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
So the above was all about running the classic Rosetta application under the current circumstances. But now about running the newer VirtualBox based Rosetta application, a.k.a. "rosetta python projects":
I had my computers off during the week but reactivated two of them yesterday. CPDN would have been my project of choice but doesn't have Linux work at this time, and so I decided to go for a mix of QuChemPedIA and the awful Rosetta Python Projects.
So the main issues with the latter are that these tasks require a larger than usual amount of RAM, that they may make the computer less responsive, and that they have a tendency to fall over into the notorious
"Postponed: VM job unmanageable, restarting later" state. I believe the latter happens especially if (or rather: when) the computer is not very highly responsive.
Right now I am using two Linux computers for the QuChem + Rosetta mix, both computers with same hardware and boinc settings. Each computer has 64 cores/ 128 threads, 256 GB RAM, and NVME flash memory storage. The nice thing about bigger computers is that there is more flexibility in how they can be used for parallel workloads.
Using two client instances
I run QuChem's native Linux "nwchem" application in one client instance, and "rosetta python projects" in another client instance. This lets me control the number of running tasks and the number of downloaded tasks for each of these projects separately. (I do like being in control a lot.)
No swapping to disk
The computers have a swap disk, but I switched the swap space off for the time being in order to avoid huge latencies in case that memory pages would be swapped out to disk.
RAM utilization
QuChem's nwchem tasks occupy about 150 MB resident memory each. "rosetta python projects" tasks take about 1.5...1.8 GB resident memory each, at this time. This fits easily into the mentioned 256 GB RAM of my computers, but obviously, the operating system needs RAM for other purposes too. The by far largest chunks of RAM are used for filesystem caches, as long as there aren't more pressing uses for the RAM.
With 56 running QuChem tasks + 8 running Rosetta vbox tasks:
Code:
$ free -h
total used free shared buff/cache available
Mem: 251Gi 20Gi 147Gi 15Mi 84Gi 229Gi
Swap: 0B 0B 0B
With 48 running QuChem tasks + 16 running Rosetta vbox tasks:
Code:
$ free -h
total used free shared buff/cache available
Mem: 251Gi 32Gi 74Gi 15Mi 144Gi 217Gi
Swap: 0B 0B 0B
Note that "buff/cache" memory is also part of the total "available" memory here. However, since I want to keep the computers highly responsive, I do want to give the tasks as much "buff/cache" as they want.
As you can tell, I could still increase the number of Rosetta vbox tasks before I run out of RAM for filesystem cache.
Furthermore, I am obviously running merely 64 tasks in total on a 64c/128t computer. This is an arbitrary decision, favoring (a) high responsiveness of course, and (b) throughput per task, over total machine throughput. In particular, these are dual-socket computers, set to 180 W socket power limit, hence 5.6 W average power budget per core (and since I run 64 tasks, 5.6 W power budget per task). That's not a very high power budget, and hence I don't expect the loss of machine throughput to be very high.
Even if I wanted to go for a higher task count, I certainly would not run 128 concurrent tasks but rather leave a few spare hardware threads in reserve for whenever the various wrapper processes, which both "nwchem" and "rosetta python projects" are keeping around, are waking up.
I still want to refine my setup for Rosetta a bit
To do:
- Implement a watchdog script which detects Rosetta tasks in "Postponed: VM job unmanageable, restarting later" state, and shuts down and restarts the boinc client instance in such cases.
IME, client shutdown and restart is both necessary and sufficient to get such tasks going again.
I have been running "rosetta python projects" work for somewhat more than a day now and got a few of these incidents on both computers during this time, despite the measures which I described above.
- Implement a watchdog script which detects Rosetta tasks with unusually long elapsed time, and aborts those.
During the 1+ day of running this, I got one task with now 23+ hours elapsed time and another with now 12+ hours elapsed time.
Of the 100+ already completed tasks (all valid), the run times on my computers are
3.3 h on average,
2.3 h as the 10-percentile, 4.3 h as the 90-percentile,
2.4 1.8 h min, 5.5 h max.
Therefore, aborting a task after maybe 10 h seems like a good idea. But a much better idea should be: Abort tasks whose current CPU time is considerably less than their elapsed time.
That's because the fundamental property of these very long running tasks is that they are using almost no CPU time.
- Unrelated to Rosetta: Have a watchdog script abort QuChem tasks with unusually long elapsed time. Occasionally, this application simply doesn't converge but isn't able to abort its iterations, from what I understand. Such tasks will blissfully continue to run forever if left alone.