Question Odd problem with BOINC on ONE computer out of 18.

Markfw · Apr 27, 2021

So this computer has been fine for months, maybe years, my dual 7601 EPYC box. But now, in one week, I have had to go to the computer, and on the project tab, select update, since its idle, and has hundreds of tasks that are ready to report. As soon as I do, it updates everything downloads new units, and goes merrily on.... For a few days, then "rinse and repeat". Is there come setting that is corrupted or something ? any ideas ? Never had this happen on any other box. Its running Rosetta, one of 4 of my EPYC boxes running that.

crashtech · Apr 27, 2021

Did you look in the Networking tab to see if anything changed?

Markfw · Apr 27, 2021

crashtech said:
Did you look in the Networking tab to see if anything changed?

Nothing at all is checked. This box is a dedicated BOINC box, and has 256 gig ram for 128 threads, NO GPU, and 100 % cpu used for BOINC, and NO changes have been made recently.

Oh, stats:
OS linux mint 19 (.1 or .2)
BOINC version 7.9.3
Rosetta version (what its running) 4.2

crashtech · Apr 27, 2021

Well, you could just run a script that updates the project on a regular basis. I don't have any ideas on how to fix it the right way.

Fardringle · Apr 27, 2021

Are there any error messages or anything else unusual in the BOINC Event Log during the time when it sits there not reporting the completed tasks?

Markfw · Apr 27, 2021

Fardringle said:
Are there any error messages or anything else unusual in the BOINC Event Log during the time when it sits there not reporting the completed tasks?

I checked and out of the hundreds of line, it only went back as far as today after I clicked update, but the only red line, with a message said" this project is using an old URL, please change it to ..."

ZipSpeed · Apr 27, 2021

Markfw said:
I checked and out of the hundreds of line, it only went back as far as today after I clicked update, but the only red line, with a message said" this project is using an old URL, please change it to ..."

Is this for Rosetta? If I remember correctly, Rosetta sent out a message under BOINC Notices last year to detach the project and then reattach as they switched to a new URL.

Markfw · Apr 27, 2021

ZipSpeed said:
Is this for Rosetta? If I remember correctly, Rosetta sent out a message under BOINC Notices last year to detach the project and then reattach as they switched to a new URL.

Yes, but I have 4 machines, all linux, all EPYC boxes all doing Rosetta, and this is the only one that has this problem, and they are all the same old URL.

mmonnin03 · Apr 27, 2021

Have you reset the project? Or detach/reattach? Tasks will be lost when doing so and in limbo until deadlines reached

Markfw · Apr 27, 2021

Well, I hate to abort over 100 tasks, plus probably 500 in the queue, but after all the replies, I removed the project and re-added it, so I guess we will see what happens in a week or so.

StefanR5R · Apr 28, 2021

Markfw said:
I have 4 machines, all linux, all EPYC boxes all doing Rosetta, and this is the only one that has this problem,

A question: Are these computers running other projects simultaneously with Rosetta? Or is Rosetta currently the only active project on them?

And an observation: The dual 7601 computer has, for whatever reason, a higher rate of invalid and error results than the other three (Zen 2/ Rome based?) computers. Maybe this causes the client on the 7601 to issue fewer scheduler requests.

computer	valid	invalid	error	% valid	% invalid	% error
3758038	676	155	291	60	14	26
3770782	771	6	21	97	1	3
4689209	757	8	24	96	1	3
5925533	762	10	22	96	1	3

Markfw · Apr 28, 2021

StefanR5R said:
A question: Are these computers running other projects simultaneously with Rosetta? Or is Rosetta currently the only active project on them?

And an observation: The dual 7601 computer has, for whatever reason, a higher rate of invalid and error results than the other three (Zen 2/ Rome based?) computers. Maybe this causes the client on the 7601 to issue fewer scheduler requests.

computer valid invalid error % valid % invalid % error
3758038
676
155
291
60
14
26
3770782
771
6
21
97
1
3
4689209
757
8
24
96
1
3
5925533
762
10
22
96
1
3

Yes, all 4 do Rosetta only, and yes, the other 3 are Rome, while this one is naples. But they all use the same memory (ECC/registered) and this has only started this last week or so

voodoo5_6k · Apr 28, 2021

StefanR5R said:
And an observation: The dual 7601 computer has, for whatever reason, a higher rate of invalid and error results than the other three (Zen 2/ Rome based?) computers. Maybe this causes the client on the 7601 to issue fewer scheduler requests.

@Markfw Do you see something like "Deferred for: xx:yy:zz" on the project tab or under projects in BoincTasks? When I recently observed the high error rates with then current Rosetta WUs, the phenomenon StefanR5R described was exactly what was happening on my system. With each failed WU, the time was increased, accumulating to several days in few hours of processing. For that reason, I made a little script that would initiate the update for each project and installed it as a crontab, running it every 8h (or three times a day).

crashtech · Apr 28, 2021

Seems as if there was a batch of very error-prone tasks put out, but when I recently began crunching Rosetta, they were mostly gone. Looks like the particular box in question got a large number of those problematic tasks. Either that, or it is malfunctioning.

StefanR5R · Apr 28, 2021

Link to the result list: host 3758038

A quick look at the error results shows two types:

error after less than a minute, stderr.txt contains "Error reading and gzipping output datafile: default.out" — certainly something wrong with the workunits, because the second replica of the same WU apparently always fails in the same way
error after Mark's configured run time limit of 4 hours, stderr contains "finish file present too long" — could be a certain old (still unfixed?) bug in the client

The invalid results are of two types too:

finished after less than a minute and submitted for validation — certainly also something wrong in the workunit, since these are replicas of tasks which failed before on the other host, with an error after less than a minute, featuring the mentioned "Error reading and gzipping output datafile: default.out"
finished after Mark's configured run time limit of 4 hours — this could be a fault on this computer, because at least none of the other current 20 top hosts has invalid results of this type.

I browsed through the top 200 hosts but have not seen another EPYC Naples for comparison. @Markfw, if you connect the BMC's Ethernet port, you could log in into the IPMI web interface and have a look at the server health log, to look for anything suspicious in there.

Furthermore, if the kernel has got the EDAC driver loaded, memory faults may (or may not) show up with the following command:
grep . /sys/devices/system/edac/mc/mc*/*count

StefanR5R · Apr 28, 2021

By the way, for trouble-shooting, it is possible to increase the verbosity of boinc-client's logging by switching on a few more log flags. Edit cc_config.xml (add log flags which aren't there already, switch log frags from 0 = off to 1 = on), then let the client re-read its configuration files: boincmgr advanced view -> Options -> Read config file.

Default configuration:

XML:

<cc_config>
    <log_flags>
        <!-- *on* by default in most but not all client versions -->
        <task>1</task>                                  <!-- The start and completion of compute jobs (should get two messages per job). -->
        <file_xfer>1</file_xfer>                        <!-- The start and completion of file transfers. -->
        <sched_ops>1</<sched_ops>                       <!-- Connections with scheduling servers. -->

        <!-- *off* by default in most client versions -->
        <app_msg_receive>0</app_msg_receive>            <!-- Shared-memory messages received from applications. -->
        <app_msg_send>0</app_msg_send>                  <!-- Shared-memory messages sent to applications. -->
        <async_file_debug>0</async_file_debug>          <!-- Asynchronous copy and checksum of large (> 10 MB) files. -->
        <benchmark_debug>0</benchmark_debug>            <!-- Debugging information about CPU benchmarks. -->
        <checkpoint_debug>0</checkpoint_debug>          <!-- Show when applications checkpoint. -->
        <coproc_debug>0</coproc_debug>                  <!-- Show details of coprocessor (GPU) scheduling. -->
        <cpu_sched>0</cpu_sched>                        <!-- CPU scheduler actions (preemption and resumption). -->
        <cpu_sched_debug>0</cpu_sched_debug>            <!-- Explain CPU scheduler decisions. -->
        <cpu_sched_status>0</cpu_sched_status>          <!-- Show what tasks are running. -->
        <dcf_debug>0</dcf_debug>                        <!-- For seeing changes in DCF. -->
        <disk_usage_debug>0</disk_usage_debug>          <!-- Show disk usage info. -->
        <file_xfer_debug>0</file_xfer_debug>            <!-- Show completion status of file transfers. -->
        <gui_rpc_debug>0</gui_rpc_debug>                <!-- Debugging information about GUI RPC operations. -->
        <http_debug>0</http_debug>                      <!-- Debugging information about HTTP operations. -->
        <http_xfer_debug>0</http_xfer_debug>            <!-- Debugging information about network communication. -->
        <mem_usage_debug>0</mem_usage_debug>            <!-- Application memory usage. -->
        <network_status_debug>0</network_status_debug>  <!-- Network status (whether need physical connection). -->
        <priority_debug>0</priority_debug>              <!-- Changes to project scheduling priority. -->
        <poll_debug>0</poll_debug>                      <!-- Show what poll functions do. -->
        <proxy_debug>0</proxy_debug>                    <!-- Debugging information about HTTP proxy operations. -->
        <rr_simulation>0</rr_simulation>                <!-- Results of the round-robin simulation used by CPU scheduler and work-fetch. -->
        <sched_op_debug>0</sched_op_debug>              <!-- Details of scheduler RPCs; also shows deferral intervals and other low info. -->
        <scrsave_debug>0</scrsave_debug>                <!-- Debugging information about the screen saver. -->
        <slot_debug>0</slot_debug>                      <!-- Prints messages about allocation of slots, creating/removing files in slot dirs. -->
        <state_debug>0</state_debug>                    <!-- Show summary of client state after scheduler RPC and garbage collection -->
        <statefile_debug>0</statefile_debug>            <!-- Show when and why state file is written. -->
        <suspend_debug>0</suspend_debug>                <!-- Show details of processing and network suspend/resume. -->
        <task_debug>0</task_debug>                      <!-- Low-level details of process start/end (status codes, PIDs etc.), and when applications checkpoint. -->
        <time_debug>0</time_debug>                      <!-- Updates to on_frac, active_frac, connected_frac. -->
        <trickle_debug>0</trickle_debug>                <!-- Details of trickles. -->
        <unparsed_xml>0</unparsed_xml>                  <!-- Show any unparsed XML. -->
        <work_fetch_debug>0</work_fetch_debug>          <!-- Work fetch policy decisions. -->
    </log_flags>

    <options>
        <!-- your options go here -->
    </options>
</cc_config>

(source)

Markfw · Apr 28, 2021

StefanR5R said:
Link to the result list: host 3758038

A quick look at the error results shows two types:

error after less than a minute, stderr.txt contains "Error reading and gzipping output datafile: default.out" — certainly something wrong with the workunits, because the second replica of the same WU apparently always fails in the same way

error after Mark's configured run time limit of 4 hours, stderr contains "finish file present too long" — could be a certain old (still unfixed?) bug in the client

The invalid results are of two types too:

finished after less than a minute and submitted for validation — certainly also something wrong in the workunit, since these are replicas of tasks which failed before on the other host, with an error after less than a minute, featuring the mentioned "Error reading and gzipping output datafile: default.out"

finished after Mark's configured run time limit of 4 hours — this could be a fault on this computer, because at least none of the other current 20 top hosts has invalid results of this type.

I browsed through the top 200 hosts but have not seen another EPYC Naples for comparison. @Markfw, if you connect the BMC's Ethernet port, you could log in into the IPMI web interface and have a look at the server health log, to look for anything suspicious in there.

Furthermore, if the kernel has got the EDAC driver loaded, memory faults may (or may not) show up with the following command:
grep . /sys/devices/system/edac/mc/mc*/*count

That command works, and display all 0's. I think for the 16 ram sticks. And these are retail 7601's. When I left this morning, I saw a bunch of "computational error" tasks with about 4 hour elapsed. Right now there are 128 tasks running with a 4 hour window, and one more hour to go.

BTW, the mention of my 4 hour time limit, where do I change that ?

Ken g6 · Apr 28, 2021

StefanR5R said:
A quick look at the error results shows two types:

error after less than a minute, stderr.txt contains "Error reading and gzipping output datafile: default.out" — certainly something wrong with the workunits, because the second replica of the same WU apparently always fails in the same way

error after Mark's configured run time limit of 4 hours, stderr contains "finish file present too long" — could be a certain old (still unfixed?) bug in the client

Could there be a problem with the drive? Try `df` to make sure you're not out of disk space. You should be able to force a FileSystem ChecK with `touch /forcefsck` and a reboot.

Markfw said:
BTW, the mention of my 4 hour time limit, where do I change that ?

https://boinc.bakerlab.org/rosetta/prefs.php?subset=project

Edit Preferences and change Target CPU Runtime.

Markfw · Apr 28, 2021

Disk usage
mark@dual-EPYC-7601:~$ df
Filesystem 1K-blocks Used Available Use% Mounted on
udev 131971272 0 131971272 0% /dev
tmpfs 26407456 2392 26405064 1% /run
/dev/nvme0n1p1 960379920 63900184 847625312 8% /
tmpfs 132037280 154232 131883048 1% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 132037280 0 132037280 0% /sys/fs/cgroup
tmpfs 26407456 32 26407424 1% /run/user/1000

Looks like 11% used total (or less)

Markfw · Apr 28, 2021

Could the 4 hours be the reason for the aborts ? I changed it to 8 hours.. So far, so good.

Also, my new 7551 came in today to replace my 7551 ES that was at ONE GHZ. The real thing runs at 2.5 ghz. Cost me $900 for the chip and a new motherboard, but I still have the ES1 system if I want to use the electricity for one ghz.. Both systems only have 64 gig of ram and only 4 channel memory. If I decide I need 8 channels, and am never going to run the ES ever again, I will put those 4 sticks in the real 7551 retail.

crashtech · Apr 28, 2021

Markfw said:
Could the 4 hours be the reason for the aborts ? I changed it to 8 hours.. So far, so good...

That's interesting. I changed mine to "No Preference" when I was getting a lot of errors but I actually forgot I did so until now. When I fired Rosetta back up, errors were vastly reduced, and the tasks were taking 8 hours. Apparently that is the default.

StefanR5R · Apr 29, 2021

Markfw said:
Could the 4 hours be the reason for the aborts ?

This is unlikely. It should not be, and your other hosts ran fine with it. (They had error rates and error types just like the other hosts in the top 20.) Many people have used both shorter and longer target runtimes in the past without issues. However, given that there are faulty workunits in circulation currently, all bets are off.

Markfw said:
I changed it to 8 hours.. So far, so good.

Note that this increased the time required for a 1000 tasks buffer on a 128 threaded computer to 2.6 days. Which is OK if the computers run 24/7 because Rosetta's reporting deadline is 3.0 days. But soon, Pentathlon will start. :-)

PS, some side notes on changing the target run time:

When the target run time is changed in the project preferences on the web site, the boinc client receives this setting when the client performs its next scheduler request. This can be forced with a project update.
The new target run time will affect
- tasks which are downloaded from then on,
- already downloaded tasks which were not started yet,
- already downloaded tasks which are suspended to disk, and resumed after the scheduler request.
Vice versa, the new target run time will not affect
- tasks which are already running since before the scheduler request,
- and, I believe, tasks which are suspended to RAM and resumed after the scheduler request.
Important: The new target run time preference will not affect the client's estimation of time remaining for the tasks (already downloaded, and to be downloaded subsequently) — at least not initially.
Therefore, when the target run time is increased, the client will underestimate the time needed to complete tasks, which also means that the client will likely download more work than desired. Vice versa, when the target run time is decreased, the client will overestimate the time needed to complete tasks.
Gradually, the client will adjust its run time estimation while it observes more tasks completing with the new target time.
You can decrease or increase the target run time any time you want. But due to the mentioned problem with the client's run time estimation, modify the target run time only in small steps if you have larger work buffers, or perform the modification combined with a very short work buffer setting.
Default target run time is 8 hours, which the project admins see as a good compromise between the science to accomplish, task turnaround times, client behavior, and user expectations.
Example applications:
- Decreasing the target run time is a good alternative to aborting tasks when you finish working at Rosetta and want to turn to other projects.
- Increasing the target run time is good when you want to have low network traffic. But keep the 3 d reporting deadline in mind.

Markfw · Apr 29, 2021

StefanR5R said:
This is unlikely. It should not be, and your other hosts ran fine with it. (They had error rates and error types just like the other hosts in the top 20.) Many people have used both shorter and longer target runtimes in the past without issues. However, given that there are faulty workunits in circulation currently, all bets are off.

Note that this increased the time required for a 1000 tasks buffer on a 128 threaded computer to 2.6 days. Which is OK if the computers run 24/7 because Rosetta's reporting deadline is 3.0 days. But soon, Pentathlon will start. :-)

PS, some side notes on changing the target run time:

When the target run time is changed in the project preferences on the web site, the boinc client receives this setting when the client performs its next scheduler request. This can be forced with a project update.

The new target run time will affect

tasks which are downloaded from then on,

already downloaded tasks which were not started yet,

already downloaded tasks which are suspended to disk, and resumed after the scheduler request.

Vice versa, the new target run time will not affect

tasks which are already running since before the scheduler request,

and, I believe, tasks which are suspended to RAM and resumed after the scheduler request.

Important: The new target run time preference will not affect the client's estimation of time remaining for the tasks (already downloaded, and to be downloaded subsequently) — at least not initially.
Therefore, when the target run time is increased, the client will underestimate the time needed to complete tasks, which also means that the client will likely download more work than desired. Vice versa, when the target run time is decreased, the client will overestimate the time needed to complete tasks.
Gradually, the client will adjust its run time estimation while it observes more tasks completing with the new target time.

You can decrease or increase the target run time any time you want. But due to the mentioned problem with the client's run time estimation, modify the target run time only in small steps if you have larger work buffers, or perform the modification combined with a very short work buffer setting.

Default target run time is 8 hours, which the project admins see as a good compromise between the science to accomplish, task turnaround times, client behavior, and user expectations.

Example applications:

Decreasing the target run time is a good alternative to aborting tasks when you finish working at Rosetta and want to turn to other projects.

Increasing the target run time is good when you want to have low network traffic. But keep the 3 d reporting deadline in mind.

One thought... That computer is slower by a bit than the others, Naples, and all the others are Rome.

Question Odd problem with BOINC on ONE computer out of 18.

Moderator Emeritus, Elite Member

Lifer

Moderator Emeritus, Elite Member

Lifer

Diamond Member

Moderator Emeritus, Elite Member

Golden Member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Senior member

Lifer

Elite Member

Elite Member

Moderator Emeritus, Elite Member

Programming Moderator, Elite Member

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Lifer

Elite Member

Moderator Emeritus, Elite Member