WCG problems

mmonnin03 · Apr 2, 2023

I just noticed I have a few COVID19 GPU tasks

waffleironhead · Apr 3, 2023

My workunits on my phone uploaded overnight. So that was a good sign.

crashtech · Apr 6, 2023

Latest:

WU Distribution Update

We are working towards resuming a consistent WU supply similar to what we had before the storage system failure. The recent sparsity of OPN1 WU was caused by a batch that has blocked the create-work process for all other projects. We have found and fixed the glitch, and the system is busy creating work for OPN1 right now. We still have an ARP1 backlog of unsent results (see ARP project update ), but we now have a spare capacity for a larger backlog. After OPN1 work units are prepared, the system will prepare ARP1 work units.

On the back end, we still had to finalize setup of the new storage as there was a networking issue that was preventing us from accessing the tape archive. Data center admins have helped to fix it, and the production system on the new storage is being backed up.

We continue to investigate the errors in the BOINC system services, specifically assimilators and validators. Unfortunately, the application is written such that an unexpected error halts the service (which happened when our storage system failed). We are attempting to clear out the problematic data to allow the applications to continue processing other results, but BOINC doesn't seem to have an easy method of flushing specific workunits or results out of its system.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team

World Community Grid - View Thread - 2023-04-06 Update (WU Distribution Update)

Markfw · Apr 6, 2023

crashtech said:
Latest:

World Community Grid - View Thread - 2023-04-06 Update (WU Distribution Update)

Thanks ! I had no idea there were problems, as I have hundreds processing on 6 different boxes.

StefanR5R · Apr 8, 2023

I received ARP1 work (it's the only WCG subproject which I have selected currently), but its large result data are uploading very slowly — considerably slower than my own slowish internet uplink would allow. It's a combination of generally low transfer rate with occasional transient HTTP errors.

It's a good thing that the client eventually stops to request more work when there are too many uploads in progress (per project). Although right now this client-side stopper isn't even triggered, as the server-side ARP1 work supply appears to have dried out again already. (This subproject has always been submitting work in waves, not contiguously.)

StefanR5R · Apr 12, 2023

On Sunday StefanR5R said:
80% of the ARP1 results which I made during the past week are still uploading… very, very slowly.

Yesterday (i.e. Tuesday) evening, my last uploads finally succeeded. Now the wait for the next ARP1 batch commences.

Markfw · Mar 2, 2024

WCG login gives:

503 Service Unavailable

No server is available to handle this request.

And after waiting a while, then I get:

System error

StefanR5R · Mar 3, 2024

I am idly wondering how much of the downgrade of WCG by its transition from IBM to Krembil might be due to loss of knowledge and experience, due to downsized manpower, or due to downsized equipment (servers including networking, storage etc.) — if any of it actually was downsized, that is.

Assimilator1 · Mar 3, 2024

Website working for me (although it loaded slowly), but I've got no WUs.
I can't see a link for server status

Kiska · Mar 4, 2024

StefanR5R said:
I am idly wondering how much of the downgrade of WCG by its transition from IBM to Krembil might be due to loss of knowledge and experience, due to downsized manpower, or due to downsized equipment (servers including networking, storage etc.) — if any of it actually was downsized, that is.

I believe that all of that was downsized...

We know that the site was moved to the Krembil compute cluster which has seen network performance degradation... and perhaps even compute as well. WCG runs on IBM WebSphere and last I know they have no experience in using that piece of software.

IBM WebSphere gets mentioned here https://web.archive.org/web/20220513180748/https://www.worldcommunitygrid.org/news/0512 and https://web.archive.org/web/20220511184724/https://www.worldcommunitygrid.org/news/0510

From those 2 news posts, I'll speculate that both knowledge and experience was lost. And from this tweet

https://twitter.com/x/status/1491563831906209797

I would say the compute environment has also been reduced. At least when it was on IBM systems, they could just spin more if needed I believe

Assimilator1 said:
Website working for me (although it loaded slowly), but I've got no WUs.
I can't see a link for server status

They never had a server status page to begin with

Kiska · Mar 8, 2024

StefanR5R said:
I am idly wondering how much of the downgrade of WCG by its transition from IBM to Krembil might be due to loss of knowledge and experience, due to downsized manpower, or due to downsized equipment (servers including networking, storage etc.) — if any of it actually was downsized, that is.

Found some more info about the reduction:

Jurisica:

Bare minimum operation is 2 technical and 1 communication staff.

And the WCG budget:

From: https://www.cs.toronto.edu/~juris/jlab/wcg.html

Exascaletech · Mar 9, 2024

For those that browse their website, there's not a lot of happy campers when it comes to how the program is being run. Funding will always be an issue with DC but communication is free and that is something they lack also. And yeah to the comment that these issues almost never get fixed if it happens on a Friday as most likely no staffing on weekends. I get the feeling WCG is going to have to make some big changes one way or another, can't survive like this.

Search

WCG problems

mmonnin03

Senior member

waffleironhead

Diamond Member

crashtech

Lifer

Markfw

Moderator Emeritus, Elite Member

StefanR5R

Elite Member

StefanR5R

Elite Member

Markfw

Moderator Emeritus, Elite Member

503 Service Unavailable

System error

StefanR5R

Elite Member

Assimilator1

Elite Member

Kiska

Golden Member

Kiska

Golden Member

Exascaletech

Junior Member

TRENDING THREADS

WCG problems

Senior member

Diamond Member

Lifer

Moderator Emeritus, Elite Member

Elite Member

Elite Member

Moderator Emeritus, Elite Member

503 Service Unavailable​

System error​

Elite Member

Elite Member

Golden Member

Golden Member

Junior Member

503 Service Unavailable

System error