• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

warning.... Rosetta@home beta MAY make your computer reboot.

Markfw

Moderator Emeritus, Elite Member
I have had the problem with my 7950x3d for days now, and it just happened to my 9654 RETAIL, which has never rebooted since I bought it, just shut down when we lost power. The 9654 is also running linux.

So its on hold on both boxes for now. The 7950x3d never bade it past 12 hours with this issue, so lets see what happens.
 
Reboot due to a userspace application only happens due to hardware defects.
Kernel code however, such as device driver kernel modules or virtualization kernel modules, can crash the kernel of course.

Is Rosetta Beta a native application or a virtualized one?
 
Reboot due to a userspace application only happens due to hardware defects.
Kernel code however, such as device driver kernel modules or virtualization kernel modules, can crash the kernel of course.

Is Rosetta Beta a native application or a virtualized one?
Its a native. I can't believe that a real server in linux would have these problems. Since removing Rosetta both boxes have not rebooted. The 7950x3d was about 12 hours, and the interval was pretty fixed. Been over 12 hours, and no reboot.
 
Its a native. I can't believe that a real server in linux would have these problems. Since removing Rosetta both boxes have not rebooted. The 7950x3d was about 12 hours, and the interval was pretty fixed. Been over 12 hours, and no reboot.
Also, the server that booted has 192 gig ram. The 7950x3d is 16 gig.
 
My first thought when I saw your post was that it ran out of memory. How much memory per task for the beta tasks? You're at 1gb per and 0.5gb per. I've seen rosseta use more than that per task
 
My first thought when I saw your post was that it ran out of memory. How much memory per task for the beta tasks? You're at 1gb per and 0.5gb per. I've seen rosseta use more than that per task
well, it says 252 meg, for 14 tasks on teh one box still running it (a 7950x BTW). The rest are WCG. I never checked the 7950x3d. Not going to again.
 
My first thought when I saw your post was that it ran out of memory.
The Linux OOM killer is taking down the most memory hungry process(es) then. System processes whose takedown could cause a reboot ('init' comes to my mind) are unlikely to get killed in such a situation.

How much memory per task for the beta tasks? You're at 1gb per and 0.5gb per. I've seen rosseta use more than that per task
Rosetta Beta memory footprint, from WUProp participants:
https://wuprop.boinc-af.org/results/projet.py?projet=Rosetta@home&application=Rosetta+Beta
It's less than Rosetta classic currently.
https://wuprop.boinc-af.org/results/projet.py?projet=Rosetta@home&application=Rosetta

But it's still a magnitude more than WCG MCM:
https://wuprop.boinc-af.org/results...unity+Grid&application=Mapping+Cancer+Markers

Perhaps Rosetta Beta's memory footprint or/and memory access patterns tripped some RAM defects (or RAM VRM defects, or CPU IMC defects) which MCM does not.
Perhaps a ≥24h Memtest86 run is advised.

I can't believe that a real server in linux would have these problems.
Server hardware (mainboard, memory, PSU, et cetera) and server firmware² can be defective too. It's not an IBM mainframe.¹

________
¹) More correctly, mainframes can be defective as well. It's just that they have even more built-in RAS features than servers, so that mainframes keep functioning error-free despite of partial defects.
²) Also more correctly than saying "server BIOSes can be defective" is to say that "server BIOSes are defective". It's rather a question of whether or not the particular bugs of the respective BIOS are affecting the server user.

________
PS: Apropos RAM, regardless whether or not it might be the culprit here: Mainstream server hardware is designed with the assumption in mind that there is speedy air flow provided over RAM and over the mainboard, including RAM VRMs.
 
Last edited:
Clearly the beta software is doing something illegal that is even tripping Linux. Could be some undocumented direct hardware access which both Windows and Linux have no idea how to handle so they are panicking and rebooting to prevent damage.

One idea could be to run Rosetta without admin/root access.
 
Could be some undocumented direct hardware access
No, not without special user privileges.

One idea could be to run Rosetta without admin/root access.
The BOINC client, and thereby the science tasks which it launches, runs as the unprivileged 'boinc' pseudo user by default.

Running BOINC as root is not wise in general, as it downloads and starts executables from project servers on the internet.

And what of 12 channel 6000 CL30 ? mine)
What do you have in the crashing server?

EPYC 9654 is specified for DDR5-4800. Kingston "server premier", Micron, Samsung, SK Hynix offer DDR5-4800 RDIMMs with CL40. Kingston "Fury Renegade Pro" (read: factory OC'd) is CL36.
 
What do you have in the crashing server?

EPYC 9654 is specified for DDR5-4800. Kingston "server premier", Micron, Samsung, SK Hynix offer DDR5-4800 RDIMMs with CL40. Kingston "Fury Renegade Pro" (read: factory OC'd) is CL36.
In the 7950x3d, its 6000 cl 30 using expo.

In the 9654 that rebooted its 4800 registered ECC. As I said, its gone years without a reboot, running 100% load all that time. Until this Rosetta beta.
 
I'm not sure why you're running beta and expecting stability.

Those two terms are generally mutually exclusive.
 
I'm not sure why you're running beta and expecting stability.
A beta application could crash itself, and the BOINC client finishes the task with an error status.
If the computer reboots, the computer has got a hardware defect. (Or it has got a bug in the BIOS or kernel or kernel modules. 3rd party kernel modules, i.e. ones which are not maintained by Torvalds & friends themselves, are the most likely to have quality issues. But Rosetta beta is unlikely to use special buggy kernel calls.)

In the 9654 that rebooted its 4800 registered ECC.
Standard "green" sticks without heatsink, CL40, 1.1V? Or Kingston Fury oc'd sticks?

its gone years without a reboot, running 100% load all that time. Until this Rosetta beta.
That was then and this is now. Not all defects occur right away when a computer is still new.
The differences between Roseta beta and WCG MCM seem with memory footprint, and hence with memory access patterns, and perhaps also CPU power use. Primegrid applications are of course much more power hungry, but they again have a lower memory footprint and a much more even instructions profile than a physics application like Rosetta.

Were the computers able to write a kernel panic message into the system log before reboot?

Because Rosetta has no app selection.
I guess one could write an app_info.xml which includes only vanilla Rosetta. But this wouldn't repair the two unstable computers. Besides, if the affected computers received both vanilla and beta tasks, we don't actually know which of the two crashed the computers.
 
If both systems use the exact same PSU, that could be a factor. Maybe the Rosetta beta client is causing some sort of unusuallly high power spike that the PSU cannot handle well reliably.
 
I don't know if the possibility of the task being run as root has been definitively excluded. I know that I used to run multiple instances as root before I knew better, so it's not unheard of.
 
Back
Top