Page 1 of 1

FahCore returned: FAILED_2 (1 = 0x1) when running a22 WUs?

Posted: Mon Dec 06, 2021 3:55 pm
by novosirj
Hi there,

We often run F@H on systems that need simulated load or what have you – makes productive use of equipment that's being tested for another reason. I'm seeing this on one such system, however, and am not sure what to make of it. Any ideas? These are RTX 2080 Ti cards, and they've got driver 470.74. F@H client is 7.6.21, and I know this both did, work, and I've seen the occasional work unit succeed in the last couple of days I've been running (CPU WUs are running fine):

Code: Select all

11:44:42:WU07:FS07:Starting
11:44:42:WU07:FS07:Running FahCore: /usr/bin/FAHCoreWrapper /scratch/novosirj/FAH/16880643_14/cores/cores.foldingathome.org/lin/64bit/22-0.0.18/Core_22.fah/FahCore_22 -dir 07 -suffix 01 -version 706 -lifeline 35953 -checkpoint 1 -opencl-platform 0 -opencl-device 6 -cuda-device 6 -gpu-vendor nvidia -gpu 6 -gpu-usage 100
11:44:42:WU07:FS07:Started FahCore on PID 36477
11:44:42:WU07:FS07:Core PID:36481
11:44:42:WU07:FS07:FahCore 0x22 started
11:44:42:WARNING:WU07:FS07:FahCore returned: FAILED_2 (1 = 0x1)
11:45:42:WU07:FS07:Starting
11:45:42:WU07:FS07:Running FahCore: /usr/bin/FAHCoreWrapper /scratch/novosirj/FAH/16880643_14/cores/cores.foldingathome.org/lin/64bit/22-0.0.18/Core_22.fah/FahCore_22 -dir 07 -suffix 01 -version 706 -lifeline 35953 -checkpoint 1 -opencl-platform 0 -opencl-device 6 -cuda-device 6 -gpu-vendor nvidia -gpu 6 -gpu-usage 100
11:45:42:WU07:FS07:Started FahCore on PID 36609
11:45:42:WU07:FS07:Core PID:36613
11:45:42:WU07:FS07:FahCore 0x22 started
11:45:43:WARNING:WU07:FS07:FahCore returned: FAILED_2 (1 = 0x1)
11:45:43:WARNING:WU07:FS07:Too many errors, failing
11:45:43:WU07:FS07:Sending unit results: id:07 state:SEND error:FAILED project:18201 run:44260 clone:0 gen:25 core:0x22 unit:0x0000000000000019000047190000ace4
11:45:43:WU07:FS07:Connecting to 128.252.203.11:8080
11:45:43:WU07:FS07:Server responded WORK_ACK (400)
11:45:43:WU07:FS07:Cleaning up
I don't really see any more information anywhere. Just FWIW, the reason I'm running this on this machine is that I suspect a problem with one of the GPUs (falling off the bus), but there's no indication that that is causing this problem.

Re: FahCore returned: FAILED_2 (1 = 0x1) when running a22 WU

Posted: Mon Dec 06, 2021 6:19 pm
by toTOW
If you try to start the core manually from a terminal, you'll get a more detailed error.

See the global announcement about core 22 v0.0.18 : viewtopic.php?f=24&t=37391

Re: FahCore returned: FAILED_2 (1 = 0x1) when running a22 WU

Posted: Mon Dec 06, 2021 7:53 pm
by novosirj
Thanks. My guess is I'll need to build a new container with a newer version of the OS with newer GLIBC support. I may have used CentOS 7.x for my current container.

Re: FahCore returned: FAILED_2 (1 = 0x1) when running a22 WU

Posted: Tue Dec 07, 2021 12:27 pm
by toTOW
I confirm that CentOS 7 has a too old version of glibc ... :(

Re: FahCore returned: FAILED_2 (1 = 0x1) when running a22 WU

Posted: Tue Dec 07, 2021 2:38 pm
by PaulTV
CentOS 7 (and RHEL 7) doesn't even have Python 3 in the default repo... RH releases are more conservative than carrot-haired 70-yo presidents.

Re: FahCore returned: FAILED_2 (1 = 0x1) when running a22 WU

Posted: Tue Dec 07, 2021 5:57 pm
by Neil-B
PaulTV wrote:CentOS 7 (and RHEL 7) doesn't even have Python 3 in the default repo... RH releases are more conservative than carrot-haired 70-yo presidents.
... since FaH currently doesn't need Python 3 they are a perfect match ;)

Re: FahCore returned: FAILED_2 (1 = 0x1) when running a22 WU

Posted: Wed Dec 08, 2021 12:34 pm
by toTOW
No, because CentOS 7 has a glibc implementation that is too old for core 22 v0.0.18 ... :(

Re: FahCore returned: FAILED_2 (1 = 0x1) when running a22 WU

Posted: Thu Dec 09, 2021 1:48 am
by novosirj
Generating a new Singularity container (that's how I currently run FaH on our clusters) that uses CentOS 8 solved the problem with no other changes.

It seems kind of like a shame, but I guess it's true that most of the target audience for this software isn't running legacy-ish enterprise systems (and we have a solution for that anyway).