Page 1 of 1

project:17435 run:0 clone:1286 gen:36 failing on v7.4.4

Posted: Tue Feb 16, 2021 1:04 am
by wuffy68
Ubuntu 16.04.1 SMP x86_64
Intel Core i7 2.8 GHz
nVidia 960 GTX [GM206]
Driver 384.130
Folding Client 7.4.4

project:17435 run:0 clone:1286 gen:36 causes FAH service to crash after reaching ~.04% complete. This happed similarly on another work unit Saturday - forcing me to dump the WU. After that, it ran well for a couple days, now it's back to the same problem.

syslog:

Code: Select all

Feb 15 17:33:05 curecoinproject1 kernel: [    0.194735] acpi PNP0A08:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
Feb 15 17:33:05 curecoinproject1 kernel: [    1.316545] nvidia: module verification failed: signature and/or required key missing - tainting kernel
Feb 15 17:33:07 curecoinproject1 thermald[947]: THD engine start failed
Feb 15 17:33:15 curecoinproject1 NetworkManager[1116]: nm_device_get_device_type: assertion 'NM_IS_DEVICE (self)' failed
Feb 15 17:33:15 curecoinproject1 NetworkManager[1116]: <warn>  [1613435595.8848] failed to enumerate oFono devices: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.ofono was not provided by any .service files
Feb 15 17:33:19 curecoinproject1 nm-dispatcher: req:2 'up' [docker0], "/etc/NetworkManager/dispatcher.d/01ifupdown": complete: failed with Script '/etc/NetworkManager/dispatcher.d/01ifupdown' exited with error status 1.
Feb 15 17:33:19 curecoinproject1 NetworkManager[1116]: <warn>  [1613435599.9162] dispatcher: (3) 01ifupdown failed (failed): Script '/etc/NetworkManager/dispatcher.d/01ifupdown' exited with error status 1.
Feb 15 17:33:24 curecoinproject1 nm-dispatcher: req:3 'up' [enp2s0], "/etc/NetworkManager/dispatcher.d/01ifupdown": complete: failed with Script '/etc/NetworkManager/dispatcher.d/01ifupdown' exited with error status 1.
Feb 15 17:33:24 curecoinproject1 NetworkManager[1116]: <warn>  [1613435604.1199] dispatcher: (5) 01ifupdown failed (failed): Script '/etc/NetworkManager/dispatcher.d/01ifupdown' exited with error status 1.
Feb 15 17:33:34 curecoinproject1 fwupd[2735]: (fwupd:2735): Fu-WARNING **: FuMain: failed to load AppStream data: Failed to parse /var/cache/app-info/xmls/fwupd.xml file: Error on line 2672: Entity did not end with a semicolon; most likely you used an ampersand character without intending to start an entity - escape ampersand as &
Feb 15 17:33:34 curecoinproject1 fwupd[2735]: (fwupd:2735): Fu-WARNING **: disabling plugin because: failed to coldplug uefi: UEFI firmware updating not supported
Feb 15 17:33:34 curecoinproject1 fwupd[2735]: (fwupd:2735): Fu-WARNING **: disabling plugin because: failed to coldplug raspberrypi: Raspberry PI firmware updating not supported, no /boot/start.elf
Feb 15 17:33:52 curecoinproject1 pulseaudio[1964]: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.bluez': timed out
Feb 15 17:34:49 curecoinproject1 pulseaudio[1964]: [pulseaudio] module-x11-bell.c: XkbQueryExtension() failed
Feb 15 17:34:49 curecoinproject1 pulseaudio[1964]: [pulseaudio] module.c: Failed to load module "module-x11-bell" (argument: "display=:10.0 sample=bell.ogg"): initialization failed.
FAH log:

Code: Select all

N/A for that time period - appears to have rolled

I realize this is an old build, old GPU and old driver ... but I figured it's worth reporting.

Thank you,

wuffy68

Re: project:17435 run:0 clone:1286 gen:36 failing on v7.4.4

Posted: Tue Feb 16, 2021 4:15 am
by PantherX
It is strongly recommended to use version 7.6.21 since FahCore_22 has some new arguments that are not supported by the older clients. Thus, it would be nice to simply update the client. Since you have Ubuntu 16, I think it can handle Python 2 without issues so it would be easier to upgrade.

BTW, you can have up-to 16 previous logs in the logs folder by default so you can check the file in there if needed :)

Re: project:17435 run:0 clone:1286 gen:36 failing on v7.4.4

Posted: Tue Feb 16, 2021 6:29 am
by wuffy68
PantherX wrote:BTW, you can have up-to 16 previous logs in the logs folder by default so you can check the file in there if needed :)
Thanks - yea, I haven't looked at Linux logs for a while (found them in /var/lib/fahclient/logs) ... looks like a "BAD_FRAME_CHECKSUM" upon restart, and the work unit auto-dumped in this case. Both failures came from project 17435.

Code: Select all

01:02:03:WU01:FS01:0x22:Project: 17435 (Run 0, Clone 1286, Gen 36)
01:02:03:WU01:FS01:0x22:Unit: 0x00000000000000000000000000000000
01:02:03:WU01:FS01:0x22:Digital signatures verified
01:02:03:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
01:02:03:WU01:FS01:0x22:Version 0.0.13
01:02:03:WU01:FS01:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
01:02:03:WU01:FS01:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
01:02:03:WU01:FS01:0x22:  XTC frame write interval: 10000 steps (0.8%) [125 total]
01:02:03:WU01:FS01:0x22:  Global context and integrator variables write interval: disabled
01:02:03:WU01:FS01:0x22:No -opencl-device specified; using deprecated -gpu argument as an alias for -opencl-device.
01:02:03:WU01:FS01:0x22:Please consider upgrading your client version.
01:02:03:WU01:FS01:0x22:There are 3 platforms available.
01:02:03:WU01:FS01:0x22:Platform 0: Reference
01:02:03:WU01:FS01:0x22:Platform 1: CPU
01:02:03:WU01:FS01:0x22:Platform 2: OpenCL
01:02:03:WU01:FS01:0x22:  opencl-device 0 specified
01:02:05:WU00:FS00:0xa7:Completed 366522 out of 500000 steps (73%)
01:02:33:WU01:FS01:0x22:Attempting to create OpenCL context:
01:02:33:WU01:FS01:0x22:  Configuring platform OpenCL
01:02:41:Removing old file 'configs/config-20200713-054312.xml'
01:02:41:Saving configuration to /etc/fahclient/config.xml
01:02:41:<config>
01:02:41:  <!-- Network -->
01:02:41:  <proxy v=':8080'/>
01:02:41:
01:02:41:  <!-- Slot Control -->
01:02:41:  <pause-on-battery v='false'/>
01:02:41:  <power v='full'/>
01:02:41:
01:02:41:  <!-- User Information -->
01:02:41:  <passkey v='********************************'/>
01:02:41:  <team v='43573'/>
01:02:41:  <user v='Ivan_Tuma'/>
01:02:41:
01:02:41:  <!-- Folding Slots -->
01:02:41:  <slot id='0' type='CPU'/>
01:02:41:  <slot id='1' type='GPU'/>
01:02:41:</config>
01:02:55:WU01:FS01:0x22:  Using OpenCL on platformId 0 and gpu 0
01:02:55:WU01:FS01:0x22:ERROR:Guru Meditation #0.baef2504129c7209 (0.42744404) '01/01/checkpoint'
^[[93m01:02:55:WARNING:WU01:FS01:FahCore returned: BAD_FRAME_CHECKSUM (112 = 0x70)^[[0m
[color=#FF0000]^[[93m01:02:55:WARNING:WU01:FS01:Fatal error, dumping^[[0m
[/color][color=#FF0000]01:02:55:WU01:FS01:Sending unit results: id:01 state:SEND error:DUMPED project:17435 run:0 clone:1286 gen:36 core:0x22 unit:0x00000506000000240000441b00000000
[/color]01:02:55:WU01:FS01:Connecting to 206.223.170.146:8080
01:02:56:WU01:FS01:Server responded WORK_ACK (400)

Re: project:17435 run:0 clone:1286 gen:36 failing on v7.4.4

Posted: Tue Feb 16, 2021 6:50 am
by PantherX
Generally speaking, a cause of that could be a faulty disk drive since it is reading the checkpoint data to resume and if it fails the checksum, that's a strong indication that something is off. See if your filesystem is healthy and repair any issues if detected/needed. Also, check to see if your drive (HDD/SSD) are within normal parameters for working.