Page 1 of 1

[SOLVED] FahCore returned: WU_STALLED (127 = 0x7f) | A100X

Posted: Thu Oct 29, 2020 2:12 pm
by benc
Having Trouble getting the GPUs on my DGX to work, though the CPU is folding fine, it should be possible based on this thread: viewtopic.php?f=80&t=36079

head -n 200 log.txt

Code: Select all

*********************** Log Started 2020-10-29T13:03:00Z ***********************
13:03:00:******************************* libFAH ********************************
13:03:00:         Date: Oct 20 2020
13:03:00:         Time: 20:36:41
13:03:00:     Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
13:03:00:       Branch: master
13:03:00:     Compiler: GNU 4.9.4
13:03:00:      Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
13:03:00:               -funroll-loops
13:03:00:     Platform: linux2 5.8.0-1-amd64
13:03:00:         Bits: 64
13:03:00:         Mode: Release
13:03:00:****************************** FAHClient ******************************
13:03:00:      Version: 7.6.21
13:03:00:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
13:03:00:    Copyright: 2020 foldingathome.org
13:03:00:     Homepage: https://foldingathome.org/
13:03:00:         Date: Oct 20 2020
13:03:00:         Time: 20:38:59
13:03:00:     Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
13:03:00:       Branch: master
13:03:00:     Compiler: GNU 4.9.4
13:03:00:      Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
13:03:00:               -funroll-loops
13:03:00:     Platform: linux2 5.8.0-1-amd64
13:03:00:         Bits: 64
13:03:00:         Mode: Release
13:03:00:         Args: --config=config.xml
13:03:00:       Config: /var/lib/home/scp/tmp/folding/usr/bin/config.xml
13:03:00:******************************** CBang ********************************
13:03:00:         Date: Oct 20 2020
13:03:00:         Time: 18:38:01
13:03:00:     Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
13:03:00:       Branch: master
13:03:00:     Compiler: GNU 4.9.4
13:03:00:      Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
13:03:00:               -funroll-loops -fPIC
13:03:00:     Platform: linux2 5.8.0-1-amd64
13:03:00:         Bits: 64
13:03:00:         Mode: Release
13:03:00:******************************* System ********************************
13:03:00:          CPU: AMD EPYC 7742 64-Core Processor
13:03:00:       CPU ID: AuthenticAMD Family 23 Model 49 Stepping 0
13:03:00:         CPUs: 256
13:03:00:       Memory: 1007.70GiB
13:03:00:  Free Memory: 928.82GiB
13:03:00:      Threads: POSIX_THREADS
13:03:00:   OS Version: 5.4
13:03:00:  Has Battery: false
13:03:00:   On Battery: false
13:03:00:   UTC Offset: 0
13:03:00:          PID: 121492
13:03:00:          CWD: /var/lib/home/scp/tmp/folding/usr/bin
13:03:00:           OS: Linux 5.4.0-48-generic x86_64
13:03:00:      OS Arch: AMD64
13:03:00:         GPUs: 8
13:03:00:        GPU 0: Bus:7 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 1: Bus:15 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 2: Bus:71 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 3: Bus:78 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 4: Bus:135 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 5: Bus:144 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 6: Bus:183 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 7: Bus:189 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:CUDA Device 0: Platform:0 Device:0 Bus:7 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 1: Platform:0 Device:1 Bus:15 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 2: Platform:0 Device:2 Bus:71 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 3: Platform:0 Device:3 Bus:78 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 4: Platform:0 Device:4 Bus:135 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 5: Platform:0 Device:5 Bus:144 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 6: Platform:0 Device:6 Bus:183 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 7: Platform:0 Device:7 Bus:189 Slot:0 Compute:8.0 Driver:11.0
13:03:00:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
13:03:00:               libOpenCL.so: cannot open shared object file: No such file or
13:03:00:               directory
13:03:00:***********************************************************************
13:03:00:<config>
13:03:00:  <!-- Network -->
13:03:00:  <proxy v='seprivatezen.astrazeneca.net:9480'/>
13:03:00:  <proxy-enable v='true'/>
13:03:00:
13:03:00:  <!-- Slot Control -->
13:03:00:  <power v='full'/>
13:03:00:
13:03:00:  <!-- User Information -->
13:03:00:  <passkey v='*****'/>
13:03:00:  <user v='BenjaminHCCarr'/>
13:03:00:
13:03:00:  <!-- Folding Slots -->
13:03:00:  <slot id='0' type='CPU'/>
13:03:00:  <slot id='1' type='GPU'>
13:03:00:    <pci-bus v='7'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='2' type='GPU'>
13:03:00:    <pci-bus v='15'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='3' type='GPU'>
13:03:00:    <pci-bus v='71'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='4' type='GPU'>
13:03:00:    <pci-bus v='78'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='5' type='GPU'>
13:03:00:    <pci-bus v='135'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='6' type='GPU'>
13:03:00:    <pci-bus v='144'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='7' type='GPU'>
13:03:00:    <pci-bus v='183'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='8' type='GPU'>
13:03:00:    <pci-bus v='189'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:</config>
13:03:00:Trying to access database...
13:03:00:Successfully acquired database lock
13:03:00:FS00:Initialized folding slot 00: cpu:248
13:03:00:FS01:Initialized folding slot 01: gpu:7:0 GA100 [GRID A100X]
13:03:00:FS02:Initialized folding slot 02: gpu:15:0 GA100 [GRID A100X]
13:03:00:FS03:Initialized folding slot 03: gpu:71:0 GA100 [GRID A100X]
13:03:00:FS04:Initialized folding slot 04: gpu:78:0 GA100 [GRID A100X]
13:03:00:FS05:Initialized folding slot 05: gpu:135:0 GA100 [GRID A100X]
13:03:00:FS06:Initialized folding slot 06: gpu:144:0 GA100 [GRID A100X]
13:03:00:FS07:Initialized folding slot 07: gpu:183:0 GA100 [GRID A100X]
13:03:00:FS08:Initialized folding slot 08: gpu:189:0 GA100 [GRID A100X]
13:03:00:WU01:FS01:Starting
13:03:00:WU01:FS01:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 0 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU01:FS01:Started FahCore on PID 121518
13:03:00:WU01:FS01:Core PID:121522
13:03:00:WU01:FS01:FahCore 0x22 started
13:03:00:WU02:FS02:Starting
13:03:00:WU02:FS02:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 1 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU02:FS02:Started FahCore on PID 121523
13:03:00:WU02:FS02:Core PID:121527
13:03:00:WU02:FS02:FahCore 0x22 started
13:03:00:WU05:FS04:Starting
13:03:00:WU05:FS04:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 05 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 3 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU05:FS04:Started FahCore on PID 121528
13:03:00:WU05:FS04:Core PID:121532
13:03:00:WU05:FS04:FahCore 0x22 started
13:03:00:WU07:FS05:Starting
13:03:00:WU07:FS05:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 07 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 4 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU07:FS05:Started FahCore on PID 121533
13:03:00:WU07:FS05:Core PID:121537
13:03:00:WU07:FS05:FahCore 0x22 started
13:03:00:WU10:FS07:Starting
13:03:00:WU10:FS07:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 10 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 6 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU10:FS07:Started FahCore on PID 121538
13:03:00:WU10:FS07:Core PID:121542
13:03:00:WU10:FS07:FahCore 0x22 started
13:03:00:WU13:FS08:Starting
13:03:00:WU13:FS08:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 13 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 7 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU13:FS08:Started FahCore on PID 121543
13:03:00:WU13:FS08:Core PID:121547
13:03:00:WU13:FS08:FahCore 0x22 started
13:03:00:WU00:FS00:Connecting to seprivatezen.astrazeneca.net:9480
13:03:00:WU03:FS03:Connecting to seprivatezen.astrazeneca.net:9480
13:03:00:WU04:FS06:Connecting to seprivatezen.astrazeneca.net:9480
13:03:00:WARNING:WU01:FS01:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU02:FS02:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU05:FS04:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU07:FS05:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU10:FS07:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU13:FS08:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WU01:FS01:Starting
13:03:00:WU01:FS01:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 0 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU01:FS01:Started FahCore on PID 121548
13:03:00:WU01:FS01:Core PID:121552
13:03:00:WU01:FS01:FahCore 0x22 started
13:03:00:WU02:FS02:Starting
13:03:00:WU02:FS02:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 1 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU02:FS02:Started FahCore on PID 121553
13:03:00:WU02:FS02:Core PID:121557
13:03:00:WU02:FS02:FahCore 0x22 started
13:03:00:WU05:FS04:Starting
13:03:00:WU05:FS04:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 05 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 3 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU05:FS04:Started FahCore on PID 121558
13:03:00:WU05:FS04:Core PID:121562
13:03:00:WU05:FS04:FahCore 0x22 started
13:03:00:WU07:FS05:Starting
13:03:00:WU07:FS05:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 07 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 4 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU07:FS05:Started FahCore on PID 121563
13:03:00:WU07:FS05:Core PID:121567
13:03:00:WU07:FS05:FahCore 0x22 started
13:03:00:WU10:FS07:Starting
13:03:00:WU10:FS07:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 10 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 6 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU10:FS07:Started FahCore on PID 121568
13:03:00:WU10:FS07:Core PID:121572
13:03:00:WU10:FS07:FahCore 0x22 started
13:03:00:WU13:FS08:Starting
13:03:00:WU13:FS08:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 13 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 7 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
would like to get the GRID A100X's folding before we put this into production

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Posted: Thu Oct 29, 2020 2:17 pm
by benc
Reading this Reddit thread: https://www.reddit.com/r/Folding/commen ... u_stalled/

Code: Select all

13:03:00:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
13:03:00:               libOpenCL.so: cannot open shared object file: No such file or
13:03:00:               directory
Will it fail with CUDA but no OpenCL?

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Posted: Thu Oct 29, 2020 2:27 pm
by benc
So this is the driver I am running

And this is from the release notes: https://docs.nvidia.com/datacenter/tesl ... index.html

Code: Select all

API Support
This release supports the following APIs:
- NVIDIA® CUDA® 11.0 for NVIDIA® KeplerTM, MaxwellTM, PascalTM, VoltaTM, TuringTM and NVIDIA Ampere architecture GPUs
- OpenGL® 4.5
- Vulkan® 1.1
- DirectX 11
- DirectX 12 (Windows 10)
- Open Computing Language (OpenCLTM software) 1.2

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Posted: Sun Nov 01, 2020 6:37 am
by PantherX
Just wondering if you have the OpenCL Package installed (sudo apt-get install ocl-icd-opencl-dev)?

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Posted: Mon Nov 02, 2020 3:08 pm
by benc
Thank you @PantherX that was the missing package!

Re: [SOLVED] FahCore returned: WU_STALLED (127 = 0x7f) | A10

Posted: Tue Nov 03, 2020 7:29 am
by PantherX
Glad that it was a simple fix! Hopefully, your 8 GPUs can be fed without issues :)

Re: [SOLVED] FahCore returned: WU_STALLED (127 = 0x7f) | A10

Posted: Thu Nov 05, 2020 2:43 pm
by benc
We're feeding Two NVidia DGX, Two sets of:
- 2x AMD EPYC 7742 64-Core Processor (1TB Ram)
- 8x A100-SXM4-40GB