Page 1 of 1

GRID A100X not using full power

Posted: Mon Sep 07, 2020 8:06 am
by luckenbach
I've noticed that my DGX a100 is not using full power of the GRID A100X cards and wondered if this is an opencl/nvidia driver thing. here is my nvidia-smi output from inside of the FAH container.

Code: Select all

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   48C    P0   168W / 400W |    482MiB / 40537MiB |     91%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   44C    P0   192W / 400W |    536MiB / 40537MiB |     91%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   46C    P0   194W / 400W |    536MiB / 40537MiB |     94%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   46C    P0   196W / 400W |    536MiB / 40537MiB |     91%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   60C    P0   224W / 400W |    476MiB / 40537MiB |     80%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   59C    P0   253W / 400W |    588MiB / 40537MiB |     98%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   60C    P0   186W / 400W |    488MiB / 40537MiB |     86%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   55C    P0   204W / 400W |    536MiB / 40537MiB |     94%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Re: GRID A100X not using full power

Posted: Mon Sep 07, 2020 8:54 am
by PantherX
Welcome to the F@H Forum luckenbach,

Can you please post the log file? Ensure you include the first 100 lines which will inform us of what the system configuration is and what the client settings are. If you require guidance, please view this topic: viewtopic.php?f=24&t=26036

Please note the on high-end GPUs, if you're folding a small WU, then the power consumption would be low since the GPU isn't being fully utilized. Moreover, F@H makes use of specific GPU components which means that other components aren't being used so less wattage would be recorded. If you provide the log files as above, we can shed more light on your situation :)

Re: GRID A100X not using full power

Posted: Mon Sep 07, 2020 8:06 pm
by luckenbach

Code: Select all

21:39:41:Trying to access database...
21:39:41:Successfully acquired database lock
21:39:41:Read GPUs.txt
21:39:41:Enabled folding slot 00: READY cpu:248
21:39:41:Enabled folding slot 01: READY gpu:0:GA100 [GRID A100X]
21:39:41:Enabled folding slot 02: READY gpu:1:GA100 [GRID A100X]
21:39:41:Enabled folding slot 03: READY gpu:2:GA100 [GRID A100X]
21:39:41:Enabled folding slot 04: READY gpu:3:GA100 [GRID A100X]
21:39:41:Enabled folding slot 05: READY gpu:4:GA100 [GRID A100X]
21:39:41:Enabled folding slot 06: READY gpu:5:GA100 [GRID A100X]
21:39:41:Enabled folding slot 07: READY gpu:6:GA100 [GRID A100X]
21:39:41:Enabled folding slot 08: READY gpu:7:GA100 [GRID A100X]
21:39:41:****************************** FAHClient ******************************
21:39:41:        Version: 7.6.13
21:39:41:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:39:41:      Copyright: 2020 foldingathome.org
21:39:41:       Homepage: https://foldingathome.org/
21:39:41:           Date: Apr 28 2020
21:39:41:           Time: 04:20:16
21:39:41:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
21:39:41:         Branch: master
21:39:41:       Compiler: GNU 8.3.0
21:39:41:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
21:39:41:                 -funroll-loops -fno-pie
21:39:41:       Platform: linux2 4.19.0-5-amd64
21:39:41:           Bits: 64
21:39:41:           Mode: Release
21:39:41:           Args: --chdir /fah
21:39:41:         Config: /fah/config.xml
21:39:41:******************************** CBang ********************************
21:39:41:           Date: Apr 25 2020
21:39:41:           Time: 00:07:53
21:39:41:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
21:39:41:         Branch: master
21:39:41:       Compiler: GNU 8.3.0
21:39:41:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
21:39:41:                 -funroll-loops -fno-pie -fPIC
21:39:41:       Platform: linux2 4.19.0-5-amd64
21:39:41:           Bits: 64
21:39:41:           Mode: Release
21:39:41:******************************* System ********************************
21:39:41:            CPU: AMD EPYC 7742 64-Core Processor
21:39:41:         CPU ID: AuthenticAMD Family 23 Model 49 Stepping 0
21:39:41:           CPUs: 256
21:39:41:         Memory: 1007.70GiB
21:39:41:    Free Memory: 971.79GiB
21:39:41:        Threads: POSIX_THREADS
21:39:41:     OS Version: 5.4
21:39:41:    Has Battery: false
21:39:41:     On Battery: false
21:39:41:     UTC Offset: 0
21:39:41:            PID: 1
21:39:41:            CWD: /fah
21:39:41:             OS: Linux 5.4.0-45-generic x86_64
21:39:41:        OS Arch: AMD64
21:39:41:           GPUs: 8
21:39:41:          GPU 0: Bus:7 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:          GPU 1: Bus:15 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:          GPU 2: Bus:71 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:          GPU 3: Bus:78 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:          GPU 4: Bus:135 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:          GPU 5: Bus:144 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:          GPU 6: Bus:183 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:          GPU 7: Bus:189 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
21:39:41:  CUDA Device 0: Platform:0 Device:0 Bus:7 Slot:0 Compute:8.0 Driver:11.0
21:39:41:  CUDA Device 1: Platform:0 Device:1 Bus:15 Slot:0 Compute:8.0 Driver:11.0
21:39:41:  CUDA Device 2: Platform:0 Device:2 Bus:71 Slot:0 Compute:8.0 Driver:11.0
21:39:41:  CUDA Device 3: Platform:0 Device:3 Bus:78 Slot:0 Compute:8.0 Driver:11.0
21:39:41:  CUDA Device 4: Platform:0 Device:4 Bus:135 Slot:0 Compute:8.0 Driver:11.0
21:39:41:  CUDA Device 5: Platform:0 Device:5 Bus:144 Slot:0 Compute:8.0 Driver:11.0
21:39:41:  CUDA Device 6: Platform:0 Device:6 Bus:183 Slot:0 Compute:8.0 Driver:11.0
21:39:41:  CUDA Device 7: Platform:0 Device:7 Bus:189 Slot:0 Compute:8.0 Driver:11.0
21:39:41:OpenCL Device 0: Platform:0 Device:0 Bus:7 Slot:0 Compute:1.2 Driver:450.51
21:39:41:OpenCL Device 1: Platform:0 Device:1 Bus:15 Slot:0 Compute:1.2 Driver:450.51
21:39:41:OpenCL Device 2: Platform:0 Device:2 Bus:71 Slot:0 Compute:1.2 Driver:450.51
21:39:41:OpenCL Device 3: Platform:0 Device:3 Bus:78 Slot:0 Compute:1.2 Driver:450.51
21:39:41:OpenCL Device 4: Platform:0 Device:4 Bus:135 Slot:0 Compute:1.2 Driver:450.51
21:39:41:OpenCL Device 5: Platform:0 Device:5 Bus:144 Slot:0 Compute:1.2 Driver:450.51
21:39:41:OpenCL Device 6: Platform:0 Device:6 Bus:183 Slot:0 Compute:1.2 Driver:450.51
21:39:41:OpenCL Device 7: Platform:0 Device:7 Bus:189 Slot:0 Compute:1.2 Driver:450.51
21:39:41:******************************* libFAH ********************************
21:39:41:           Date: Apr 15 2020
21:39:41:           Time: 21:43:24
21:39:41:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
21:39:41:         Branch: master
21:39:41:       Compiler: GNU 8.3.0
21:39:41:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
21:39:41:                 -funroll-loops -fno-pie
21:39:41:       Platform: linux2 4.19.0-5-amd64
21:39:41:           Bits: 64
21:39:41:           Mode: Release
21:39:41:***********************************************************************
21:39:41:<config>
21:39:41:  <!-- HTTP Server -->
21:39:41:  <allow v='10.0.0.0/8'/>
21:39:41:
21:39:41:  <!-- Slot Control -->
21:39:41:  <power v='full'/>
21:39:41:
21:39:41:  <!-- User Information -->
21:39:41:  <team v='238525'/>
21:39:41:  <user v='zhilliard'/>
21:39:41:
21:39:41:  <!-- Web Server -->
21:39:41:  <web-allow v='10.0.0.0/8'/>
21:39:41:
21:39:41:  <!-- Folding Slots -->
21:39:41:  <slot id='0' type='CPU'/>
21:39:41:  <slot id='1' type='GPU'/>
21:39:41:  <slot id='2' type='GPU'/>
21:39:41:  <slot id='3' type='GPU'/>
21:39:41:  <slot id='4' type='GPU'/>
21:39:41:  <slot id='5' type='GPU'/>
21:39:41:  <slot id='6' type='GPU'/>
21:39:41:  <slot id='7' type='GPU'/>
21:39:41:  <slot id='8' type='GPU'/>
21:39:41:</config>
21:39:41:WU05:FS05:Starting
21:39:41:WU05:FS05:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 05 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 4 -cuda-device 4 -gpu 4
21:39:41:WU05:FS05:Started FahCore on PID 15
21:39:41:WU05:FS05:Core PID:19
21:39:41:WU05:FS05:FahCore 0x22 started
21:39:41:WU01:FS01:Starting
21:39:41:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
21:39:41:WU01:FS01:Started FahCore on PID 22
21:39:41:WU01:FS01:Core PID:26
21:39:41:WU01:FS01:FahCore 0x22 started
21:39:41:WU04:FS04:Starting
21:39:41:WU04:FS04:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 04 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 3 -cuda-device 3 -gpu 3
21:39:41:WU04:FS04:Started FahCore on PID 29
21:39:41:WU04:FS04:Core PID:33
21:39:41:WU04:FS04:FahCore 0x22 started
21:39:42:WU08:FS08:Starting
21:39:42:WU08:FS08:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 08 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 7 -cuda-device 7 -gpu 7
21:39:42:WU08:FS08:Started FahCore on PID 36
21:39:42:WU08:FS08:Core PID:40
21:39:42:WU08:FS08:FahCore 0x22 started
21:39:42:WU07:FS07:Starting
21:39:42:WU07:FS07:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 07 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 6 -cuda-device 6 -gpu 6
21:39:42:WU07:FS07:Started FahCore on PID 43
21:39:42:WU07:FS07:Core PID:47
21:39:42:WU07:FS07:FahCore 0x22 started
21:39:42:WU06:FS06:Starting
21:39:42:WU06:FS06:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 06 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 5 -cuda-device 5 -gpu 5
21:39:42:WU06:FS06:Started FahCore on PID 50
21:39:42:WU06:FS06:Core PID:54
21:39:42:WU06:FS06:FahCore 0x22 started
21:39:42:WU03:FS03:Starting
21:39:42:WU03:FS03:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 03 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 2 -cuda-device 2 -gpu 2
21:39:42:WU03:FS03:Started FahCore on PID 57
21:39:42:WU03:FS03:Core PID:61
21:39:42:WU03:FS03:FahCore 0x22 started
21:39:42:WU02:FS02:Starting
21:39:42:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit/22-0.0.11/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 1 -cuda-device 1 -gpu 1
21:39:42:WU02:FS02:Started FahCore on PID 64
21:39:42:WU02:FS02:Core PID:68
21:39:42:WU02:FS02:FahCore 0x22 started
21:39:42:WU00:FS00:Starting
21:39:42:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -np 248
21:39:42:WU00:FS00:Started FahCore on PID 71
21:39:42:WU00:FS00:Core PID:75
21:39:42:WU00:FS00:FahCore 0xa7 started
21:39:42:WU05:FS05:0x22:*********************** Log Started 2020-09-06T21:39:41Z ***********************
21:39:42:WU05:FS05:0x22:*************************** Core22 Folding@home Core ***************************
21:39:42:WU05:FS05:0x22:       Core: Core22
21:39:42:WU05:FS05:0x22:       Type: 0x22
21:39:42:WU05:FS05:0x22:    Version: 0.0.11
21:39:42:WU05:FS05:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:39:42:WU05:FS05:0x22:  Copyright: 2020 foldingathome.org
21:39:42:WU05:FS05:0x22:   Homepage: https://foldingathome.org/
21:39:42:WU05:FS05:0x22:       Date: Jun 27 2020
21:39:42:WU05:FS05:0x22:       Time: 22:50:00
21:39:42:WU05:FS05:0x22:   Revision: cfc2940c5dd1aa80f60daa6e28d4a2a417f74edb
21:39:42:WU05:FS05:0x22:     Branch: core22-0.0.11
21:39:42:WU05:FS05:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
21:39:42:WU05:FS05:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
21:39:42:WU05:FS05:0x22:             -funroll-loops
21:39:42:WU05:FS05:0x22:   Platform: linux2 4.19.76-linuxkit
21:39:42:WU05:FS05:0x22:       Bits: 64
21:39:42:WU05:FS05:0x22:       Mode: Release
21:39:42:WU05:FS05:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
21:39:42:WU05:FS05:0x22:             <peastman@stanford.edu>
21:39:42:WU05:FS05:0x22:       Args: -dir 05 -suffix 01 -version 706 -lifeline 15 -checkpoint 15
21:39:42:WU05:FS05:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 4 -cuda-device
21:39:42:WU05:FS05:0x22:             4 -gpu 4
21:39:42:WU05:FS05:0x22:************************************ libFAH ************************************
21:39:42:WU05:FS05:0x22:       Date: Jun 27 2020
21:39:42:WU05:FS05:0x22:       Time: 22:11:04
21:39:42:WU05:FS05:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
21:39:42:WU05:FS05:0x22:     Branch: HEAD
21:39:42:WU05:FS05:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
21:39:42:WU05:FS05:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
21:39:42:WU05:FS05:0x22:             -funroll-loops
21:39:42:WU05:FS05:0x22:   Platform: linux2 4.19.76-linuxkit
21:39:42:WU05:FS05:0x22:       Bits: 64
21:39:42:WU05:FS05:0x22:       Mode: Release
21:39:42:WU05:FS05:0x22:************************************ CBang *************************************
21:39:42:WU05:FS05:0x22:       Date: Jun 27 2020
21:39:42:WU05:FS05:0x22:       Time: 22:10:11
21:39:42:WU05:FS05:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee

Re: GRID A100X not using full power

Posted: Mon Sep 07, 2020 8:07 pm
by luckenbach
That is the first 200 lines of the logs, i will note that I did end up slotting up the CPU into 16 core slots leaving around 32 cores for 'not folding' tasks.

Re: GRID A100X not using full power

Posted: Tue Sep 08, 2020 2:03 am
by JohnChodera
Hi!

We're super thrilled to have powerful GPUs like A100X pitching in!

The current 134xx projects for the COVID Moonshot perform well on the A100, but the systems aren't quite large enough to fill up the entire GPU to achieve full utilization. We're making some improvements in core22 over the next few weeks that you should keep an eye out for that will improve utilization substantially, and eventually we will be able to allow multiple simulations to take advantage of Ampere's multi-task scheduling to completely fill up the whole GPU.

I think you're the first volunteer with an A100 I've worked with, so if you'd like to be pulled into testing the latest core22 variant, send me a private message with your email and we'll rope you in!

Thanks again for contributing, and for bearing with us!

~ John Chodera // MSKCC

Re: GRID A100X not using full power

Posted: Tue Sep 08, 2020 9:29 pm
by FaaR
JohnChodera wrote: The current 134xx projects for the COVID Moonshot perform well on the A100, but the systems aren't quite large enough to fill up the entire GPU to achieve full utilization. We're making some improvements in core22 over the next few weeks that you should keep an eye out for that will improve utilization substantially
Hello John! :)

This is sort of off-topic I'm afraid (sorry sorry!), but I was wondering if you could quickly say if maybe these utilization improvements would also apply to AMD Vega architecture, as I think my Vegas should be able to give more than the appx 1.2M PPD or so they currently yield even with these large moonshot WUs you're handing out nowadays.

Thank you!

Re: GRID A100X not using full power

Posted: Tue Sep 08, 2020 10:33 pm
by JohnChodera
@FaaR: A few bugfixes for AMD Vegas will go into core22 0.0.12, so we're hoping that you'll see improved PPD too!