Page 1 of 2

Temperature/GPU Usage unstable; drops align with checkpoints

Posted: Sat Feb 27, 2021 10:31 pm
by PRP_R148H

Code: Select all

OS: Kubuntu 18.04 LTS
FAHClient: 7.6.21
Nvidia Driver: 460.39
Folding for a few hours on my 3070 yields the following trend in the GPU temperature:
Image

And GPU utilisation
Image

Each dip corresponds to the checkpoint session. Even though I have set the interval to 30 minutes, it seems that my client is checkpointing every 5-6 minutes. At first I thought this might be due to some latency from the write time of my HDD (USB 3.1) but I have seen other work units on the card perform with much more stability, eg the work unit on the far left of this graph from a separate GPU (another 3070) Image

No error logs in the FAHClient console. Possibly still working through my first 10 WUs. Can provide nvidia log dump if needed or can provide a snippet if directed.

Thoughts?

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sat Feb 27, 2021 10:38 pm
by JimboPalmer
Welcome to Folding@Home!

I would comment that the GPU check point code is run on the CPU, not the GPU, so a drop when checkpointing makes sense.

My understanding is that the GPU checkpoint interval is set by the developer, not the user, like it is with CPU checkpoints.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sat Feb 27, 2021 10:41 pm
by PRP_R148H
Thanks for the welcome Jimbo, and for the tip about the checkpoint flag.

I'm currently folding on a 980ti as well - that card folds with both temperature and GPU as a steady line. Should I be concerned about this fluctuation on the 3070s?

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sat Feb 27, 2021 10:56 pm
by JimboPalmer
[None of my GPUs are as fast as your slower GPU, let alone your faster one, so this is theory]

I think your 3070 is so fast that the CPU is 'slow' while your 980ti is so 'slow' that the CPU gets it stuff done quickly.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sat Feb 27, 2021 11:14 pm
by Joe_H
As mentioned, the checkpoint for a GPU WU is set by the researcher. When it happens is printed in the log file.

Utilization of a card like your 3070 will depend on the size of the WU in atoms. WUs with many will have a higher utilization percentage, but the checkpoint done on your CPU which includes a sanity check calculation may take longer as compared to a WU with fewer atoms.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sun Feb 28, 2021 12:04 am
by Neil-B
I am not seeing anything like the same level of fluctuations tbh ... I am running Asus Strix rtx3070 OC on Win 10 ... At checkpoint I see maybe a 2C dop in temp from 55C to 53C on current WU ... Monitored by HWMonitor and confirmed by GPU Tweak II ... If your GPu is having time to cool as much as it is and show the utilisation drop your are seeing I have to ask is your CPU loaded up as well? ... for a variety of reasons at the moment I am just gpu folding with the gpu supported by i9-1850K, 64GB Ram and an nvme so basically doing nothing else with kit but making sure gpu is folding effectively ... I am wondering if the gpu WU checkpointing is having to queue for resource allocation if the cpu is loaded up - this might mean the gpu has to wait much longer both giving the gpu a chance to cool significantly and show significant drops in utilisation (mine only showed a drop from 100% to 96% before it ramped up again)

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sun Feb 28, 2021 1:19 am
by PRP_R148H
Thanks Neil, I've checked my CPU and RAM usage (CPU: Ryzen 5 3500X. 2 cores at 100% for the GPU threads, other 4 cores are idle. RAM: 3.3GB of 7.7GB Used, no fluctuation). The only thing I can think of is that my HDD is a USB 3.1 stick. I am noticing in my System Monitor that during checkpointing the CPU thread goes to zero and puts the thread into `disk sleep` for 3-4 seconds. This could explain the dip - that the GPU is waiting with nothing to do while this processes, Maybe I should go and pick up cheap SSD and re-install the system.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sun Feb 28, 2021 1:30 am
by Neil-B
That sounds as if you have identified the issue - If the GPU is waiting on the CPU which is waiting on the USB read/writes for that long then your GPU probably thinks it is on holiday ;) ... Hope you get it sorted :) - at least sdd/nvmes are not artificially inflated in price at the moment - the lunacy of gpu prices at the moment is scary :(

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sun Feb 28, 2021 1:39 am
by PRP_R148H
Well, if we can't go on holiday right now, at least our GPUs can! I'll see how the SSD fares. Yes it's quite a farce what's happened to the GPU market. I managed to grab a pair of fairly (?) well priced 3070s from a store and as soon as I bought them, they raised the price $200 for the next batch. Wow.

Also thanks Joe_H for that explanation of the WU:atom business and how that affects checkpointing.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sun Feb 28, 2021 4:24 am
by bruce
As has already been said, the checkpoint frequency for GPUs is defined by the researcher. So project M may be quite different than project N.

The client's checkpoint frequency setting does apply to CPU projects. The software in the FAHCores (GROMACS vs. OpenMM) have been developed by different teams so there are significant differences. Closer to the users are FAHControl and FAHClient which have to support whatever is available at the interface with the FAHCore.

Question: Is your CPU loaded doing the calculation of another WU or is it idle ... free to accept the workload of doing the sanity check of the GPU's assignment? The shapes of the dips may depend on that answer.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sun Feb 28, 2021 12:28 pm
by ipkh
What are your complete system specs? A larger amount of RAM could help as Linux will use it a buffer for HDD writes. Making sure there's always a free CPU core for checkpoints would also help. USB and spinning disks aren't great for random access, but writing should be cached in RAM or the HardDisks onboard cache.
If your concern is the temp/utilization spikes, you can use the Nvidia X-Server Settings applet to prefer maximum performance and it won't enter idle clocks.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sun Feb 28, 2021 12:36 pm
by Neil-B
The Op has posted that they have probably tracked down the issue (see middle of thread) .. Using a USB Disk is causing the CPU to hang for a few seconds on Checkpointing (so no Harddisk onboard cache to worry about) which appears to be the cause of the significant drop in temp and utilisation of the GPU ... Probably isn't worth finessing any possible solutions until this part of the equation has been resolved?

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Mon Mar 01, 2021 8:51 am
by PRP_R148H
ipkh wrote:What are your complete system specs?
Here; but no HDD - just a Gen 3.1 USB stick running kubuntu:
me wrote:I've checked my CPU and RAM usage (CPU: Ryzen 5 3500X. 2 cores at 100% for the GPU threads, other 4 cores are idle. RAM: 3.3GB of 7.7GB Used [During folding], no fluctuation).
ipkh wrote:If your concern is the temp/utilization spikes, you can use the Nvidia X-Server Settings applet to prefer maximum performance and it won't enter idle clocks.
Good thinking. I'm currently running a soft power limit with persistance mode enabled, and a moderate overclock. But the problem persists regardless of what powermizer state or manual clock I put it in.
Neil-B wrote:Probably isn't worth finessing any possible solutions until this part of the equation has been resolved?
Right :) I'm buying an SSD as we speak and I'll report back later this week when I...

when I...

..reinstall linux again. And battle with the nvidia xorg configuration files to enable coolbits correctly.

Ohno.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Wed Mar 03, 2021 3:48 am
by MeeLee
PRP_R148H wrote:
ipkh wrote:What are your complete system specs?
Here; but no HDD - just a Gen 3.1 USB stick running kubuntu:
me wrote:I've checked my CPU and RAM usage (CPU: Ryzen 5 3500X. 2 cores at 100% for the GPU threads, other 4 cores are idle. RAM: 3.3GB of 7.7GB Used [During folding], no fluctuation).
ipkh wrote:If your concern is the temp/utilization spikes, you can use the Nvidia X-Server Settings applet to prefer maximum performance and it won't enter idle clocks.
Good thinking. I'm currently running a soft power limit with persistance mode enabled, and a moderate overclock. But the problem persists regardless of what powermizer state or manual clock I put it in.
Neil-B wrote:Probably isn't worth finessing any possible solutions until this part of the equation has been resolved?
Right :) I'm buying an SSD as we speak and I'll report back later this week when I...

when I...

..reinstall linux again. And battle with the nvidia xorg configuration files to enable coolbits correctly.

Ohno.

Code: Select all

Sudo nvidia-smi -i 0 -lgc 1835,1935
With 1835 min and 1935 max gpu speeds.
It'll force gpu speeds to remain high, and have less of a temp drop.
Setting max too high, won't damage the gpu, as it'll only go as high as the driver allows.
Setting min value too high the same.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Wed Mar 03, 2021 7:31 am
by bruce
Data transfer speeds are an important factor in determining the WIDTH of the dip, but so is the speed of your CPU. Offloading some processing to the CPU is generally a good practice.

What project was running, and what was the checkpoint interval?

In this example project, that's every 2% (every 4 minutes) though it will depend on the GPU speed.

19:37:08:WU00:FS01:0x22: Checkpoint write interval: 25000 steps (2%) [50 total]
19:37:08:WU00:FS01:0x22: JSON viewer frame write interval: 12500 steps (1%) [100 total]
19:37:08:WU00:FS01:0x22: XTC frame write interval: 10000 steps (0.8%) [125 total]
19:37:08:WU00:FS01:0x22: Global context and integrator variables write interval: disabled

20:44:56:WU00:FS01:0x22:Checkpoint completed at step 25000
21:16:21:WU00:FS01:0x22:Completed 37500 out of 1250000 steps (3%)
21:47:46:WU00:FS01:0x22:Completed 50000 out of 1250000 steps (4%)
21:48:45:WU00:FS01:0x22:Checkpoint completed at step 50000
22:20:09:WU00:FS01:0x22:Completed 62500 out of 1250000 steps (5%)
22:51:36:WU00:FS01:0x22:Completed 75000 out of 1250000 steps (6%)
22:52:34:WU00:FS01:0x22:Checkpoint completed at step 75000