Temperature/GPU Usage unstable; drops align with checkpoints

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by Neil-B »

PRP_R148H wrote:Well, if we can't go on holiday right now, at least our GPUs can! I'll see how the SSD fares. Yes it's quite a farce what's happened to the GPU market. I managed to grab a pair of fairly (?) well priced 3070s from a store and as soon as I bought them, they raised the price $200 for the next batch. Wow.

Also thanks Joe_H for that explanation of the WU:atom business and how that affects checkpointing.
Just a heads up ... Have been watching my temperatures/usage patterns quite carefully and can advise the following for comparison:

Checkpointing on my system appears to last maybe a second - GPU usage shows a slight dip of maybe 5% but temperature and clocks speeds don't change.

At the changeover of WUs my system preloads at 99% so next WU is ready to run - GPU usage drops to effectively 0 for maybe 3 seconds (occasionally possibly 10 seconds) as the client focuses on wrapping/packing/shutting down one core and spinning the next one up ... In this time the GPU gets a bigger chance to cool and drops maybe 10C (15C with the longer pause) with clocks also spinning down for a shorter while - this drop is some 30% to 50% of the difference between folding temp and ambient.

Now my kit is fast and cools well - the changeover shows that temp drops do happen is gpu not loaded but the minimal drops I am seeing on checkpointing imply it doesn't need to happen to the same extent as you have seen - the pausing that you have spotted will undoubtedly be a major factor in the spiking you have observed ... hopefully the move from a usb install to an ssd install resolved the majority of this ... Do let us know how this goes/has gone.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by bruce »

Those dips are normal. There will always be dips, but the time spent in each dip is dependent on both the speed of the HD and the speed of your CPUl.

Here's what's happening:
(1) the energy level of the WU computed by the GPU is checked against the energy level as computed by your CPU. (While this process is happening the progress toward 100% is briefly suspended,) If they differ by a large amount there is something wrong and the calculation will be aborted.

Errors are always possible but they're relatively rare if you kit is functioning correctly and not overclocked. If an error has occurred you really don't want to continue to compute somehting that is certain to fail. Completing the rest of the WU before aborting the calculation and getting a notification would be a bad plan.

(2) the state of the WU is stored on disk so that if something goes wrong during the next segment of the computation (including a pause, which isn't really some "wrong") the computation can resume from that state.

You can PAUSE the calculation at any time and you really don't want to have to start the calculation from the beginning again. The checkpoints need to be frequent enough that you only have to repeat a small part of the calculation -- that part computed since the last checkpoint was stored.
Post Reply