Page 1 of 1

18601 checkpoints too often

Posted: Wed Nov 16, 2022 2:09 am
by Alex_Atkin
I'm noticing a waste of GPU resources on 18601 as it checkpoints every 25000 steps. On a 4090 that's under a minute, on a 3080 its about every 2 minutes. It seems to take a few seconds each time which surely adds up as a lot of wasted time over 24 hours.

Why have the option to set the checkpointing frequency if its ignored?

Conversely, I'm not seeing any checkpointing in the logs at all for aarch64 WUs although looking in the data folder they do seem to be written.

Re: 18601 checkpoints too often

Posted: Wed Nov 16, 2022 2:37 am
by Joe_H
The checkpoints on GPU projects are set by the researcher. They happen when important data is collected and retained for later analysis after the WU is returned. It is also when a sanity check is done on the data returned to that point on the CPU to verify the GPU is properly calculating. That was found necessary as unstable GPUs may not give any indication that there are errors in the processing of the WU data.

The algorithms used in the CPU processing cores based on GROMACS are different and can be interrupted on a timed basis. In the latest versions they also will attempt to write out a checkpoint when folding is paused. The OpenMM code used in the GPU folding core needs to be interrupted at certain points to be able to write out a usable checkpoint.

Re: 18601 checkpoints too often

Posted: Wed Nov 16, 2022 6:28 am
by Alex_Atkin
Thanks, that's obviously more important than getting it done a little faster.

Re: 18601 checkpoints too often

Posted: Wed Nov 16, 2022 7:21 pm
by toTOW
The checkpoints are usually set to not waste too much compute time when low end GPUs are interrupted ...

Re: 18601 checkpoints too often

Posted: Thu Nov 17, 2022 10:46 am
by PaulTV
Would be nice if:
- Checkpoints could be written without interrupting calculations (dunno how hard that would be if at all possible), or
- There would be something like 'if last checkpoint was within x minutes, skip this one', with default of 5 or 15m, configurable with an advanced setting - that way there are still checkpoints on whole percentages but it would auto adjust to the speed of the card

I know, most effort is put in building the new client, so this might end up somewhere on the backlog with lower priority

Re: 18601 checkpoints too often

Posted: Wed Nov 23, 2022 8:50 pm
by toTOW
Unfortunately, OpenMM core used on GPUs doesn't support triggeed checkpoints : it only works at a predefined frequency. OpenMM also performs checks (we call them sanity checks) between data computed on the GPU and data computed on the CPU before it writes a checkpoint, which explain why there are some interruptions in GPU load.

Gromacs core used on CPU is more flexible : you can set the checkpoint frequency, and it can write a checkpoint when the core is interrupted.