Page 1 of 1

AWS EC2 - backup checkpoints directory

Posted: Sun Feb 06, 2022 8:19 pm
by ouhman
Hello,

I am currently allowing some budget to fold on AWS with EC2. Sometimes the instance gets killed and I would like to backup the work directory to be able to resume it when I start another ec2 instance.

I tried to backup the /var/lib/fahclient/work directory but I am getting some errors when syncing it again and starting fah.

Any help would be appreciated. Thank you!

Re: AWS EC2 - backup checkpoints directory

Posted: Mon Feb 07, 2022 12:54 am
by gunnarre
Are you backing up the /var/lib/fahclient/cores directory and /var/lib/fahclient/GPUs.txt too?

Re: AWS EC2 - backup checkpoints directory

Posted: Mon Feb 07, 2022 9:04 am
by ouhman
gunnarre wrote:Are you backing up the /var/lib/fahclient/cores directory and /var/lib/fahclient/GPUs.txt too?
I've backed up the entire /var/lib/fahclient in the end and it still doesn't work. This is the error I am getting:

Code: Select all

9:03:01:Trying to access database...
09:03:01:Successfully acquired database lock
^[[93m09:03:01:WARNING:FS01:Guessing ambiguous GPU to OpenCL device mapping for 01: gpu:0:30 TU104GL [Tesla T4].  Consider upgrading your graphics driver or manually setting ``opencl-index`` in this slot's configuration.^[[0m
09:03:01:FS01:Initialized folding slot 01: gpu:0:30 TU104GL [Tesla T4]
09:03:01:WU00:FS01:Starting
09:03:01:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.20/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 4262 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
09:03:01:WU00:FS01:Started FahCore on PID 4274
09:03:01:WU00:FS01:Core PID:4278
09:03:01:WU00:FS01:FahCore 0x22 started
^[[93m09:03:01:WARNING:WU00:FS01:FahCore returned: FAILED_3 (255 = 0xff)^[[0m
09:03:01:WU00:FS01:Starting
09:03:01:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.20/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 4262 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
09:03:01:WU00:FS01:Started FahCore on PID 4279
09:03:01:WU00:FS01:Core PID:4283
09:03:01:WU00:FS01:FahCore 0x22 started
^[[93m09:03:02:WARNING:WU00:FS01:FahCore returned: FAILED_3 (255 = 0xff)^[[0m

Re: AWS EC2 - backup checkpoints directory

Posted: Mon Feb 07, 2022 10:23 am
by gunnarre
The PCI ID of the GPU likely changes between instances, so you should perhaps over-write the config.xml file with a fresh one that re-discovers the GPU and adds it as a slot with the correct OpenCL IDs, provided that this happens before WU is dumped. If doing it that way dumps the WU, you might have to do it in a different way:

1: Start FAH with a config that has no GPU slot, but has gpu set to true (default) for auto-configuring the GPU, and pause-on-start set to true to avoid picking a new WU.
2: After the client has added the GPU slot successfully, stop fahclient.
3: Sync the work and cores folder from the backup, but do not over-write config.xml
4: Start Fahclient

Re: AWS EC2 - backup checkpoints directory

Posted: Mon Feb 07, 2022 10:32 am
by toTOW
I don't know AWS, but maybe you can find some inspiration from these tutorials (GCP/Azure) : https://github.com/gitHu6-newb/FoldingAtAltitude

It's easier to use the older client (7.6.13) than the latest one (7.6.21) to avoid automatic GPU detection and the new config scheme with pci-bus and pci-slot settings ...

Also, you might need some persistent storage associated with your AWS instance ...

edit : also, be careful with folder permissions after restoring it.

Re: AWS EC2 - backup checkpoints directory

Posted: Mon Feb 07, 2022 4:26 pm
by gunnarre
One person who folds on cloud similar to this reports that turning off automatic GPU detection, by setting "gpu" to "false" in the config helps. So the opposite of what I suggested at first.

Re: AWS EC2 - backup checkpoints directory

Posted: Mon Feb 07, 2022 5:29 pm
by ouhman
gunnarre wrote:One person who folds on cloud similar to this reports that turning off automatic GPU detection, by setting "gpu" to "false" in the config helps. So the opposite of what I suggested at first.
Thank you for the feedback. I am currently testing multiple scenario and still not able to make it work. I will report back here when I am getting something.

Appreciate the help though thanks!

Re: AWS EC2 - backup checkpoints directory

Posted: Tue Feb 08, 2022 4:37 am
by Knish
I, too, recommend persistent storage. I never ran GCP or Azure without it, and it will likely be your easiest path to success. That's about all I can think of, good luck!