AWS EC2 - backup checkpoints directory

Moderators: Site Moderators, FAHC Science Team

Post Reply
ouhman
Posts: 5
Joined: Sun Feb 06, 2022 7:30 pm

AWS EC2 - backup checkpoints directory

Post by ouhman »

Hello,

I am currently allowing some budget to fold on AWS with EC2. Sometimes the instance gets killed and I would like to backup the work directory to be able to resume it when I start another ec2 instance.

I tried to backup the /var/lib/fahclient/work directory but I am getting some errors when syncing it again and starting fah.

Any help would be appreciated. Thank you!
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: AWS EC2 - backup checkpoints directory

Post by gunnarre »

Are you backing up the /var/lib/fahclient/cores directory and /var/lib/fahclient/GPUs.txt too?
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
ouhman
Posts: 5
Joined: Sun Feb 06, 2022 7:30 pm

Re: AWS EC2 - backup checkpoints directory

Post by ouhman »

gunnarre wrote:Are you backing up the /var/lib/fahclient/cores directory and /var/lib/fahclient/GPUs.txt too?
I've backed up the entire /var/lib/fahclient in the end and it still doesn't work. This is the error I am getting:

Code: Select all

9:03:01:Trying to access database...
09:03:01:Successfully acquired database lock
^[[93m09:03:01:WARNING:FS01:Guessing ambiguous GPU to OpenCL device mapping for 01: gpu:0:30 TU104GL [Tesla T4].  Consider upgrading your graphics driver or manually setting ``opencl-index`` in this slot's configuration.^[[0m
09:03:01:FS01:Initialized folding slot 01: gpu:0:30 TU104GL [Tesla T4]
09:03:01:WU00:FS01:Starting
09:03:01:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.20/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 4262 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
09:03:01:WU00:FS01:Started FahCore on PID 4274
09:03:01:WU00:FS01:Core PID:4278
09:03:01:WU00:FS01:FahCore 0x22 started
^[[93m09:03:01:WARNING:WU00:FS01:FahCore returned: FAILED_3 (255 = 0xff)^[[0m
09:03:01:WU00:FS01:Starting
09:03:01:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.20/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 4262 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
09:03:01:WU00:FS01:Started FahCore on PID 4279
09:03:01:WU00:FS01:Core PID:4283
09:03:01:WU00:FS01:FahCore 0x22 started
^[[93m09:03:02:WARNING:WU00:FS01:FahCore returned: FAILED_3 (255 = 0xff)^[[0m
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: AWS EC2 - backup checkpoints directory

Post by gunnarre »

The PCI ID of the GPU likely changes between instances, so you should perhaps over-write the config.xml file with a fresh one that re-discovers the GPU and adds it as a slot with the correct OpenCL IDs, provided that this happens before WU is dumped. If doing it that way dumps the WU, you might have to do it in a different way:

1: Start FAH with a config that has no GPU slot, but has gpu set to true (default) for auto-configuring the GPU, and pause-on-start set to true to avoid picking a new WU.
2: After the client has added the GPU slot successfully, stop fahclient.
3: Sync the work and cores folder from the backup, but do not over-write config.xml
4: Start Fahclient
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: AWS EC2 - backup checkpoints directory

Post by toTOW »

I don't know AWS, but maybe you can find some inspiration from these tutorials (GCP/Azure) : https://github.com/gitHu6-newb/FoldingAtAltitude

It's easier to use the older client (7.6.13) than the latest one (7.6.21) to avoid automatic GPU detection and the new config scheme with pci-bus and pci-slot settings ...

Also, you might need some persistent storage associated with your AWS instance ...

edit : also, be careful with folder permissions after restoring it.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: AWS EC2 - backup checkpoints directory

Post by gunnarre »

One person who folds on cloud similar to this reports that turning off automatic GPU detection, by setting "gpu" to "false" in the config helps. So the opposite of what I suggested at first.
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
ouhman
Posts: 5
Joined: Sun Feb 06, 2022 7:30 pm

Re: AWS EC2 - backup checkpoints directory

Post by ouhman »

gunnarre wrote:One person who folds on cloud similar to this reports that turning off automatic GPU detection, by setting "gpu" to "false" in the config helps. So the opposite of what I suggested at first.
Thank you for the feedback. I am currently testing multiple scenario and still not able to make it work. I will report back here when I am getting something.

Appreciate the help though thanks!
Knish
Posts: 232
Joined: Tue Mar 17, 2020 5:20 am

Re: AWS EC2 - backup checkpoints directory

Post by Knish »

I, too, recommend persistent storage. I never ran GCP or Azure without it, and it will likely be your easiest path to success. That's about all I can think of, good luck!
Post Reply