Question about AMD GPU and checkpoints

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Question about AMD GPU and checkpoints

Post by mwroggenbuck »

Has anyone recovered from an AMD GPU checkpoint AND successfully submitted the result?

I can pause or exit FAH, and then restart it. The log file will show the GPU falling back to the checkpoint. It will continue until 100%. However, it will fail at the very end and not send.

Code: Select all

20:15:12:WU00:FS00:0x22:Completed 2000000 out of 2000000 steps (100%)
20:15:14:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
20:15:14:WU00:FS00:0x22:Saving result file checkpointState.xml
20:15:14:WU00:FS00:0x22:Saving result file checkpt.crc
20:15:14:WU00:FS00:0x22:Saving result file positions.xtc
20:15:14:WU00:FS00:0x22:Saving result file science.log
20:15:14:WU00:FS00:0x22:Folding@home Core Shutdown: FINISHED_UNIT
20:15:16:WARNING:WU00:FS00:FahCore returned an unknown error code which probably indicates that it crashed
20:15:16:WARNING:WU00:FS00:FahCore returned: UNKNOWN_ENUM (-1073740940 = 0xc0000374)
It will not send the file.

If I never pause or stop so a checkpoint is not used, thing work fine

Here is a successful end

Code: Select all

01:03:48:WU00:FS00:0x22:Completed 990000 out of 1000000 steps (99%)
01:03:48:WU01:FS00:Connecting to 65.254.110.245:80
01:03:48:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:48:WU01:FS00:Connecting to 18.218.241.186:80
01:03:49:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:49:WU01:FS00:Connecting to 65.254.110.245:80
01:03:49:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:49:WU01:FS00:Connecting to 18.218.241.186:80
01:03:49:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:49:ERROR:WU01:FS00:Exception: Could not get an assignment
01:03:49:WU01:FS00:Connecting to 65.254.110.245:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:50:WU01:FS00:Connecting to 18.218.241.186:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:50:WU01:FS00:Connecting to 65.254.110.245:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:50:WU01:FS00:Connecting to 18.218.241.186:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:50:ERROR:WU01:FS00:Exception: Could not get an assignment
01:04:49:WU01:FS00:Connecting to 65.254.110.245:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:04:50:WU01:FS00:Connecting to 18.218.241.186:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:04:50:WU01:FS00:Connecting to 65.254.110.245:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:04:50:WU01:FS00:Connecting to 18.218.241.186:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:04:50:ERROR:WU01:FS00:Exception: Could not get an assignment
01:05:40:WU00:FS00:0x22:Completed 1000000 out of 1000000 steps (100%)
01:05:43:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
01:05:43:WU00:FS00:0x22:Saving result file checkpointState.xml
01:05:44:WU00:FS00:0x22:Saving result file checkpt.crc
01:05:44:WU00:FS00:0x22:Saving result file positions.xtc
01:05:44:WU00:FS00:0x22:Saving result file science.log
01:05:44:WU00:FS00:0x22:Folding@home Core Shutdown: FINISHED_UNIT
01:05:44:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:05:44:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:11762 run:0 clone:8302 gen:23 core:0x22 unit:0x0000003180fccb0a5e7113ccda4722a1
01:05:44:WU00:FS00:Uploading 33.02MiB to 128.252.203.10
01:05:44:WU00:FS00:Connecting to 128.252.203.10:8080
01:06:00:WU00:FS00:Upload 0.19%
01:06:06:WU00:FS00:Upload 0.38%
01:06:27:WU01:FS00:Connecting to 65.254.110.245:80
01:06:27:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:06:27:WU01:FS00:Connecting to 18.218.241.186:80
01:06:27:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:06:27:WU01:FS00:Connecting to 65.254.110.245:80
01:06:27:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:06:27:WU01:FS00:Connecting to 18.218.241.186:80
01:06:28:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:06:28:ERROR:WU01:FS00:Exception: Could not get an assignment
01:07:04:WU00:FS00:Upload 0.95%
01:07:10:WU00:FS00:Upload 11.74%
01:07:16:WU00:FS00:Upload 24.80%
01:07:22:WU00:FS00:Upload 36.91%
01:07:28:WU00:FS00:Upload 49.21%
01:07:34:WU00:FS00:Upload 61.51%
01:07:40:WU00:FS00:Upload 74.01%
01:07:46:WU00:FS00:Upload 86.69%
01:07:52:WU00:FS00:Upload 96.53%
01:07:54:WU00:FS00:Upload complete
01:07:54:WU00:FS00:Server responded WORK_ACK (400)
01:07:54:WU00:FS00:Final credit estimate, 51365.00 points
01:07:54:WU00:FS00:Cleaning up

Some of these jobs are almost 8 hours long. It is frustrating that it looks like it completes, but does not send.

So I ask again, does anyone have a log file for an AMD GPU WU that shows recovery from checkpoint AND successfully sending result and getting credit?

Thanks for any information
Artemios
Posts: 42
Joined: Tue Dec 10, 2019 3:39 pm
Location: Athens, Greece

Re: Question about AMD GPU and checkpoints

Post by Artemios »

As I use my desktop for folding with 2 Radeon RX5700XT GPUs I pause and continue often and never had any issues.
I use Windows 10 and the latest AMD drivers.
(CPU=AMD 2700x) (GPU=2X RX 5700XT) (OS= win10)(client type=beta)
Image
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Question about AMD GPU and checkpoints

Post by mwroggenbuck »

Does a log show recovery from a checkpoint? I just want to make sure.
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Question about AMD GPU and checkpoints

Post by foldy »

What is your AMD GPU type and driver version? And which FAHclient version running?
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Question about AMD GPU and checkpoints

Post by mwroggenbuck »

RX 570, 4 GB
Radeon 20.4.2
FAH 7.6.9
Artemios
Posts: 42
Joined: Tue Dec 10, 2019 3:39 pm
Location: Athens, Greece

Re: Question about AMD GPU and checkpoints

Post by Artemios »

This is a typical pause/unpause log extract...

Press the pause button

Code: Select all

10:11:14:WU01:FS02:0x22:Completed 550000 out of 5000000 steps (11%)
10:12:16:FS02:Paused
10:12:16:FS02:Shutting core down
10:12:16:WU01:FS02:0x22:WARNING:Console control signal 1 on PID 2216
10:12:16:WU01:FS02:0x22:Exiting, please wait. . .
10:12:16:WU01:FS02:0x22:Folding@home Core Shutdown: INTERRUPTED
10:12:16:WU01:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
Press the fold button

Code: Select all

10:32:43:FS02:Unpaused
10:32:43:WU01:FS02:Starting
10:32:43:WU01:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Darth\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/beta/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 14872 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 1 -gpu 1
10:32:43:WU01:FS02:Started FahCore on PID 1840
10:32:43:WU01:FS02:Core PID:19568
10:32:43:WU01:FS02:FahCore 0x22 started
10:32:44:WU01:FS02:0x22:*********************** Log Started 2020-04-25T10:32:44Z ***********************
10:32:44:WU01:FS02:0x22:*************************** Core22 Folding@home Core ***************************
10:32:44:WU01:FS02:0x22:       Type: 0x22
10:32:44:WU01:FS02:0x22:       Core: Core22
10:32:44:WU01:FS02:0x22:    Website: https://foldingathome.org/
10:32:44:WU01:FS02:0x22:  Copyright: (c) 2009-2018 foldingathome.org
10:32:44:WU01:FS02:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
10:32:44:WU01:FS02:0x22:             <rafal.wiewiora@choderalab.org>
10:32:44:WU01:FS02:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 1840 -checkpoint 15
10:32:44:WU01:FS02:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 1 -gpu 1
10:32:44:WU01:FS02:0x22:     Config: <none>
10:32:44:WU01:FS02:0x22:************************************ Build *************************************
10:32:44:WU01:FS02:0x22:    Version: 0.0.2
10:32:44:WU01:FS02:0x22:       Date: Dec 6 2019
10:32:44:WU01:FS02:0x22:       Time: 21:30:31
10:32:44:WU01:FS02:0x22: Repository: Git
10:32:44:WU01:FS02:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
10:32:44:WU01:FS02:0x22:     Branch: HEAD
10:32:44:WU01:FS02:0x22:   Compiler: Visual C++ 2008
10:32:44:WU01:FS02:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
10:32:44:WU01:FS02:0x22:   Platform: win32 10
10:32:44:WU01:FS02:0x22:       Bits: 64
10:32:44:WU01:FS02:0x22:       Mode: Release
10:32:44:WU01:FS02:0x22:************************************ System ************************************
10:32:44:WU01:FS02:0x22:        CPU: AMD Ryzen 7 2700X Eight-Core Processor
10:32:44:WU01:FS02:0x22:     CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
10:32:44:WU01:FS02:0x22:       CPUs: 16
10:32:44:WU01:FS02:0x22:     Memory: 15.95GiB
10:32:44:WU01:FS02:0x22:Free Memory: 9.12GiB
10:32:44:WU01:FS02:0x22:    Threads: WINDOWS_THREADS
10:32:44:WU01:FS02:0x22: OS Version: 6.2
10:32:44:WU01:FS02:0x22:Has Battery: false
10:32:44:WU01:FS02:0x22: On Battery: false
10:32:44:WU01:FS02:0x22: UTC Offset: 3
10:32:44:WU01:FS02:0x22:        PID: 19568
10:32:44:WU01:FS02:0x22:        CWD: C:\Users\Darth\AppData\Roaming\FAHClient\work
10:32:44:WU01:FS02:0x22:         OS: Windows 10 Home
10:32:44:WU01:FS02:0x22:    OS Arch: AMD64
10:32:44:WU01:FS02:0x22:********************************************************************************
10:32:44:WU01:FS02:0x22:Project: 16435 (Run 755, Clone 0, Gen 1)
10:32:44:WU01:FS02:0x22:Unit: 0x0000000103854c135e9a4efa34d65490
10:32:44:WU01:FS02:0x22:Digital signatures verified
10:32:44:WU01:FS02:0x22:Folding@home GPU Core22 Folding@home Core
10:32:44:WU01:FS02:0x22:Version 0.0.2
10:32:44:WU01:FS02:0x22:  Found a checkpoint file
10:33:24:WU01:FS02:0x22:Completed 570000 out of 5000000 steps (11%)
(CPU=AMD 2700x) (GPU=2X RX 5700XT) (OS= win10)(client type=beta)
Image
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Question about AMD GPU and checkpoints

Post by mwroggenbuck »

Ok, you are clearly recovering from a checkpoint. I appreciate you going back to check.

Now if I could figure out why mine does not work...
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Question about AMD GPU and checkpoints

Post by foldy »

Is maybe a virus scanner somehow killing the FAhCore process while finishing? You can try FahClient 7.6.13 beta but the FahCore will be the same. And you can delete the FahCore to force download of a latest beta version.
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Question about AMD GPU and checkpoints

Post by mwroggenbuck »

My virus scanner is not showing anything in quarantine and I am not seeing anything in the log.

The windows event log shows the following

Code: Select all

Faulting application name: FahCore_22.exe, version: 0.0.0.0, time stamp: 0x5e9fcb57
Faulting module name: ntdll.dll, version: 10.0.18362.815, time stamp: 0xb29ecf52
Exception code: 0xc0000374
Fault offset: 0x00000000000f9229
Faulting process id: 0x2ce4
Faulting application start time: 0x01d61d865a1117fe
Faulting application path: D:\C_Alt\data\FAHClient\cores\cores.foldingathome.org\v7\win\64bit\Core_22.fah\FahCore_22.exe
Faulting module path: C:\windows\SYSTEM32\ntdll.dll
Report Id: 63483971-12a2-4c1f-8c74-c206bfc6e5e3
Faulting package full name: 
Faulting package-relative application ID: 
Of course, knowing it is ntdll.dll is just about useless. :roll:

About the only thing consistent (that I can see) is the job works fine if it does not recover from a checkpoint, but fails at done/send step if a checkpoint has been used.

Like I said, frustrating :twisted:
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Question about AMD GPU and checkpoints

Post by PantherX »

Can you please post the first 100 lines of the log file which shows the system configuration the client settings?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Question about AMD GPU and checkpoints

Post by mwroggenbuck »

I may have found the problem. I will know for sure in a couple of hours. The following was in my windows event log

Code: Select all

C:\Program Files (x86)\FAHClient\FAHClient.exe
Exception: Failed to rename 'log.txt' to 'logs\log-20200415-125040.txt': The process cannot access the file because it is being used by another process.
I think that my anti-virus had the file locked for some reason. I have removed the anti-virus (I went back to Windows Defender).

The anti-virus log did not show an error because it did not actually stop anything. I have had other problems with files being locked up, and I suspected the anti-virus, but I never proved it. People in this forum suggested adding an exemption for the FAH data directory. Unfortunately, McAfee only allows to exempt specific files (something a lot of people complain about), so I cannot properly exempt the rename.

Like I said, I should know in a few hours, but this smells like the issue.

I will let you know.
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Question about AMD GPU and checkpoints

Post by mwroggenbuck »

Well, I only have one data point, but this appears to have fixed my problem. I don't know why it seemed to be related to checkpoints, but I will take what I can get.

I want to thank everyone who helped. It is much appreciated. Between bad hardware and this problem, it took me about a month to get things going. Hopefully, I can successfully fold from now on. :D
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Question about AMD GPU and checkpoints

Post by Neil-B »

Thank You for continuing to work with everyone and persevering with FAH … The current challenges make it hard for everyone but fingers crossed you should be smooth folding from now on.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Joe_H
Site Admin
Posts: 7876
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Question about AMD GPU and checkpoints

Post by Joe_H »

mwroggenbuck wrote:I think that my anti-virus had the file locked for some reason. I have removed the anti-virus (I went back to Windows Defender).
Which anti-virus program was this? It may help someone the next time.

We do recommend excluding the FAHClient data directory from being scanned, the random binary data in some of the files can trigger a false positive in many anti-virus products.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Question about AMD GPU and checkpoints

Post by mwroggenbuck »

I was using some version of McAfee (my cable company gives it away). It is a mix of their enterprise and consumer stuff. I have never been very fond of it because it literally creates over a dozen processes. The final straw was when I could not exclude an entire directory. If people cannot exclude an entire directory from the scan, then they have a similar version to mine.

Like I said in the previous post, the McAfee log did not show anything. I found a file locked from the windows event log. I guessed it was McAfee because of some other issues that I intermittently have (usb drives would not unmount because the system thought something was using it).

I have not had any problems with FAH since this and I have paused a number of times, recovering from a checkpoint, so I think that we found the issue. I do want to get a few more days in before I am sure though.

I did find something a little strange. I cannot put my computer into sleep mode unless I pause and exit FAH. Not a big deal, but it was funny to see my computer stay alive when I told it to sleep (screens would go out, but computer would not stop)

Again, thanks for all the help.
Post Reply