CUDA_ERROR_LAUNCH_FAILED

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: CUDA_ERROR_LAUNCH_FAILED

Post by PantherX »

Building on what bruce mentioned, there's an exception and that is if the error isn't recorded in science.log file, then that never gets reported to the Server so can't be identified or fix. Thus, as long as the error message is recorded in science.log which then gets uploaded, we would have sufficient information and using PRCG, extract that from the Server.

science.log file is located within the WU directory inside the Work folder. On my system, it's located here:
C:\Users\PantherX-H\AppData\Roaming\FAHClient\work\00\01
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

PantherX wrote:science.log file is located within the WU directory inside the Work folder.
Sadly, the five science.log files I have currently belong to the 5 WUs in progress on this quad GPU machine while running with -disable-cuda, and are of zero size, presumably because the WUs are not yet complete.

It seems there are no backup copies of past science.log files. Maybe that would be a good idea, to create a "....\FAHclient\sciencelogs" folder just like there seems to be a "....\FAHclient\logs" folder for log.txt backups, renamed with time stamp and all... They could all be stored in there as science-WU-SLOT-YYYYMMDD-HHMMSS.log. Just sayin'...

If you'd like, I could FINISH the host, remove -disable-cuda, restart the host, and see what is in science.log for crashed WUs. Just let me know. No problem doing it; just don't want to do it if it won't be useful to someone.

Tuna
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: CUDA_ERROR_LAUNCH_FAILED

Post by PantherX »

Umm... the science.log file is written continuously even when the WU is being processed. You can open it up in Notepad and see that it should have data in it. If not, then that's something that needs to be investigated.

Given that science.log file is part of the WU results, it would be unlikely to be left behind. However, most of the information in it is duplicated in log file so that's why storing a copy of it wouldn't be ideal.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

PantherX wrote:Umm... the science.log file is written continuously even when the WU is being processed. You can open it up in Notepad and see that it should have data in it. If not, then that's something that needs to be investigated.
This is weird. I checked all my 9 hosts. If I didn't count it wrong, 5 of them have 0-sized science.log files, rest are non-zero. Huh?
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: CUDA_ERROR_LAUNCH_FAILED

Post by PantherX »

On my system, they show up as 0 sized but they still have data in them if you open them up.

Since I run my GPU Slot 24/7 without pausing, my guess is that the file is 0 bytes since nothing is saved to it when the slot is running. If you paused the GPU Slot, it would write that information to the file and the file size will not be 0.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: CUDA_ERROR_LAUNCH_FAILED

Post by bruce »

Each of my active WUs has a non-zero science log; both CPU and GPU projects. How many active WUs do you have? The work files are in hidden directories so maybe you're not finding them. Maybe you have some ancient files from broken WUs before the name "science" was adopted.

There's a option to keep unnecessary files but by default, FAHClient uploads finished WUs and then cleans up data that you don't need to keep.

06:46:08:WU02:FS01:Upload complete
06:46:08:WU02:FS01:Server responded WORK_ACK (400)
06:46:08:WU02:FS01:Final credit estimate, 1069.00 points
06:46:08:WU02:FS01:Cleaning up

note the last message I quoted.
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

PantherX wrote:On my system, they show up as 0 sized but they still have data in them if you open them up.

Since I run my GPU Slot 24/7 without pausing, my guess is that the file is 0 bytes since nothing is saved to it when the slot is running. If you paused the GPU Slot, it would write that information to the file and the file size will not be 0.
Sorry for the delay in responding. The weekend was busy watching the craziness going on in USA...

Yes! I revisited my 9 hosts, searched for and double clicked on every science.log under %USERPROFILE%\AppData\Roaming\FAHclient, even if they were listed as 0 bytes, and sure enough, they opened with data. And, here is a fun fact: After closing them all and refreshing the search in the File Explorer, they all showed up as non-zero size! Somehow the file creation flags must be making the size of the file sort of unknowable, wait for it... ...until it is observed. Never knew Windows 10 file system operated on Quantum Mechanics principles of observation effecting measurements... LOL!
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: CUDA_ERROR_LAUNCH_FAILED

Post by bruce »

LOL.
Yep: the Cheshire Cat is somewhere in the cache.
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

bruce wrote:Each of my active WUs has a non-zero science log; both CPU and GPU projects. How many active WUs do you have? The work files are in hidden directories so maybe you're not finding them. Maybe you have some ancient files from broken WUs before the name "science" was adopted.

There's a option to keep unnecessary files but by default, FAHClient uploads finished WUs and then cleans up data that you don't need to keep.
Regarding science.log, see the response to PantherX I posted just before this one. Mea culpa...

I am hoping that my science.log files uploaded during these failures did include "CUDA_ERROR_LAUNCH_FAILED", and actually were uploaded. Is there a way to check? I didn't see any reference to my username/userid in the science.log, so I am guessing not specific to my failures, but possibly in aggregate across users/hosts/WUs.

In any case,

1) is there any benefit in me removing my -disable-cuda from two hosts, one Quad GPU and another Single GPU, to repro the problem again, and see if the science.log contains the CUDA_ERROR_LAUNCH_FAILED?

2) And, if I do that to confirm what is and is not in science.log, then I could also follow https://docs.microsoft.com/en-us/window ... istry-keys to make regkey changes to try to avoid the issue, possibly by TdrDelay=10, TdrLimitTime=120, TdrLimitCount=10. But this would mean, even if the issue causing these delays gets fixed in FAH, I won't really know, and the machines will continue to operate under these permissive settings. Is there a public list of bugs fixed and changes made in each FAH release? Maybe it is just in GIT... Would be nice to have a link next to the download link, pointing to that fix/change/add info.

Tuna
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: CUDA_ERROR_LAUNCH_FAILED

Post by bruce »

Test whatever makes sense to you and report back.

If you report the PRCG, the development team can find your science log, though your name may or may not be there.

Check for a flag that contains something like "noclean" in it's name.
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

bruce wrote:Check for a flag that contains something like "noclean" in it's name.
Only if there was an easy to find list of "expert flags" acceptable in that dialog... I knew of -disable-cuda due to some post I ran into randomly in this forum, not from a maintained webpage. And, using google and bing to search for that flag didn't point me at any webpage containing "disable-cuda". Then I searched everything, including the binaries" in the Program Files for FAH using "findstring /i" to see if there was any disable-cuda in any of them, aaaaaaand, nope.

There's got to be a list of those options somewhere... It shouldn't be that hard. Maybe I am missing something obvious here.

I did, however, remove my -disable-cuda from the Quad GPU machine, and it is running full speed. I hoping to see the error over the next day or days.
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

bruce wrote:Test whatever makes sense to you and report back.

If you report the PRCG, the development team can find your science log, though your name may or may not be there.
Aaaaaand, the results are in. Under the Quad GPU setup with CUDA enabled, every single one of the GPUs ran into the error at different % progress points. Three GPUs got hit just once (at 0%, 1%, 43%, respectively), and one got hit three times (at 0%, 7%, 15%) before hitting 100%, requiring me to click OK in the crash dialog every single time, 6 in total. For each problem WU+Slot run, I captured the PRCG at the start, and science.log and log.txt at failure points, and the full log.txt at the 100% mark. Since I couldn't find that "noclean" expert flag, the 100% science.log is missing. But, they all must have gotten reported & uploaded to the server.

For convenience, here is again the link to where I have uploaded the files: https://1drv.ms/u/s!AvC041C64j0eyINf-i6 ... Q?e=Nratne. They are all under the new ""Quad GPU Detailed Logs" folder. They are all TXT files, and the file names include WU#, Slot#, Progress% and whether it is the Slot/WU-filtered log.txt, or that WU's results.log in progress at the time of the failure, or the full log.txt of the host right after the "upload & cleanup" of that WU/Slot, or the PRCG info. Simply sorting that folder by filename should make everything very obvious. Needless to say, the largest "*_full_log.txt" in there is the final full non-filtered log.txt that includes the entire run with all the CPU/GPU WUs, successes and failures, on all slots since I started this run yesterday.

For further convenience, here are the PRCGs that got hit:

Code: Select all

01:16:56:WU01:FS01:0x22:Project: 16918 (Run 9, Clone 8, Gen 135)
01:16:56:WU01:FS01:0x22:Unit: 0x000000b40002894c5f0e45d461a9d663

21:13:21:WU02:FS04:0x22:Project: 16918 (Run 53, Clone 51, Gen 167)
21:13:21:WU02:FS04:0x22:Unit: 0x000000da0002894c5f1761ac40656c6b

13:41:05:WU03:FS02:0x22:Project: 13428 (Run 2398, Clone 6, Gen 0)
13:41:05:WU03:FS02:0x22:Unit: 0x0000000012bc7d9a00000000095e0006

03:15:43:WU04:FS03:0x22:Project: 14903 (Run 98, Clone 0, Gen 218)
03:15:43:WU04:FS03:0x22:Unit: 0x0000011c81d59d695f21c6fa793ea8c8
Now I will start experimenting with the regkeys...
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

Tuna_Ertemalp wrote:Now I will start experimenting with the regkeys...
Sadly, the first attempt using:

Code: Select all

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
"TdrDelay"=dword:0000000a
"TdrLimitTime"=dword:00000078
"TdrLimitCount"=dword:0000000a
Resulted in crash at 0% in Project: 13428 (Run 2874, Clone 27, Gen 1), Unit: 0x0000000112bc7d9a000000000b3a001b:

Code: Select all

*********************** Log Started 2020-11-10T21:43:31Z ***********************
******************************* Date: 2020-11-11 *******************************
04:58:40:WU00:FS03:Connecting to assign1.foldingathome.org:80
04:58:40:WU00:FS03:Assigned to work server 18.188.125.154
04:58:40:WU00:FS03:Requesting new work unit for slot 03: gpu:66:0 GP102 [GeForce GTX 1080 Ti] 11380 from 18.188.125.154
04:58:40:WU00:FS03:Connecting to 18.188.125.154:8080
04:58:52:WU00:FS03:Downloading 7.50MiB
04:58:56:WU00:FS03:Download complete
04:58:56:WU00:FS03:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13428 run:2874 clone:27 gen:1 core:0x22 unit:0x0000000112bc7d9a000000000b3a001b
05:01:00:WU00:FS03:Starting
05:01:00:WU00:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Master\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 14240 -checkpoint 5 -opencl-platform 0 -opencl-device 2 -cuda-device 2 -gpu-vendor nvidia -gpu 2 -gpu-usage 100
05:01:00:WU00:FS03:Started FahCore on PID 1888
05:01:00:WU00:FS03:Core PID:1444
05:01:00:WU00:FS03:FahCore 0x22 started
05:01:01:WU00:FS03:0x22:*********************** Log Started 2020-11-11T05:01:00Z ***********************
05:01:01:WU00:FS03:0x22:*************************** Core22 Folding@home Core ***************************
05:01:01:WU00:FS03:0x22:       Core: Core22
05:01:01:WU00:FS03:0x22:       Type: 0x22
05:01:01:WU00:FS03:0x22:    Version: 0.0.13
05:01:01:WU00:FS03:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
05:01:01:WU00:FS03:0x22:  Copyright: 2020 foldingathome.org
05:01:01:WU00:FS03:0x22:   Homepage: https://foldingathome.org/
05:01:01:WU00:FS03:0x22:       Date: Sep 19 2020
05:01:01:WU00:FS03:0x22:       Time: 02:35:58
05:01:01:WU00:FS03:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
05:01:01:WU00:FS03:0x22:     Branch: core22-0.0.13
05:01:01:WU00:FS03:0x22:   Compiler: Visual C++ 2015
05:01:01:WU00:FS03:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
05:01:01:WU00:FS03:0x22:             -DOPENMM_GIT_HASH="\"189320d0\""
05:01:01:WU00:FS03:0x22:   Platform: win32 10
05:01:01:WU00:FS03:0x22:       Bits: 64
05:01:01:WU00:FS03:0x22:       Mode: Release
05:01:01:WU00:FS03:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
05:01:01:WU00:FS03:0x22:             <peastman@stanford.edu>
05:01:01:WU00:FS03:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 1888 -checkpoint 5
05:01:01:WU00:FS03:0x22:             -opencl-platform 0 -opencl-device 2 -cuda-device 2 -gpu-vendor
05:01:01:WU00:FS03:0x22:             nvidia -gpu 2 -gpu-usage 100
05:01:01:WU00:FS03:0x22:************************************ libFAH ************************************
05:01:01:WU00:FS03:0x22:       Date: Sep 7 2020
05:01:01:WU00:FS03:0x22:       Time: 19:09:56
05:01:01:WU00:FS03:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
05:01:01:WU00:FS03:0x22:     Branch: HEAD
05:01:01:WU00:FS03:0x22:   Compiler: Visual C++ 2015
05:01:01:WU00:FS03:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
05:01:01:WU00:FS03:0x22:   Platform: win32 10
05:01:01:WU00:FS03:0x22:       Bits: 64
05:01:01:WU00:FS03:0x22:       Mode: Release
05:01:01:WU00:FS03:0x22:************************************ CBang *************************************
05:01:01:WU00:FS03:0x22:       Date: Sep 7 2020
05:01:01:WU00:FS03:0x22:       Time: 19:08:30
05:01:01:WU00:FS03:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
05:01:01:WU00:FS03:0x22:     Branch: HEAD
05:01:01:WU00:FS03:0x22:   Compiler: Visual C++ 2015
05:01:01:WU00:FS03:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
05:01:01:WU00:FS03:0x22:   Platform: win32 10
05:01:01:WU00:FS03:0x22:       Bits: 64
05:01:01:WU00:FS03:0x22:       Mode: Release
05:01:01:WU00:FS03:0x22:************************************ System ************************************
05:01:01:WU00:FS03:0x22:        CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
05:01:01:WU00:FS03:0x22:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
05:01:01:WU00:FS03:0x22:       CPUs: 32
05:01:01:WU00:FS03:0x22:     Memory: 31.88GiB
05:01:01:WU00:FS03:0x22:Free Memory: 26.12GiB
05:01:01:WU00:FS03:0x22:    Threads: WINDOWS_THREADS
05:01:01:WU00:FS03:0x22: OS Version: 6.2
05:01:01:WU00:FS03:0x22:Has Battery: false
05:01:01:WU00:FS03:0x22: On Battery: false
05:01:01:WU00:FS03:0x22: UTC Offset: -8
05:01:01:WU00:FS03:0x22:        PID: 1444
05:01:01:WU00:FS03:0x22:        CWD: C:\Users\Master\AppData\Roaming\FAHClient\work
05:01:01:WU00:FS03:0x22:************************************ OpenMM ************************************
05:01:01:WU00:FS03:0x22:   Revision: 189320d0
05:01:01:WU00:FS03:0x22:********************************************************************************
05:01:02:WU00:FS03:0x22:Project: 13428 (Run 2874, Clone 27, Gen 1)
05:01:02:WU00:FS03:0x22:Unit: 0x0000000112bc7d9a000000000b3a001b
05:01:02:WU00:FS03:0x22:Reading tar file core.xml
05:01:02:WU00:FS03:0x22:Reading tar file integrator.xml.bz2
05:01:02:WU00:FS03:0x22:Reading tar file state.xml.bz2
05:01:02:WU00:FS03:0x22:Reading tar file system.xml.bz2
05:01:02:WU00:FS03:0x22:Digital signatures verified
05:01:02:WU00:FS03:0x22:Folding@home GPU Core22 Folding@home Core
05:01:02:WU00:FS03:0x22:Version 0.0.13
05:01:02:WU00:FS03:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
05:01:02:WU00:FS03:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
05:01:02:WU00:FS03:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
05:01:02:WU00:FS03:0x22:  Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
05:01:02:WU00:FS03:0x22:There are 4 platforms available.
05:01:02:WU00:FS03:0x22:Platform 0: Reference
05:01:02:WU00:FS03:0x22:Platform 1: CPU
05:01:02:WU00:FS03:0x22:Platform 2: OpenCL
05:01:02:WU00:FS03:0x22:  opencl-device 2 specified
05:01:02:WU00:FS03:0x22:Platform 3: CUDA
05:01:02:WU00:FS03:0x22:  cuda-device 2 specified
05:01:27:WU00:FS03:0x22:Attempting to create CUDA context:
05:01:27:WU00:FS03:0x22:  Configuring platform CUDA
05:01:41:WU00:FS03:0x22:  Using CUDA and gpu 2
05:01:41:WU00:FS03:0x22:Completed 0 out of 1000000 steps (0%)
05:01:43:WU00:FS03:0x22:Checkpoint completed at step 0
05:02:45:WU00:FS03:0x22:An exception occurred at step 5313: Error downloading array buffer: CUDA_ERROR_LAUNCH_FAILED (719)
05:02:45:WU00:FS03:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
05:02:45:WU00:FS03:0x22:Folding@home Core Shutdown: CORE_RESTART
Will try with higher limits... I would have thought 10s would be a good timeout for any operation to not take any longer than... :(
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
Tuna_Ertemalp
Posts: 68
Joined: Sun Mar 22, 2020 8:54 pm
Hardware configuration: OS:Win10
GPUs: EVGA

CPU (cores), RAM, (GPU Core OC, Mem OC): GPU(s), Motherboard:

* AMD Ryzen 5 3600 (6C), 32G DDR4-2400, (+0,+0): 3090 FTW3 ULTRA, Gigabyte AB350M-D3H-CF
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+0,+0): 3090 XC3 ULTRA HYBRID, ASUS X99-M WS
* Intel Core i7 5960X (8C), 32G DDR4-2400, (+100,+200): 2x 3090 FTW3 ULTRA, ASUS X99-E WS/USB 3.1
* Intel Core i7 970 (6C), 24G DDR3-1333, (+0,+0): 2x 3080 FTW3 ULTRA HYBRID, ASUS RAMPAGE III GENE
* Intel Core i7 5960X (8C), 16G DDR4-2400, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, ASRock X99 OC Formula/3.1
* AMD Ryzen 7 2700X (8C), 16G DDR4-2666, (+100,+200): 3090 FTW3 ULTRA HYBRID, ASRock B450M Pro4
* AMD Ryzen TR 1950X (16C), 32G DDR4-2133, (+100,+200): 3x 3090 XC3 ULTRA HYBRID, ASRock X399 Taichi
* Intel Core i7 5960X (8C), 64G DDR4-2133, (+100,+0): 1080 Ti FTW3 + HYBRID KIT, 2x 1080 Ti SC2 HYBRID, MSI X99A XPOWER AC
Location: Seattle, WA, USA

Re: CUDA_ERROR_LAUNCH_FAILED

Post by Tuna_Ertemalp »

Tuna_Ertemalp wrote:Will try with higher limits...
Tried 30s delay allowed, and still...

Code: Select all

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
"TdrDelay"=dword:0000001e
"TdrLimitTime"=dword:00000190
"TdrLimitCount"=dword:0000000a
This time another GPU crashed at 68% for Project: 14905 (Run 399, Clone 4, Gen 149), Unit: 0x000000cf81d59d695f4ec9dab1d17b44:

Code: Select all

*********************** Log Started 2020-11-11T06:26:02Z ***********************
08:39:25:WU01:FS04:Connecting to assign1.foldingathome.org:80
08:39:25:WU01:FS04:Assigned to work server 129.213.157.105
08:39:25:WU01:FS04:Requesting new work unit for slot 04: gpu:67:0 GP102 [GeForce GTX 1080 Ti] 11380 from 129.213.157.105
08:39:25:WU01:FS04:Connecting to 129.213.157.105:8080
08:39:29:WU01:FS04:Downloading 11.38MiB
08:39:32:WU01:FS04:Download complete
08:39:32:WU01:FS04:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:14905 run:399 clone:4 gen:149 core:0x22 unit:0x000000cf81d59d695f4ec9dab1d17b44
08:41:17:WU01:FS04:Starting
08:41:17:WU01:FS04:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Master\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 706 -lifeline 13496 -checkpoint 5 -opencl-platform 0 -opencl-device 3 -cuda-device 3 -gpu-vendor nvidia -gpu 3 -gpu-usage 100
08:41:17:WU01:FS04:Started FahCore on PID 3016
08:41:17:WU01:FS04:Core PID:10656
08:41:17:WU01:FS04:FahCore 0x22 started
08:41:18:WU01:FS04:0x22:*********************** Log Started 2020-11-11T08:41:17Z ***********************
08:41:18:WU01:FS04:0x22:*************************** Core22 Folding@home Core ***************************
08:41:18:WU01:FS04:0x22:       Core: Core22
08:41:18:WU01:FS04:0x22:       Type: 0x22
08:41:18:WU01:FS04:0x22:    Version: 0.0.13
08:41:18:WU01:FS04:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:41:18:WU01:FS04:0x22:  Copyright: 2020 foldingathome.org
08:41:18:WU01:FS04:0x22:   Homepage: https://foldingathome.org/
08:41:18:WU01:FS04:0x22:       Date: Sep 19 2020
08:41:18:WU01:FS04:0x22:       Time: 02:35:58
08:41:18:WU01:FS04:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
08:41:18:WU01:FS04:0x22:     Branch: core22-0.0.13
08:41:18:WU01:FS04:0x22:   Compiler: Visual C++ 2015
08:41:18:WU01:FS04:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
08:41:18:WU01:FS04:0x22:             -DOPENMM_GIT_HASH="\"189320d0\""
08:41:18:WU01:FS04:0x22:   Platform: win32 10
08:41:18:WU01:FS04:0x22:       Bits: 64
08:41:18:WU01:FS04:0x22:       Mode: Release
08:41:18:WU01:FS04:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
08:41:18:WU01:FS04:0x22:             <peastman@stanford.edu>
08:41:18:WU01:FS04:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 3016 -checkpoint 5
08:41:18:WU01:FS04:0x22:             -opencl-platform 0 -opencl-device 3 -cuda-device 3 -gpu-vendor
08:41:18:WU01:FS04:0x22:             nvidia -gpu 3 -gpu-usage 100
08:41:18:WU01:FS04:0x22:************************************ libFAH ************************************
08:41:18:WU01:FS04:0x22:       Date: Sep 7 2020
08:41:18:WU01:FS04:0x22:       Time: 19:09:56
08:41:18:WU01:FS04:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
08:41:18:WU01:FS04:0x22:     Branch: HEAD
08:41:18:WU01:FS04:0x22:   Compiler: Visual C++ 2015
08:41:18:WU01:FS04:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
08:41:18:WU01:FS04:0x22:   Platform: win32 10
08:41:18:WU01:FS04:0x22:       Bits: 64
08:41:18:WU01:FS04:0x22:       Mode: Release
08:41:18:WU01:FS04:0x22:************************************ CBang *************************************
08:41:18:WU01:FS04:0x22:       Date: Sep 7 2020
08:41:18:WU01:FS04:0x22:       Time: 19:08:30
08:41:18:WU01:FS04:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
08:41:18:WU01:FS04:0x22:     Branch: HEAD
08:41:18:WU01:FS04:0x22:   Compiler: Visual C++ 2015
08:41:18:WU01:FS04:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
08:41:18:WU01:FS04:0x22:   Platform: win32 10
08:41:18:WU01:FS04:0x22:       Bits: 64
08:41:18:WU01:FS04:0x22:       Mode: Release
08:41:18:WU01:FS04:0x22:************************************ System ************************************
08:41:18:WU01:FS04:0x22:        CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
08:41:18:WU01:FS04:0x22:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
08:41:18:WU01:FS04:0x22:       CPUs: 32
08:41:18:WU01:FS04:0x22:     Memory: 31.88GiB
08:41:18:WU01:FS04:0x22:Free Memory: 25.18GiB
08:41:18:WU01:FS04:0x22:    Threads: WINDOWS_THREADS
08:41:18:WU01:FS04:0x22: OS Version: 6.2
08:41:18:WU01:FS04:0x22:Has Battery: false
08:41:18:WU01:FS04:0x22: On Battery: false
08:41:18:WU01:FS04:0x22: UTC Offset: -8
08:41:18:WU01:FS04:0x22:        PID: 10656
08:41:18:WU01:FS04:0x22:        CWD: C:\Users\Master\AppData\Roaming\FAHClient\work
08:41:18:WU01:FS04:0x22:************************************ OpenMM ************************************
08:41:18:WU01:FS04:0x22:   Revision: 189320d0
08:41:18:WU01:FS04:0x22:********************************************************************************
08:41:18:WU01:FS04:0x22:Project: 14905 (Run 399, Clone 4, Gen 149)
08:41:18:WU01:FS04:0x22:Unit: 0x000000cf81d59d695f4ec9dab1d17b44
08:41:18:WU01:FS04:0x22:Reading tar file core.xml
08:41:18:WU01:FS04:0x22:Reading tar file integrator.xml
08:41:18:WU01:FS04:0x22:Reading tar file state.xml
08:41:19:WU01:FS04:0x22:Reading tar file system.xml
08:41:20:WU01:FS04:0x22:Digital signatures verified
08:41:20:WU01:FS04:0x22:Folding@home GPU Core22 Folding@home Core
08:41:20:WU01:FS04:0x22:Version 0.0.13
08:41:21:WU01:FS04:0x22:  Checkpoint write interval: 100000 steps (5%) [20 total]
08:41:21:WU01:FS04:0x22:  JSON viewer frame write interval: 20000 steps (1%) [100 total]
08:41:21:WU01:FS04:0x22:  XTC frame write interval: 50000 steps (2.5%) [40 total]
08:41:21:WU01:FS04:0x22:  Global context and integrator variables write interval: disabled
08:41:21:WU01:FS04:0x22:There are 4 platforms available.
08:41:21:WU01:FS04:0x22:Platform 0: Reference
08:41:21:WU01:FS04:0x22:Platform 1: CPU
08:41:21:WU01:FS04:0x22:Platform 2: OpenCL
08:41:21:WU01:FS04:0x22:  opencl-device 3 specified
08:41:21:WU01:FS04:0x22:Platform 3: CUDA
08:41:21:WU01:FS04:0x22:  cuda-device 3 specified
08:41:34:WU01:FS04:0x22:Attempting to create CUDA context:
08:41:34:WU01:FS04:0x22:  Configuring platform CUDA
08:41:41:WU01:FS04:0x22:  Using CUDA and gpu 3
08:41:41:WU01:FS04:0x22:Completed 0 out of 2000000 steps (0%)
08:41:42:WU01:FS04:0x22:Checkpoint completed at step 0
08:43:11:WU01:FS04:0x22:Completed 20000 out of 2000000 steps (1%)
08:44:40:WU01:FS04:0x22:Completed 40000 out of 2000000 steps (2%)
08:46:09:WU01:FS04:0x22:Completed 60000 out of 2000000 steps (3%)
08:47:37:WU01:FS04:0x22:Completed 80000 out of 2000000 steps (4%)
08:49:05:WU01:FS04:0x22:Completed 100000 out of 2000000 steps (5%)
08:49:06:WU01:FS04:0x22:Checkpoint completed at step 100000
08:50:35:WU01:FS04:0x22:Completed 120000 out of 2000000 steps (6%)
08:52:05:WU01:FS04:0x22:Completed 140000 out of 2000000 steps (7%)
08:53:34:WU01:FS04:0x22:Completed 160000 out of 2000000 steps (8%)
08:55:03:WU01:FS04:0x22:Completed 180000 out of 2000000 steps (9%)
08:56:31:WU01:FS04:0x22:Completed 200000 out of 2000000 steps (10%)
08:56:33:WU01:FS04:0x22:Checkpoint completed at step 200000
08:58:01:WU01:FS04:0x22:Completed 220000 out of 2000000 steps (11%)
08:59:30:WU01:FS04:0x22:Completed 240000 out of 2000000 steps (12%)
09:00:59:WU01:FS04:0x22:Completed 260000 out of 2000000 steps (13%)
09:02:27:WU01:FS04:0x22:Completed 280000 out of 2000000 steps (14%)
09:03:56:WU01:FS04:0x22:Completed 300000 out of 2000000 steps (15%)
09:03:58:WU01:FS04:0x22:Checkpoint completed at step 300000
09:05:26:WU01:FS04:0x22:Completed 320000 out of 2000000 steps (16%)
09:06:55:WU01:FS04:0x22:Completed 340000 out of 2000000 steps (17%)
09:08:24:WU01:FS04:0x22:Completed 360000 out of 2000000 steps (18%)
09:09:53:WU01:FS04:0x22:Completed 380000 out of 2000000 steps (19%)
09:11:21:WU01:FS04:0x22:Completed 400000 out of 2000000 steps (20%)
09:11:23:WU01:FS04:0x22:Checkpoint completed at step 400000
09:12:52:WU01:FS04:0x22:Completed 420000 out of 2000000 steps (21%)
09:14:21:WU01:FS04:0x22:Completed 440000 out of 2000000 steps (22%)
09:15:50:WU01:FS04:0x22:Completed 460000 out of 2000000 steps (23%)
09:17:19:WU01:FS04:0x22:Completed 480000 out of 2000000 steps (24%)
09:18:48:WU01:FS04:0x22:Completed 500000 out of 2000000 steps (25%)
09:18:50:WU01:FS04:0x22:Checkpoint completed at step 500000
09:20:18:WU01:FS04:0x22:Completed 520000 out of 2000000 steps (26%)
09:21:48:WU01:FS04:0x22:Completed 540000 out of 2000000 steps (27%)
09:23:17:WU01:FS04:0x22:Completed 560000 out of 2000000 steps (28%)
09:24:45:WU01:FS04:0x22:Completed 580000 out of 2000000 steps (29%)
09:26:15:WU01:FS04:0x22:Completed 600000 out of 2000000 steps (30%)
09:26:16:WU01:FS04:0x22:Checkpoint completed at step 600000
09:27:45:WU01:FS04:0x22:Completed 620000 out of 2000000 steps (31%)
09:29:12:WU01:FS04:0x22:Completed 640000 out of 2000000 steps (32%)
09:30:41:WU01:FS04:0x22:Completed 660000 out of 2000000 steps (33%)
09:32:09:WU01:FS04:0x22:Completed 680000 out of 2000000 steps (34%)
09:33:37:WU01:FS04:0x22:Completed 700000 out of 2000000 steps (35%)
09:33:39:WU01:FS04:0x22:Checkpoint completed at step 700000
09:35:07:WU01:FS04:0x22:Completed 720000 out of 2000000 steps (36%)
09:36:36:WU01:FS04:0x22:Completed 740000 out of 2000000 steps (37%)
09:38:04:WU01:FS04:0x22:Completed 760000 out of 2000000 steps (38%)
09:39:33:WU01:FS04:0x22:Completed 780000 out of 2000000 steps (39%)
09:41:01:WU01:FS04:0x22:Completed 800000 out of 2000000 steps (40%)
09:41:03:WU01:FS04:0x22:Checkpoint completed at step 800000
09:42:31:WU01:FS04:0x22:Completed 820000 out of 2000000 steps (41%)
09:43:57:WU01:FS04:0x22:Completed 840000 out of 2000000 steps (42%)
09:45:25:WU01:FS04:0x22:Completed 860000 out of 2000000 steps (43%)
09:46:54:WU01:FS04:0x22:Completed 880000 out of 2000000 steps (44%)
09:48:21:WU01:FS04:0x22:Completed 900000 out of 2000000 steps (45%)
09:48:23:WU01:FS04:0x22:Checkpoint completed at step 900000
09:49:51:WU01:FS04:0x22:Completed 920000 out of 2000000 steps (46%)
09:51:19:WU01:FS04:0x22:Completed 940000 out of 2000000 steps (47%)
09:52:47:WU01:FS04:0x22:Completed 960000 out of 2000000 steps (48%)
09:54:16:WU01:FS04:0x22:Completed 980000 out of 2000000 steps (49%)
09:55:45:WU01:FS04:0x22:Completed 1000000 out of 2000000 steps (50%)
09:55:46:WU01:FS04:0x22:Checkpoint completed at step 1000000
09:57:15:WU01:FS04:0x22:Completed 1020000 out of 2000000 steps (51%)
09:58:43:WU01:FS04:0x22:Completed 1040000 out of 2000000 steps (52%)
10:00:12:WU01:FS04:0x22:Completed 1060000 out of 2000000 steps (53%)
10:01:40:WU01:FS04:0x22:Completed 1080000 out of 2000000 steps (54%)
10:03:08:WU01:FS04:0x22:Completed 1100000 out of 2000000 steps (55%)
10:03:09:WU01:FS04:0x22:Checkpoint completed at step 1100000
10:04:36:WU01:FS04:0x22:Completed 1120000 out of 2000000 steps (56%)
10:06:25:WU01:FS04:0x22:Completed 1140000 out of 2000000 steps (57%)
10:08:18:WU01:FS04:0x22:Completed 1160000 out of 2000000 steps (58%)
10:10:09:WU01:FS04:0x22:Completed 1180000 out of 2000000 steps (59%)
10:12:00:WU01:FS04:0x22:Completed 1200000 out of 2000000 steps (60%)
10:12:03:WU01:FS04:0x22:Checkpoint completed at step 1200000
10:13:56:WU01:FS04:0x22:Completed 1220000 out of 2000000 steps (61%)
10:15:47:WU01:FS04:0x22:Completed 1240000 out of 2000000 steps (62%)
10:17:40:WU01:FS04:0x22:Completed 1260000 out of 2000000 steps (63%)
10:19:30:WU01:FS04:0x22:Completed 1280000 out of 2000000 steps (64%)
10:21:21:WU01:FS04:0x22:Completed 1300000 out of 2000000 steps (65%)
10:21:24:WU01:FS04:0x22:Checkpoint completed at step 1300000
10:23:16:WU01:FS04:0x22:Completed 1320000 out of 2000000 steps (66%)
10:25:06:WU01:FS04:0x22:Completed 1340000 out of 2000000 steps (67%)
10:26:57:WU01:FS04:0x22:Completed 1360000 out of 2000000 steps (68%)
10:29:50:WU01:FS04:0x22:An exception occurred at step 1378311: Error invoking kernel: CUDA_ERROR_LAUNCH_TIMEOUT (702)
10:29:50:WU01:FS04:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
10:29:50:WU01:FS04:0x22:Folding@home Core Shutdown: CORE_RESTART
******************************* Date: 2020-11-11 *******************************
This is sounding more like something is getting STUCK as opposed to taking TOO LONG.

I will go back to using -disable-cuda expert flag on this machine.

I hope the data I provided so far was meaningful & helpful for someone who wants to investigate it.

If there is a coding solution, please update this thread so that other running into this can see the problem, the attempts, and the solution.
Small things make quality, but quality is no small thing. (Adapted from Henry Royce talking about perfection, not quality)
8 Win10 PCs/22 slots: 8x CPUs (3xAMD+5xIntel=68C/122T), 14x NVIDIA EVGA GPUs (8x 3090, 2x 3080, 4x 1080Ti) [Details in my profile]
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: CUDA_ERROR_LAUNCH_FAILED

Post by bruce »

All oprating systems have commands which allow task X to start only after task Y has been started. Those commands need to be part of the startup script for FAHClient.
Post Reply