Page 2 of 5

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sat Oct 24, 2020 5:35 pm
by bruce
You're asking me to predict how Microsoft's basic logic and NVidia's drivers manage the timeout detection to achieve whatever they think is important. It's not possible to do that without actually looking at the internals or doing a lot of testing. Your question is valid, but i don't have an irrefutable answer ... and it might change in the next version of their code.

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sat Oct 24, 2020 6:44 pm
by gunnarre
Is hardware accellerated GPU scheduling switched on?
https://www.windowslatest.com/2020/06/2 ... cheduling/

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sat Oct 24, 2020 8:39 pm
by Tuna_Ertemalp
gunnarre wrote:Is hardware accellerated GPU scheduling switched on?
https://www.windowslatest.com/2020/06/2 ... cheduling/
Nope. And, looking at my other hosts, for the ones that have 1080/2080 GPUs, it seems the default value is OFF, and I have not turned it on for any host.

Did you suspect that an ON setting could be the culprit? Or, that an ON setting would be a better choice to avoid this situation?

Thanks!
Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sun Oct 25, 2020 12:29 am
by bruce
The setting is new and there has been a lot of guessing about what it does. Hearing the results (if any) of changing that setting will add to the body of knowledge and maybe we'll come up with a recommendation.

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sun Oct 25, 2020 3:50 pm
by Tuna_Ertemalp
bruce wrote:The setting is new and there has been a lot of guessing about what it does. Hearing the results (if any) of changing that setting will add to the body of knowledge and maybe we'll come up with a recommendation.
OK. I turned it ON on my 4 hosts that have 10XX/20XX/30XX GPUs (seems it is not supported on Titan Z nor on Titan X), and run a 2004 or 20H2 version of Win10. If one of them starts behaving badly, I'll know.

Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Wed Oct 28, 2020 1:56 pm
by Tuna_Ertemalp
PantherX wrote:As a first step, I would finish all Slots and then then only fold on 1 GPU. Assuming you initially encountered the CUDA_ERROR_LAUNCH_TIMEOUT (702) issue once a week, see if you can encounter this issue again while folding on a single GPU. If you can fold an entire week without having that issue, then start up a second GPU. Continue until you encounter that issue.
Tuna_Ertemalp wrote:Thanks for the tip about pause-on-start per GPU. Updated FAH, started GPU #1, muted others. I'll report back about the CUDA error here when I have something new...
So, for the last 5 days, I ran GPU#1 on its own for a few days. No problem. Then GPU#2 for a few days. No problem. Then GPU#3 for a few days, and the problem showed up.

Code: Select all

05:56:11:WU02:FS03:0x22:An exception occurred at step 61949: Error downloading array energySum: CUDA_ERROR_LAUNCH_TIMEOUT (702)
05:56:11:WU02:FS03:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
05:56:11:WU02:FS03:0x22:Folding@home Core Shutdown: CORE_RESTART
But, that doesn't necessarily mean that the hardware GPU#3 is at fault, since during my original reporting on this thread, this had happened on GPU #4, then on #1 and #2 in short order while GPU #4 was still in the failed state, therefore resulting 3 of 4 GPUs showing the crash report concurrently, while running under my usual QuadGPU config. Now #3 failed, in a SingleGPU config.

I grabbed both the old "#4 failed on Quad" and the new "#3 failed in Single" logs, both full and filtered for the relevant GPU, and placed them under https://1drv.ms/u/s!AvC041C64j0eyINf-i6 ... Q?e=Nratne.

I am leaving my host in the failure state ("FahCore_22.exe has stopped working" for GPU#3) in case someone needs to harvest something else.

Thanks
Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Wed Oct 28, 2020 5:44 pm
by bruce
Which FAHClient 7.6.xx are you running. Have you updated recently?

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Wed Oct 28, 2020 5:53 pm
by Tuna_Ertemalp
bruce wrote:Which FAHClient 7.6.xx are you running. Have you updated recently?
I try to use always the latest released. Therefore 7.6.21. But it had been happening under 7.6.20. Unfortunately, I don't remember if it was happening before 7.6.20.

Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Thu Oct 29, 2020 3:37 pm
by Tuna_Ertemalp
Tuna_Ertemalp wrote:I am leaving my host in the failure state ("FahCore_22.exe has stopped working" for GPU#3) in case someone needs to harvest something else.
Given no takers for harvesting :), I have restarted the machine with only the GPU#4 running.

Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sat Oct 31, 2020 2:48 pm
by Tuna_Ertemalp
Tuna_Ertemalp wrote:I have restarted the machine with only the GPU#4 running.
After running #4 single for more than a day, I got no problem, restarted all GPUs in parallel, and again got it to crash overnight, this time for GPUs #1, #3, #4. If you look at the early parts of this thread, it was #1, #2, #4, at that time. And, during "single CPU" trials, it happened to happen on #3. So, it jumps around.

I placed the full log into the same place: https://1drv.ms/u/s!AvC041C64j0eyINf-i6 ... Q?e=Nratne

Again leaving the host in this state for a while to see if there is any interest of harvesting any other info.

I am thinking an instrumented debug binary given to me by a dev would serve much better in figuring this out. I don't think I am the only one running into this, although it might be rare...

Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Mon Nov 02, 2020 6:52 am
by Tuna_Ertemalp
Tuna_Ertemalp wrote:I am thinking an instrumented debug binary given to me by a dev would serve much better in figuring this out. I don't think I am the only one running into this, although it might be rare...
Since, again, no takers, I OK'd all crash dialogs, and all GPUs are now continuing.

Since this will keep happening, my current thinking is to "Finish" all GPU projects, and then disable CUDA on all of them, just to see if that will work: "FAHControl -> Configure -> Expert -> Click Add under Extra Core Options -> -disable-cuda -> OK -> Save"

I will report my findings.

Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Mon Nov 02, 2020 7:08 am
by PantherX
BTW, if you were to publish the PRCGs of the WUs that failed, we can potentially pass that on to the researchers for further inspection. That's assuming that all the information was captured in the upload. This is what a PRCG looks like in the log file:
Project: 16929 (Run 0, Clone 8, Gen 2)

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Mon Nov 02, 2020 8:32 am
by Tuna_Ertemalp
Heading to bed in a minute, so I will do that tomorrow. I am guessing they are already in the logs I published, right? So, you'd like me to find them and post here?

Thanks
Tuna

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Mon Nov 02, 2020 8:53 am
by PantherX
Ah, you're right. I grabbed the data from the OneDrive link you posted and have notified few people about this. Let's wait and see what happens :)

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Mon Nov 02, 2020 5:11 pm
by Tuna_Ertemalp
Awesome! Thank you for saving me the trouble... :)

In the meantime, for the last 7.5hrs, all 4 GPUs are ticking along without a crash, with CUDA disabled. Fingers crossed.

I'll report back IF there is a crash. Otherwise, it would be safe to assume it is working.

If there is anything from the developers/researchers, I am all ears and am open to testing anything they want to throw my way.

Tuna