Page 5 of 5

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Wed Nov 11, 2020 6:00 pm
by Tuna_Ertemalp
bruce wrote:All oprating systems have commands which allow task X to start only after task Y has been started. Those commands need to be part of the startup script for FAHClient.
Ummm... ??

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Wed Nov 11, 2020 10:58 pm
by Tuna_Ertemalp
Tuna_Ertemalp wrote:I will go back to using -disable-cuda expert flag on this machine.
And, I just sliced & diced some data using HFM.NET, and my 1080Ti GPUs end up going from low 2.x million PPD to something like low-to-mid 1.x million PPD when CUDA is disabled. So, right now, I have five 1080Ti cards (out of my 9, i.e. 55% of my 1080Ti GPUs) that are running under -disable-cuda which is costing me about 2.5-5 million PPDs. :(

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sat Nov 14, 2020 10:23 pm
by bruce
Is this information up-to-date?

Iit seems like your GTX1080TI doesn't exiat.

Have you left at least one CPU free per GPU to support data transfers to/from it? IF that thread has to wait for resources it will definitely throttle your GPU.
:

Code: Select all

Hardware configuration:
    (OS) CPU (cores), Memory, GPU(s), Motherboard:

    (Win10) AMD Ryzen 5 3600 (6C), 32G DDR4-1200, Titan X, Gigabyte AB350M-D3H-CF
    (Win10) Intel Core i7 5960X (8C), 32G DDR4-2133, 2080 Ti Hybrid, ASUS X99-M WS
    (Win10) Intel Core i7 5960X (8C), 32G DDR4-2400, 2x 3090 FTW3 Ultra, ASUS X99-E WS/USB 3.1
    (Win10) Intel Core i7 970 (6C), 24G DDR3-1333, 1080Ti Hybrid, ASUS RAMPAGE III GENE
    (Win10) Intel Core i7 5960X (8C), 16G DDR4-2400, 1080Ti Hybrid, ASRock X99 OC Formula/3.1
    (Win10) Intel Core i7 2600 (4C), 12G DDR3-1333, Titan X Hybrid, ASUS P8P67
    (Win10) AMD Ryzen TR 1950X (16C), 32G DDR4-2133, 4x 1080Ti Hybrid, ASRock X399 Taichi
    (Win10) Intel Core i7 5960X (8C), 64G DDR4-2400, 3x 1080Ti Hybrid, MSI X99A XPOWER AC
    (Win7) Intel Core i7 2600 (4C), 16G DDR3-1333, GTX 580, Intel DP67BG

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sat Nov 14, 2020 11:08 pm
by Tuna_Ertemalp
bruce wrote:Is this information up-to-date?
Yes.
bruce wrote:Iit seems like your GTX1080TI doesn't exiat.
I don't understand what you mean by "it doesn't exist": "(Win10) AMD Ryzen TR 1950X (16C), 32G DDR4-2133, 4x 1080Ti Hybrid, ASRock X399 Taichi"

Maybe it is because my list doesn't use the "GTX"/"RTX" prefixes and you searched for "GTX1080TI" instead of "1080Ti". Given the model of the card, those two are basically fixed marketing names (580-1080 are GTX="Giga Texel Shader eXtreme", 2080...3090 are RTX="Ray Tracing Texel eXtreme"). :)
bruce wrote:Have you left at least one CPU free per GPU to support data transfers to/from it? IF that thread has to wait for resources it will definitely throttle your GPU.
Yes. Without me doing anything, by default, due to the CPUs=-1 setting in the CPU category under the SLOTS tab of the CONFIGURATION dialog, FAH automatically reduces the thread count from 32 to 28, reserving 4 threads to the 4 GPUs. By the way, that is what happens on all my hosts. I don't play around with those "-1" settings, and let FAH decide what to do with the hardware.

So, there is plenty of CPU power, RAM, SSD and hybrid cooling (with temps always <50C for each card) to go around this host.

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Sat Nov 14, 2020 11:15 pm
by Tuna_Ertemalp
Something worth mentioning: Yesterday, I took the time to do one ultimate Hail Mary move and completely reinstalled a fresh copy Win10/Pro on a different fresh empty SSD on this machine, booted from there, completely reinstalled everything on that SSD, from FAH to drivers to whatever else. Essentially, exact same hardware, but a fresh clean install of every piece of software and the data it downloads. So far it has run for 1 day and 2 hours, as of this post, without a problem. Of course, that doesn't prove anything, yet. I hope I didn't just jinx it. I am crossing my fingers that the problem was due to some Windows component triggering a TDR event by mistake, and by running a fresh copy of everything, maybe that erroneous behavior goes away. I guess we'll see if this runs for a week untouched without any crashes.

Yet, I hope you would agree, that this would be a drastic fix. The software should be able to deal with such errors without being stuck in the UI with a dialog waiting for a human intervention.

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Mon Nov 16, 2020 6:49 am
by PantherX
Hopefully, the fresh installation has fixed it. I am curious as to how many other applications you installed since it could be an application that might be causing conflicts with F@H.

I do agree that the software should handle the error without user intervention... however, we need to figure out where the error occurs, before we can see who can fix it (Microsoft, Nvidia, F@H, something else).

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Mon Nov 16, 2020 3:19 pm
by Tuna_Ertemalp
PantherX wrote:Hopefully, the fresh installation has fixed it.
I hope so, too. So far, 2 days and 18 hours of uptime, and still ticking...
PantherX wrote:I am curious as to how many other applications you installed since it could be an application that might be causing conflicts with F@H.
I can tell you exactly since I keep an XLS to track what needs to be updated on each host as versions change. Out of my 9 hosts, 1 is also my regular office PC with other stuff installed, but the remaining 8 of them are basically dedicated to compute, running the same bare minimum of apps:

Code: Select all

Windows 10: 20H2 19042.630
Chrome: auto-updated to latest
BOINC: 7.16.11 (inactive)
FAHclient: 7.6.21 (active)
VirtualBox: 6.1.16 (inactive)
MSI AfterBurner: 4.6.2 (active, to watch temperatures)
GPUz: 2.35.0 (launched on demand)
CPUz: 1.94.0 (launched on depand)
NZXT CAM: 4.15.0 (launched on demand)
TeamViewer: 15.11.6.0 (for remote access)
EVGA Precision X1: 1.1.1 (launched on demand, used to update GPU BIOS/Firmware)
Nvidia Experience/Driver: 3.20.5.70/457.30
No need to be irked by VBox/BOINC; they are not doing any compute jobs since I started contributing to F@H.
PantherX wrote:I do agree that the software should handle the error without user intervention... however, we need to figure out where the error occurs, before we can see who can fix it (Microsoft, Nvidia, F@H, something else).
If the refreshed host continues to work, we might have lost the repro case.

Re: CUDA_ERROR_LAUNCH_FAILED

Posted: Fri Nov 20, 2020 8:31 pm
by Tuna_Ertemalp
Tuna_Ertemalp wrote:If the refreshed host continues to work, we might have lost the repro case.
To report back progress: After a full 6d 23h 30m run, the refreshed quad 1080ti Hybrid host is still working without any CUDA errors.