NVIDIA GPUs stuck at Send or Clear after completing WUs

Moderators: Site Moderators, FAHC Science Team

Post Reply
ostieca
Posts: 7
Joined: Sat Jan 09, 2021 12:01 am

NVIDIA GPUs stuck at Send or Clear after completing WUs

Post by ostieca »

Hi there,

I'm scratching my head wondering why, of all the CPUs and GPUs that I'm using for folding, only the NVIDIA GPUs are now problematic. Intel CPU folding is fine, as well as AMD GPU, on all four Windows PC (mix of 7 and 10 x64 version) I tried so far.

Hoping someone can point me into finding why reason why I have to reboot the PCs to clear that.

Here's the log from one where the upload completed and is now stuck at Cleanup state:

14:09:44:Saving configuration to config.xml
14:09:44:<config>
14:09:44: <!-- Folding Slot Configuration -->
14:09:44: <cause v='HIGH_PRIORITY'/>
14:09:44:
14:09:44: <!-- Network -->
14:09:44: <proxy v=':8080'/>
14:09:44:
14:09:44: <!-- Slot Control -->
14:09:44: <power v='full'/>
14:09:44:
14:09:44: <!-- User Information -->
14:09:44: <passkey v='*****'/>
14:09:44: <team v='198'/>
14:09:44: <user v='Daniel_Trudeau'/>
14:09:44:
14:09:44: <!-- Folding Slots -->
14:09:44: <slot id='0' type='CPU'>
14:09:44: <cpus v='4'/>
14:09:44: <paused v='True'/>
14:09:44: </slot>
14:09:44: <slot id='1' type='GPU'>
14:09:44: <pause-on-start v='True'/>
14:09:44: <pci-bus v='1'/>
14:09:44: <pci-slot v='0'/>
14:09:44: </slot>
14:09:44:</config>
14:09:44:Saving configuration to config.xml
14:09:44:<config>
14:09:44: <!-- Folding Slot Configuration -->
14:09:44: <cause v='HIGH_PRIORITY'/>
14:09:44:
14:09:44: <!-- Network -->
14:09:44: <proxy v=':8080'/>
14:09:44:
14:09:44: <!-- Slot Control -->
14:09:44: <power v='full'/>
14:09:44:
14:09:44: <!-- User Information -->
14:09:44: <passkey v='*****'/>
14:09:44: <team v='198'/>
14:09:44: <user v='Daniel_Trudeau'/>
14:09:44:
14:09:44: <!-- Folding Slots -->
14:09:44: <slot id='0' type='CPU'>
14:09:44: <cpus v='4'/>
14:09:44: <pause-on-start v='True'/>
14:09:44: <paused v='True'/>
14:09:44: </slot>
14:09:44: <slot id='1' type='GPU'>
14:09:44: <pause-on-start v='True'/>
14:09:44: <pci-bus v='1'/>
14:09:44: <pci-slot v='0'/>
14:09:44: </slot>
14:09:44:</config>
14:10:10:WU01:FS01:0x23:Completed 25000 out of 2500000 steps (1%)
14:10:45:Saving configuration to config.xml
14:10:45:<config>
14:10:45: <!-- Folding Slot Configuration -->
14:10:45: <cause v='HIGH_PRIORITY'/>
14:10:45:
14:10:45: <!-- Network -->
14:10:45: <proxy v=':8080'/>
14:10:45:
14:10:45: <!-- Slot Control -->
14:10:45: <power v='full'/>
14:10:45:
14:10:45: <!-- User Information -->
14:10:45: <passkey v='*****'/>
14:10:45: <team v='198'/>
14:10:45: <user v='Daniel_Trudeau'/>
14:10:45:
14:10:45: <!-- Folding Slots -->
14:10:45: <slot id='0' type='CPU'>
14:10:45: <cpus v='4'/>
14:10:45: <pause-on-start v='True'/>
14:10:45: <paused v='True'/>
14:10:45: </slot>
14:10:45: <slot id='1' type='GPU'>
14:10:45: <pause-on-start v='True'/>
14:10:45: <pci-bus v='1'/>
14:10:45: <pci-slot v='0'/>
14:10:45: </slot>
14:10:45:</config>
14:11:34:WU01:FS01:0x23:Completed 50000 out of 2500000 steps (2%)
14:11:35:WU01:FS01:0x23:Checkpoint completed at step 50000
14:12:58:WU01:FS01:0x23:Completed 75000 out of 2500000 steps (3%)
14:14:22:WU01:FS01:0x23:Completed 100000 out of 2500000 steps (4%)
14:14:22:WU01:FS01:0x23:Checkpoint completed at step 100000
14:15:47:WU01:FS01:0x23:Completed 125000 out of 2500000 steps (5%)
14:17:09:WU01:FS01:0x23:Completed 150000 out of 2500000 steps (6%)
14:17:09:WU01:FS01:0x23:Checkpoint completed at step 150000
14:18:32:WU01:FS01:0x23:Completed 175000 out of 2500000 steps (7%)
14:19:54:WU01:FS01:0x23:Completed 200000 out of 2500000 steps (8%)
14:19:54:WU01:FS01:0x23:Checkpoint completed at step 200000
14:21:17:WU01:FS01:0x23:Completed 225000 out of 2500000 steps (9%)
14:22:39:WU01:FS01:0x23:Completed 250000 out of 2500000 steps (10%)
14:22:40:WU01:FS01:0x23:Checkpoint completed at step 250000
14:24:02:WU01:FS01:0x23:Completed 275000 out of 2500000 steps (11%)
14:25:25:WU01:FS01:0x23:Completed 300000 out of 2500000 steps (12%)
14:25:25:WU01:FS01:0x23:Checkpoint completed at step 300000
14:26:47:WU01:FS01:0x23:Completed 325000 out of 2500000 steps (13%)
14:28:10:WU01:FS01:0x23:Completed 350000 out of 2500000 steps (14%)
14:28:10:WU01:FS01:0x23:Checkpoint completed at step 350000
14:29:32:WU01:FS01:0x23:Completed 375000 out of 2500000 steps (15%)
14:30:54:WU01:FS01:0x23:Completed 400000 out of 2500000 steps (16%)
14:30:54:WU01:FS01:0x23:Checkpoint completed at step 400000
14:32:17:WU01:FS01:0x23:Completed 425000 out of 2500000 steps (17%)
14:33:39:WU01:FS01:0x23:Completed 450000 out of 2500000 steps (18%)
14:33:39:WU01:FS01:0x23:Checkpoint completed at step 450000
14:35:01:WU01:FS01:0x23:Completed 475000 out of 2500000 steps (19%)
14:36:23:WU01:FS01:0x23:Completed 500000 out of 2500000 steps (20%)
14:36:23:WU01:FS01:0x23:Checkpoint completed at step 500000
14:37:46:WU01:FS01:0x23:Completed 525000 out of 2500000 steps (21%)
14:39:08:WU01:FS01:0x23:Completed 550000 out of 2500000 steps (22%)
14:39:08:WU01:FS01:0x23:Checkpoint completed at step 550000
14:40:30:WU01:FS01:0x23:Completed 575000 out of 2500000 steps (23%)
14:41:52:WU01:FS01:0x23:Completed 600000 out of 2500000 steps (24%)
14:41:53:WU01:FS01:0x23:Checkpoint completed at step 600000
14:43:15:WU01:FS01:0x23:Completed 625000 out of 2500000 steps (25%)
14:44:38:WU01:FS01:0x23:Completed 650000 out of 2500000 steps (26%)
14:44:38:WU01:FS01:0x23:Checkpoint completed at step 650000
14:46:01:WU01:FS01:0x23:Completed 675000 out of 2500000 steps (27%)
14:47:26:WU01:FS01:0x23:Completed 700000 out of 2500000 steps (28%)
14:47:26:WU01:FS01:0x23:Checkpoint completed at step 700000
14:48:48:WU01:FS01:0x23:Completed 725000 out of 2500000 steps (29%)
14:50:11:WU01:FS01:0x23:Completed 750000 out of 2500000 steps (30%)
14:50:11:WU01:FS01:0x23:Checkpoint completed at step 750000
14:51:33:WU01:FS01:0x23:Completed 775000 out of 2500000 steps (31%)
14:52:55:WU01:FS01:0x23:Completed 800000 out of 2500000 steps (32%)
14:52:55:WU01:FS01:0x23:Checkpoint completed at step 800000
14:54:17:WU01:FS01:0x23:Completed 825000 out of 2500000 steps (33%)
14:55:39:WU01:FS01:0x23:Completed 850000 out of 2500000 steps (34%)
14:55:39:WU01:FS01:0x23:Checkpoint completed at step 850000
14:57:01:WU01:FS01:0x23:Completed 875000 out of 2500000 steps (35%)
14:58:23:WU01:FS01:0x23:Completed 900000 out of 2500000 steps (36%)
14:58:24:WU01:FS01:0x23:Checkpoint completed at step 900000
14:59:46:WU01:FS01:0x23:Completed 925000 out of 2500000 steps (37%)
15:01:08:WU01:FS01:0x23:Completed 950000 out of 2500000 steps (38%)
15:01:08:WU01:FS01:0x23:Checkpoint completed at step 950000
15:02:30:WU01:FS01:0x23:Completed 975000 out of 2500000 steps (39%)
15:03:53:WU01:FS01:0x23:Completed 1000000 out of 2500000 steps (40%)
15:03:53:WU01:FS01:0x23:Checkpoint completed at step 1000000
15:05:16:WU01:FS01:0x23:Completed 1025000 out of 2500000 steps (41%)
15:06:42:WU01:FS01:0x23:Completed 1050000 out of 2500000 steps (42%)
15:06:43:WU01:FS01:0x23:Checkpoint completed at step 1050000
15:08:07:WU01:FS01:0x23:Completed 1075000 out of 2500000 steps (43%)
15:09:33:WU01:FS01:0x23:Completed 1100000 out of 2500000 steps (44%)
15:09:33:WU01:FS01:0x23:Checkpoint completed at step 1100000
15:10:58:WU01:FS01:0x23:Completed 1125000 out of 2500000 steps (45%)
15:12:23:WU01:FS01:0x23:Completed 1150000 out of 2500000 steps (46%)
15:12:23:WU01:FS01:0x23:Checkpoint completed at step 1150000
15:13:47:WU01:FS01:0x23:Completed 1175000 out of 2500000 steps (47%)
15:15:12:WU01:FS01:0x23:Completed 1200000 out of 2500000 steps (48%)
15:15:12:WU01:FS01:0x23:Checkpoint completed at step 1200000
15:16:39:WU01:FS01:0x23:Completed 1225000 out of 2500000 steps (49%)
15:18:03:WU01:FS01:0x23:Completed 1250000 out of 2500000 steps (50%)
15:18:03:WU01:FS01:0x23:Checkpoint completed at step 1250000
15:19:28:WU01:FS01:0x23:Completed 1275000 out of 2500000 steps (51%)
15:20:51:WU01:FS01:0x23:Completed 1300000 out of 2500000 steps (52%)
15:20:51:WU01:FS01:0x23:Checkpoint completed at step 1300000
15:22:13:WU01:FS01:0x23:Completed 1325000 out of 2500000 steps (53%)
15:23:37:WU01:FS01:0x23:Completed 1350000 out of 2500000 steps (54%)
15:23:38:WU01:FS01:0x23:Checkpoint completed at step 1350000
15:25:01:WU01:FS01:0x23:Completed 1375000 out of 2500000 steps (55%)
15:26:25:WU01:FS01:0x23:Completed 1400000 out of 2500000 steps (56%)
15:26:25:WU01:FS01:0x23:Checkpoint completed at step 1400000
15:27:49:WU01:FS01:0x23:Completed 1425000 out of 2500000 steps (57%)
15:29:11:WU01:FS01:0x23:Completed 1450000 out of 2500000 steps (58%)
15:29:11:WU01:FS01:0x23:Checkpoint completed at step 1450000
15:30:36:WU01:FS01:0x23:Completed 1475000 out of 2500000 steps (59%)
15:31:59:WU01:FS01:0x23:Completed 1500000 out of 2500000 steps (60%)
15:31:59:WU01:FS01:0x23:Checkpoint completed at step 1500000
15:33:28:WU01:FS01:0x23:Completed 1525000 out of 2500000 steps (61%)
15:34:55:WU01:FS01:0x23:Completed 1550000 out of 2500000 steps (62%)
15:34:56:WU01:FS01:0x23:Checkpoint completed at step 1550000
15:36:19:WU01:FS01:0x23:Completed 1575000 out of 2500000 steps (63%)
15:37:43:WU01:FS01:0x23:Completed 1600000 out of 2500000 steps (64%)
15:37:43:WU01:FS01:0x23:Checkpoint completed at step 1600000
15:39:08:WU01:FS01:0x23:Completed 1625000 out of 2500000 steps (65%)
15:40:32:WU01:FS01:0x23:Completed 1650000 out of 2500000 steps (66%)
15:40:33:WU01:FS01:0x23:Checkpoint completed at step 1650000
15:41:55:WU01:FS01:0x23:Completed 1675000 out of 2500000 steps (67%)
15:43:17:WU01:FS01:0x23:Completed 1700000 out of 2500000 steps (68%)
15:43:17:WU01:FS01:0x23:Checkpoint completed at step 1700000
15:44:40:WU01:FS01:0x23:Completed 1725000 out of 2500000 steps (69%)
15:46:07:WU01:FS01:0x23:Completed 1750000 out of 2500000 steps (70%)
15:46:07:WU01:FS01:0x23:Checkpoint completed at step 1750000
15:47:32:WU01:FS01:0x23:Completed 1775000 out of 2500000 steps (71%)
15:48:57:WU01:FS01:0x23:Completed 1800000 out of 2500000 steps (72%)
15:48:58:WU01:FS01:0x23:Checkpoint completed at step 1800000
15:50:23:WU01:FS01:0x23:Completed 1825000 out of 2500000 steps (73%)
15:51:47:WU01:FS01:0x23:Completed 1850000 out of 2500000 steps (74%)
15:51:47:WU01:FS01:0x23:Checkpoint completed at step 1850000
15:53:09:WU01:FS01:0x23:Completed 1875000 out of 2500000 steps (75%)
15:54:33:WU01:FS01:0x23:Completed 1900000 out of 2500000 steps (76%)
15:54:33:WU01:FS01:0x23:Checkpoint completed at step 1900000
15:56:05:WU01:FS01:0x23:Completed 1925000 out of 2500000 steps (77%)
15:57:38:WU01:FS01:0x23:Completed 1950000 out of 2500000 steps (78%)
15:57:38:WU01:FS01:0x23:Checkpoint completed at step 1950000
15:59:09:WU01:FS01:0x23:Completed 1975000 out of 2500000 steps (79%)
16:00:44:WU01:FS01:0x23:Completed 2000000 out of 2500000 steps (80%)
16:00:45:WU01:FS01:0x23:Checkpoint completed at step 2000000
16:02:14:WU01:FS01:0x23:Completed 2025000 out of 2500000 steps (81%)
16:03:38:WU01:FS01:0x23:Completed 2050000 out of 2500000 steps (82%)
16:03:39:WU01:FS01:0x23:Checkpoint completed at step 2050000
16:05:06:WU01:FS01:0x23:Completed 2075000 out of 2500000 steps (83%)
16:06:36:WU01:FS01:0x23:Completed 2100000 out of 2500000 steps (84%)
16:06:37:WU01:FS01:0x23:Checkpoint completed at step 2100000
16:08:06:WU01:FS01:0x23:Completed 2125000 out of 2500000 steps (85%)
16:09:33:WU01:FS01:0x23:Completed 2150000 out of 2500000 steps (86%)
16:09:34:WU01:FS01:0x23:Checkpoint completed at step 2150000
16:11:01:WU01:FS01:0x23:Completed 2175000 out of 2500000 steps (87%)
16:12:29:WU01:FS01:0x23:Completed 2200000 out of 2500000 steps (88%)
16:12:29:WU01:FS01:0x23:Checkpoint completed at step 2200000
16:13:57:WU01:FS01:0x23:Completed 2225000 out of 2500000 steps (89%)
16:15:23:WU01:FS01:0x23:Completed 2250000 out of 2500000 steps (90%)
16:15:23:WU01:FS01:0x23:Checkpoint completed at step 2250000
16:16:46:WU01:FS01:0x23:Completed 2275000 out of 2500000 steps (91%)
16:18:08:WU01:FS01:0x23:Completed 2300000 out of 2500000 steps (92%)
16:18:09:WU01:FS01:0x23:Checkpoint completed at step 2300000
16:19:31:WU01:FS01:0x23:Completed 2325000 out of 2500000 steps (93%)
16:20:54:WU01:FS01:0x23:Completed 2350000 out of 2500000 steps (94%)
16:20:54:WU01:FS01:0x23:Checkpoint completed at step 2350000
16:22:17:WU01:FS01:0x23:Completed 2375000 out of 2500000 steps (95%)
16:23:39:WU01:FS01:0x23:Completed 2400000 out of 2500000 steps (96%)
16:23:40:WU01:FS01:0x23:Checkpoint completed at step 2400000
16:25:02:WU01:FS01:0x23:Completed 2425000 out of 2500000 steps (97%)
16:26:25:WU01:FS01:0x23:Completed 2450000 out of 2500000 steps (98%)
16:26:25:WU01:FS01:0x23:Checkpoint completed at step 2450000
16:27:48:WU01:FS01:0x23:Completed 2475000 out of 2500000 steps (99%)
16:29:10:WU01:FS01:0x23:Completed 2500000 out of 2500000 steps (100%)
16:29:10:WU01:FS01:0x23:Average performance: 104.727 ns/day
16:29:11:WU01:FS01:0x23:Checkpoint completed at step 2500000
16:29:13:WU01:FS01:0x23:Saving result file ..\logfile_01.txt
16:29:13:WU01:FS01:0x23:Saving result file checkpointIntegrator.xml
16:29:13:WU01:FS01:0x23:Saving result file checkpointState.xml.bz2
16:29:13:WU01:FS01:0x23:Saving result file positions.xtc
16:29:13:WU01:FS01:0x23:Saving result file science.log
16:29:13:WU01:FS01:0x23:Saving result file xtcAtoms.csv.bz2
16:29:13:WU01:FS01:0x23:Folding@home Core Shutdown: FINISHED_UNIT
17:17:29:FS01:Shutting core down
17:17:29:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
17:17:41:WU01:FS01:Starting
17:17:41:ERROR:WU01:FS01:Failed to start core: Failed to rename 'work/01/logfile_01.txt' to 'work/01/logfile_01-20240103-171741.txt': The process cannot access the file because it is being used by another process.
17:19:18:WU01:FS01:Starting
17:19:18:ERROR:WU01:FS01:Failed to start core: Failed to rename 'work/01/logfile_01.txt' to 'work/01/logfile_01-20240103-171918.txt': The process cannot access the file because it is being used by another process.
17:21:55:WU01:FS01:Starting
17:21:55:ERROR:WU01:FS01:Failed to start core: Failed to rename 'work/01/logfile_01.txt' to 'work/01/logfile_01-20240103-172155.txt': The process cannot access the file because it is being used by another process.
17:26:09:WU01:FS01:Starting
17:26:09:ERROR:WU01:FS01:Failed to start core: Failed to rename 'work/01/logfile_01.txt' to 'work/01/logfile_01-20240103-172609.txt': The process cannot access the file because it is being used by another process.
17:33:01:WU01:FS01:Starting
17:33:01:ERROR:WU01:FS01:Failed to start core: Failed to rename 'work/01/logfile_01.txt' to 'work/01/logfile_01-20240103-173301.txt': The process cannot access the file because it is being used by another process.
17:33:01:WU01:FS01:Sending unit results: id:01 state:SEND error:FAILED project:12262 run:0 clone:207 gen:18 core:0x23 unit:0x000000cf0000001200002fe600000000
17:33:01:WU01:FS01:Uploading 5.73MiB to 158.130.118.23
17:33:01:WU01:FS01:Connecting to 158.130.118.23:8080
17:33:05:WU01:FS01:Upload complete
17:33:05:WU01:FS01:Server responded WORK_ACK (400)
17:33:05:WU01:FS01:Cleaning up
17:33:05:ERROR:WU01:FS01:Exception: Failed to remove 'work/01/logfile_01.txt': The process cannot access the file because it is being used by another process.
Gary480six
Posts: 91
Joined: Mon Jan 21, 2008 6:42 pm

Re: NVIDIA GPUs stuck at Send or Clear after completing WUs

Post by Gary480six »

This issue was covered in a group of posts here on the forum. viewtopic.php?t=40862

It is an issue with Windows 7 systems and the new GPU Core23.

Windows 7 systems cannot fold on that new core. Well.. they Fold to 100% then fail to return the finished work.
At which point the client is stuck.

So far, the only 'solution' is to upgrade to Windows 10/11 - or switch to Linux.

Some folks who, for whatever reason, do not want to or cannot switch from Windows 7 - have just shut off Folding on those systems.

It does not appear that the Folding techs have any intention of adjusting the Core23 code so it works with Windows 7.
ostieca
Posts: 7
Joined: Sat Jan 09, 2021 12:01 am

Re: NVIDIA GPUs stuck at Send or Clear after completing WUs

Post by ostieca »

Thank you for the reply! At least, I can stop searching the reason.
Post Reply