Could not connect with OpenMM work servers for 15 hours

Moderators: Site Moderators, FAHC Science Team

Post Reply
DmitryKo
Posts: 29
Joined: Mon Apr 13, 2020 9:22 pm

Could not connect with OpenMM work servers for 15 hours

Post by DmitryKo »

I've been consistently assigned GPU work units for the last 5 days. Yesterday morning there was a string of server connection errors. The delay between attempts was throttled up to one hour, and I could not get any GPU work unit for the next 15 hours.
Tried pause/unpause to no avail, then I restarted FAHClient and the same work servers started functioning again.

I wonder if that's the expected behaviour during high load?

I've checked the logs and there were 24 cases of 'Failed to get assignment',which seems normal to me. However other 28 resulted in an assignment to a few work servers, and three of them dropped almost all connections with some random error.
Here is the summary table that shows affected work servers and their error messages as a CSV text file.

Code: Select all

10:07:33 128.252.203.10 orkney.seas.wustl.edu "Exception: 10002: Received short response, expected 512 bytes, got 0"
18:43:07 128.252.203.10 orkney.seas.wustl.edu "Exception: 10002: Received short response, expected 512 bytes, got 0"
06:11:24 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond."
11:07:55 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
13:07:33 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
16:07:34 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
18:07:35 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
19:38:31 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
21:12:32 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
06:51:32 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: No connection could be made because..."
06:37:18 128.252.203.10 orkney.seas.wustl.edu "Exception: Server did not assign work unit"
19:13:43 128.252.203.10 orkney.seas.wustl.edu "Exception: Server did not assign work unit"
19:27:26 128.252.203.10 orkney.seas.wustl.edu "Exception: Transfer failed"
21:03:04 128.252.203.10 orkney.seas.wustl.edu "Receive error: 10053: An established connection was aborted by the software in your host machine."
06:22:30 128.252.203.10 orkney.seas.wustl.edu "Received short response, expected 512 bytes, got 0"
06:04:55 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
06:07:11 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
08:07:34 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
09:07:33 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
20:25:30 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
21:00:27 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
18:41:52 3.133.76.19 aws1.foldingathome.org "Exception: Failed to connect: A connection attempt failed because..."
19:10:33 3.133.76.19 aws1.foldingathome.org "Exception: Failed to connect: A connection attempt failed because..."
12:07:33 13.82.98.119 fah3.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
12:07:33 13.82.98.119 fah3.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
15:07:34 13.82.98.119 fah3.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
06:08:48 52.224.109.74 fah4.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
17:07:35 52.224.109.74 fah4.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
Last edited by DmitryKo on Thu May 07, 2020 8:06 pm, edited 3 times in total.
DmitryKo
Posts: 29
Joined: Mon Apr 13, 2020 9:22 pm

Re: Could not connect with OpenMM work servers for 15 hours

Post by DmitryKo »

Never mind, it looks like most assignments were by coincidence to faulty work servers which are known to experience problems lately:

128.252.203.10 : orkney.seas.wustl.edu
viewtopic.php?f=18&t=35076
viewtopic.php?f=18&t=34966
viewtopic.php?f=18&t=35117

140.163.4.231 : plfah1-1.mskcc.org
viewtopic.php?f=18&t=34908
viewtopic.php?f=18&t=34116

3.133.76.19 : aws1.foldingathome.org
viewtopic.php?f=18&t=35054
viewtopic.php?f=18&t=34967
Post Reply