Page 1 of 1

Project: 2665 (Run 2, Clone 968, Gen 104) - 3 hrs since ckpt

Posted: Mon Aug 17, 2009 6:09 pm
by anko1
This is on Big Red, which hasn't had an EUE since April (iirc; i.e., my records are accurate). 3 hrs since last check point, so restarted a new WU (which is an improvement; iirc, it used to just hang until you found it).

Code: Select all

[20:03:05] + Closed connections
[20:03:05] 
[20:03:05] + Processing work unit
[20:03:05] Work type a1 not eligible for variable processors
[20:03:05] Core required: FahCore_a1.exe
[20:03:05] Core found.
[20:03:05] Using generic mpiexec calls
[20:03:05] Working on queue slot 09 [August 15 20:03:05 UTC]
[20:03:05] + Working ...
[20:03:05] - Calling 'mpiexec -np 4 -channel auto -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -verbose -lifeline 2496 -version 624'

[20:03:05] 
[20:03:05] *------------------------------*
[20:03:05] Folding@Home Gromacs SMP Core
[20:03:05] Version 1.74 (March 10, 2007)
[20:03:05] 
[20:03:05] Preparing to commence simulation
[20:03:05] - Looking at optimizations...
[20:03:06] .
[20:03:10] - Starting from initial work packet
[20:03:10] 
[20:03:10] Project: 2665 (Run 2, Clone 968, Gen 104)
[20:03:10] 
[20:03:11] Assembly optimizations on if available.
[20:03:11] Entering M.D.
[20:03:33] percent)
[20:03:33] - Starting from initial work packet
[20:03:33] 8, Gen 104)
[20:03:33] 
[20:03:33] Entering M.D.
[20:03:34] e 968, Gen 104)
[20:03:34] 
[20:03:34] Entering M.D.
[20:03:40] Rejecting checkpoint
[20:03:42] Protein: HGG with glycosylations
[20:03:42] Writing local files
[20:03:51] Extra SSE boost OK.
[20:03:51] Writing local files
[20:03:51] Completed 0 out of 250000 steps  (0 percent)
[20:18:52] Timered checkpoint triggered.
[20:19:36] Writing local files
[20:19:36] Completed 2500 out of 250000 steps  (1 percent)
                      {snip}
[10:33:35] Timered checkpoint triggered.
[10:34:27] Writing local files
[10:34:27] Completed 137500 out of 250000 steps  (55 percent)
[10:49:00] - Autosending finished units... [August 16 10:49:00 UTC]
[10:49:00] Trying to send all finished work units
[10:49:00] + No unsent completed units remaining.
[10:49:00] - Autosend completed
[10:49:28] Timered checkpoint triggered.
[10:50:18] Writing local files
[10:50:19] Completed 140000 out of 250000 steps  (56 percent)
[11:05:20] Timered checkpoint triggered.
[11:06:10] Writing local files
[11:06:11] Completed 142500 out of 250000 steps  (57 percent)
[14:06:12] At least 3 hours since checkpoint written...
[14:08:12] 
[14:08:12] Folding@home Core Shutdown: EARLY_UNIT_END
[14:08:12] 
[14:08:12] Folding@home Core Shutdown: EARLY_UNIT_END
[14:08:15] CoreStatus = 7B (123)
[14:08:15] Sending work to server
[14:08:15] Project: 2665 (Run 2, Clone 968, Gen 104)


[14:08:15] + Attempting to send results [August 16 14:08:15 UTC]
[14:08:15] - Reading file work/wuresults_09.dat from core
[14:08:15]   (Read 116435 bytes from disk)
[14:08:15] Connecting to http://171.64.65.64:80/
[14:08:16] Posted data.
[14:08:16] Initial: 0000; - Uploaded at ~114 kB/s
[14:08:16] - Averaged speed for that direction ~373 kB/s
[14:08:16] + Results successfully sent
[14:08:16] Thank you for your contribution to Folding@Home.

Re: Project: 2665 (Run 2, Clone 968, Gen 104) - 3 hrs since ckpt

Posted: Mon Aug 17, 2009 6:18 pm
by MtM
viewtopic.php?f=19&t=11048

Same occurance ( and there are 4 more threads about the same subject ).

Re: Project: 2665 (Run 2, Clone 968, Gen 104) - 3 hrs since ckpt

Posted: Mon Aug 17, 2009 6:37 pm
by anko1
Thanks for the response, MtM. Maybe I edited too much of my log. The WU was actually progressing normally until it just stopped. Probably just one of those mysterious incidents that happens now and then, but I thought I'd report it anyway. If you'd like to see more of the log, just let me know.

Re: Project: 2665 (Run 2, Clone 968, Gen 104) - 3 hrs since ckpt

Posted: Mon Aug 17, 2009 9:40 pm
by P5-133XL
Whenever I have a WU stall like that, the first thing I do is check a process manager (In windows it would be the task manager) to see if the FAHCore_xx processes are still running. If not then I know that folding needs to be restarted.

Re: Project: 2665 (Run 2, Clone 968, Gen 104) - 3 hrs since ckpt

Posted: Mon Aug 17, 2009 9:59 pm
by anko1
Yes. What was nice about this is that the client caught the stall and restarted itself. Otherwise it would have waited for my return this morning.

Re: Project: 2665 (Run 2, Clone 968, Gen 104) - 3 hrs since ckpt

Posted: Tue Aug 18, 2009 9:30 am
by susato
MtM wrote:http://foldingforum.org/viewtopic.php?f=19&t=11048

Same occurance ( and there are 4 more threads about the same subject ).
Um, this looks different from the "folding on 1 core" problem in the link you cited.

Interesting, though, that the client noticed the delay between checkpoints and ended the unit. Good eye anko1.

Re: Project: 2665 (Run 2, Clone 968, Gen 104) - 3 hrs since ckpt

Posted: Tue Aug 18, 2009 9:56 pm
by MtM
susato wrote:
MtM wrote:http://foldingforum.org/viewtopic.php?f=19&t=11048

Same occurance ( and there are 4 more threads about the same subject ).
Um, this looks different from the "folding on 1 core" problem in the link you cited.

Interesting, though, that the client noticed the delay between checkpoints and ended the unit. Good eye anko1.
Here I did not read it well, but this wasn't really news to me and it's in the problems with a wu section so I guess I didn't make the connection with the client/core which got updated.

Sorry susato and OP, reading it back I can't believe I missed that, must have posted in a hurry.