Page 1 of 1

Project 3065 (Run 2, Clone 257, Gen 11)

Posted: Sat Mar 07, 2009 6:01 pm
by klasseng
Running on a 3Ghz MacPro Octo-core
Just got this WU and it's displaying unusual behaviour:

Code: Select all

[16:38:56] Project: 306- Starting from initial work packet
[16:38:56] 
[16:38:56] Project: 3065 (Run 2, Clone 257, Gen 11)
[16:38:56] 
[16:38:56] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=8Core.local
NNODES=4, MYRANK=0, HOSTNAME=8Core.local
NNODES=4, MYRANK=1, HOSTNAME=8Core.local
NNODES=4, MYRANK=3, HOSTNAME=8Core.local
NODEID=3 argc=15
NODEID=0 argc=15
NODEID=1 argc=15
NODEID=2 argc=15
      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.

        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.

[16:39:02] cpfilenamepfilename: 
[16:39:02] Rejecting checkpoint
starting mdrun '66728 p3065_lambda5_99sb_big'
2500000 steps,   5000.0 ps.

[16:39:03] Protein: 66728 p3065_lambda5Extra SSE boost OK.
[16:39:03] oost OK.
[16:39:03] 
[16:39:03] Extra SSE boost OK.
[16:39:03] Writing local files
[16:39:03] Completed 0 out of 2500000 steps  (0 percent)
[16:59:03] Timered checkpoint triggered.
[17:19:03] Timered checkpoint triggered.
[17:39:05] Timered checkpoint triggered.
unitinfo.txt reports that it's still at 0% after an hour.

Activity Monitor shows that it is only sporadically using CPU time, the four instances of FahCOre_a1.exe are jumping between 0% and 25% CPU utilization (most cores run steady at around 95% - 96%). There's lots of idle time on the system.

Re: Project 3065 (Run 2, Clone 257, Gen 11)

Posted: Sat Mar 07, 2009 6:14 pm
by susato
Hi Grant - Thanks for reporting this strange problem. Funny, I have a p3064 (r3, c192, g21) on a Mini, which was doing something similar this morning. Just quit working some time after the 13th frame, while still in 'running' status. The cores and client were all visible in activity monitor but using 0% CPU. I stopped the WU (cores and client all quit immediately) and restarted it. It came back up smoothly and is now cranking along well, though it will probably miss the deadline after spending around 24 hours stalled.

Try stopping and restarting yours, monitoring the client and core utilization in Activity Monitor as you do so. I'd be interested to see whether the simple restart is enough to get your work unit back on track.

No one else has turned in results partial or complete for this unit, and the preceding generation was submitted for credit early this morning, meaning that you are the first recipient for this generation of the unit.

Re: Project 3065 (Run 2, Clone 257, Gen 11)

Posted: Sun Mar 08, 2009 12:30 am
by klasseng
I didn't leave the WU running at reduced output very long, there was no way it was going to complete on time at that rate.

So, as you suggested, I stopped the WU, restarted it and it started up at full speed, now having completed 31%

Will let you know how it ends.

Re: Project 3065 (Run 2, Clone 257, Gen 11)

Posted: Sun Mar 08, 2009 6:36 pm
by klasseng
Seemed to complete and upload the result OK