Page 1 of 1

Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Tue Jul 21, 2009 12:43 pm
by Foxbat
This Work Unit will not start on my Mac Pro 2x Dual-core 2.66 GHz Xeons w/10 GB RAM running OS X 10.4.11. The log shows that it writes out the 0% message, then exits immediately:

Code: Select all

--- Opening Log file [July 21 12:21:28 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.24R1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/Foxbat/Library/FAH-SMP-Term1
Executable: /Users/Foxbat/Library/FAH-SMP-Term1/fah6
Arguments: -local -advmethods -forceasm -verbosity 9 -smp 

[12:21:28] - Ask before connecting: No
[12:21:28] - User name: Foxbat (Team 55236)
[12:21:28] - User ID: 3DA6459B38FDAE1E
[12:21:28] - Machine ID: 1
[12:21:28] 
[12:21:28] Loaded queue successfully.
[12:21:28] - Autosending finished units... [July 21 12:21:28 UTC]
[12:21:28] Trying to send all finished work units
[12:21:28] + No unsent completed units remaining.
[12:21:28] - Autosend completed
[12:21:28] 
[12:21:28] + Processing work unit
[12:21:28] At least 4 processors must be requested; read 1.
[12:21:28] Core required: FahCore_a2.exe
[12:21:28] Core found.
[12:21:28] - Using generic ./mpiexec
[12:21:28] Working on queue slot 09 [July 21 12:21:28 UTC]
[12:21:28] + Working ...
[12:21:28] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 09 -priority 96 -checkpoint 8 -forceasm -verbose -lifeline 12841 -version 624'

[12:21:28] 
[12:21:28] *------------------------------*
[12:21:28] Folding@Home Gromacs SMP Core
[12:21:28] Version 2.07 (Sun Apr 19 14:29:51 PDT 2009)
[12:21:28] 
[12:21:28] Preparing to commence simulation
[12:21:28] - Assembly optimizations manually forced on.
[12:21:28] - Not checking prior termination.
[12:21:29] - Expanded 4838439 -> 24041233 (decompressed 496.8 percent)
[12:21:30] Called DecompressByteArray: compressed_data_size=4838439 data_size=24041233, decompressed_data_size=24041233 diff=0
[12:21:30] - Digital signature verified
[12:21:30] 
[12:21:30] Project: 2671 (Run 15, Clone 16, Gen 70)
[12:21:30] 
[12:21:30] Assembly optimizations on if available.
[12:21:30] Entering M.D.
[12:21:39] Completed 0 out of 250000 steps  (0%)
[12:21:40] 
[12:21:40] Folding@home Core Shutdown: INTERRUPTED
[12:21:44] CoreStatus = 66 (102)
[12:21:44] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[12:21:44] Killing all core threads

Folding@Home Client Shutdown.
I tried restarting this WU two or three times, each time with the same result. I blew it away and am now Folding on a P2677 WU.

Re: Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Wed Jul 22, 2009 1:28 am
by Foxbat
Lucky me, got it again! After it died a few times, I tried running it from the terminal to see if I got any more information:

Code: Select all

FoxMacPro266:~/Library/FAH-SMP-Term1 Foxbat$ ./mac_qfix
entry 3, status 0, address 0.0.0.0
entry 4, status 0, address 0.0.0.0
entry 5, status 0, address 0.0.0.0
entry 6, status 0, address 0.0.0.0
entry 7, status 0, address 0.0.0.0
entry 8, status 0, address 0.0.0.0
entry 9, status 0, address 0.0.0.0
entry 0, status 0, address 0.0.0.0
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 1, address 171.67.108.24:8080
File is OK
FoxMacPro266:~/Library/FAH-SMP-Term1 Foxbat$ ./fah6 -smp 4 -verbosity 9 -local -advmethods -forceasm
Using local directory for configuration

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

Using local directory for work files
4 cores detected


--- Opening Log file [July 22 01:31:13 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.24R1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/Foxbat/Library/FAH-SMP-Term1
Executable: ./fah6
Arguments: -smp 4 -verbosity 9 -local -advmethods -forceasm 

[01:31:13] - Ask before connecting: No
[01:31:13] - User name: Foxbat (Team 55236)
[01:31:13] - User ID: 3DA6459B38FDAE1E
[01:31:13] - Machine ID: 1
[01:31:13] 
[01:31:13] Loaded queue successfully.
[01:31:13] 
[01:31:13] - Autosending finished units... [July 22 01:31:13 UTC]
[01:31:13] + Processing work unit
[01:31:13] Trying to send all finished work units
[01:31:13] Core required: FahCore_a2.exe
[01:31:13] + No unsent completed units remaining.
[01:31:13] - Autosend completed
[01:31:13] Core found.
[01:31:13] - Using generic ./mpiexec
[01:31:13] Working on queue slot 02 [July 22 01:31:13 UTC]
[01:31:13] + Working ...
[01:31:13] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -priority 96 -checkpoint 8 -forceasm -verbose -lifeline 13168 -version 624'

[01:31:13] 
[01:31:13] *------------------------------*
[01:31:13] Folding@Home Gromacs SMP Core
[01:31:13] Version 2.07 (Sun Apr 19 14:29:51 PDT 2009)
[01:31:13] 
[01:31:13] Preparing to commence simulation
[01:31:13] - Ensuring status. Please wait.
[01:31:14] Called DecompressByteArray: compressed_data_size=4838439 data_size=24041233, decompressed_data_size=24041233 diff=0
[01:31:15] - Digital signature verified
[01:31:15] 
[01:31:15] Project: 2671 (Run 15, Clone 16, Gen 70)
[01:31:15] 
[01:31:15] Assembly optimizations on if available.
[01:31:15] Entering M.D.
[01:31:24]  on if available.
[01:31:24] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=FoxMacPro266.local
NNODES=4, MYRANK=0, HOSTNAME=FoxMacPro266.local
NNODES=4, MYRANK=1, HOSTNAME=FoxMacPro266.local
NNODES=4, MYRANK=3, HOSTNAME=FoxMacPro266.local
NODEID=0 argc=20
NODEID=2 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

NODEID=1 argc=20
                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

NODEID=3 argc=20
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
17750002 steps,  35500.0 ps (continuing from step 17500002,  35000.0 ps).
[01:31:34] Completed 0 out of 250000 steps  (0%)
[01:31:34] 
[01:31:34] Folding@home Core Shutdown: INTERRUPTED
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[cli_1]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[0]0:Return code = 102
[0]1:Return code = 1
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Quit
[01:31:38] CoreStatus = 66 (102)
[01:31:38] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[01:31:38] Killing all core threads

Folding@Home Client Shutdown.
FoxMacPro266:~/Library/FAH-SMP-Term1 Foxbat$

Re: Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Wed Jul 22, 2009 7:47 am
by parkut
Got this same WU on two different machines. Both machines appear to have "hung" with system load dropping to zero, so my watchdog script attempted to restart the wu, whereon it fails immediately. Repeated attempts to restart the WU all fail immediately with the same error. After trying to restart six or more times, deleted queue.dat and entire work folder contents, restarted Fah and was assigned a different WU.


[07:34:30] CoreStatus = 66 (102)
[07:34:30] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[07:34:30] Killing all core threads

Folding@Home Client Shutdown.
...
model name : Intel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
cpu MHz : 2508.429
cache size : 2048 KB
Memory: 1.96 GB physical, 1.94 GB virtual

Current Work Unit
-----------------
Name: p2671_IBX in water
Tag: P2671R15C16G70
Download time: July 21 20:36:23
Due time: July 24 20:36:23
Progress: 0% [__________]
...
Project: 2671 (Run 15, Clone 16, Gen 70) 1920.00 pts


NNODES=4, MYRANK=1, HOSTNAME=conroe5.parkut.com
NNODES=4, MYRANK=0, HOSTNAME=conroe5.parkut.com
NNODES=4, MYRANK=2, HOSTNAME=conroe5.parkut.com
NNODES=4, MYRANK=3, HOSTNAME=conroe5.parkut.com
NODEID=0 argc=22
NODEID=1 argc=22
NODEID=2 argc=22
NODEID=3 argc=22
:-) G R O M A C S (-:

Groningen Machine for Chemical Simulation

:-) VERSION 4.0.99_development_20090425 (-:


Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2008, The GROMACS development team,
check out http://www.gromacs.org for more information.


:-) mdrun (-:

Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 65

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
17750002 steps, 35500.0 ps (continuing from step 17500002, 35000.0 ps).

t = 35000.006 ps: Water molecule starting at atom 69652 can not be settled.
Check for bad contacts and/or reduce the timestep.
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[cli_1]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[cli_3]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[0]0:Return code = 102
[0]1:Return code = 1
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 1

Re: Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Wed Jul 22, 2009 8:22 am
by bruce
parkut wrote:Got this same WU on two different machines. Both machines appear to have "hung" with system load dropping to zero, so my watchdog script attempted to restart the wu, whereon it fails immediately. Repeated attempts to restart the WU all fail immediately with the same error. After trying to restart six or more times, deleted queue.dat and entire work folder contents, restarted Fah and was assigned a different WU.
The first time I read that, I understood something different than you actually meant. There are two ways to interpret "restart the WU" I assumed that you meant restart from the beginning. What you actually did was restart from the last checkpoint.

Some errors are accompanied by a message that says something to the effect of "this simulation has reached a point from which processing cannot be continued" which describes the situation you've encountered. Trying repeatedly to resume from a point after the protein has gotten itself in an impossible configuration is futile. What we do not know is whether restarting the WU from the beginning would have reached the same impossible configuration or not. I suggest that you modify your script to try both options, not the same option six times.

With the added information from a second machine, though, the answer to a key question is helpful: Did both machines hang at the same point or were they at different points in the simulation?

Re: Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Wed Jul 22, 2009 8:42 am
by parkut
The previous WU completed normally. Assigned this WU, and it failed immediately and FAH exited with zero progress made in the normal logfile or unitinfo.txt

Multiple attempts to restart the FAH client on both machines yielded the same result, immediate failure and exiting FAH.

I don't normally look at the error.log file, but note both my Linux machines and the OS X machine of the first poster in this thread have the same entry

starting mdrun '22884 system in water'
17750002 steps, 35500.0 ps (continuing from step 17500002, 35000.0 ps).

Re: Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Wed Jul 22, 2009 6:06 pm
by parkut
Project: 2671 (Run 15, Clone 16, Gen 70) CoreStatus = 66 (102)

Machine was assigned this WU again, and it failed immediately again.

Code: Select all

[17:54:13] Thank you for your contribution to Folding@Home.
[17:54:13] + Number of Units Completed: 716

[17:54:14] - Warning: Could not delete all work unit files (1): Core file absent
[17:54:14] Trying to send all finished work units
[17:54:14] + No unsent completed units remaining.
[17:54:14] - Preparing to get new work unit...
[17:54:14] Cleaning up work directory
[17:54:14] + Attempting to get work packet
[17:54:14] - Will indicate memory of 2002 MB
[17:54:14] - Connecting to assignment server
[17:54:14] Connecting to http://assign.stanford.edu:8080/
[17:54:14] Posted data.
[17:54:14] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[17:54:14] + News From Folding@Home: Welcome to Folding@Home
[17:54:14] Loaded queue successfully.
[17:54:14] Connecting to http://171.67.108.24:8080/
[17:54:21] Posted data.
[17:54:21] Initial: 0000; - Receiving payload (expected size: 4838951)
[17:54:37] - Downloaded at ~295 kB/s
[17:54:37] - Averaged speed for that direction ~316 kB/s
[17:54:37] + Received work.
[17:54:37] Trying to send all finished work units
[17:54:37] + No unsent completed units remaining.
[17:54:37] + Closed connections
[17:54:37] 
[17:54:37] + Processing work unit
[17:54:37] At least 4 processors must be requested; read 1.
[17:54:37] Core required: FahCore_a2.exe
[17:54:37] Core found.
[17:54:37] Working on queue slot 02 [July 22 17:54:37 UTC]
[17:54:37] + Working ...
[17:54:37] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 02 -checkpoint 15 -verbose -lifeline 3012 -version 624'

[17:54:37] 
[17:54:37] *------------------------------*
[17:54:37] Folding@Home Gromacs SMP Core
[17:54:37] Version 2.08 (Mon May 18 14:47:42 PDT 2009)
[17:54:37] 
[17:54:37] Preparing to commence simulation
[17:54:37] - Ensuring status. Please wait.
[17:54:38] Called DecompressByteArray: compressed_data_size=4838439 data_size=24041233, decompressed_data_size=24041233 diff=0
[17:54:38] - Digital signature verified
[17:54:38] 
[17:54:38] Project: 2671 (Run 15, Clone 16, Gen 70)
[17:54:38] 
[17:54:38] Assembly optimizations on if available.
[17:54:38] Entering M.D.
[17:54:48] Run 15, Clone 16, Gen 70)
[17:54:48] 
[17:54:48] Entering M.D.
[17:54:56] lding@home Core Shutdown: INTERRUPTED
[17:55:00] CoreStatus = 66 (102)
[17:55:00] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[17:55:00] Killing all core threads

Folding@Home Client Shutdown.

Code: Select all

Error encountered before initializing MPICH
NNODES=4, MYRANK=0, HOSTNAME=conroe5.parkut.com
NNODES=4, MYRANK=1, HOSTNAME=conroe5.parkut.com
NNODES=4, MYRANK=2, HOSTNAME=conroe5.parkut.com
NNODES=4, MYRANK=3, HOSTNAME=conroe5.parkut.com
NODEID=0 argc=22
NODEID=2 argc=22
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090425  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=1 argc=22
NODEID=3 argc=22
Note: tpx file_version 48, software version 65

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
17750002 steps,  35500.0 ps (continuing from step 17500002,  35000.0 ps).

t = 35000.006 ps: Water molecule starting at atom 69652 can not be settled.
Check for bad contacts and/or reduce the timestep.
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[cli_1]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[cli_3]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[0]0:Return code = 102
[0]1:Return code = 1
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 1

Re: Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Wed Jul 22, 2009 7:59 pm
by kasson
Thanks--I terminated this one. (Standalone gromacs segfaults on this, which is probably what's happening to the core as well.

Re: Project: 2671 (Run 15, Clone 16, Gen 70) INT CoreStatus = 66

Posted: Wed Jul 22, 2009 8:52 pm
by Foxbat
kasson, good to know. At 14:00 GMT, my Mac got assigned this WU one last time and has been sitting here waiting for me to fix it. I'm blowing away queue.dat and the Work folder and restarting FAH.