Project: 6012 (Run 0, Clone 235, Gen 30)
Posted: Sat Feb 20, 2010 5:44 pm
Had a power failure, the UPS initiated a shutdown fah6 @ 16:07:07. After restarting the computer after power came back on fah6 starts OK, but the core errors out and just stops at the last point in the log below. Stopping and restarted fah6 ends up with the same result. The error does not terminate all cores, one of the cores stays active as you can see in the top extract below.
I am guessing this WU is lost and I should just issue a "fah6 -delete 01" and restart with a new WU.
I am guessing this WU is lost and I should just issue a "fah6 -delete 01" and restart with a new WU.
Code: Select all
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3427 folder 20 0 25936 504 376 S 0.0 0.0 0:00.00 fah6
3431 folder 39 19 123m 27m 2160 S 0.0 0.7 0:01.29 FahCore_a3.exe
Code: Select all
[09:34:09] - Preparing to get new work unit...
[09:34:09] Cleaning up work directory
[09:34:09] + Attempting to get work packet
[09:34:09] Passkey found
[09:34:09] - Will indicate memory of 3816 MB
[09:34:09] - Connecting to assignment server
[09:34:09] Connecting to http://assign.stanford.edu:8080/
[09:34:13] Posted data.
[09:34:13] Initial: ED82; - Successful: assigned to (130.237.232.140).
[09:34:13] + News From Folding@Home: Welcome to Folding@Home
[09:34:13] Loaded queue successfully.
[09:34:13] Connecting to http://130.237.232.140:8080/
[09:34:17] Posted data.
[09:34:17] Initial: 0000; - Receiving payload (expected size: 1798241)
[09:35:01] - Downloaded at ~39 kB/s
[09:35:01] - Averaged speed for that direction ~46 kB/s
[09:35:01] + Received work.
[09:35:01] Trying to send all finished work units
[09:35:01] + No unsent completed units remaining.
[09:35:01] + Closed connections
[09:35:01]
[09:35:01] + Processing work unit
[09:35:01] Core required: FahCore_a3.exe
[09:35:01] Core found.
[09:35:01] Working on queue slot 01 [February 20 09:35:01 UTC]
[09:35:01] + Working ...
[09:35:01] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 3414 -version 629'
[09:35:01]
[09:35:01] *------------------------------*
[09:35:01] Folding@Home Gromacs SMP Core
[09:35:01] Version 2.13 (Tue Aug 18 18:28:33 CEST 2009)
[09:35:01]
[09:35:01] Preparing to commence simulation
[09:35:01] - Looking at optimizations...
[09:35:01] - Created dyn
[09:35:01] - Files status OK
[09:35:01] - Expanded 1797729 -> 2078149 (decompressed 115.5 percent)
[09:35:01] Called DecompressByteArray: compressed_data_size=1797729 data_size=2078149, decompressed_data_size=2078149 diff=0
[09:35:01] - Digital signature verified
[09:35:01]
[09:35:01] Project: 6012 (Run 0, Clone 235, Gen 30)
[09:35:01]
[09:35:01] Assembly optimizations on if available.
[09:35:01] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
15500001 steps, 31000.0 ps (continuing from step 15000001, 30000.0 ps).
[09:35:08] Completed 0 out of 500000 steps (0%)
[10:02:55] Completed 5000 out of 500000 steps (1%)
[10:29:48] Completed 10000 out of 500000 steps (2%)
[10:59:29] Completed 15000 out of 500000 steps (3%)
[11:28:36] Completed 20000 out of 500000 steps (4%)
[12:04:29] Completed 25000 out of 500000 steps (5%)
[12:38:42] Completed 30000 out of 500000 steps (6%)
[13:04:54] Completed 35000 out of 500000 steps (7%)
[13:08:44] - Autosending finished units... [February 20 13:08:44 UTC]
[13:08:44] Trying to send all finished work units
[13:08:44] + No unsent completed units remaining.
[13:08:44] - Autosend completed
[13:33:14] Completed 40000 out of 500000 steps (8%)
[13:58:53] Completed 45000 out of 500000 steps (9%)
[14:27:34] Completed 50000 out of 500000 steps (10%)
[14:57:34] Completed 55000 out of 500000 steps (11%)
[15:21:43] Completed 60000 out of 500000 steps (12%)
[15:43:18] Completed 65000 out of 500000 steps (13%)
[16:04:54] Completed 70000 out of 500000 steps (14%)
[16:07:07] ***** Got a SIGTERM signal (15)
[16:07:07] Killing all core threads
Folding@Home Client Shutdown.
Received the TERM signal, stopping at the next step
Received the TERM signal, stopping at the next step
Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.
2 cores detected
--- Opening Log file [February 20 16:27:19 UTC]
# Linux SMP Console Edition ###################################################
###############################################################################
Folding@Home Client Version 6.29
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /home/folder/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp
[16:27:19] - Ask before connecting: No
[16:27:19] - User name: chrisretusn (Team 2291)
[16:27:19] - User ID: 83719AB3EDA1FA2
[16:27:19] - Machine ID: 1
[16:27:19]
[16:27:20] Loaded queue successfully.
[16:27:20] - Autosending finished units... [February 20 16:27:20 UTC]
[16:27:20] Trying to send all finished work units
[16:27:20] + No unsent completed units remaining.
[16:27:20] - Autosend completed
[16:27:20]
[16:27:20] + Processing work unit
[16:27:20] Core required: FahCore_a3.exe
[16:27:20] Core found.
[16:27:20] Working on queue slot 01 [February 20 16:27:20 UTC]
[16:27:20] + Working ...
[16:27:20] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 3224 -version 629'
[16:27:20]
[16:27:20] *------------------------------*
[16:27:20] Folding@Home Gromacs SMP Core
[16:27:20] Version 2.13 (Tue Aug 18 18:28:33 CEST 2009)
[16:27:20]
[16:27:20] Preparing to commence simulation
[16:27:20] - Looking at optimizations...
[16:27:21] - Files status OK
[16:27:21] - Expanded 1797729 -> 2078149 (decompressed 115.5 percent)
[16:27:21] Called DecompressByteArray: compressed_data_size=1797729 data_size=2078149, decompressed_data_size=2078149 diff=0
[16:27:21] - Digital signature verified
[16:27:21]
[16:27:21] Project: 6012 (Run 0, Clone 235, Gen 30)
[16:27:21]
[16:27:21] Assembly optimizations on if available.
[16:27:21] Entering M.D.
[16:27:27] Using Gromacs checkpoints
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Reading checkpoint file work/wudata_01.cpt generated: Sun Feb 21 00:07:08 2010
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
15500001 steps, 31000.0 ps (continuing from step 15070512, 30141.0 ps).
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090927
Source code file: md.c, line: 1419
Fatal error:
Checkpoint error on step 0
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
: No such process
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 2
gcq#0: Thanx for Using GROMACS - Have a Nice Day
tMPI_Abort called wiht error code -1 on thread 1
[16:27:31] fcSaveRestoreState: I/O failed dir=0, var=000000004265ABF0, varsize=20
[16:27:31] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[16:27:31] Resuming from checkpoint
[16:27:31] fcSaveRestoreState: I/O failed dir=0, var=0000000040E52E90, varsize=20
[16:27:31] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.