I am guessing this WU is lost and I should just issue a "fah6 -delete 01" and restart with a new WU.
Code: Select all
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3427 folder 20 0 25936 504 376 S 0.0 0.0 0:00.00 fah6
3431 folder 39 19 123m 27m 2160 S 0.0 0.7 0:01.29 FahCore_a3.exe
Code: Select all
[09:34:09] - Preparing to get new work unit...
[09:34:09] Cleaning up work directory
[09:34:09] + Attempting to get work packet
[09:34:09] Passkey found
[09:34:09] - Will indicate memory of 3816 MB
[09:34:09] - Connecting to assignment server
[09:34:09] Connecting to http://assign.stanford.edu:8080/
[09:34:13] Posted data.
[09:34:13] Initial: ED82; - Successful: assigned to (130.237.232.140).
[09:34:13] + News From Folding@Home: Welcome to Folding@Home
[09:34:13] Loaded queue successfully.
[09:34:13] Connecting to http://130.237.232.140:8080/
[09:34:17] Posted data.
[09:34:17] Initial: 0000; - Receiving payload (expected size: 1798241)
[09:35:01] - Downloaded at ~39 kB/s
[09:35:01] - Averaged speed for that direction ~46 kB/s
[09:35:01] + Received work.
[09:35:01] Trying to send all finished work units
[09:35:01] + No unsent completed units remaining.
[09:35:01] + Closed connections
[09:35:01]
[09:35:01] + Processing work unit
[09:35:01] Core required: FahCore_a3.exe
[09:35:01] Core found.
[09:35:01] Working on queue slot 01 [February 20 09:35:01 UTC]
[09:35:01] + Working ...
[09:35:01] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 3414 -version 629'
[09:35:01]
[09:35:01] *------------------------------*
[09:35:01] Folding@Home Gromacs SMP Core
[09:35:01] Version 2.13 (Tue Aug 18 18:28:33 CEST 2009)
[09:35:01]
[09:35:01] Preparing to commence simulation
[09:35:01] - Looking at optimizations...
[09:35:01] - Created dyn
[09:35:01] - Files status OK
[09:35:01] - Expanded 1797729 -> 2078149 (decompressed 115.5 percent)
[09:35:01] Called DecompressByteArray: compressed_data_size=1797729 data_size=2078149, decompressed_data_size=2078149 diff=0
[09:35:01] - Digital signature verified
[09:35:01]
[09:35:01] Project: 6012 (Run 0, Clone 235, Gen 30)
[09:35:01]
[09:35:01] Assembly optimizations on if available.
[09:35:01] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
15500001 steps, 31000.0 ps (continuing from step 15000001, 30000.0 ps).
[09:35:08] Completed 0 out of 500000 steps (0%)
[10:02:55] Completed 5000 out of 500000 steps (1%)
[10:29:48] Completed 10000 out of 500000 steps (2%)
[10:59:29] Completed 15000 out of 500000 steps (3%)
[11:28:36] Completed 20000 out of 500000 steps (4%)
[12:04:29] Completed 25000 out of 500000 steps (5%)
[12:38:42] Completed 30000 out of 500000 steps (6%)
[13:04:54] Completed 35000 out of 500000 steps (7%)
[13:08:44] - Autosending finished units... [February 20 13:08:44 UTC]
[13:08:44] Trying to send all finished work units
[13:08:44] + No unsent completed units remaining.
[13:08:44] - Autosend completed
[13:33:14] Completed 40000 out of 500000 steps (8%)
[13:58:53] Completed 45000 out of 500000 steps (9%)
[14:27:34] Completed 50000 out of 500000 steps (10%)
[14:57:34] Completed 55000 out of 500000 steps (11%)
[15:21:43] Completed 60000 out of 500000 steps (12%)
[15:43:18] Completed 65000 out of 500000 steps (13%)
[16:04:54] Completed 70000 out of 500000 steps (14%)
[16:07:07] ***** Got a SIGTERM signal (15)
[16:07:07] Killing all core threads
Folding@Home Client Shutdown.
Received the TERM signal, stopping at the next step
Received the TERM signal, stopping at the next step
Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.
2 cores detected
--- Opening Log file [February 20 16:27:19 UTC]
# Linux SMP Console Edition ###################################################
###############################################################################
Folding@Home Client Version 6.29
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /home/folder/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp
[16:27:19] - Ask before connecting: No
[16:27:19] - User name: chrisretusn (Team 2291)
[16:27:19] - User ID: 83719AB3EDA1FA2
[16:27:19] - Machine ID: 1
[16:27:19]
[16:27:20] Loaded queue successfully.
[16:27:20] - Autosending finished units... [February 20 16:27:20 UTC]
[16:27:20] Trying to send all finished work units
[16:27:20] + No unsent completed units remaining.
[16:27:20] - Autosend completed
[16:27:20]
[16:27:20] + Processing work unit
[16:27:20] Core required: FahCore_a3.exe
[16:27:20] Core found.
[16:27:20] Working on queue slot 01 [February 20 16:27:20 UTC]
[16:27:20] + Working ...
[16:27:20] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 3224 -version 629'
[16:27:20]
[16:27:20] *------------------------------*
[16:27:20] Folding@Home Gromacs SMP Core
[16:27:20] Version 2.13 (Tue Aug 18 18:28:33 CEST 2009)
[16:27:20]
[16:27:20] Preparing to commence simulation
[16:27:20] - Looking at optimizations...
[16:27:21] - Files status OK
[16:27:21] - Expanded 1797729 -> 2078149 (decompressed 115.5 percent)
[16:27:21] Called DecompressByteArray: compressed_data_size=1797729 data_size=2078149, decompressed_data_size=2078149 diff=0
[16:27:21] - Digital signature verified
[16:27:21]
[16:27:21] Project: 6012 (Run 0, Clone 235, Gen 30)
[16:27:21]
[16:27:21] Assembly optimizations on if available.
[16:27:21] Entering M.D.
[16:27:27] Using Gromacs checkpoints
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Reading checkpoint file work/wudata_01.cpt generated: Sun Feb 21 00:07:08 2010
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
15500001 steps, 31000.0 ps (continuing from step 15070512, 30141.0 ps).
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090927
Source code file: md.c, line: 1419
Fatal error:
Checkpoint error on step 0
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
: No such process
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 2
gcq#0: Thanx for Using GROMACS - Have a Nice Day
tMPI_Abort called wiht error code -1 on thread 1
[16:27:31] fcSaveRestoreState: I/O failed dir=0, var=000000004265ABF0, varsize=20
[16:27:31] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[16:27:31] Resuming from checkpoint
[16:27:31] fcSaveRestoreState: I/O failed dir=0, var=0000000040E52E90, varsize=20
[16:27:31] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.