Page 1 of 1

Project: 6012 (Run 0, Clone 235, Gen 30)

Posted: Sat Feb 20, 2010 5:44 pm
by chrisretusn
Had a power failure, the UPS initiated a shutdown fah6 @ 16:07:07. After restarting the computer after power came back on fah6 starts OK, but the core errors out and just stops at the last point in the log below. Stopping and restarted fah6 ends up with the same result. The error does not terminate all cores, one of the cores stays active as you can see in the top extract below.

I am guessing this WU is lost and I should just issue a "fah6 -delete 01" and restart with a new WU.

Code: Select all

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3427 folder    20   0 25936  504  376 S  0.0  0.0   0:00.00 fah6
 3431 folder    39  19  123m  27m 2160 S  0.0  0.7   0:01.29 FahCore_a3.exe

Code: Select all

[09:34:09] - Preparing to get new work unit...
[09:34:09] Cleaning up work directory
[09:34:09] + Attempting to get work packet
[09:34:09] Passkey found
[09:34:09] - Will indicate memory of 3816 MB
[09:34:09] - Connecting to assignment server
[09:34:09] Connecting to http://assign.stanford.edu:8080/
[09:34:13] Posted data.
[09:34:13] Initial: ED82; - Successful: assigned to (130.237.232.140).
[09:34:13] + News From Folding@Home: Welcome to Folding@Home
[09:34:13] Loaded queue successfully.
[09:34:13] Connecting to http://130.237.232.140:8080/
[09:34:17] Posted data.
[09:34:17] Initial: 0000; - Receiving payload (expected size: 1798241)
[09:35:01] - Downloaded at ~39 kB/s
[09:35:01] - Averaged speed for that direction ~46 kB/s
[09:35:01] + Received work.
[09:35:01] Trying to send all finished work units
[09:35:01] + No unsent completed units remaining.
[09:35:01] + Closed connections
[09:35:01] 
[09:35:01] + Processing work unit
[09:35:01] Core required: FahCore_a3.exe
[09:35:01] Core found.
[09:35:01] Working on queue slot 01 [February 20 09:35:01 UTC]
[09:35:01] + Working ...
[09:35:01] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 3414 -version 629'
[09:35:01] 
[09:35:01] *------------------------------*
[09:35:01] Folding@Home Gromacs SMP Core
[09:35:01] Version 2.13 (Tue Aug 18 18:28:33 CEST 2009)
[09:35:01] 
[09:35:01] Preparing to commence simulation
[09:35:01] - Looking at optimizations...
[09:35:01] - Created dyn
[09:35:01] - Files status OK
[09:35:01] - Expanded 1797729 -> 2078149 (decompressed 115.5 percent)
[09:35:01] Called DecompressByteArray: compressed_data_size=1797729 data_size=2078149, decompressed_data_size=2078149 diff=0
[09:35:01] - Digital signature verified
[09:35:01] 

[09:35:01] Project: 6012 (Run 0, Clone 235, Gen 30)
[09:35:01] 
[09:35:01] Assembly optimizations on if available.
[09:35:01] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
15500001 steps,  31000.0 ps (continuing from step 15000001,  30000.0 ps).
[09:35:08] Completed 0 out of 500000 steps  (0%)
[10:02:55] Completed 5000 out of 500000 steps  (1%)
[10:29:48] Completed 10000 out of 500000 steps  (2%)
[10:59:29] Completed 15000 out of 500000 steps  (3%)
[11:28:36] Completed 20000 out of 500000 steps  (4%)
[12:04:29] Completed 25000 out of 500000 steps  (5%)
[12:38:42] Completed 30000 out of 500000 steps  (6%)
[13:04:54] Completed 35000 out of 500000 steps  (7%)
[13:08:44] - Autosending finished units... [February 20 13:08:44 UTC]
[13:08:44] Trying to send all finished work units
[13:08:44] + No unsent completed units remaining.
[13:08:44] - Autosend completed
[13:33:14] Completed 40000 out of 500000 steps  (8%)
[13:58:53] Completed 45000 out of 500000 steps  (9%)
[14:27:34] Completed 50000 out of 500000 steps  (10%)
[14:57:34] Completed 55000 out of 500000 steps  (11%)
[15:21:43] Completed 60000 out of 500000 steps  (12%)
[15:43:18] Completed 65000 out of 500000 steps  (13%)
[16:04:54] Completed 70000 out of 500000 steps  (14%)
[16:07:07] ***** Got a SIGTERM signal (15)
[16:07:07] Killing all core threads

Folding@Home Client Shutdown.


Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step


Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [February 20 16:27:19 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp 

[16:27:19] - Ask before connecting: No
[16:27:19] - User name: chrisretusn (Team 2291)
[16:27:19] - User ID: 83719AB3EDA1FA2
[16:27:19] - Machine ID: 1
[16:27:19] 
[16:27:20] Loaded queue successfully.
[16:27:20] - Autosending finished units... [February 20 16:27:20 UTC]
[16:27:20] Trying to send all finished work units
[16:27:20] + No unsent completed units remaining.
[16:27:20] - Autosend completed
[16:27:20] 
[16:27:20] + Processing work unit
[16:27:20] Core required: FahCore_a3.exe
[16:27:20] Core found.
[16:27:20] Working on queue slot 01 [February 20 16:27:20 UTC]
[16:27:20] + Working ...
[16:27:20] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 3224 -version 629'

[16:27:20] 
[16:27:20] *------------------------------*
[16:27:20] Folding@Home Gromacs SMP Core
[16:27:20] Version 2.13 (Tue Aug 18 18:28:33 CEST 2009)
[16:27:20] 
[16:27:20] Preparing to commence simulation
[16:27:20] - Looking at optimizations...
[16:27:21] - Files status OK
[16:27:21] - Expanded 1797729 -> 2078149 (decompressed 115.5 percent)
[16:27:21] Called DecompressByteArray: compressed_data_size=1797729 data_size=2078149, decompressed_data_size=2078149 diff=0
[16:27:21] - Digital signature verified
[16:27:21] 
[16:27:21] Project: 6012 (Run 0, Clone 235, Gen 30)
[16:27:21] 
[16:27:21] Assembly optimizations on if available.
[16:27:21] Entering M.D.
[16:27:27] Using Gromacs checkpoints
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)

Reading checkpoint file work/wudata_01.cpt generated: Sun Feb 21 00:07:08 2010

Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
15500001 steps,  31000.0 ps (continuing from step 15070512,  30141.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090927
Source code file: md.c, line: 1419

Fatal error:
Checkpoint error on step 0

For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day
: No such process
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 2

gcq#0: Thanx for Using GROMACS - Have a Nice Day

tMPI_Abort called wiht error code -1 on thread 1
[16:27:31] fcSaveRestoreState: I/O failed dir=0, var=000000004265ABF0, varsize=20
[16:27:31] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[16:27:31] Resuming from checkpoint
[16:27:31] fcSaveRestoreState: I/O failed dir=0, var=0000000040E52E90, varsize=20
[16:27:31] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.

Re: Project: 6012 (Run 0, Clone 235, Gen 30)

Posted: Fri Apr 09, 2010 8:59 am
by noorman
.

Seems to be a file (data) corruption problem; I fear the WU will be lost.

I would back up the Work folder and queue.dat in another folder, then delete those 2 items from your current F@H folder.
Then restart Folding@Home; the Work folder and queue.dat file will be recreated and F@H should restart as normal ...


.