Project: 2619 (Run 6, Clone 644, Gen 1)

Moderators: Site Moderators, FAHC Science Team

Post Reply
Tigerbiten
Posts: 62
Joined: Sun Dec 02, 2007 6:02 am

Project: 2619 (Run 6, Clone 644, Gen 1)

Post by Tigerbiten »

First time running through it errored at 100% with a core status = FF(255)

Code: Select all

[14:32:08] Folding@Home Gromacs SMP Core
[14:32:08] Version 1.91 (2007)
[14:32:08] 
[14:32:08] Preparing to commence simulation
[14:32:08] - Ensuring status. Please wait.
[14:32:08] Finalizing output
[14:32:10] - Expanded 9147494 -> 48331685 (decompressed 58.8 percent)
[14:32:11] Project: 2619 (Run 6, Clone 644, Gen 1)
[14:32:12] Assembly optimizations on if available.
[14:32:30] Entering M.D.
[14:32:41]   (0%)
[14:32:41] ed 0 out of 125000 steps  (0%)
[14:40:59] ps  (1%)
[14:41:00]  630 out of 125000 steps  (1%)
[14:57:00] eps  (2%)
[14:57:00] 1880 out of 125000 steps  (2%)
[15:12:23] eps  (3%)
[15:12:23] 3130 out of 125000 steps  (3%)
....................................
[14:50:49] Completed 121880 out of 125000 steps  (98%)
[15:07:53] Completed 123130 out of 125000 steps  (99%)
[15:25:22] steps  (100%)
[15:25:22] 80 out of 125000 steps  (100%)
[15:33:56] CoreStatus = FF (255)
[15:33:56] Client-core communications error: ERROR 0xff
[15:33:56] Deleting current work unit & continuing...
[15:38:19] - Warning: Could not delete all work unit files (1): Core file absent
[15:38:19] Trying to send all finished work units
[15:38:19] + No unsent completed units remaining.
[15:38:19] - Preparing to get new work unit...
[15:38:19] + Attempting to get work packet
[15:38:19] - Will indicate memory of 999 MB
Then on the next 2 attempts to run this protien I got some weird error I've never seen before.

Code: Select all

[15:42:29] + Processing work unit
[15:42:29] Core required: FahCore_a2.exe
[15:42:29] Core found.
[15:42:29] Working on Unit 02 [April 5 15:42:29]
[15:42:29] + Working ...
[15:42:29] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 30 -verbose -lifeline 2202 -version 601'

[15:42:29] 
[15:42:29] *------------------------------*
[15:42:29] Folding@Home Gromacs SMP Core
[15:42:29] Version 1.91 (2007)
[15:42:29] 
[15:42:29] Preparing to commence simulation
[15:42:29] - Ensuring status. Please wait.
[15:42:46] - Looking at optimizations...
[15:42:46] - Working with standard loops on this execution.
[15:42:46] - Previous termination of core was improper.
[15:42:46] - Going to use standard loops.
[15:42:46] - Files status OK
[15:42:46] Error: Work unit read from disk is invalid
[15:42:46] Finalizing output
[15:42:50] - Expanded 9147494 -> 48331685 (decompressed 58.8 percent)
[15:42:52] p619 (Run 6, Clone 644, Gen 1)
[15:42:52] 
[15:42:53] Error: Could not write local file.  Exiting.
[15:4[15:49:28] - Warning: Could not delete all work unit files (2): Core file absent
[15:49:28] Trying to send all finished work units
[15:49:28] + No unsent completed units remaining.
[15:49:28] - Preparing to get new work unit...
[15:49:28] + Attempting to get work packet
[15:49:28] - Will indicate memory of 999 MB
[15:49:28]

Code: Select all

[16:06:27] Working on Unit 03 [April 5 16:06:27]
[16:06:27] + Working ...
[16:06:27] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 30 -verbose -lifeline 2202 -version 601'

[16:06:27] 
[16:06:27] *------------------------------*
[16:06:27] Folding@Home Gromacs SMP Core
[16:06:27] Version 1.91 (2007)
[16:06:27] 
[16:06:27] Preparing to commence simulation
[16:06:27] - Ensuring status. Please wait.
[16:06:44] - Looking at optimizations...
[16:06:44] - Working with standard loops on this execution.
[16:06:44] - Previous termination of core was improper.
[16:06:44] - Going to use standard loops.
[16:06:44] - Files status OK
[16:06:44] Error: Work unit read from disk is invalid
[16:06:44] Finalizing output
[16:06:50] - Expanded 9147494 -> 48331685 (decompressed 58.8 percent)
[16:06:53] 
[16:06:53] Project: 2619 (Run 6, Clone 644, Gen 1)
[16:06:53] 
[16:06:54] Entering M.D.
[16:07:05] Completed 0 out of 125000 steps  (0%)
[16:07:16] dir=1, var=00002AAAB37A9010, varsize=3696564
[16:07:16] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB3B30010, varsize=3696564
[16:07:16] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB46F2010, varsize=3696564
[16:07:16] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB436B010, varsize=3696564
[16:17:31] Completed 630 out of 125000 steps  (1%)
[16:17:40] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB37A9010, varsize=3696564
[16:17:40] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB3B30010, varsize=3696564
[16:17:40] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB46F2010, varsize=3696564
[16:17:40] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB436B010, varsize=3696564
[16:37:16] Completed 1880 out of 125000 steps  (2%)
[16:37:25] fcSaveRestoreState: I/O failed dir=1, var=000000000094A120, varsize=54000
[16:37:25] fcSaveRestoreState: I/O failed dir=1, var=0000000000957410, varsize=21120
[16:37:25] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB37A9010, varsize=3696564
[16:37:25] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB3B30010, varsize=3696564
[16:37:25] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB46F2010, varsize=3696564
[16:37:25] fcSaveRestoreState: I/O failed dir=1, var=00002AAAB436B010, varsize=3696564
[16:46:50] CoreStatus = 1 (1)
[16:46:50] Client-core communications error: ERROR 0x1
[16:46:50] - Attempting to download new core...
Luck .............. :D
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project: 2619 (Run 6, Clone 644, Gen 1)

Post by kasson »

It looks to me like the checkpoints got corrupted. What I'd suggest is the following:
Stop the client. Make sure all the core processes die out. Copy the wudata_xx.dat file out of the work directory. Delete all the other *_xx.* files (where xx corresponds to the queue position). Copy the wudata_xx.dat file back in. Restart the client. It will start the WU from the beginning again, unfortunately, but there shouldn't be any other residual nastiness.
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project: 2619 (Run 6, Clone 644, Gen 1)

Post by VijayPande »

PS Peter and I met today and we have some ideas. No ETA, but there is some progress (new code that we're starting to test).
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Post Reply