Page 1 of 1

Project: 2671 (Run 27, Clone 85, Gen 40) what happened here?

Posted: Mon Jun 01, 2009 2:02 am
by alpha754293
What happened here??? Any way of recovering/restoring it?

log:

Code: Select all

[20:15:02] 
[20:15:02] *------------------------------*
[20:15:02] Folding@Home Gromacs SMP Core
[20:15:02] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[20:15:02] 
[20:15:02] Preparing to commence simulation
[20:15:02] - Ensuring status. Please wait.
[20:15:03] Called DecompressByteArray: compressed_data_size=4824526 data_size=24062605, decompressed_data_size=24062605 diff=0
[20:15:03] - Digital signature verified
[20:15:03] 
[20:15:03] Project: 2671 (Run 27, Clone 85, Gen 40)
[20:15:03] 
[20:15:03] Assembly optimizations on if available.
[20:15:03] Entering M.D.
[20:15:09] Using Gromacs checkpoints
[20:15:12] 
[20:15:13] Entering M.D.
[20:15:19] Using Gromacs checkpoints
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_00.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_00.cpt generated: Sun May 31 16:12:37 2009


NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22911 system in water'
10250001 steps,  20500.0 ps (continuing from step 10135010,  20270.0 ps).
[20:15:22] data_00.log
[20:15:22] Verified work/wudata_00.trr
[20:15:22] Verified work/wudata_00.xtc
[20:15:22] Verified work/wudata_00.edr
[20:15:22] Completed 135009 out of 250000 steps  (54%)
[20:24:03] Completed 137500 out of 250000 steps  (55%)
[20:32:45] Completed 140000 out of 250000 steps  (56%)
[20:41:26] Completed 142500 out of 250000 steps  (57%)
[20:50:07] Completed 145000 out of 250000 steps  (58%)
[20:58:49] Completed 147500 out of 250000 steps  (59%)
[21:07:31] Completed 150000 out of 250000 steps  (60%)
[21:16:12] Completed 152500 out of 250000 steps  (61%)
[21:24:56] Completed 155000 out of 250000 steps  (62%)
[21:33:39] Completed 157500 out of 250000 steps  (63%)
[21:42:22] Completed 160000 out of 250000 steps  (64%)
[21:51:06] Completed 162500 out of 250000 steps  (65%)
[21:59:49] Completed 165000 out of 250000 steps  (66%)
[22:08:32] Completed 167500 out of 250000 steps  (67%)
[22:17:15] Completed 170000 out of 250000 steps  (68%)
[22:25:58] Completed 172500 out of 250000 steps  (69%)
[22:34:42] Completed 175000 out of 250000 steps  (70%)
[22:43:24] Completed 177500 out of 250000 steps  (71%)
[22:52:09] Completed 180000 out of 250000 steps  (72%)
[23:00:53] Completed 182500 out of 250000 steps  (73%)
[23:09:36] Completed 185000 out of 250000 steps  (74%)
[23:18:23] Completed 187500 out of 250000 steps  (75%)
[23:27:09] Completed 190000 out of 250000 steps  (76%)
[23:35:55] Completed 192500 out of 250000 steps  (77%)
[23:44:42] Completed 195000 out of 250000 steps  (78%)
[23:53:28] Completed 197500 out of 250000 steps  (79%)
[00:02:15] Completed 200000 out of 250000 steps  (80%)
[00:11:00] Completed 202500 out of 250000 steps  (81%)
[00:19:43] Completed 205000 out of 250000 steps  (82%)
[00:28:28] Completed 207500 out of 250000 steps  (83%)
[00:37:14] Completed 210000 out of 250000 steps  (84%)
[00:45:59] Completed 212500 out of 250000 steps  (85%)
[00:54:44] Completed 215000 out of 250000 steps  (86%)
[01:03:30] Completed 217500 out of 250000 steps  (87%)
[01:12:17] Completed 220000 out of 250000 steps  (88%)

DD cell 0 0 2: Neighboring cells do not have atoms: 19890 19892 19894 19888

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: domdec_con.c, line: 680

Fatal error:
DD cell 0 0 2 could only obtain 166 of the 170 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order or use the -rcon option of mdrun.
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[01:17:10] 
[01:17:10] Folding@home Core Shutdown: INTERRUPTED
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[cli_1]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[cli_3]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[01:17:14] CoreStatus = FF (255)
[01:17:14] Sending work to server
[01:17:14] Project: 2671 (Run 27, Clone 85, Gen 40)
[01:17:14] - Error: Could not get length of results file work/wuresults_00.dat
[01:17:14] - Error: Could not read unit 00 file. Removing from queue.
[01:17:14] Trying to send all finished work units
[01:17:14] + No unsent completed units remaining.
[01:17:14] + Closed connections
[01:17:14] + Paused after finishing unit
[01:17:14] Press Enter to continue, Ctrl-C to exit...

Re: Project: 2671 (Run 27, Clone 85, Gen 40) what happened here?

Posted: Mon Jun 01, 2009 2:13 am
by 7im
Many of those errors reported, few with PG posts. Here is one.

http://foldingforum.org/viewtopic.php?f=44&t=8730

Re: Project: 2671 (Run 27, Clone 85, Gen 40) what happened here?

Posted: Mon Jun 01, 2009 2:15 am
by alpha754293
No, no I don't mean by the CoreStatus = FF (255).

I mean by more of this
DD cell 0 0 2: Neighboring cells do not have atoms: 19890 19892 19894 19888

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: domdec_con.c, line: 680

Fatal error:
DD cell 0 0 2 could only obtain 166 of the 170 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order or use the -rcon option of mdrun.
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
and then it kicked/killed 2 out of the 4 MPI processes.

THAT'S what I'm more interested in and if there's still some way for me to salvage the WU or is it lost entirely forever?

Re: Project: 2671 (Run 27, Clone 85, Gen 40) what happened here?

Posted: Mon Jun 01, 2009 2:28 am
by 7im
Oh, sorry. Guessing it's a malformed WU, who's logical conclusion falls outside the defined spec of the protein simulation. Like when the molecules expand outside a spec'd boundary.

Yes, very interesting failure, not your typical run of the mill EUE. I couldn't find that fatal error text anywhere else in this forum. Unfortunately, there's also no typical way to recover, though you are welcome to try.

Re: Project: 2671 (Run 27, Clone 85, Gen 40) what happened here?

Posted: Mon Jun 01, 2009 2:30 am
by alpha754293
7im wrote:Oh, sorry. Guessing it's a malformed WU, who's logical conclusion falls outside the defined spec of the protein simulation. Like when the molecules expand outside a spec'd boundary.

Yes, very interesting failure, not your typical run of the mill EUE. I couldn't find that fatal error text anywhere else in this forum. Unfortunately, there's also no typical way to recover, though you are welcome to try.
qfix?

*edit*

As I'm sure you're well aware, I see a lot of various errors (somewhat unfortunately), but THIS...is a new one for me. Even by my standards.

And the only thing that changed was that I had stopped the run at about 54% in order to add the "pause" flag to the run so that when the WU completes, it will prevent it from downloading a new WU in preparation for the "-smp 8" tests.

Re: Project: 2671 (Run 27, Clone 85, Gen 40) what happened here?

Posted: Mon Jun 01, 2009 6:48 pm
by bruce
That's an error inside of Gromacs. (qfix will do nothing for errors reported by the science code.) I have no way of knowing whether it's a malformed WU or a bug in Gromacs, itself, or even a hardware calculation error. Only the Pande Group would know.

Re: Project: 2671 (Run 27, Clone 85, Gen 40) what happened here?

Posted: Mon Jun 01, 2009 6:58 pm
by alpha754293
bruce wrote:That's an error inside of Gromacs. (qfix will do nothing for errors reported by the science code.) I have no way of knowing whether it's a malformed WU or a bug in Gromacs, itself, or even a hardware calculation error. Only the Pande Group would know.
Here's a list of the files that are currently in the directory. I don't know if there's anything that anybody can suggest that I do with them, or if PG can make a suggestion for me to upload those files and the work folder via FTP.

It also doesn't help that I already started my "-smp 8" testing concurrent to it on the same client (as scheduled/planned), so I don't know. I'd be willing to go ahead with anything that anybody can instruct me on. And if not, I'll just let the client take care of it and hopefully that it would be able to clean itself up.

Code: Select all

total 313608
-rwxr-x--- 1 share users     140 2009-06-01 10:25 client.cfg
-rwxr--r-- 1 share users  223957 2009-06-01 01:58 Copy of fah1.txt
-rw-r--r-- 1 share users 2534266 2009-06-01 02:43 dd_dump_err_0_n0.pdb
-rw-r--r-- 1 share users 2253402 2009-06-01 02:43 dd_dump_err_0_n1.pdb
-rw-r--r-- 1 share users 2545120 2009-06-01 02:43 dd_dump_err_0_n2.pdb
-rw-r--r-- 1 share users 2147207 2009-06-01 02:43 dd_dump_err_0_n3.pdb
-rw-r--r-- 1 share users 2242012 2009-06-01 02:43 dd_dump_err_0_n4.pdb
-rw-r--r-- 1 share users 2058298 2009-06-01 02:43 dd_dump_err_0_n5.pdb
-rw-r--r-- 1 share users 2283820 2009-06-01 02:43 dd_dump_err_0_n6.pdb
-rw-r--r-- 1 share users 2041414 2009-06-01 02:43 dd_dump_err_0_n7.pdb
-rwxr--r-- 1 share users  912040 2009-02-26 01:42 fah1
-rw-r--r-- 1 share users  788663 2009-06-01 14:48 fah1.txt
-rwxr-x--- 1 share users 3625104 2009-02-09 10:25 FahCore_a1.exe
-rwxr-x--- 1 share users 4341288 2009-04-22 11:43 FahCore_a2.exe
-rw-r--r-- 1 share users   96185 2009-05-31 16:14 FAHlog-Prev.txt
-rw-r--r-- 1 share users   32706 2009-06-01 14:48 FAHlog.txt
-rw-r--r-- 1 share users       0 2009-06-01 14:49 list.txt
-rwxr--r-- 1 share users       8 2009-01-23 17:13 machinedependent.dat
-rwxr--r-- 1 share users   68492 2009-01-23 17:12 mpiexec
-rw-r--r-- 1 share users    5236 2009-02-26 01:43 MyFolding.html
-rw-r--r-- 1 share users    7168 2009-06-01 10:28 queue.dat
-rw-r--r-- 1 share users 1708715 2009-04-06 16:29 step18533870b_n1.pdb
-rw-r--r-- 1 share users 1708726 2009-04-06 16:29 step18533870c_n1.pdb
-rw-r--r-- 1 share users 1708715 2009-04-06 16:29 step18533872b_n1.pdb
-rw-r--r-- 1 share users 1708726 2009-04-06 16:29 step18533872c_n1.pdb
-rw-r--r-- 1 share users 1708715 2009-04-06 16:29 step18533873b_n1.pdb
-rw-r--r-- 1 share users 1708726 2009-04-06 16:29 step18533873c_n1.pdb
-rw-r--r-- 1 share users 1708715 2009-04-06 16:29 step18533874b_n1.pdb
-rw-r--r-- 1 share users 1709263 2009-04-06 16:29 step18533874c_n1.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061720b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061720c_n0.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061721b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061721c_n0.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061723b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061723c_n0.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061724b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061724c_n0.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061725b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061725c_n0.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061726b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061726c_n0.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061727b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061727c_n0.pdb
-rw-r--r-- 1 share users 2778795 2009-03-07 22:08 step20061728b_n0.pdb
-rw-r--r-- 1 share users 2778806 2009-03-07 22:08 step20061728c_n0.pdb
-rw-r--r-- 1 share users 2801180 2009-03-07 22:08 step20061730b_n0.pdb
-rw-r--r-- 1 share users 2801191 2009-03-07 22:08 step20061730c_n0.pdb
-rw-r--r-- 1 share users 2801180 2009-03-07 22:08 step20061732b_n0.pdb
-rw-r--r-- 1 share users 2801191 2009-03-07 22:08 step20061732c_n0.pdb
-rw-r--r-- 1 share users 2801180 2009-03-07 22:08 step20061734b_n0.pdb
-rw-r--r-- 1 share users 2801191 2009-03-07 22:08 step20061734c_n0.pdb
-rw-r--r-- 1 share users 2801180 2009-03-07 22:08 step20061735b_n0.pdb
-rw-r--r-- 1 share users 2801191 2009-03-07 22:08 step20061735c_n0.pdb
-rw-r--r-- 1 share users 2801180 2009-03-07 22:08 step20061736b_n0.pdb
-rw-r--r-- 1 share users 2801191 2009-03-07 22:08 step20061736c_n0.pdb
-rw-r--r-- 1 share users 2801180 2009-03-07 22:08 step20061737b_n0.pdb
-rw-r--r-- 1 share users 2801191 2009-03-07 22:08 step20061737c_n0.pdb
-rw-r--r-- 1 share users 2841440 2009-03-07 22:08 step20061740b_n0.pdb
-rw-r--r-- 1 share users 2841451 2009-03-07 22:08 step20061740c_n0.pdb
-rw-r--r-- 1 share users 2841440 2009-03-07 22:08 step20061742b_n0.pdb
-rw-r--r-- 1 share users 2841451 2009-03-07 22:08 step20061742c_n0.pdb
-rw-r--r-- 1 share users 2841440 2009-03-07 22:08 step20061744b_n0.pdb
-rw-r--r-- 1 share users 2841451 2009-03-07 22:08 step20061744c_n0.pdb
-rw-r--r-- 1 share users 2841440 2009-03-07 22:08 step20061745b_n0.pdb
-rw-r--r-- 1 share users 2841451 2009-03-07 22:08 step20061745c_n0.pdb
-rw-r--r-- 1 share users 2841440 2009-03-07 22:08 step20061746b_n0.pdb
-rw-r--r-- 1 share users 2841451 2009-03-07 22:08 step20061746c_n0.pdb
-rw-r--r-- 1 share users 2841440 2009-03-07 22:08 step20061748b_n0.pdb
-rw-r--r-- 1 share users 2841451 2009-03-07 22:08 step20061748c_n0.pdb
-rw-r--r-- 1 share users 2871085 2009-03-07 22:08 step20061750b_n0.pdb
-rw-r--r-- 1 share users 2871096 2009-03-07 22:08 step20061750c_n0.pdb
-rw-r--r-- 1 share users 2871085 2009-03-07 22:08 step20061751b_n0.pdb
-rw-r--r-- 1 share users 2871096 2009-03-07 22:08 step20061751c_n0.pdb
-rw-r--r-- 1 share users 2871085 2009-03-07 22:08 step20061752b_n0.pdb
-rw-r--r-- 1 share users 2871096 2009-03-07 22:08 step20061752c_n0.pdb
-rw-r--r-- 1 share users 2871085 2009-03-07 22:08 step20061754b_n0.pdb
-rw-r--r-- 1 share users 2871096 2009-03-07 22:08 step20061754c_n0.pdb
-rw-r--r-- 1 share users 2871085 2009-03-07 22:08 step20061759b_n0.pdb
-rw-r--r-- 1 share users 2871096 2009-03-07 22:08 step20061759c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061761b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061761c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061763b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061763c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061764b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061764c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061765b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061765c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061766b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061766c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061767b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061767c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061768b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061768c_n0.pdb
-rw-r--r-- 1 share users 2868775 2009-03-07 22:09 step20061769b_n0.pdb
-rw-r--r-- 1 share users 2868786 2009-03-07 22:09 step20061769c_n0.pdb
-rw-r--r-- 1 share users 2847380 2009-03-07 22:09 step20061770b_n0.pdb
-rw-r--r-- 1 share users 2847391 2009-03-07 22:09 step20061770c_n0.pdb
-rw-r--r-- 1 share users 2847380 2009-03-07 22:09 step20061771b_n0.pdb
-rw-r--r-- 1 share users 2847391 2009-03-07 22:09 step20061771c_n0.pdb
-rw-r--r-- 1 share users 2847380 2009-03-07 22:09 step20061772b_n0.pdb
-rw-r--r-- 1 share users 2847391 2009-03-07 22:09 step20061772c_n0.pdb
-rw-r--r-- 1 share users 2847380 2009-03-07 22:09 step20061773b_n0.pdb
-rw-r--r-- 1 share users 2847391 2009-03-07 22:09 step20061773c_n0.pdb
-rw-r--r-- 1 share users 2847380 2009-03-07 22:09 step20061774b_n0.pdb
-rw-r--r-- 1 share users 2847391 2009-03-07 22:09 step20061774c_n0.pdb
-rw-r--r-- 1 share users 2847380 2009-03-07 22:09 step20061776b_n0.pdb
-rw-r--r-- 1 share users 2847391 2009-03-07 22:09 step20061776c_n0.pdb
-rw-r--r-- 1 share users 2847380 2009-03-07 22:09 step20061778b_n0.pdb
-rw-r--r-- 1 share users 2847391 2009-03-07 22:09 step20061778c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061780b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061780c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061781b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061781c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061783b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061783c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061785b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061785c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061786b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061786c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061787b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061787c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061788b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061788c_n0.pdb
-rw-r--r-- 1 share users 2876640 2009-03-07 22:09 step20061789b_n0.pdb
-rw-r--r-- 1 share users 2876651 2009-03-07 22:09 step20061789c_n0.pdb
-rw-r--r-- 1 share users 2901995 2009-03-07 22:09 step20061790b_n0.pdb
-rw-r--r-- 1 share users 2902006 2009-03-07 22:09 step20061790c_n0.pdb
-rw-r--r-- 1 share users     352 2009-06-01 14:48 unitinfo.txt
drwxr-x--- 2 share users    4096 2009-06-01 14:48 work