Page 1 of 1

Project: 2669 (Run 1, Clone 21, Gen 13)

Posted: Fri Sep 19, 2008 6:06 pm
by 314159
Linux current release combined client, stock clock, stable machine.

I must admit that I am NOT in favor of having a so called "combined client" for any platform.
It seems to me to be mixing eggs and apples.
I would also suspect that resolution of bugs would prove more difficult.

Whom would we offend if we started a "friendly petition" to reinstate the stand alone console version? :D

The Linux SMP client is almost totally devoid of proper error trapping as it is.
And yes, I do suspect that resolution of this situation is not a top priority.
(I run the beta release posted in a thread here by Dr. Kasson on about half of my machines.
It appears to eliminate all or most of the stalls reflected in this WU but does not resolve the other issues.)

Perhaps this one will assist in the process? :?: :)

Code: Select all

[03:04:05] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 16305 -version 602'

[03:04:05] 
[03:04:05] *------------------------------*
[03:04:05] Folding@Home Gromacs SMP Core
[03:04:05] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[03:04:05] 
[03:04:05] Preparing to commence simulation
[03:04:05] - Ensuring status. Please wait.
[03:04:14] - Assembly optimizations manually forced on.
[03:04:14] - Not checking prior termination.
[03:04:16] - Expanded 4837744 -> 23983821 (decompressed 495.7 percent)
[03:04:17] Called DecompressByteArray: compressed_data_size=4837744 data_size=23983821, decompressed_data_size=23983821 diff=0
[03:04:17] - Digital signature verified
[03:04:17] 
[03:04:17] Project: 2669 (Run 1, Clone 21, Gen 13)
[03:04:17] 
[03:04:17] Assembly optimizations on if available.
[03:04:17] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=L12SMP
NNODES=4, MYRANK=2, HOSTNAME=L12SMP
NNODES=4, MYRANK=0, HOSTNAME=L12SMP
NNODES=4, MYRANK=3, HOSTNAME=L12SMP
NODEID=0 argc=19
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 3.3.99_development_200800503  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=2 argc=19
NODEID=3 argc=19
NODEID=1 argc=19
Note: tpx file_version 48, software version 56
Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system'
250000 steps,    500.0 ps.

Writing checkpoint, step 3251940 at Wed Sep 17 23:19:26 2008
[03:23:43] Completed 2500 out of 250000 steps  (1%)

<snip>

Writing checkpoint, step 3409780 at Thu Sep 18 19:34:29 2008
[23:36:10] Completed 160000 out of 250000 steps  (64%)

-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: nsgrid.c, line: 358                <-------Would this be useful in ERROR TRAPPING?

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483487. It should have been within [ 0 .. 2250 ]

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: nsgrid.c, line: 358            <-------Same comment.

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483485. It should have been within [ 0 .. 2601 ]

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MP
-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: nsgrid.c, line: 358

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483521. It should have been within [ 0 .. 2023 ]

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
I_Abort(MPI_COMM_WORLD, -1) - process 2
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 255
[0]3:Return code = 255
[23:49:06] CoreStatus = FF (255)
[23:49:06] Client-core communications error: ERROR 0xff
[23:49:06] Deleting current work unit & continuing...

                                           <-----Note delay!  Over 1 Hour.

[00:51:20] ***** Got an Activate signal (2)<------User Intervention for obvious reasons.
[00:51:20] Killing all core threads

Folding@Home Client Shutdown.
jsc@L12SMP:~/folding/FAH$ 
This particular one, and many like it, would have hung forever without user intervention. :(