Project: 2671 (Run 51, Clone 11, Gen 82)

Moderators: Site Moderators, FAHC Science Team

Post Reply
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Project: 2671 (Run 51, Clone 11, Gen 82)

Post by alpha754293 »

error:

Code: Select all

[20:50:03] - Ask before connecting: No
[20:50:03] - User name: alpha754293 (Team 596)
[20:50:03] - User ID: 1909DAEB5DAE563A
[20:50:03] - Machine ID: 2
[20:50:03]
[20:50:04] Loaded queue successfully.
[20:50:04]
[20:50:04] + Processing work unit
[20:50:04] Core required: FahCore_a2.exe
[20:50:04] Core found.
[20:50:04] Working on queue slot 08 [February 6 20:50:04 UTC]
[20:50:04] + Working ...
[20:50:04]
[20:50:04] *------------------------------*
[20:50:04] Folding@Home Gromacs SMP Core
[20:50:04] Version 2.02 (Wed Aug 27 13:11:25 PDT 2008)
[20:50:04]
[20:50:04] Preparing to commence simulation
[20:50:04] - Ensuring status. Please wait.
[20:50:13] - Looking at optimizations...
[20:50:13] - Working with standard loops on this execution.
[20:50:13] - Files status OK
[20:50:14] - Expanded 4836186 -> 24033557 (decompressed 496.9 percent)
[20:50:14] Called DecompressByteArray: compressed_data_size=4836186 data_size=24033557, decompressed_data_size=24033557 diff=0
[20:50:15] - Digital signature verified
[20:50:15]
[20:50:15] Project: 2671 (Run 51, Clone 11, Gen 82)
[20:50:15]
[20:50:15] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NODEID=0 argc=19
NODEID=1 argc=19
NODEID=2 argc=19
NODEID=3 argc=19
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 3.3.99_development_200800503  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_08.tpr, VERSION 3.3.99_development_20070618 (single precision)
[20:50:21] Will resume from checkpoint file
Note: tpx file_version 48, software version 56
Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22887 system in water'
250000 steps,    500.0 ps.

-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: md.c, line: 933

Fatal error:
Checkpoint error on step 16172510

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[20:50:23] Resuming from checkpoint
[20:50:23] fcSaveRestoreState: I/O failed dir=0, var=00002AAAAC037010, varsize=578340
[cli_1]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[0]0:Return code = 255
[0]1:Return code = 1
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[20:50:27] CoreStatus = FF (255)
[20:50:27] Sending work to server
[20:50:27] Project: 2671 (Run 51, Clone 11, Gen 82)
[20:50:27] - Error: Could not get length of results file work/wuresults_08.dat
[20:50:27] - Error: Could not read unit 08 file. Removing from queue.
[20:50:27] + Closed connections
[20:50:27] + Paused after finishing unit
[20:50:27] Press Enter to continue, Ctrl-C to exit...
tried to re-run it. it failed and the console output is above.

memtest passed. chipset heatsink is warm to the touch, but not super hot.
toTOW
Site Moderator
Posts: 6433
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 2671 (Run 51, Clone 11, Gen 82)

Post by toTOW »

Could you try to use 2.04 core ? It won't help with your stability issue, but it might help with corrupted checkpoints.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 51, Clone 11, Gen 82)

Post by alpha754293 »

toTOW wrote:Could you try to use 2.04 core ? It won't help with your stability issue, but it might help with corrupted checkpoints.
well..those previous corrupted runs...I don't know what happened. the system right now is pretty much a dedicated F@H machine, running SLES 10 SP2 x64 on a Tyan B4882-D barebones server with 4x AMD Opteron 880 and 16 GB of RAM and two 146 GB 15krpm U320 drives.

It's got an 850 W PSU in it, and running two "-smp 4" clients and a2 cores, it only draws about 510 W (measured with the killawatt thing).

I know that I've had some aborted runs that corrupted the checkpoint file because if I terminate it when it says "writing checkpoint file", I'm not entirely sure when it starts and stop writing, and so now I have discovered that I should wait until the next frame is done (showing up on console) before terminating it.

But these other errors that I've been getting; that are computational in nature (or seemingly so anyways), so far...I think that two of them failed to restart properly. The rest, upon restart, it picks up from where it left off and continues on. So the error wasn't persistent enough.

So I don't know what's going on.

*edit*
oh and one of the two clients have moved up to the 2.04 a2 core. the other one's waiting for the current WU to finish before I migrate that too.
Post Reply