Project: 2669 (Run 7, Clone 46, Gen 70)

Moderators: Site Moderators, FAHC Science Team

Post Reply
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Project: 2669 (Run 7, Clone 46, Gen 70)

Post by alpha754293 »

error:

full console output:

Code: Select all

[13:18:45] - Preparing to get new work unit...
[13:18:45] + Attempting to get work packet
[13:18:45] - Connecting to assignment server
[13:18:46] - Successful: assigned to (171.64.65.56).
[13:18:46] + News From Folding@Home: Welcome to Folding@Home
[13:18:46] Loaded queue successfully.
[13:19:03] + Closed connections
[13:19:03]
[13:19:03] + Processing work unit
[13:19:03] Core required: FahCore_a2.exe
[13:19:03] Core found.
[13:19:03] Working on queue slot 04 [February 4 13:19:03 UTC]
[13:19:03] + Working ...
[13:19:03]
[13:19:03] *------------------------------*
[13:19:03] Folding@Home Gromacs SMP Core
[13:19:03] Version 2.02 (Wed Aug 27 13:11:25 PDT 2008)
[13:19:03]
[13:19:03] Preparing to commence simulation
[13:19:03] - Ensuring status. Please wait.
[13:19:04] Called DecompressByteArray: compressed_data_size=4830853 data_size=23977801, decompressed_data_size=23977801 diff=0
[13:19:04] - Digital signature verified
[13:19:04]
[13:19:04] Project: 2669 (Run 7, Clone 46, Gen 70)
[13:19:04]
[13:19:04] Assembly optimizations on if available.
[13:19:04] Entering M.D.
[13:19:13] (Run 7, Clone 46, Gen 70)
[13:19:13]
[13:19:14] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NODEID=0 argc=19
NODEID=1 argc=19
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 3.3.99_development_200800503  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_04.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=2 argc=19
NODEID=3 argc=19
Note: tpx file_version 48, software version 56
Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22869 system'
250000 steps,    500.0 ps.

Writing checkpoint, step 17504430 at Wed Feb  4 08:34:23 2009
[13:36:21] Completed 5008 out of 250000 steps  (2%)
[13:44:49] Completed 7508 out of 250000 steps  (3%)

Writing checkpoint, step 17508840 at Wed Feb  4 08:49:22 2009
[13:53:21] Completed 10008 out of 250000 steps  (4%)
[14:01:53] Completed 12508 out of 250000 steps  (5%)

Writing checkpoint, step 17513230 at Wed Feb  4 09:04:21 2009
[14:10:27] Completed 15008 out of 250000 steps  (6%)
[14:19:01] Completed 17508 out of 250000 steps  (7%)

Writing checkpoint, step 17517610 at Wed Feb  4 09:19:21 2009
[14:27:34] Completed 20008 out of 250000 steps  (8%)

Writing checkpoint, step 17522010 at Wed Feb  4 09:34:22 2009
[14:36:04] Completed 22508 out of 250000 steps  (9%)
[14:44:36] Completed 25008 out of 250000 steps  (10%)

Writing checkpoint, step 17526410 at Wed Feb  4 09:49:22 2009
[14:53:08] Completed 27508 out of 250000 steps  (11%)
[15:01:41] Completed 30008 out of 250000 steps  (12%)

Writing checkpoint, step 17530800 at Wed Feb  4 10:04:23 2009
[15:10:14] Completed 32508 out of 250000 steps  (13%)
[15:18:46] Completed 35008 out of 250000 steps  (14%)

Writing checkpoint, step 17535180 at Wed Feb  4 10:19:21 2009
[15:27:18] Completed 37508 out of 250000 steps  (15%)

Writing checkpoint, step 17539580 at Wed Feb  4 10:34:22 2009
[15:35:50] Completed 40008 out of 250000 steps  (16%)
[15:44:22] Completed 42508 out of 250000 steps  (17%)

Writing checkpoint, step 17543980 at Wed Feb  4 10:49:23 2009
[15:52:53] Completed 45008 out of 250000 steps  (18%)
[16:01:24] Completed 47508 out of 250000 steps  (19%)

Writing checkpoint, step 17548380 at Wed Feb  4 11:04:22 2009
[16:09:55] Completed 50008 out of 250000 steps  (20%)
[16:18:26] Completed 52508 out of 250000 steps  (21%)

Writing checkpoint, step 17552780 at Wed Feb  4 11:19:22 2009
[16:26:59] Completed 55008 out of 250000 steps  (22%)

Writing checkpoint, step 17557180 at Wed Feb  4 11:34:23 2009
[16:35:30] Completed 57508 out of 250000 steps  (23%)
[16:44:05] Completed 60008 out of 250000 steps  (24%)

Writing checkpoint, step 17561550 at Wed Feb  4 11:49:22 2009
[16:52:40] Completed 62508 out of 250000 steps  (25%)
[17:01:16] Completed 65008 out of 250000 steps  (26%)

Writing checkpoint, step 17565920 at Wed Feb  4 12:04:22 2009
[17:09:48] Completed 67508 out of 250000 steps  (27%)
[17:18:21] Completed 70008 out of 250000 steps  (28%)

Writing checkpoint, step 17570310 at Wed Feb  4 12:19:23 2009
[17:26:53] Completed 72508 out of 250000 steps  (29%)

Writing checkpoint, step 17574700 at Wed Feb  4 12:34:22 2009
[17:35:26] Completed 75008 out of 250000 steps  (30%)
[17:43:56] Completed 77508 out of 250000 steps  (31%)

Writing checkpoint, step 17579110 at Wed Feb  4 12:49:23 2009
[17:52:27] Completed 80008 out of 250000 steps  (32%)
[18:00:58] Completed 82508 out of 250000 steps  (33%)

Writing checkpoint, step 17583510 at Wed Feb  4 13:04:22 2009
[18:09:29] Completed 85008 out of 250000 steps  (34%)
[18:17:59] Completed 87508 out of 250000 steps  (35%)

Writing checkpoint, step 17587910 at Wed Feb  4 13:19:21 2009
[18:26:31] Completed 90008 out of 250000 steps  (36%)

Writing checkpoint, step 17592310 at Wed Feb  4 13:34:21 2009
[18:35:02] Completed 92508 out of 250000 steps  (37%)
[18:43:34] Completed 95008 out of 250000 steps  (38%)

Writing checkpoint, step 17596710 at Wed Feb  4 13:49:23 2009
[18:52:06] Completed 97508 out of 250000 steps  (39%)
[19:00:38] Completed 100008 out of 250000 steps  (40%)

Writing checkpoint, step 17601100 at Wed Feb  4 14:04:22 2009
[19:09:11] Completed 102508 out of 250000 steps  (41%)
[19:17:44] Completed 105008 out of 250000 steps  (42%)

Writing checkpoint, step 17605480 at Wed Feb  4 14:19:21 2009
[19:26:18] Completed 107508 out of 250000 steps  (43%)

Writing checkpoint, step 17609870 at Wed Feb  4 14:34:22 2009
[19:34:51] Completed 110008 out of 250000 steps  (44%)
[19:43:25] Completed 112508 out of 250000 steps  (45%)

Writing checkpoint, step 17614240 at Wed Feb  4 14:49:21 2009
[19:52:00] Completed 115008 out of 250000 steps  (46%)
[20:00:36] Completed 117508 out of 250000 steps  (47%)

Writing checkpoint, step 17618610 at Wed Feb  4 15:04:22 2009
[20:09:11] Completed 120008 out of 250000 steps  (48%)
[20:17:46] Completed 122508 out of 250000 steps  (49%)

Writing checkpoint, step 17622970 at Wed Feb  4 15:19:21 2009
[20:26:22] Completed 125008 out of 250000 steps  (50%)

Writing checkpoint, step 17627340 at Wed Feb  4 15:34:21 2009
[20:34:56] Completed 127508 out of 250000 steps  (51%)
[20:43:30] Completed 130008 out of 250000 steps  (52%)

Writing checkpoint, step 17631720 at Wed Feb  4 15:49:21 2009
Warning: 1-4 interaction between 4431 and 81172 at distance 3.686 which is larger than the 1-4 table size 2.200 nm
These are ignored for the rest of the simulation
This usually means your system is exploding,
if not, you should increase table-extension in your mdp file
or with user tables increase the table size

A list of missing interactions:
               LJ-14 of  60837 missing     -2

-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: domdec_top.c, line: 87

Software inconsistency error:
Some interactions seem to be assigned multiple times

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[cli_0]: aborting job:
Fatal error in MPI_Allreduce: Error message texts are not available
[cli_2]: aborting job:
Fatal error in MPI_Allreduce: Error message texts are not available
[cli_3]: aborting job:
Fatal error in MPI_Allreduce: Error message texts are not available
[20:51:21] CoreStatus = FF (255)
[20:51:21] Sending work to server
[20:51:21] Project: 2669 (Run 7, Clone 46, Gen 70)
[20:51:21] - Error: Could not get length of results file work/wuresults_04.dat
[20:51:21] - Error: Could not read unit 04 file. Removing from queue.
[20:51:21] - Preparing to get new work unit...
[20:51:21] + Attempting to get work packet
[20:51:21] - Connecting to assignment server
[20:51:22] - Successful: assigned to (171.67.108.24).
[20:51:22] + News From Folding@Home: Welcome to Folding@Home
[20:51:22] Loaded queue successfully.
[20:51:23] + Could not connect to Work Server
[20:51:23] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[20:51:31] + Attempting to get work packet
[20:51:31] - Connecting to assignment server
[20:51:32] - Successful: assigned to (171.67.108.24).
[20:51:32] + News From Folding@Home: Welcome to Folding@Home
[20:51:32] Loaded queue successfully.
[20:51:50] + Closed connections
[20:51:55]
[20:51:55] + Processing work unit
[20:51:55] Core required: FahCore_a2.exe
[20:51:55] Core found.
[20:51:55] Working on queue slot 05 [February 4 20:51:55 UTC]
[20:51:55] + Working ...
[20:51:55]
[20:51:55] *------------------------------*
[20:51:55] Folding@Home Gromacs SMP Core
[20:51:55] Version 2.02 (Wed Aug 27 13:11:25 PDT 2008)
[20:51:55]
[20:51:55] Preparing to commence simulation
[20:51:55] - Ensuring status. Please wait.
[20:52:05] - Looking at optimizations...
[20:52:05] - Working with standard loops on this execution.
[20:52:05] - Files status OK
[20:52:06] - Expanded 4838203 -> 24030157 (decompressed 496.6 percent)
[20:52:06] Called DecompressByteArray: compressed_data_size=4838203 data_size=24030157, decompressed_data_size=24030157 diff=0
[20:52:06] - Digital signature verified
[20:52:06]
[20:52:06] Project: 2671 (Run 34, Clone 11, Gen 82)
[20:52:06]
[20:52:06] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NODEID=0 argc=19
NODEID=1 argc=19
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 3.3.99_development_200800503  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_05.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=2 argc=19
NODEID=3 argc=19
[20:52:12] Will resume from checkpoint file
Note: tpx file_version 48, software version 56
Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22908 system in water'
250000 steps,    500.0 ps.

-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: md.c, line: 933

Fatal error:
Checkpoint error on step 15525010

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[20:52:14] Resuming from checkpoint
[20:52:14] fcSaveRestoreState: I/O failed dir=0, var=00002AAAAC037010, varsize=568788
[cli_1]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
attempting to restart it from last checkpoint now.
Last edited by alpha754293 on Thu Feb 05, 2009 8:47 am, edited 1 time in total.
toTOW
Site Moderator
Posts: 6433
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 2671 (Run 34, Clone 11, Gen 82)

Post by toTOW »

There's no data for this WU in the DB yet ...

It's not the first time you posted similar report ... are they from the same machine ? Did you check your CPU and RAM stability ?
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 34, Clone 11, Gen 82)

Post by alpha754293 »

toTOW wrote:There's no data for this WU in the DB yet ...

It's not the first time you posted similar report ... are they from the same machine ? Did you check your CPU and RAM stability ?
Yes, it's all from the same machine.

CPU and RAM are known good (although no, I haven't run memtest on it though) and I don't know of a way to test CPU stability although it's been able to do other units (about 6k PPD on other WUs just fine).

And sometimes, if I restart the run, it'll complete it.

I just report the errors as I come across them.

*edit*

I think that so far, of all the WUs that I have reported having problems, they seem to finish just fine when I restart it from where it left off.
toTOW
Site Moderator
Posts: 6433
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 2671 (Run 34, Clone 11, Gen 82)

Post by toTOW »

alpha754293 wrote:I think that so far, of all the WUs that I have reported having problems, they seem to finish just fine when I restart it from where it left off.
That's what makes me think about a random memory error (bad chip ?) or something similar.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 34, Clone 11, Gen 82)

Post by alpha754293 »

toTOW wrote:
alpha754293 wrote:I think that so far, of all the WUs that I have reported having problems, they seem to finish just fine when I restart it from where it left off.
That's what makes me think about a random memory error (bad chip ?) or something similar.
I can run memtest to be sure, but I don't think that there's an issue with it though because I would think that if there was a problem, it would show up more consistently, but given the sporadic failures, it isn't enough of a cause for concern yet.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2669 (Run 7, Clone 46, Gen 70)

Post by alpha754293 »

oops. my bad. okay...there were actually TWO WUs in there that had problems.

Project: 2669 (Run 7, Clone 46, Gen 70)

AND

Project: 2671 (Run 34, Clone 11, Gen 82)

I was reexaming the logfile. Sorry about that.
toTOW
Site Moderator
Posts: 6433
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 2671 (Run 34, Clone 11, Gen 82)

Post by toTOW »

alpha754293 wrote:
toTOW wrote:
alpha754293 wrote:I think that so far, of all the WUs that I have reported having problems, they seem to finish just fine when I restart it from where it left off.
That's what makes me think about a random memory error (bad chip ?) or something similar.
I can run memtest to be sure, but I don't think that there's an issue with it though because I would think that if there was a problem, it would show up more consistently, but given the sporadic failures, it isn't enough of a cause for concern yet.
That's exactly the kind of issue I was seeing when my overclocked X2 failed. It was a memory issue (more voltage, or lower clock, and the issue was gone).

FAH is very good at finding this kind of issue ... that might be be something else, but I think it's worth checking.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 34, Clone 11, Gen 82)

Post by alpha754293 »

toTOW wrote:
alpha754293 wrote:
toTOW wrote: That's what makes me think about a random memory error (bad chip ?) or something similar.
I can run memtest to be sure, but I don't think that there's an issue with it though because I would think that if there was a problem, it would show up more consistently, but given the sporadic failures, it isn't enough of a cause for concern yet.
That's exactly the kind of issue I was seeing when my overclocked X2 failed. It was a memory issue (more voltage, or lower clock, and the issue was gone).

FAH is very good at finding this kind of issue ... that might be be something else, but I think it's worth checking.
It's a Tyan server though. I don't have any options for overclocking or voltage control.

But like I also said too though, I can run memtest when the current WUs are finished and then report back.

By the same token, I would expect that if there was a real issue with it, I wouldn't see it only sporadically.

I'm not entirely sure what the different core stati mean (not sure if that's a F@H thing or if that's a GROMACS thing), but what's weird is that when I restart the runs, MOST of them (I think so far, all except like one or two or something), would pick up from where it left off and continue on. Whether that has an adverse effects/impact on the final results, I don't know. But if it was a consistent memory problem, then I would have expected it to fail at the same place because (presumably) the sequence of operations will be the same.

If there was some kind of consistency in either the type of failure, or the level or location of failure; then that would be easier for me to track.

At the point in time, unfortunately, there isn't; which makes it all the more harder to track.

Who knows.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2669 (Run 7, Clone 46, Gen 70)

Post by bruce »

Rather than a memory error on the RAM itself, you may have a chipset error, which looks just like a memory error. Does your memory chipset have a heatsink, and is it hot to touch?

. . . and do run Memtest.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2669 (Run 7, Clone 46, Gen 70)

Post by alpha754293 »

I believe so. I don't know remember. Yes, there are heatsinks on it, but I don't know how hot they are as they're in the 2U rackmount chassis that it came in and I don't really pop that open too often (if at all actually) to check the temps.

memtest passed. (BTW...it takes a fairly long time to test 16 GB of DDR400!!!)

I don't expect there to be anything that's wrong with the system physically because it is a Tyan B4882-D 2U server and they're well designed.
Post Reply