p2671 -- all Gen 17

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 12:01 pm

console:

                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_09.tpr, VERSION 3.3.99_development_20070618 (single pre
cision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_09.cpt generated: Fri Apr 24 08:19:50 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147225 atoms, while the current system consis
ts of 146898 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki
 at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[11:55:03] CoreStatus = FF (255)
[11:55:03] Sending work to server
[11:55:03] Project: 2671 (Run 22, Clone 85, Gen 17)
[11:55:03] - Error: Could not get length of results file work/wuresults_09.dat
[11:55:03] - Error: Could not read unit 09 file. Removing from queue.
[11:55:03] Trying to send all finished work units
[11:55:03] + No unsent completed units remaining.
[11:55:03] - Preparing to get new work unit...
[11:55:03] + Attempting to get work packet
[11:55:03] - Will indicate memory of 16003 MB
[11:55:03] - Connecting to assignment server
[11:55:03] Connecting to http://assign.stanford.edu:8080/
[11:55:03] Posted data.
[11:55:03] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[11:55:03] + News From Folding@Home: Welcome to Folding@Home
[11:55:03] Loaded queue successfully.
[11:55:03] Connecting to http://171.67.108.24:8080/
[11:55:10] Posted data.
[11:55:10] Initial: 0000; - Receiving payload (expected size: 4845286)
[11:55:31] - Downloaded at ~225 kB/s
[11:55:31] - Averaged speed for that direction ~330 kB/s
[11:55:31] + Received work.
[11:55:31] Trying to send all finished work units
[11:55:31] + No unsent completed units remaining.
[11:55:31] + Closed connections
[11:55:36]
[11:55:36] + Processing work unit
[11:55:36] Core required: FahCore_a2.exe
[11:55:36] Core found.
[11:55:36] Working on queue slot 00 [April 28 11:55:36 UTC]
[11:55:36] + Working ...
[11:55:36] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work
/ -suffix 00 -checkpoint 15 -verbose -lifeline 29452 -version 624'

[11:55:36]
[11:55:36] *------------------------------*
[11:55:36] Folding@Home Gromacs SMP Core
[11:55:36] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[11:55:36]
[11:55:36] Preparing to commence simulation
[11:55:36] - Ensuring status. Please wait.
[11:55:46] - Looking at optimizations...
[11:55:46] - Working with standard loops on this execution.
[11:55:46] - Files status OK
[11:55:47] - Expanded 4844774 -> 24012685 (decompressed 495.6 percent)
[11:55:47] Called DecompressByteArray: compressed_data_size=4844774 data_size=24
012685, decompressed_data_size=24012685 diff=0
[11:55:47] - Digital signature verified
[11:55:47]
[11:55:47] Project: 2671 (Run 22, Clone 98, Gen 17)
[11:55:47]
[11:55:47] Entering M.D.
[11:55:53] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=computenode
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=3 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

NODEID=2 argc=23
                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_00.tpr, VERSION 3.3.99_development_20070618 (single pre
cision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_00.cpt generated: Mon Apr 27 13:07:16 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147117 atoms, while the current system consis
ts of 146898 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki
 at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[11:56:00] CoreStatus = FF (255)
[11:56:00] Sending work to server
[11:56:00] Project: 2671 (Run 22, Clone 98, Gen 17)
[11:56:00] - Error: Could not get length of results file work/wuresults_00.dat
[11:56:00] - Error: Could not read unit 00 file. Removing from queue.
[11:56:00] Trying to send all finished work units
[11:56:00] + No unsent completed units remaining.
[11:56:00] - Preparing to get new work unit...
[11:56:00] + Attempting to get work packet
[11:56:00] - Will indicate memory of 16003 MB
[11:56:00] - Connecting to assignment server
[11:56:00] Connecting to http://assign.stanford.edu:8080/
[11:56:01] Posted data.
[11:56:01] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[11:56:01] + News From Folding@Home: Welcome to Folding@Home
[11:56:01] Loaded queue successfully.
[11:56:01] Connecting to http://171.67.108.24:8080/
[11:56:01] Posted data.
[11:56:01] Initial: 0000; - Error: Bad packet type from server, expected work as
signment
[11:56:02] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[11:56:19] + Attempting to get work packet
[11:56:19] - Will indicate memory of 16003 MB
[11:56:19] - Connecting to assignment server
[11:56:19] Connecting to http://assign.stanford.edu:8080/
[11:56:19] Posted data.
[11:56:19] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[11:56:19] + News From Folding@Home: Welcome to Folding@Home
[11:56:19] Loaded queue successfully.
[11:56:19] Connecting to http://171.67.108.24:8080/
[11:56:26] Posted data.
[11:56:26] Initial: 0000; - Receiving payload (expected size: 4825625)
[11:56:38] - Downloaded at ~392 kB/s
[11:56:38] - Averaged speed for that direction ~343 kB/s
[11:56:38] + Received work.
[11:56:38] Trying to send all finished work units
[11:56:38] + No unsent completed units remaining.
[11:56:38] + Closed connections
[11:56:43]
[11:56:43] + Processing work unit
[11:56:43] Core required: FahCore_a2.exe
[11:56:43] Core found.
[11:56:43] Working on queue slot 01 [April 28 11:56:43 UTC]
[11:56:43] + Working ...
[11:56:43] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work
/ -suffix 01 -checkpoint 15 -verbose -lifeline 29452 -version 624'

[11:56:43]
[11:56:43] *------------------------------*
[11:56:43] Folding@Home Gromacs SMP Core
[11:56:43] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[11:56:43]
[11:56:43] Preparing to commence simulation
[11:56:43] - Ensuring status. Please wait.
[11:56:53] - Looking at optimizations...
[11:56:53] - Working with standard loops on this execution.
[11:56:53] - Files status OK
[11:56:54] - Expanded 4825113 -> 24057089 (decompressed 498.5 percent)
[11:56:54] Called DecompressByteArray: compressed_data_size=4825113 data_size=24
057089, decompressed_data_size=24057089 diff=0
[11:56:54] - Digital signature verified
[11:56:54]
[11:56:54] Project: 2671 (Run 18, Clone 87, Gen 17)
[11:56:54]
[11:56:54] Entering M.D.
[11:57:00] Using Gromacs checkpoints
NNODES=4, MYRANK=1, HOSTNAME=computenode
NNODES=4, MYRANK=2, HOSTNAME=computenode
NNODES=4, MYRANK=3, HOSTNAME=computenode
NNODES=4, MYRANK=0, HOSTNAME=computenode
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=3 argc=23
NODEID=2 argc=23
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single pre
cision)
Note: tpx file_version 48, software version 64

Reading checkpoint file work/wudata_01.cpt generated: Mon Apr 20 15:43:31 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: checkpoint.c, line: 1151

Fatal error:
Checkpoint file is for a system of 147024 atoms, while the current system consis
ts of 147246 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki
 at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[11:57:07] CoreStatus = FF (255)
[11:57:07] Sending work to server
[11:57:07] Project: 2671 (Run 18, Clone 87, Gen 17)
[11:57:07] - Error: Could not get length of results file work/wuresults_01.dat
[11:57:07] - Error: Could not read unit 01 file. Removing from queue.
[11:57:07] Trying to send all finished work units
[11:57:07] + No unsent completed units remaining.
[11:57:07] - Preparing to get new work unit...
[11:57:07] + Attempting to get work packet
[11:57:07] - Will indicate memory of 16003 MB
[11:57:07] - Connecting to assignment server
[11:57:07] Connecting to http://assign.stanford.edu:8080/
[11:57:07] Posted data.
[11:57:07] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[11:57:07] + News From Folding@Home: Welcome to Folding@Home
[11:57:07] Loaded queue successfully.
[11:57:07] Connecting to http://171.67.108.24:8080/
[11:57:08] Posted data.
[11:57:08] Initial: 0000; - Error: Bad packet type from server, expected work as
signment
[11:57:08] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.

What's going on here? Where are there so many bad WUs?

There's a bunch more before and after it as well.

I just dumped the queue.dat, *.pdb in the client directory, and also the work directory as well and that seems to have fixed things/got it moving along again.

Any ideas as to what was wrong with it in the first place?

Post by **toTOW** » Tue Apr 28, 2009 12:25 pm

It looks like there was some remaining checkpoint file from previous failures, or that's a lot of bad WUs ...

I checked them :
Project: 2671 (Run 22, Clone 85, Gen 17) : no data in the DB yet.
Project: 2671 (Run 22, Clone 98, Gen 17) : no data in the DB yet.
Project: 2671 (Run 18, Clone 87, Gen 17) : no data in the DB yet.

That might be an issue with Gen 17 on this project ... Have you been able to complete some p2671 Gen 17 WUs ?

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 12:27 pm

toTOW wrote:It looks like there was some remaining checkpoint file from previous failures, or that's a lot of bad WUs ...

I checked them :
Project: 2671 (Run 22, Clone 85, Gen 17) : no data in the DB yet.
Project: 2671 (Run 22, Clone 98, Gen 17) : no data in the DB yet.
Project: 2671 (Run 18, Clone 87, Gen 17) : no data in the DB yet.

That might be an issue with Gen 17 on this project ... Have you been able to complete some p2671 Gen 17 WUs ?

Uh....how do I check that?

I don't think that FahMon or the stats @ folding.stanford.edu can tell me.

And I'm pretty sure that the logs either have swapped a few times (from me starting and restarting the clients) such that they probably don't go back far enough to be able to tell.

*edit*
here's the console output:

Code: Select all

$ grep Gen\ 17 *.txt
FAHlog-Prev.txt:[21:35:16] Project: 2671 (Run 4, Clone 88, Gen 17)
FAHlog-Prev.txt:[01:10:03] Project: 2671 (Run 7, Clone 64, Gen 17)
FAHlog-Prev.txt:[01:10:16] Project: 2671 (Run 7, Clone 64, Gen 17)
FAHlog-Prev.txt:[01:11:03] Project: 2671 (Run 7, Clone 64, Gen 17)
FAHlog-Prev.txt:[01:11:16] Project: 2671 (Run 7, Clone 64, Gen 17)
FAHlog-Prev.txt:[01:12:13] Project: 2671 (Run 7, Clone 73, Gen 17)
FAHlog-Prev.txt:[01:12:26] Project: 2671 (Run 7, Clone 73, Gen 17)
FAHlog-Prev.txt:[01:13:08] Project: 2671 (Run 7, Clone 79, Gen 17)
FAHlog-Prev.txt:[01:13:21] Project: 2671 (Run 7, Clone 79, Gen 17)
FAHlog-Prev.txt:[01:13:55] Project: 2671 (Run 7, Clone 79, Gen 17)
FAHlog-Prev.txt:[01:14:09] Project: 2671 (Run 7, Clone 79, Gen 17)
FAHlog-Prev.txt:[01:15:03] Project: 2671 (Run 7, Clone 84, Gen 17)
FAHlog-Prev.txt:[01:15:16] Project: 2671 (Run 7, Clone 84, Gen 17)
FAHlog-Prev.txt:[11:53:02] Project: 2671 (Run 7, Clone 47, Gen 17)
FAHlog-Prev.txt:[11:53:15] Project: 2671 (Run 7, Clone 47, Gen 17)
FAHlog-Prev.txt:[11:53:59] Project: 2671 (Run 22, Clone 85, Gen 17)
FAHlog-Prev.txt:[11:54:13] Project: 2671 (Run 22, Clone 85, Gen 17)
FAHlog-Prev.txt:[11:54:50] Project: 2671 (Run 22, Clone 85, Gen 17)
FAHlog-Prev.txt:[11:55:03] Project: 2671 (Run 22, Clone 85, Gen 17)
FAHlog-Prev.txt:[11:55:47] Project: 2671 (Run 22, Clone 98, Gen 17)
FAHlog-Prev.txt:[11:56:00] Project: 2671 (Run 22, Clone 98, Gen 17)
FAHlog-Prev.txt:[11:56:54] Project: 2671 (Run 18, Clone 87, Gen 17)
FAHlog-Prev.txt:[11:57:07] Project: 2671 (Run 18, Clone 87, Gen 17)
FAHlog-Prev.txt:[11:58:04] Project: 2671 (Run 23, Clone 36, Gen 17)
FAHlog-Prev.txt:[11:58:17] Project: 2671 (Run 23, Clone 36, Gen 17)
FAHlog-Prev.txt:[11:58:53] Project: 2671 (Run 23, Clone 36, Gen 17)
FAHlog-Prev.txt:[11:59:06] Project: 2671 (Run 23, Clone 36, Gen 17)
FAHlog-Prev.txt:[11:59:38] Project: 2671 (Run 23, Clone 48, Gen 17)
FAHlog-Prev.txt:[11:59:52] Project: 2671 (Run 23, Clone 48, Gen 17)
FAHlog-Prev.txt:[12:00:35] Project: 2671 (Run 23, Clone 57, Gen 17)
FAHlog-Prev.txt:[12:00:48] Project: 2671 (Run 23, Clone 57, Gen 17)
FAHlog-Prev.txt:[12:01:27] Project: 2671 (Run 23, Clone 57, Gen 17)
FAHlog-Prev.txt:[12:01:40] Project: 2671 (Run 23, Clone 57, Gen 17)

Judging by the DTS, I don't think so. But then again, it says nothing of the 20+ day uptime of the system as I'm quite certain the log isn't long enough to reflect that.

I think that you'd have to check the submission history in order to find that out.

Post by **toTOW** » Tue Apr 28, 2009 12:44 pm

I expected that you might have some in older FAHLog ...

I suspect an issue here ... I've checked some few more WUs on the surrounding clones :

Project: 2671 (Run 22, Clone 85-88-89, Gen 17) : previous generation completed two days ago, none of them completed fine with Gen 17

Project: 2671 (Run 22, Clone 93-95-98-99, Gen 17) : previous generation completed two days ago, none of them completed fine with Gen 17

Project: 2671 (Run 18, Clone 87, Gen 17) : no data for Gen 16

... Gen 15 completed yesterday
Project: 2671 (Run 18, Clone 88, Gen 17) : Gen 16 completed yesterday ... no data for Gen 17
Project: 2671 (Run 18, Clone 88, Gen 17) : Gen 16 completed yesterday ... no data for Gen 17

Keep watching your machine, and report how other p2671 Gen 17 you get will do ... I've sent a mail to kasson to ask him to look at the project.

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 12:49 pm

toTOW wrote:I expected that you might have some in older FAHLog ...

I suspect an issue here ... I've checked some few more WUs on the surrounding clones :

Project: 2671 (Run 22, Clone 85-88-89, Gen 17) : previous generation completed two days ago, none of them completed fine with Gen 17

Project: 2671 (Run 22, Clone 93-95-98-99, Gen 17) : previous generation completed two days ago, none of them completed fine with Gen 17

Project: 2671 (Run 18, Clone 87, Gen 17) : no data for Gen 16 ... Gen 15 completed yesterday
Project: 2671 (Run 18, Clone 88, Gen 17) : Gen 16 completed yesterday ... no data for Gen 17
Project: 2671 (Run 18, Clone 88, Gen 17) : Gen 16 completed yesterday ... no data for Gen 17

Keep watching your machine, and report how other p2671 Gen 17 you get will do ... I've sent a mail to kasson to ask him to look at the project.

No, because FAHlog-prev only makes one copy. Anything older than that gets replaced especially if you're testing/playing around with starting/restarting clients like I was yesterday.

Pity that there isn't an option to just append all logs.

And if I knew of a way to pipe the console output into a file, but still have it on display, so that I would be able to monitor the system headlessly, that would be ideal.

I'll keep an eye out for all p2671 g17s and will let you know if anything turns up.

Highly/strongly suggest that they change the way the program logs it though because when I run the system headlessly, a LOT of errors like that doesn't get picked up on, esp. when it's just cycling through WUs like that.

Post by **bruce** » Tue Apr 28, 2009 12:58 pm

alpha754293 wrote:And if I knew of a way to pipe the console output into a file, but still have it on display, so that I would be able to monitor the system headlessly, that would be ideal.

Some versions of *nix have a command called "tee" which does exactly that -- writes the input data to a file AND passes it to the output. That wouldn't be a difficult program to write.

Post by **toTOW** » Tue Apr 28, 2009 1:01 pm

Thanks for the list ... here are the results for the DB :

[21:35:16] Project: 2671 (Run 4, Clone 88, Gen 17)
Completed by 3 donors.

[01:10:03] Project: 2671 (Run 7, Clone 64, Gen 17)
Completed by 1 donor.

[01:12:13] Project: 2671 (Run 7, Clone 73, Gen 17)
No data. Gen 16 completed two days ago.

[01:13:08] Project: 2671 (Run 7, Clone 79, Gen 17)
No data. Gen 16 completed yesterday.

[01:15:03] Project: 2671 (Run 7, Clone 84, Gen 17)
No data. Gen 16 completed two days ago.

[11:53:02] Project: 2671 (Run 7, Clone 47, Gen 17)
No data for both Gen 16 and 17

. Gen 15 completed two days ago.

[11:53:59] Project: 2671 (Run 22, Clone 85, Gen 17)
No data. Gen 16 completed yesterday.

[11:55:47] Project: 2671 (Run 22, Clone 98, Gen 17)
No data. Gen 16 completed yesterday.

[11:56:54] Project: 2671 (Run 18, Clone 87, Gen 17)
No data for both Gen 16 and 17

. Gen 15 completed yesterday.

[11:58:04] Project: 2671 (Run 23, Clone 36, Gen 17)
No data. Gen 16 completed yesterday.

[11:59:38] Project: 2671 (Run 23, Clone 48, Gen 17)
No data. Gen 16 completed yesterday.

[12:00:35] Project: 2671 (Run 23, Clone 57, Gen 17)
No data. Gen 16 completed yesterday.

What happens if you start you client with a command like : "./fah6 > terminaloutput.txt" (or "./fah6 >& terminaloutput.txt" to include error messages) ? Does it write the terminal output to a file called terminaloutput.txt ?

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 1:02 pm

bruce wrote:
alpha754293 wrote:And if I knew of a way to pipe the console output into a file, but still have it on display, so that I would be able to monitor the system headlessly, that would be ideal.
Some versions of *nix have a command called "tee" which does exactly that -- writes the input data to a file AND passes it to the output. That wouldn't be a difficult program to write.

For those that are programmers, absolutely. For those that aren't...it's a whole different story.

bruce wrote:What happens if you start you client with a command like : "./fah6 > terminaloutput.txt" (or "./fah6 >& terminaloutput.txt" to include error messages) ? Does it write the terminal output to a file called terminaloutput.txt ?

Sadly, if you just use ">" to dump the output to a file, I can parse the file periodically, but the idea is to be a uninvolved in the whole process as possible. And even if there were failures, "top" wouldn't be sufficient to be able to determine the status of the program because I've had vacant cores still running around taking up CPU time even though the master process was dead and/or defunct.

(to answer your questions, yes, it will do that, but...see above).

Because I'm running the system in headless mode right now; if I use ">", it will dump the output to file, but that would likely mean that I wouldn't see it on screen either. (Even when I wasn't running it headlessly, I don't check the console all that often), so it is quite possible that there were MANY MANY MANY errors that were missed on account of it.

FahMon only reports those that are running, and those that have failed to run after proper initialization. For errors such as these, the native Fahlog wouldn't be sufficiently capturing these errors (or trapping them as it were).

It'd be a whole different issue if there was like a 'debug' mode that you can put the core in that will force it to be a lot more explicit and that it will also display the error, plus append to the debug log so that even if you HAD to restart the client; you wouldn't lose all of the log, and that it would be better than the current "save one previous" (because sometimes, I've had to restart the client a few times in order to clear the error.)

bollix47 · Post by **bollix47** » Tue Apr 28, 2009 1:14 pm

Try:

Code: Select all

./fah6 | tee -a foldinglog.txt

I just tried it and it seems to be working .... output shows in console and is being written to a file called foldinglog.txt. The -a option will append to the file rather than overwrite it.

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 1:14 pm

BTW...here's something interesting for you:

on another system of mine (dual AMD Opteron 2220 on Tyan S2915WA2NRF, 8 GB, RHEL 4 WS):

Code: Select all

[share@opteron3 fah3]$ grep Gen\ 17 *.txt
FAHlog.txt:[11:28:16] Project: 2671 (Run 21, Clone 36, Gen 17)
[share@opteron3 fah3]$ tail -n 40 FAHlog.txt
[11:28:15] + Closed connections
[11:28:15]
[11:28:15] + Processing work unit
[11:28:15] Core required: FahCore_a2.exe
[11:28:15] Core found.
[11:28:15] Working on Unit 05 [April 28 11:28:15]
[11:28:15] + Working ...
[11:28:15] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work
/ -suffix 05 -np 4 -checkpoint 15 -verbose -lifeline 3747 -version 602'

[11:28:15]
[11:28:15] *------------------------------*
[11:28:15] Folding@Home Gromacs SMP Core
[11:28:15] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[11:28:15]
[11:28:15] Preparing to commence simulation
[11:28:15] - Ensuring status. Please wait.
[11:28:16] Called DecompressByteArray: compressed_data_size=4830951 data_size=24
041389, decompressed_data_size=24041389 diff=0
[11:28:16] - Digital signature verified
[11:28:16]
[11:28:16] Project: 2671 (Run 21, Clone 36, Gen 17)
[11:28:16]
[11:28:16] Assembly optimizations on if available.
[11:28:16] Entering M.D.
[11:28:24] Multi-core optimizations on
[11:28:26] ntering M.D.
[11:28:33] Multi-core optimizations on
[11:28:35] Completed 0 out of 250001 steps  (0%)
[11:36:19] Completed 2501 out of 250001 steps  (1%)
[11:44:03] Completed 5001 out of 250001 steps  (2%)
[11:51:46] Completed 7501 out of 250001 steps  (3%)
[11:59:29] Completed 10001 out of 250001 steps  (4%)
[12:07:12] Completed 12501 out of 250001 steps  (5%)
[12:14:55] Completed 15001 out of 250001 steps  (6%)
[12:22:38] Completed 17501 out of 250001 steps  (7%)
[12:30:21] Completed 20001 out of 250001 steps  (8%)
[12:38:04] Completed 22501 out of 250001 steps  (9%)
[12:45:48] Completed 25001 out of 250001 steps  (10%)
[12:53:31] Completed 27501 out of 250001 steps  (11%)
[13:01:15] Completed 30001 out of 250001 steps  (12%)
[13:08:58] Completed 32501 out of 250001 steps  (13%)

Apparently Project: 2671 (Run 21, Clone 36, Gen 17) IS currently running on this system.

Picked up the WU Apr 28 06:28 EST 2009. Est. completion date: Apr 29 07:30 EST 2009.

I CAN, if you want, try to make a copy of it right now and put it on my other machine to see if it will run it or if it will give me the same type/class of error that we've been seeing, but I do find it weird how it is running on one of my system, but not the other.

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 1:20 pm

And to confuse you even further, just restarted the client on my quad AMD Opteron 880 (on Tyan B4882, 16 GB, SLES10 SP2).

Code: Select all

[13:14:59]
[13:14:59] *------------------------------*
[13:14:59] Folding@Home Gromacs SMP Core
[13:14:59] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[13:14:59]
[13:14:59] Preparing to commence simulation
[13:14:59] - Ensuring status. Please wait.
[13:15:09] - Looking at optimizations...
[13:15:09] - Working with standard loops on this execution.
[13:15:09] - Files status OK
[13:15:10] - Expanded 4822845 -> 24064269 (decompressed 498.9 percent)
[13:15:10] Called DecompressByteArray: compressed_data_size=4822845 data_size=24
064269, decompressed_data_size=24064269 diff=0
[13:15:10] - Digital signature verified
[13:15:10]
[13:15:10] Project: 2671 (Run 28, Clone 33, Gen 17)
[13:15:10]
[13:15:10] Entering M.D.
[13:15:20] Completed 0 out of 250001 steps  (0%)

seems like that that's running now...

Post by **toTOW** » Tue Apr 28, 2009 2:11 pm

Well the issue might affect only some particular Runs ... or you had some files that haven't been deleted from previous WU which messed the checkpoint check system (did you clean the /work folder from all files which number doesn't match with the active queue slot ?).

Runs with known successes : 4 - 7 - 21 - 28
Runs with reported issues : 18 - 22
Runs with not enough data to conclude : 7 - 18 - 22 - 23

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 2:13 pm

toTOW wrote:Well the issue might affect only some particular Runs ... or you had some files that haven't been deleted from previous WU which messed the checkpoint check system (did you clean the /work folder from all files which number doesn't match with the active queue slot ?).

Runs with known successes : 4 - 7 - 21 - 28
Runs with reported issues : 18 - 22
Runs with not enough data to conclude : 7 - 18 - 22 - 23

As I mentioned, I purged the entire work directory (rather than doing it slot by slot). Quicker, easier, faster, and it said that there were no outstanding results to be sent, so I figured that I was in the clear to do so.

tear · Post by **tear** » Tue Apr 28, 2009 2:18 pm

I'm seeing something very peculiar with P2671/G17 units here too.

They are not crashing though. Instead, every single instance seems to be affected by huge-number-in-unitinfo bug.

Pipe chars not included for clarity.

Code: Select all

Tag: P2671R26C43G17
Progress: 1717995%  []
Tag: P2671R24C95G17
Progress: 1717997%  []
Tag: P2671R11C38G17
Progress: 1718031%  []
Tag: P2671R11C77G17
Progress: 1718030%  []
Tag: P2671R10C52G17
Progress: 1718038%  []
Tag: P2671R8C68G17
Progress: 1718058%  []
Tag: P2671R2C57G17
Progress: 1718051%  []
Tag: P2671R14C68G17
Progress: 1718026%  []
Tag: P2671R25C36G17
Progress: 1718005%  []
Tag: P2671R28C84G17
Progress: 1717991%  []
Tag: P2671R24C72G17
Progress: 1718000%  []
Tag: P2671R13C68G17
Progress: 1718036%  []

I wonder if all that has anything to do with a bug in fah core and its interaction
with address space randomization (I disabled it [AS randomization, not the bug

]
long, long time ago).

Alpha -- if you add the following line to your /etc/sysctl.conf

Code: Select all

kernel.randomize_va_space = 0

and call

Code: Select all

sysctl -p

from root, does it make any difference? [it requires (re-)starting the client tho]

Cheers,
tear

alpha754293 · Post by **alpha754293** » Tue Apr 28, 2009 2:28 pm

tear wrote:I'm seeing something very peculiar with P2671/G17 units here too.

They are not crashing though. Instead, every single instance seems to be affected by huge-number-in-unitinfo bug.

Pipe chars not included for clarity.
Code: Select all
Tag: P2671R26C43G17
Progress: 1717995%  []
Tag: P2671R24C95G17
Progress: 1717997%  []
Tag: P2671R11C38G17
Progress: 1718031%  []
Tag: P2671R11C77G17
Progress: 1718030%  []
Tag: P2671R10C52G17
Progress: 1718038%  []
Tag: P2671R8C68G17
Progress: 1718058%  []
Tag: P2671R2C57G17
Progress: 1718051%  []
Tag: P2671R14C68G17
Progress: 1718026%  []
Tag: P2671R25C36G17
Progress: 1718005%  []
Tag: P2671R28C84G17
Progress: 1717991%  []
Tag: P2671R24C72G17
Progress: 1718000%  []
Tag: P2671R13C68G17
Progress: 1718036%  []
I wonder if all that has anything to do with a bug in fah core and its interaction
with address space randomization (I disabled it [AS randomization, not the bug ]
long, long time ago).

Alpha -- if you add the following line to your /etc/sysctl.conf
Code: Select all
kernel.randomize_va_space = 0
and call
Code: Select all
sysctl -p
from root, does it make any difference? [it requires (re-)starting the client tho]

Cheers,
tear

k...did that. added the line to /etc/sysctl.conf and restarted the clients.

How will I tell if there's a difference?

I restarted the clients (running with two "-smp 4" per all my other threads that I mentioned it).

client 1 is Project: 2676 (Run 2, Clone 76, Gen 82) - restart ok.
client 2 was Project: 2671 (Run 28, Clone 33, Gen 17) - seg fault on stop. Cleared on restart, picked up the same WU though from assign (171.67.108.24:8080). Currently running P2671R28C33G17 (again). Effectively restarting from scratch. New download time = Apr 28 10:33 EST 2009. New ETA N/A yet.

Folding Forum

p2671 -- all Gen 17

p2671 -- all Gen 17

Re: lots of WU issues -- see console output

Re: lots of WU issues -- see console output

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: lots of WU issues -- see console output

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17

Re: p2671 -- all Gen 17