Project 6012: (Run 2 Clone 262 Gen 13)

Aardvark · Post by **Aardvark** » Sun Mar 28, 2010 9:15 pm

Another failed a3core WU early in the folding. This one did not make it to 4%. I will have to remember this as Black Sunday.

Log file is as follows:

Code: Select all

[19:14:57] - Will indicate memory of 1024 MB
[19:14:57] - Connecting to assignment server
[19:14:57] Connecting to http://assign.stanford.edu:8080/
[19:15:00] Posted data.
[19:15:00] Initial: ED82; - Successful: assigned to (130.237.232.140).
[19:15:00] + News From Folding@Home: Welcome to Folding@Home
[19:15:00] Loaded queue successfully.
[19:15:00] Sent data
[19:15:00] Connecting to http://130.237.232.140:8080/
[19:15:02] Posted data.
[19:15:02] Initial: 0000; - Receiving payload (expected size: 1797039)
[19:25:36] - Downloaded at ~2 kB/s
[19:25:36] - Averaged speed for that direction ~3 kB/s
[19:25:36] + Received work.
[19:25:36] Trying to send all finished work units
[19:25:36] + No unsent completed units remaining.
[19:25:36] + Connections closed: You may now disconnect
[19:25:41] 
[19:25:41] + Processing work unit
[19:25:41] Core required: FahCore_a3.exe
[19:25:41] Core found.
[19:25:41] Working on queue slot 07 [March 28 19:25:41 UTC]
[19:25:41] + Working ...
[19:25:41] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 07 -np 2 -checkpoint 15 -verbose -lifeline 6642 -version 629'

[19:25:41] 
[19:25:41] *------------------------------*
[19:25:41] Folding@Home Gromacs SMP Core
[19:25:41] Version 2.17 (Mar 7 2010)
[19:25:41] 
[19:25:41] Preparing to commence simulation
[19:25:41] - Looking at optimizations...
[19:25:41] - Created dyn
[19:25:41] - Files status OK
[19:25:41] - Expanded 1796527 -> 2078149 (decompressed 115.6 percent)
[19:25:41] Called DecompressByteArray: compressed_data_size=1796527 data_size=2078149, decompressed_data_size=2078149 diff=0
[19:25:41] - Digital signature verified
[19:25:41] 
[19:25:41] Project: 6012 (Run 2, Clone 262, Gen 13)
[19:25:41] 
[19:25:41] Assembly optimizations on if available.
[19:25:41] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_07.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
7000000 steps,  14000.0 ps (continuing from step 6500000,  13000.0 ps).
[19:25:48] Completed 0 out of 500000 steps  (0%)
[19:49:33] Completed 5000 out of 500000 steps  (1%)
[20:12:55] Completed 10000 out of 500000 steps  (2%)
[20:36:11] Completed 15000 out of 500000 steps  (3%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
24 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[20:39:04] mdrun returned 255
[20:39:04] Going to send back what have done -- stepsTotalG=500000
[20:39:04] Work fraction=0.0312 steps=500000.
[20:39:05] CoreStatus = 0 (0)
[20:39:05] Sending work to server
[20:39:05] Project: 6012 (Run 2, Clone 262, Gen 13)
[20:39:05] - Error: Could not get length of results file work/wuresults_07.dat
[20:39:05] - Error: Could not read unit 07 file. Removing from queue.
[20:39:05] Trying to send all finished work units
[20:39:05] + No unsent completed units remaining.
[20:39:05] - Preparing to get new work unit...
[20:39:06] > Press "c" to connect to the server to download unit

I attempted to return the "remains" to Stanford, but it failed. Apparently, the Client is in such a rush to "fail" that it does not write to the work files correctly. Outcome is no info transferred to PG by the formal route and no possibility of any points credited, however meager. Log file covering the return attempt is as follows:

Code: Select all

[20:39:04] Going to send back what have done -- stepsTotalG=500000
[20:39:04] Work fraction=0.0312 steps=500000.
[20:39:05] CoreStatus = 0 (0)
[20:39:05] Sending work to server
[20:39:05] Project: 6012 (Run 2, Clone 262, Gen 13)
[20:39:05] - Error: Could not get length of results file work/wuresults_07.dat
[20:39:05] - Error: Could not read unit 07 file. Removing from queue.
[20:39:05] Trying to send all finished work units
[20:39:05] + No unsent completed units remaining.
[20:39:05] - Preparing to get new work unit...
[/code

Until the next time...... :-)

Aardvark · Post by **Aardvark** » Sun Mar 28, 2010 10:41 pm

The AS, in its wisdom, resent the same WU to me. It failed at the same point (less than 4%) that it did on my first try.

The only change seems to be that the Client was able to write to the Work files such that a return to Stanford of the remains was possible.

The Log file follows:

Code: Select all

[20:55:06] - Connecting to assignment server
[20:55:06] Connecting to http://assign.stanford.edu:8080/
[20:55:09] Posted data.
[20:55:09] Initial: ED82; - Successful: assigned to (130.237.232.140).
[20:55:09] + News From Folding@Home: Welcome to Folding@Home
[20:55:09] Loaded queue successfully.
[20:55:09] Sent data
[20:55:09] Connecting to http://130.237.232.140:8080/
[20:55:12] Posted data.
[20:55:12] Initial: 0000; - Receiving payload (expected size: 1797039)
[21:04:34] - Downloaded at ~3 kB/s
[21:04:34] - Averaged speed for that direction ~3 kB/s
[21:04:34] + Received work.
[21:04:34] Trying to send all finished work units
[21:04:34] + No unsent completed units remaining.
[21:04:34] + Connections closed: You may now disconnect
[21:04:39] 
[21:04:39] + Processing work unit
[21:04:39] Core required: FahCore_a3.exe
[21:04:39] Core found.
[21:04:39] Working on queue slot 08 [March 28 21:04:39 UTC]
[21:04:39] + Working ...
[21:04:39] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 2 -checkpoint 15 -verbose -lifeline 6642 -version 629'

[21:04:39] 
[21:04:39] *------------------------------*
[21:04:39] Folding@Home Gromacs SMP Core
[21:04:39] Version 2.17 (Mar 7 2010)
[21:04:39] 
[21:04:39] Preparing to commence simulation
[21:04:39] - Ensuring status. Please wait.
[21:04:48] - Looking at optimizations...
[21:04:48] - Working with standard loops on this execution.
[21:04:48] - Created dyn
[21:04:48] - Files status OK
[21:04:49] - Expanded 1796527 -> 2078149 (decompressed 115.6 percent)
[21:04:49] Called DecompressByteArray: compressed_data_size=1796527 data_size=2078149, decompressed_data_size=2078149 diff=0
[21:04:49] - Digital signature verified
[21:04:49] 
[21:04:49] Project: 6012 (Run 2, Clone 262, Gen 13)
[21:04:49] 
[21:04:49] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_08.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
7000000 steps,  14000.0 ps (continuing from step 6500000,  13000.0 ps).
[21:04:55] Completed 0 out of 500000 steps  (0%)
[21:29:03] Completed 5000 out of 500000 steps  (1%)
[21:52:09] Completed 10000 out of 500000 steps  (2%)
[22:16:08] Completed 15000 out of 500000 steps  (3%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
24 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[22:19:01] mdrun returned 255
[22:19:01] Going to send back what have done -- stepsTotalG=500000
[22:19:01] Work fraction=0.0312 steps=500000.
[22:19:05] logfile size=13365 infoLength=13365 edr=0 trr=25
[22:19:05] logfile size: 13365 info=13365 bed=0 hdr=25
[22:19:05] - Writing 13903 bytes of core data to disk...
[22:19:06]   ... Done.
[22:19:06] 
[22:19:06] Folding@home Core Shutdown: UNSTABLE_MACHINE
[22:19:06] CoreStatus = 7A (122)
[22:19:06] Sending work to server
[22:19:06] Project: 6012 (Run 2, Clone 262, Gen 13)


[22:19:06] + Attempting to send results [March 28 22:19:06 UTC]
[22:19:06] - Reading file work/wuresults_08.dat from core
[22:19:06]   (Read 13903 bytes from disk)
[22:19:07] > Press "c" to connect to the server to upload results
c[22:27:13] - Establishing connection
[22:27:16] Connecting to http://130.237.232.140:8080/
[22:27:21] Posted data.
[22:27:21] Initial: 0000; - Uploaded at ~2 kB/s
[22:27:21] - Averaged speed for that direction ~2 kB/s
[22:27:21] + Results successfully sent
[22:27:21] Thank you for your contribution to Folding@Home.
[22:27:22] Trying to send all finished work units
[22:27:22] + No unsent completed units remaining.
[22:27:22] - Preparing to get new work unit...
[22:27:22] Cleaning up work directory

I have no idea as to why my computer has gone "UNSTABLE". I run a fan application that controls and reports on the core temperature. I run it less than 150F and there does not seem to have been any problem. I check it frequently. It displays continuously.

I am not interested in going on a "wild goose" chase but are there any checks that I should do at this time to prove or disprove this charge of instability?

Guidance requested.

Aardvark · Post by **Aardvark** » Mon Mar 29, 2010 1:43 am

As I think about it, I believe the fact that my computer can run and rerun the same WU and bring it to failure at exactly the same point in each run is prima facie evidence of ideal stability on the part of the computer. An unstable computer, by definition, would produce somewhat randomized results. Perhaps I do not understand UNSTABLE in computerese....

Folding Forum

Project 6012: (Run 2 Clone 262 Gen 13)

Project 6012: (Run 2 Clone 262 Gen 13)

Re: Project 6012: (Run 2 Clone 262 Gen 13)

Re: Project 6012: (Run 2 Clone 262 Gen 13)