The first delivery of P6050R0C147G13 failed before reaching 1% completion. The Client was not able to write to the Work file properly so there was nothing available to return to Stanford, either for the Science or for ANY points.
So, back I went to the AS at Stanford looking for more work. The AS assumed that I was “cherry picking” because I already had an assigned WU that I had not returned(It was not returnable). So, the AS gave me the same identical WU P6050R0C147G13. I understand this is protocol, for good reason. This time through my Mac Mini it fails before reaching 2%. However, the Client was able to properly write to the Work files. mdrun returned 255 and my Mini was identified as an UNSTABLE_MACHINE with a CoreStatus = 7A(122). I returned the remains to the Stanford Servers and I received the obligatory “Thank You”, acknowledging receipt. I have not yet checked but I am confident that four (or so) points will be awarded for the second part of this episode and ZIPPO for the first part.
The Server must have felt sorry for me because I was next assigned a Project 6040 WU. I have not had one of these before. Fresh meat is a good thing when they have been feeding you MREs for what seems an eternity.
The Log File for the combined two runs at P6050R0C147G13 is as follows:
Code: Select all
[11:47:26] + Sent 1 of 1 completed units to the server
[11:47:26] + Connections closed: You may now disconnect
[11:47:26]
[11:47:26] + Processing work unit
[11:47:26] Core required: FahCore_a3.exe
[11:47:26] Core found.
[11:47:26] Working on queue slot 00 [April 28 11:47:26 UTC]
[11:47:26] + Working ...
[11:47:26] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 00 -np 2 -checkpoint 15 -verbose -lifeline 5262 -version 629'
[11:47:26]
[11:47:26] *------------------------------*
[11:47:26] Folding@Home Gromacs SMP Core
[11:47:26] Version 2.17 (Mar 7 2010)
[11:47:26]
[11:47:26] Preparing to commence simulation
[11:47:26] - Looking at optimizations...
[11:47:26] - Created dyn
[11:47:26] - Files status OK
[11:47:27] - Expanded 1766537 -> 2253505 (decompressed 127.5 percent)
[11:47:27] Called DecompressByteArray: compressed_data_size=1766537 data_size=2253505, decompressed_data_size=2253505 diff=0
[11:47:27] - Digital signature verified
[11:47:27]
[11:47:27] Project: 6050 (Run 0, Clone 147, Gen 13)
[11:47:27]
[11:47:27] Assembly optimizations on if available.
[11:47:27] Entering M.D.
[11:47:34] Completed 0 out of 500000 steps (0%)
[11:47:51] mdrun returned 255
[11:47:51] Going to send back what have done -- stepsTotalG=500000
[11:47:51] Work fraction=0.0001 steps=500000.
[11:47:51] CoreStatus = 0 (0)
[11:47:51] Sending work to server
[11:47:51] Project: 6050 (Run 0, Clone 147, Gen 13)
[11:47:51] - Error: Could not get length of results file work/wuresults_00.dat
[11:47:51] - Error: Could not read unit 00 file. Removing from queue.
[11:47:51] Trying to send all finished work units
[11:47:51] + No unsent completed units remaining.
[11:47:51] - Preparing to get new work unit...
[11:47:52] > Press "c" to connect to the server to download unit
[12:01:12] - Establishing connection
[12:01:15] Cleaning up work directory
[12:01:16] + Attempting to get work packet
[12:01:16] Passkey found
[12:01:16] - Will indicate memory of 2048 MB
[12:01:16] - Connecting to assignment server
[12:01:16] Connecting to http://assign.stanford.edu:8080/
[12:01:19] Posted data.
[12:01:19] Initial: 40AB; - Successful: assigned to (171.64.65.54).
[12:01:19] + News From Folding@Home: Welcome to Folding@Home
[12:01:20] Loaded queue successfully.
[12:01:20] Sent data
[12:01:20] Connecting to http://171.64.65.54:8080/
[12:01:22] Posted data.
[12:01:22] Initial: 0000; - Receiving payload (expected size: 1767049)
[12:16:58] - Downloaded at ~1 kB/s
[12:16:58] - Averaged speed for that direction ~2 kB/s
[12:16:58] + Received work.
[12:16:58] Trying to send all finished work units
[12:16:58] + No unsent completed units remaining.
[12:16:58] + Connections closed: You may now disconnect
[12:17:03]
[12:17:03] + Processing work unit
[12:17:03] Core required: FahCore_a3.exe
[12:17:03] Core found.
[12:17:03] Working on queue slot 01 [April 28 12:17:03 UTC]
[12:17:03] + Working ...
[12:17:03] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 5262 -version 629'
[12:17:03]
[12:17:03] *------------------------------*
[12:17:03] Folding@Home Gromacs SMP Core
[12:17:03] Version 2.17 (Mar 7 2010)
[12:17:03]
[12:17:03] Preparing to commence simulation
[12:17:03] - Ensuring status. Please wait.
[12:17:13] - Looking at optimizations...
[12:17:13] - Working with standard loops on this execution.
[12:17:13] - Created dyn
[12:17:13] - Files status OK
[12:17:13] - Expanded 1766537 -> 2253505 (decompressed 127.5 percent)
[12:17:13] Called DecompressByteArray: compressed_data_size=1766537 data_size=2253505, decompressed_data_size=2253505 diff=0
[12:17:13] - Digital signature verified
[12:17:13]
[12:17:13] Project: 6050 (Run 0, Clone 147, Gen 13)
[12:17:13]
[12:17:13] Entering M.D.
[12:17:20] Completed 0 out of 500000 steps (0%)
[12:40:11] Completed 5000 out of 500000 steps (1%)
[12:41:57] ***** Got an Activate signal (2)
[12:41:57] Killing all core threads
Folding@Home Client Shutdown.
--- Opening Log file [April 28 14:48:49 UTC]
# Mac OS X SMP Console Edition ################################################
###############################################################################
Folding@Home Client Version 6.29r3
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /Users/tedkreuserIII/Library/Folding@home
Executable: ./fah6
Arguments: -local -verbosity 9 -smp -advmethods
[14:48:49] - Ask before connecting: Yes
[14:48:49] - User name: Aardvark (Team 48057)
[14:48:49] - User ID: XXXXXXXXXXXXXXXXXXX
[14:48:49] - Machine ID: 1
[14:48:49]
[14:48:49] Loaded queue successfully.
[14:48:49]
[14:48:49] - Autosending finished units... [April 28 14:48:49 UTC]
[14:48:49] + Processing work unit
[14:48:49] Trying to send all finished work units
[14:48:49] Core required: FahCore_a3.exe
[14:48:49] + No unsent completed units remaining.
[14:48:49] - Autosend completed
[14:48:49] Core found.
[14:48:49] Working on queue slot 01 [April 28 14:48:49 UTC]
[14:48:49] + Working ...
[14:48:49] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 139 -version 629'
[14:48:50]
[14:48:50] *------------------------------*
[14:48:50] Folding@Home Gromacs SMP Core
[14:48:50] Version 2.17 (Mar 7 2010)
[14:48:50]
[14:48:50] Preparing to commence simulation
[14:48:50] - Looking at optimizations...
[14:48:50] - Files status OK
[14:48:50] - Expanded 1766537 -> 2253505 (decompressed 127.5 percent)
[14:48:50] Called DecompressByteArray: compressed_data_size=1766537 data_size=2253505, decompressed_data_size=2253505 diff=0
[14:48:50] - Digital signature verified
[14:48:50]
[14:48:50] Project: 6050 (Run 0, Clone 147, Gen 13)
[14:48:50]
[14:48:50] Assembly optimizations on if available.
[14:48:50] Entering M.D.
[14:48:56] Using Gromacs checkpoints
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Reading checkpoint file work/wudata_01.cpt generated: Wed Apr 28 07:41:58 2010
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Mutant_scan'
7000000 steps, 14000.0 ps (continuing from step 6505392, 13010.8 ps).
[14:48:57] Resuming from checkpoint
[14:48:57] Verified work/wudata_01.log
[14:48:57] Verified work/wudata_01.trr
[14:48:57] Verified work/wudata_01.edr
[14:48:58] Completed 5392 out of 500000 steps (1%)
-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563
Fatal error:
3 particles communicated to PME node 0 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
[15:08:09] mdrun returned 255
[15:08:09] Going to send back what have done -- stepsTotalG=500000
[15:08:09] Work fraction=0.0197 steps=500000.
[15:08:13] logfile size=13475 infoLength=13475 edr=0 trr=25
[15:08:13] logfile size: 13475 info=13475 bed=0 hdr=25
[15:08:13] - Writing 14013 bytes of core data to disk...
[15:08:14] ... Done.
[15:08:15]
[15:08:15] Folding@home Core Shutdown: UNSTABLE_MACHINE
[15:08:15] CoreStatus = 7A (122)
[15:08:15] Sending work to server
[15:08:15] Project: 6050 (Run 0, Clone 147, Gen 13)
[15:08:15] + Attempting to send results [April 28 15:08:15 UTC]
[15:08:15] - Reading file work/wuresults_01.dat from core
[15:08:15] (Read 14013 bytes from disk)
[15:08:16] > Press "c" to connect to the server to upload results
c[15:35:40] - Establishing connection
[15:35:43] Connecting to http://171.64.65.54:8080/
[15:35:48] Posted data.
[15:35:48] Initial: 0000; - Uploaded at ~2 kB/s
[15:35:48] - Averaged speed for that direction ~2 kB/s
[15:35:48] + Results successfully sent
[15:35:48] Thank you for your contribution to Folding@Home.
That's all, until the next time. And there will be a next time, unfortunately.