Page 1 of 1

Project: 2671 (Run 51, Clone 50, Gen 89)

Posted: Mon Sep 14, 2009 8:10 pm
by HendricksSA
This WU has already been reported as one of the 1-CPU bad WUs with a small compressed data size. It seems to have come back out to cause me more trouble ... now with NANs right from the very start and an error in md.c at line 2169. I've included the relevant log here. I got it three times and finally it stopped, the client downloaded a new a2 core and processing continued with another 2671 WU. I read in the 1-CPU thread that a script is now cleaning up 2671. Thank goodness. I wonder if my NAN problem (that appears with this project on the very first step) is related? This is the third 2671 WU that has done this to me in three days. On the good side, I'm stealing network bandwidth refreshing my a2 core every day! Posting FYI.

Code: Select all

[08:33:22] - Preparing to get new work unit...
[08:33:22] + Attempting to get work packet
[08:33:22] - Will indicate memory of 7200 MB
[08:33:22] - Connecting to assignment server
[08:33:22] Connecting to http://assign.stanford.edu:8080/
[08:33:23] Posted data.
[08:33:23] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[08:33:23] + News From Folding@Home: Welcome to Folding@Home
[08:33:23] Loaded queue successfully.
[08:33:23] Connecting to http://171.67.108.24:8080/
[08:33:28] Posted data.
[08:33:28] Initial: 0000; - Receiving payload (expected size: 1498244)
[08:33:31] - Downloaded at ~487 kB/s
[08:33:31] - Averaged speed for that direction ~437 kB/s
[08:33:31] + Received work.
[08:33:31] + Closed connections
[08:33:36] 
[08:33:36] + Processing work unit
[08:33:36] Core required: FahCore_a2.exe
[08:33:36] Core found.
[08:33:36] Working on Unit 04 [September 14 08:33:36]
[08:33:36] + Working ...
[08:33:36] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 15 -verbose -lifeline 3251 -version 602'

[08:33:36] 
[08:33:36] *------------------------------*
[08:33:36] Folding@Home Gromacs SMP Core
[08:33:36] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[08:33:36] 
[08:33:36] Preparing to commence simulation
[08:33:36] - Ensuring status. Please wait.
[08:33:45] - Looking at optimizations...
[08:33:45] - Working with standard loops on this execution.
[08:33:45] - Files status OK
[08:33:46] - Expanded 1497732 -> 24033557 (decompressed 1604.6 percent)
[08:33:46] Called DecompressByteArray: compressed_data_size=1497732 data_size=24033557, decompressed_data_size=24033557 diff=0
[08:33:46] - Digital signature verified
[08:33:46] 
[08:33:46] Project: 2671 (Run 51, Clone 50, Gen 89)
[08:33:46] 
[08:33:46] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NODEID=0 argc=20
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NODEID=1 argc=20
NODEID=2 argc=20
Reading file work/wudata_04.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=3 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22887 system in water'
22500000 steps,  45000.0 ps (continuing from step 22250000,  44500.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
[08:34:13] Completed 0 out of 250000 steps  (0%)
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[08:34:17] CoreStatus = FF (255)
[08:34:17] Client-core communications error: ERROR 0xff
[08:34:17] - Attempting to download new core...
[08:34:17] + Downloading new core: FahCore_a2.exe
[08:34:17] Downloading core (/~pande/Linux/x86/Core_a2.fah from www.stanford.edu)
[08:34:18] Initial: AFDE; + 10240 bytes downloaded
**********************new core downloaded here*******************************
[08:34:20] Initial: D8FF; + 2412195 bytes downloaded
[08:34:20] Verifying core Core_a2.fah...
[08:34:20] Signature is VALID
[08:34:20] 
[08:34:20] Trying to unzip core FahCore_a2.exe
[08:34:21] Decompressed FahCore_a2.exe (5509624 bytes) successfully
[08:34:21] + Core successfully engaged
[08:34:29] Deleting current work unit & continuing...
[0]3:Return code = 0, signaled with Quit
[08:34:43] - Warning: Could not delete all work unit files (4): Core file absent
[08:34:43] Trying to send all finished work units
[08:34:43] + No unsent completed units remaining.
[08:34:43] - Preparing to get new work unit...
[08:34:43] + Attempting to get work packet
[08:34:43] - Will indicate memory of 7200 MB
[08:34:43] - Connecting to assignment server
[08:34:43] Connecting to http://assign.stanford.edu:8080/
[08:34:44] Posted data.
[08:34:44] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[08:34:44] + News From Folding@Home: Welcome to Folding@Home
[08:34:44] Loaded queue successfully.
[08:34:44] Connecting to http://171.67.108.24:8080/
[08:34:44] Posted data.
[08:34:44] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[08:34:44] - Attempt #1  to get work failed, and no other work to do.
             Waiting before retry.
[08:34:59] + Attempting to get work packet
[08:34:59] - Will indicate memory of 7200 MB
[08:34:59] - Connecting to assignment server
[08:34:59] Connecting to http://assign.stanford.edu:8080/
[08:34:59] Posted data.
[08:34:59] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[08:34:59] + News From Folding@Home: Welcome to Folding@Home
[08:34:59] Loaded queue successfully.
[08:34:59] Connecting to http://171.67.108.24:8080/
[08:35:05] Posted data.
[08:35:05] Initial: 0000; - Receiving payload (expected size: 4839947)
[08:35:16] - Downloaded at ~429 kB/s
[08:35:16] - Averaged speed for that direction ~436 kB/s
[08:35:16] + Received work.
[08:35:16] + Closed connections
[08:35:21] 
[08:35:21] + Processing work unit
[08:35:21] Core required: FahCore_a2.exe
[08:35:21] Core found.
[08:35:21] Working on Unit 05 [September 14 08:35:21]
[08:35:21] + Working ...
[08:35:21] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 05 -checkpoint 15 -verbose -lifeline 3251 -version 602'

[08:35:22] 
[08:35:22] *------------------------------*
[08:35:22] Folding@Home Gromacs SMP Core
[08:35:22] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[08:35:22] 
[08:35:22] Preparing to commence simulation
[08:35:22] - Ensuring status. Please wait.
[08:35:31] - Looking at optimizations...
[08:35:31] - Working with standard loops on this execution.
[08:35:31] - Files status OK
[08:35:32] - Expanded 4839435 -> 24005045 (decompressed 496.0 percent)
[08:35:32] Called DecompressByteArray: compressed_data_size=4839435 data_size=24005045, decompressed_data_size=24005045 diff=0
[08:35:32] - Digital signature verified
[08:35:32] 
[08:35:32] Project: 2671 (Run 43, Clone 86, Gen 101)
[08:35:32] 
[08:35:32] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NODEID=0 argc=20
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NODEID=1 argc=20
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
Reading file work/wudata_05.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=2 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22866 system in water'
25500000 steps,  51000.0 ps (continuing from step 25250000,  50500.0 ps).
[08:35:42] Completed 0 out of 250000 steps  (0%)
[08:42:24] Completed 2500 out of 250000 steps  (1%)
[08:49:18] Completed 5000 out of 250000 steps  (2%)
[08:55:57] Completed 7500 out of 250000 steps  (3%)
[09:02:38] Completed 10000 out of 250000 steps  (4%)
[09:09:25] Completed 12500 out of 250000 steps  (5%)
[09:16:00] Completed 15000 out of 250000 steps  (6%)

Re: Project: 2671 (Run 51, Clone 50, Gen 89)

Posted: Mon Sep 14, 2009 10:32 pm
by bruce
According to this post, Dr. Kasson had started cleaning out problems like this in project 2671 but he probably hadn't finished at the time it was assigned to you.