Project: 2671 (Run 12, Clone 40, Gen 89)

Moderators: Site Moderators, FAHC Science Team

Post Reply
HendricksSA
Posts: 339
Joined: Fri Jun 26, 2009 4:34 am

Project: 2671 (Run 12, Clone 40, Gen 89)

Post by HendricksSA »

This WU has already been reported as one of the one-cpu WUs with a small data size.. It seems to have come back out to cause more trouble ... now with NANs right at the start and an error in md.c at line 2169. I've included the relevant log here. I got it three times and finally it stopped, the client downloaded a new a2 core and processing continued with a new WU.

Code: Select all

[05:01:32] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[05:01:32] + News From Folding@Home: Welcome to Folding@Home
[05:01:32] Loaded queue successfully.
[05:01:32] Connecting to http://171.67.108.24:8080/
[05:01:40] Posted data.
[05:01:40] Initial: 0000; - Receiving payload (expected size: 1507339)
[05:01:43] - Downloaded at ~490 kB/s
[05:01:43] - Averaged speed for that direction ~457 kB/s
[05:01:43] + Received work.
[05:01:43] Trying to send all finished work units
[05:01:43] + No unsent completed units remaining.
[05:01:43] + Closed connections
[05:01:43] 
[05:01:43] + Processing work unit
[05:01:43] Core required: FahCore_a2.exe
[05:01:43] Core found.
[05:01:43] Working on Unit 00 [September 11 05:01:43]
[05:01:43] + Working ...
[05:01:43] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -checkpoint 15 -verbose -lifeline 2780 -version 602'

[05:01:43] 
[05:01:43] *------------------------------*
[05:01:43] Folding@Home Gromacs SMP Core
[05:01:43] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[05:01:43] 
[05:01:43] Preparing to commence simulation
[05:01:43] - Ensuring status. Please wait.
[05:01:43] Files status OK
[05:01:43] - Expanded 1506827 -> 24012993 (decompressed 1593.6 percent)
[05:01:43] Called DecompressByteArray: compressed_data_size=1506827 data_size=24012993, decompressed_data_size=24012993 diff=0
[05:01:44] - Digital signature verified
[05:01:44] 
[05:01:44] Project: 2671 (Run 12, Clone 40, Gen 89)
[05:01:44] 
[05:01:44] Assembly optimizations on if available.
[05:01:44] Entering M.D.
[05:01:53] Run 12, Clone 40, Gen 89)
[05:01:53] 
[05:01:53] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
NODEID=0 argc=20
NODEID=1 argc=20
Reading file work/wudata_00.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NODEID=2 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
22500000 steps,  45000.0 ps (continuing from step 22250000,  44500.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[05:02:19] Completed 0 out of 250000 steps  (0%)
[05:02:23] CoreStatus = FF (255)
[05:02:23] Client-core communications error: ERROR 0xff
[05:02:23] Deleting current work unit & continuing...
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[05:02:37] - Warning: Could not delete all work unit files (0): Core file absent
[05:02:37] Trying to send all finished work units
[05:02:37] + No unsent completed units remaining.
[05:02:37] - Preparing to get new work unit...
[05:02:37] + Attempting to get work packet
[05:02:37] - Will indicate memory of 7200 MB
[05:02:37] - Connecting to assignment server
[05:02:37] Connecting to http://assign.stanford.edu:8080/
[05:02:37] Posted data.
[05:02:37] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[05:02:37] + News From Folding@Home: Welcome to Folding@Home
[05:02:37] Loaded queue successfully.
[05:02:37] Connecting to http://171.67.108.24:8080/
[05:02:45] Posted data.
[05:02:45] Initial: 0000; - Receiving payload (expected size: 1507339)
[05:02:48] - Downloaded at ~490 kB/s
[05:02:48] - Averaged speed for that direction ~463 kB/s
[05:02:48] + Received work.
[05:02:48] + Closed connections
[05:02:53] 
[05:02:53] + Processing work unit
[05:02:53] Core required: FahCore_a2.exe
[05:02:53] Core found.
[05:02:53] Working on Unit 01 [September 11 05:02:53]
[05:02:53] + Working ...
[05:02:53] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 2780 -version 602'

[05:02:53] 
[05:02:53] *------------------------------*
[05:02:53] Folding@Home Gromacs SMP Core
[05:02:53] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[05:02:53] 
[05:02:53] Preparing to commence simulation
[05:02:53] - Ensuring status. Please wait.
[05:03:02] - Looking at optimizations...
[05:03:02] - Working with standard loops on this execution.
[05:03:02] - Files status OK
[05:03:03] - Expanded 1506827 -> 24012993 (decompressed 1593.6 percent)
[05:03:03] Called DecompressByteArray: compressed_data_size=1506827 data_size=24012993, decompressed_data_size=24012993 diff=0
[05:03:03] - Digital signature verified
[05:03:03] 
[05:03:03] Project: 2671 (Run 12, Clone 40, Gen 89)
[05:03:03] 
[05:03:03] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
22500000 steps,  45000.0 ps (continuing from step 22250000,  44500.0 ps).
[05:03:29] Completed 0 out of 250000 steps  (0%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[05:03:33] CoreStatus = FF (255)
[05:03:33] Client-core communications error: ERROR 0xff
[05:03:33] Deleting current work unit & continuing...
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[05:03:47] - Warning: Could not delete all work unit files (1): Core file absent
[05:03:47] Trying to send all finished work units
[05:03:47] + No unsent completed units remaining.
[05:03:47] - Preparing to get new work unit...
[05:03:47] + Attempting to get work packet
[05:03:47] - Will indicate memory of 7200 MB
[05:03:47] - Connecting to assignment server
[05:03:47] Connecting to http://assign.stanford.edu:8080/
[05:03:47] Posted data.
[05:03:47] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[05:03:47] + News From Folding@Home: Welcome to Folding@Home
[05:03:48] Loaded queue successfully.
[05:03:48] Connecting to http://171.67.108.24:8080/
[05:03:56] Posted data.
[05:03:56] Initial: 0000; - Receiving payload (expected size: 1507339)
[05:03:59] - Downloaded at ~490 kB/s
[05:03:59] - Averaged speed for that direction ~469 kB/s
[05:03:59] + Received work.
[05:03:59] + Closed connections
[05:04:04] 
[05:04:04] + Processing work unit
[05:04:04] Core required: FahCore_a2.exe
[05:04:04] Core found.
[05:04:04] Working on Unit 02 [September 11 05:04:04]
[05:04:04] + Working ...
[05:04:04] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 2780 -version 602'

[05:04:04] 
[05:04:04] *------------------------------*
[05:04:04] Folding@Home Gromacs SMP Core
[05:04:04] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[05:04:04] 
[05:04:04] Preparing to commence simulation
[05:04:04] - Ensuring status. Please wait.
[05:04:13] - Looking at optimizations...
[05:04:13] - Working with standard loops on this execution.
[05:04:13] - Files status OK
[05:04:14] - Expanded 1506827 -> 24012993 (decompressed 1593.6 percent)
[05:04:14] Called DecompressByteArray: compressed_data_size=1506827 data_size=24012993, decompressed_data_size=24012993 diff=0
[05:04:14] - Digital signature verified
[05:04:14] 
[05:04:14] Project: 2671 (Run 12, Clone 40, Gen 89)
[05:04:14] 
[05:04:14] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=3 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
22500000 steps,  45000.0 ps (continuing from step 22250000,  44500.0 ps).

-----------------------------------[05:04:40] Completed 0 out of 250000 steps  (0%)
[05:04:40] 
[05:04:40] Folding@home Core Shutdown: INTERRUPTED
--------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 22250000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[05:04:44] CoreStatus = FF (255)
[05:04:44] Client-core communications error: ERROR 0xff
[05:04:44] - Attempting to download new core...
[05:04:44] + Downloading new core: FahCore_a2.exe
[05:04:44] Downloading core (/~pande/Linux/x86/Core_a2.fah from http://www.stanford.edu)
[05:04:44] Initial: AFDE; + 10240 bytes downloaded
[05:04:44] Initial: E2DE; + 20480 bytes downloaded
[05:04:45] Initial: 10DD; + 30720 bytes downloaded
Post Reply