Project: 2671 (Run 50, Clone 97, Gen 92)

Moderators: Site Moderators, FAHC Science Team

Post Reply
HendricksSA
Posts: 339
Joined: Fri Jun 26, 2009 4:34 am

Project: 2671 (Run 50, Clone 97, Gen 92)

Post by HendricksSA »

This WU has already been reported as one of the one-cpu WUs with a small data size.. It seems to have come back out to cause more trouble ... now with NANs right at the start and an error in md.c at line 2169. I've included the relevant log here. I got it three times and finally it stopped, the client downloaded a new a2 core and processing continued with a new WU.

Code: Select all

[13:34:06] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[13:34:06] + News From Folding@Home: Welcome to Folding@Home
[13:34:06] Loaded queue successfully.
[13:34:06] Connecting to http://171.67.108.24:8080/
[13:34:14] Posted data.
[13:34:14] Initial: 0000; - Receiving payload (expected size: 1493395)
[13:34:17] - Downloaded at ~486 kB/s
[13:34:17] - Averaged speed for that direction ~494 kB/s
[13:34:17] + Received work.
[13:34:17] Trying to send all finished work units
[13:34:17] + No unsent completed units remaining.
[13:34:17] + Closed connections
[13:34:17] 
[13:34:17] + Processing work unit
[13:34:17] Core required: FahCore_a2.exe
[13:34:17] Core found.
[13:34:17] Working on Unit 06 [September 12 13:34:17]
[13:34:17] + Working ...
[13:34:17] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 2780 -version 602'

[13:34:17] 
[13:34:17] *------------------------------*
[13:34:17] Folding@Home Gromacs SMP Core
[13:34:17] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:34:17] 
[13:34:17] Preparing to commence simulation
[13:34:17] - Ensuring status. Please wait.
[13:34:17] Called DecompressByteArray: compressed_data_size=1492883 data_size=24046869, decompressed_data_size=24046869 diff=0
[13:34:18] - Digital signature verified
[13:34:18] 
[13:34:18] Project: 2671 (Run 50, Clone 97, Gen 92)
[13:34:18] 
[13:34:18] Assembly optimizations on if available.
[13:34:18] Entering M.D.
[13:34:27] Run 50, Clone 97, Gen 92)
[13:34:27] 
[13:34:27] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22887 system in water'
23250000 steps,  46500.0 ps (continuing from step 23000000,  46000.0 ps).

--------------------------------[13:34:53] Completed 0 out of 250000 steps  (0%)
[13:34:53] 
[13:34:53] Folding@home Core Shutdown: INTERRUPTED
-----------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[13:34:57] CoreStatus = FF (255)
[13:34:57] Client-core communications error: ERROR 0xff
[13:34:57] Deleting current work unit & continuing...
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[13:35:01] - Warning: Could not delete all work unit files (6): Core file absent
[13:35:01] Trying to send all finished work units
[13:35:01] + No unsent completed units remaining.
[13:35:01] - Preparing to get new work unit...
[13:35:01] + Attempting to get work packet
[13:35:01] - Will indicate memory of 7200 MB
[13:35:01] - Connecting to assignment server
[13:35:01] Connecting to http://assign.stanford.edu:8080/
[13:35:02] Posted data.
[13:35:02] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[13:35:02] + News From Folding@Home: Welcome to Folding@Home
[13:35:02] Loaded queue successfully.
[13:35:02] Connecting to http://171.67.108.24:8080/
[13:35:10] Posted data.
[13:35:10] Initial: 0000; - Receiving payload (expected size: 1493395)
[13:35:13] - Downloaded at ~486 kB/s
[13:35:13] - Averaged speed for that direction ~492 kB/s
[13:35:13] + Received work.
[13:35:13] + Closed connections
[13:35:18] 
[13:35:18] + Processing work unit
[13:35:18] Core required: FahCore_a2.exe
[13:35:18] Core found.
[13:35:18] Working on Unit 07 [September 12 13:35:18]
[13:35:18] + Working ...
[13:35:18] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 2780 -version 602'

[13:35:18] 
[13:35:18] *------------------------------*
[13:35:18] Folding@Home Gromacs SMP Core
[13:35:18] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:35:18] 
[13:35:18] Preparing to commence simulation
[13:35:18] - Ensuring status. Please wait.
[13:35:18] Called DecompressByteArray: compressed_data_size=1492883 data_size=24046869, decompressed_data_size=24046869 diff=0
[13:35:19] - Digital signature verified
[13:35:19] 
[13:35:19] Project: 2671 (Run 50, Clone 97, Gen 92)
[13:35:19] 
[13:35:19] Assembly optimizations on if available.
[13:35:19] Entering M.D.
[13:35:28] Run 50, Clone 97, Gen 92)
[13:35:28] 
[13:35:28] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NODEID=0 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_07.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NODEID=1 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22887 system in water'
23250000 steps,  46500.0 ps (continuing from step 23000000,  46000.0 ps).
[13:35:54] Completed 0 out of 250000 steps  (0%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[13:35:58] CoreStatus = FF (255)
[13:35:58] Client-core communications error: ERROR 0xff
[13:35:58] Deleting current work unit & continuing...
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[13:36:12] - Warning: Could not delete all work unit files (7): Core file absent
[13:36:12] Trying to send all finished work units
[13:36:12] + No unsent completed units remaining.
[13:36:12] - Preparing to get new work unit...
[13:36:12] + Attempting to get work packet
[13:36:12] - Will indicate memory of 7200 MB
[13:36:12] - Connecting to assignment server
[13:36:12] Connecting to http://assign.stanford.edu:8080/
[13:36:12] Posted data.
[13:36:12] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[13:36:12] + News From Folding@Home: Welcome to Folding@Home
[13:36:13] Loaded queue successfully.
[13:36:13] Connecting to http://171.67.108.24:8080/
[13:36:21] Posted data.
[13:36:21] Initial: 0000; - Receiving payload (expected size: 1493395)
[13:36:24] - Downloaded at ~486 kB/s
[13:36:24] - Averaged speed for that direction ~491 kB/s
[13:36:24] + Received work.
[13:36:24] + Closed connections
[13:36:29] 
[13:36:29] + Processing work unit
[13:36:29] Core required: FahCore_a2.exe
[13:36:29] Core found.
[13:36:29] Working on Unit 08 [September 12 13:36:29]
[13:36:29] + Working ...
[13:36:29] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 2780 -version 602'

[13:36:29] 
[13:36:29] *------------------------------*
[13:36:29] Folding@Home Gromacs SMP Core
[13:36:29] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:36:29] 
[13:36:29] Preparing to commence simulation
[13:36:29] - Ensuring status. Please wait.
[13:36:38] - Looking at optimizations...
[13:36:38] - Working with standard loops on this execution.
[13:36:38] - Files status OK
[13:36:39] - Expanded 1492883 -> 24046869 (decompressed 1610.7 percent)
[13:36:39] Called DecompressByteArray: compressed_data_size=1492883 data_size=24046869, decompressed_data_size=24046869 diff=0
[13:36:39] - Digital signature verified
[13:36:39] 
[13:36:39] Project: 2671 (Run 50, Clone 97, Gen 92)
[13:36:39] 
[13:36:39] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=tinalinux
NNODES=4, MYRANK=2, HOSTNAME=tinalinux
NNODES=4, MYRANK=1, HOSTNAME=tinalinux
NNODES=4, MYRANK=3, HOSTNAME=tinalinux
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_08.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22887 system in water'
23250000 steps,  46500.0 ps (continuing from step 23000000,  46000.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 23000000

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[13:37:05] Completed 0 out of 250000 steps  (0%)

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[13:37:09] CoreStatus = FF (255)
[13:37:09] Client-core communications error: ERROR 0xff
[13:37:09] - Attempting to download new core...
[13:37:09] + Downloading new core: FahCore_a2.exe
[13:37:09] Downloading core (/~pande/Linux/x86/Core_a2.fah from www.stanford.edu)
[13:37:09] Initial: AFDE; + 10240 bytes downloaded
[13:37:09] Initial: E2DE; + 20480 bytes downloaded
Post Reply