2675 { run 2 Clone 93 Gen 141 ) dies on start up

Moderators: Site Moderators, FAHC Science Team

Post Reply
dschief
Posts: 163
Joined: Tue Dec 04, 2007 5:56 am
Location: California Wine country

2675 { run 2 Clone 93 Gen 141 ) dies on start up

Post by dschief »

this one has messed up repeatedly. I deleted work folder, queue& unit data log files etc.
And started over after reboot. This is what I got

Code: Select all

[jim@localhost ~]$ cd fold2
[jim@localhost fold2]$ ./fah6 -smp 4 -verbosity 9

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

4 cores detected


--- Opening Log file [September 6 16:19:09 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/jim/fold2
Executable: ./fah6
Arguments: -smp 4 -verbosity 9 

[16:19:09] - Ask before connecting: No
[16:19:09] - User name: dschief (Team 13761)
[16:19:09] - User ID: ###############
[16:19:09] - Machine ID: 2
[16:19:09] 
[16:19:09] Work directory not found. Creating...
[16:19:09] Could not open work queue, generating new queue...
[16:19:09] - Preparing to get new work unit...
[16:19:09] - Autosending finished units... [16:19:09]
[16:19:09] Trying to send all finished work units
[16:19:09] + No unsent completed units remaining.
[16:19:09] - Autosend completed
[16:19:09] + Attempting to get work packet
[16:19:09] - Will indicate memory of 3924 MB
[16:19:09] - Connecting to assignment server
[16:19:09] Connecting to http://assign.stanford.edu:8080/
[16:19:09] Posted data.
[16:19:09] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[16:19:09] + News From Folding@Home: Welcome to Folding@Home
[16:19:09] Loaded queue successfully.
[16:19:09] Connecting to http://171.64.65.56:8080/
[16:19:18] Posted data.
[16:19:18] Initial: 0000; - Receiving payload (expected size: 1515973)
[16:19:27] - Downloaded at ~164 kB/s
[16:19:27] - Averaged speed for that direction ~164 kB/s
[16:19:27] + Received work.
[16:19:27] + Closed connections
[16:19:27] 
[16:19:27] + Processing work unit
[16:19:27] Core required: FahCore_a2.exe
[16:19:27] Core found.
[16:19:27] Working on queue slot 01 [September 6 16:19:27 UTC]
[16:19:27] + Working ...
[16:19:27] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 2494 -version 624'

[16:19:28] 
[16:19:28] *------------------------------*
[16:19:28] Folding@Home Gromacs SMP Core
[16:19:28] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[16:19:28] 
[16:19:28] Preparing to commence simulation
[16:19:28] - Ensuring status. Please wait.
[16:19:28] Files status OK
[16:19:28] - Expanded 1515461 -> 24004801 (decompressed 1583.9 percent)
[16:19:28] Called DecompressByteArray: compressed_data_size=1515461 data_size=24004801, decompressed_data_size=24004801 diff=0
[16:19:28] - Digital signature verified
[16:19:28] 
[16:19:28] Project: 2675 (Run 2, Clone 93, Gen 141)
[16:19:28] 
[16:19:28] Assembly optimizations on if available.
[16:19:28] Entering M.D.
[16:19:38] Run 2, Clone 93, Gen 141)
[16:19:38] 
[16:19:38] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
35500004 steps,  71000.0 ps (continuing from step 35250004,  70500.0 ps).

------------------[16:20:00] Completed 0 out of 250000 steps  (0%)
[16:20:00] 
[16:20:00] Folding@home Core Shutdown: INTERRUPTED
-------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 35250004

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 35250004

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169

Fatal error:
NaN detected at step 35250004

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
[16:20:04] CoreStatus = FF (255)
[16:20:04] Sending work to server
[16:20:04] Project: 2675 (Run 2, Clone 93, Gen 141)
[16:20:04] - Error: Could not get length of results file work/wuresults_01.dat
[16:20:04] - Error: Could not read unit 01 file. Removing from queue.
[16:20:04] Trying to send all finished work units
[16:20:04] + No unsent completed units remaining.
[16:20:04] - Preparing to get new work unit...
[16:20:04] + Attempting to get work packet
[16:20:04] - Will indicate memory of 3924 MB
[16:20:04] - Connecting to assignment server
[16:20:04] Connecting to http://assign.stanford.edu:8080/
[16:20:04] Posted data.
[16:20:04] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[16:20:04] + News From Folding@Home: Welcome to Folding@Home
[16:20:04] Loaded queue successfully.
[16:20:04] Connecting to http://171.64.65.56:8080/
[16:20:11] Posted data.
[16:20:11] Initial: 0000; - Receiving payload (expected size: 1515973)
[16:20:21] - Downloaded at ~148 kB/s
[16:20:21] - Averaged speed for that direction ~156 kB/s
[16:20:21] + Received work.
[16:20:21] Trying to send all finished work units
[16:20:21] + No unsent completed units remaining.
[16:20:21] + Closed connections
[16:20:26] 
[16:20:26] + Processing work unit
[16:20:26] Core required: FahCore_a2.exe
[16:20:26] Core found.
[16:20:26] Working on queue slot 02 [September 6 16:20:26 UTC]
[16:20:26] + Working ...
[16:20:26] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 2494 -version 624'

[16:20:26] 
[16:20:26] *------------------------------*
[16:20:26] Folding@Home Gromacs SMP Core
[16:20:26] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[16:20:26] 
[16:20:26] Preparing to commence simulation
[16:20:26] - Ensuring status. Please wait.
[16:20:27] Called DecompressByteArray: compressed_data_size=1515461 data_size=24004801, decompressed_data_size=24004801 diff=0
[16:20:27] - Digital signature verified
[16:20:27] 
[16:20:27] Project: 2675 (Run 2, Clone 93, Gen 141)
[16:20:27] 
[16:20:27] Assembly optimizations on if available.
[16:20:27] Entering M.D.
[16:20:36] Run 2, Clone 93, Gen 141)
[16:20:36] 
[16:20:37] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NODEID=1 argc=20
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NODEID=2 argc=20
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NODEID=0 argc=20
NODEID=3 argc=20
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
35500004 steps,  71000.0 ps (continuing from step 35250004,  70500.0 ps).
^C[16:20:56] ***** Got an Activate signal (2)
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[16:20:56] Killing all core threads

Folding@Home Client Shutdown.
[jim@localhost fold2]$ [0]0:Return code = 102
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit

uncle fuzzy
Posts: 460
Joined: Sun Dec 02, 2007 10:15 pm
Location: Michigan

Re: 2675 { run 2 Clone 93 Gen 141 ) dies on start up

Post by uncle fuzzy »

The key is this
compressed_data_size=1515461
That's a bad WU that the new 2.10 a2 core is designed to dump. Look for data sizes around 1.5MB. It SHOULD EUE and get a new WU.
Proud to crash my machines as a Beta Tester!

Image
dschief
Posts: 163
Joined: Tue Dec 04, 2007 5:56 am
Location: California Wine country

Re: 2675 { run 2 Clone 93 Gen 141 ) dies on start up

Post by dschief »

after several more failures, I deleted the folder. downloaded a fresh client & started from scratch.
Finally got a different Wu ( 2677) which is crunching along fine now.
uncle fuzzy
Posts: 460
Joined: Sun Dec 02, 2007 10:15 pm
Location: Michigan

Re: 2675 { run 2 Clone 93 Gen 141 ) dies on start up

Post by uncle fuzzy »

Just watch for that 1.5 size entry. I had one VM run through 7 bad ones before it got a normal WU and started folding. I just watched it, so 0 effort involved on my part. You'll keep seeing them until they get all the bad ones weeded out.
Proud to crash my machines as a Beta Tester!

Image
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: 2675 { run 2 Clone 93 Gen 141 ) dies on start up

Post by codysluder »

dschief wrote:this one has messed up repeatedly. I deleted work folder, queue& unit data log files etc.
And started over after reboot. This is what I got
From your log, this WU wasted 55 seconds of wall-clock time each time it downloaded. I'm wondering if your efforts to force a new WU were worthwhile. It looks to me like it would have been more effective to just ignore it, let it download several times, and let it move on to a new WU naturally.

I'm guessing that many folks spend entirely too much effort (for a net change in processing time that might even be negative) in order to avoid errors like this one.
dschief
Posts: 163
Joined: Tue Dec 04, 2007 5:56 am
Location: California Wine country

Re: 2675 { run 2 Clone 93 Gen 141 ) dies on start up

Post by dschief »

I guess your right, and I'm being too anal about it. The Linux farm runs so smoothly that I over react on the rare occasion that it does throw a glitch. :mrgreen:
Post Reply