Project: 3062 (Run 2, Clone 101, Gen 11) crashing at 40%
Posted: Wed May 07, 2008 3:02 pm
this has died twice at 40%, and just down-loaded the same wu? I thought the server was set up to switch to a different wu after 2 failed attempts?
Snipped, fixed code tags, updated title. -7im
Code: Select all
23:26:28] Project: 3062 (Run 2, Clone 101, Gen 11)
[23:26:28]
[23:26:28] Assembly optimizations on if available.
[23:26:28] Entering M.D.
[23:26:44] percent)
[23:26:45] - Sta
[23:26:45] Project: 3062 (Run 2, Clone 101, Gen 11)
[23:26:45]
[23:26:45] Entering M.D.
[23:26:45] ne 101, Gen 11)
[23:26:45]
[23:26:45] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NODEID=1 argc=15
NODEID=0 argc=15
NODEID=2 argc=15
NODEID=3 argc=15
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2004, The GROMACS development team,
check out http://www.gromacs.org for more information.
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.
starting mdrun 'p3062_lambda5_99sb'
5000000 steps, 10000.0 ps.
[23:26:51] files
[23:26:51] Completed 0 out of 5000000 steps (0 percent)
[23:26:51] a SSE boost OK.
[23:41:51] nt triggered.
[23:42:52] Writing local files
[23:42:52] Completed 50000 out of 5000000 steps (1 percent)
[23:57:52] Timered checkpoint triggered.
[23:58:51] Writing local files
[23:58:51] Completed 100000 out of 5000000 steps (2 percent)
snip
[09:50:43] Completed 1950000 out of 5000000 steps (39 percent)
[10:05:44] Timered checkpoint triggered.
[10:06:42] Writing local files
[10:06:42] Completed 2000000 out of 5000000 steps (40 percent)
[10:21:42] Timered checkpoint triggered.
[10:22:21] Warning: long 1-4 interactions
[0]0:Return code = 0, signaled with Segmentation fault
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Segmentation fault
[10:22:25] CoreStatus = 0 (0)
[10:22:25] Client-core communications error: ERROR 0x0
[10:22:25] Deleting current work unit & continuing...
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 18
[0]3:Return code = 0, signaled with Quit
[10:26:46] - Warning: Could not delete all work unit files (7): Core returned invalid code
[10:26:46] Trying to send all finished work units
[10:26:46] + No unsent completed units remaining.
[10:26:46] - Preparing to get new work unit...
[10:26:46] + Attempting to get work packet
[10:26:46] - Will indicate memory of 2013 MB
[10:26:46] - Connecting to assignment server
[10:26:46] Connecting to http://assign.stanford.edu:8080/
[10:26:47] Posted data.
[10:26:47] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[10:26:47] + News From Folding@Home: Welcome to Folding@Home
[10:26:47] Loaded queue successfully.
[10:26:47] Connecting to http://171.64.65.63:8080/
[10:26:48] Posted data.
[10:26:48] Initial: 0000; - Receiving payload (expected size: 610425)
[10:26:52] - Downloaded at ~149 kB/s
[10:26:52] - Averaged speed for that direction ~151 kB/s
[10:26:52] + Received work.
[10:26:52] + Closed connections
[10:26:57]
[10:26:57] + Processing work unit
[10:26:57] Core required: FahCore_a1.exe
[10:26:57] Core found.
[10:26:57] Working on Unit 08 [May 7 10:26:57]
[10:26:57] + Working ...
[10:26:57] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 17884 -version 601'
[10:26:57]
[10:26:57] *------------------------------*
[10:26:57] Folding@Home Gromacs SMP Core
[10:26:57] Version 1.74 (November 27, 2006)
[10:26:57]
[10:26:57] Preparing to commence simulation
[10:26:57] - Ensuring status. Please wait.
[10:27:14] - Looking at optimizations...
[10:27:14] - Working with standard loops on this execution.
[10:27:14] - Previous termination of core was improper.
[10:27:14] - Going to use standard loops.
[10:27:14] - Files status OK
[10:27:14] arting from initial work packet
[10:27:14]
[10:27:14] Project: 3062 (Run 2, Clone 101, Gen 11)
[10:27:14]
[10:27:14] Entering M.D.
[10:27:14] cket
[10:27:14]
[10:27:14] Project: 3062 (Run 2, Clone 101, Gen 11)
[10:27:14]
[10:27:14] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NODEID=2 argc=15
NODEID=3 argc=15
NODEID=0 argc=15
NODEID=1 argc=15
starting mdrun 'p3062_lambda5_99sb'
5000000 steps, 10000.0 ps.
[10:27:20] files
[10:27:20] Extra SSE boost OK.
[10:27:20] Writing local files
[10:27:20] Completed 0 out of 5000000 steps (0 percent)
[10:42:20] Timered checkpoint triggered.
[10:43:18] Writing local files
[10:43:18] Completed 50000 out of 5000000 steps (1 percent)
[10:58:18] Timered checkpoint triggered.
[10:59:14] Writing local files
[10:59:14] Completed 100000 out of 5000000 steps (2 percent)
snip
[20:49:32] Completed 1950000 out of 5000000 steps (39 percent)
[21:04:32] Timered checkpoint triggered.
[21:05:31] Writing local files
[21:05:31] Completed 2000000 out of 5000000 steps (40 percent)
[21:20:31] Timered checkpoint triggered.
[21:21:11] Warning: long 1-4 interactions
[0]0:Return code = 0, signaled with Segmentation fault
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Segmentation fault
[21:21:15] CoreStatus = 0 (0)
[21:21:15] Client-core communications error: ERROR 0x0
[21:21:15] Deleting current work unit & continuing...
[0]0:Return code = 18
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[21:25:36] - Warning: Could not delete all work unit files (8): Core returned invalid code
[21:25:36] Trying to send all finished work units
[21:25:36] + No unsent completed units remaining.
[21:25:36] - Preparing to get new work unit...
[21:25:36] + Attempting to get work packet
[21:25:36] - Will indicate memory of 2013 MB
[21:25:36] - Connecting to assignment server
[21:25:36] Connecting to http://assign.stanford.edu:8080/
[21:25:36] Posted data.
[21:25:36] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[21:25:36] + News From Folding@Home: Welcome to Folding@Home
[21:25:36] Loaded queue successfully.
[21:25:36] Connecting to http://171.64.65.63:8080/
[21:25:37] Posted data.
[21:25:37] Initial: 0000; - Receiving payload (expected size: 610425)
[21:25:41] - Downloaded at ~149 kB/s
[21:25:41] - Averaged speed for that direction ~151 kB/s
[21:25:41] + Received work.
[21:25:41] + Closed connections
[21:25:46]
[21:25:46] + Processing work unit
[21:25:46] Core required: FahCore_a1.exe
[21:25:46] Core found.
[21:25:46] Working on Unit 09 [May 7 21:25:46]
[21:25:46] + Working ...
[21:25:46] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -verbose -lifeline 17884 -version 601'
[21:25:46]
[21:25:46] *------------------------------*
[21:25:46] Folding@Home Gromacs SMP Core
[21:25:46] Version 1.74 (November 27, 2006)
[21:25:46]
[21:25:46] Preparing to commence simulation
[21:25:46] - Ensuring status. Please wait.
[21:26:03] - Looking at optimizations...
[21:26:03] - Working with standard loops on this execution.
[21:26:03] - Created dyn
[21:26:03] - Files status OK
[21:26:03] as improper.
[21:26:03] - Going to use sta- Expanded 609913 -> 3263133 (decompressed 535.0 percent)
[21:26:04] cket
[21:26:04]
[21:26:04] Project: 3062 (Run 2, Clone 101, Gen 11)
[21:26:04]
[21:26:04] Entering M.D.
[21:26:04] cket
[21:26:04]
[21:26:04] Project: 3062 (Run 2, Clone 101, Gen 11)
[21:26:04]
[21:26:04] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NODEID=3 argc=15
NODEID=0 argc=15
NODEID=1 argc=15
NODEID=2 argc=15
starting mdrun 'p3062_lambda5_99sb'
5000000 steps, 10000.0 ps.
[21:26:10] t OK.
[21:26:10] n: p3062_laExtra SSE boost OK.
[21:26:10] ambda5_99sbExtra SSE boost OK.
[21:26:10]
[21:26:10] Extra SSE boost OK.
[21:26:10] Writing local files
[21:26:10] Completed 0 out of 5000000 steps (0 percent)
[21:41:10] Timered checkpoint triggered.
[21:42:02] Writing local files
[21:42:02] Completed 50000 out of 5000000 steps (1 percent)