Page 1 of 1

Project: 5733 (Run 4, Clone 21, Gen 521)

Posted: Sun Dec 13, 2009 12:53 am
by Pette Broad
I'm on my 6th run of this and I've lost patience with it. It keeps failing at around the same point each time. How many times do these units get sent out before the server gives up? I'm going to delete it because I must be getting close to the dreaded 24 hour sleep. :roll:

EDIT, too late the next attempt failed while I was writing this and put me into a 24 hour sleep and on restart I got it AGAIN absolutely ridiculous .

EDIT 2, Finally got something else after winning my battle with the server even though it sent it me a further 3 times. I noticed that after the 3rd delete there was no work available on the first attempt so is it likely that this one's been hanging around for a while??. Anyway, I think its very unfair to put a machine into sleep if it fails the same RCG over and over again. I understand that Pande group don't want an errant machine burning hundreds of WU's but wouldn't it be better if the sleep mode only applied if 2 different WU's EUE'd ?

Code: Select all

[22:04:16] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:04:16] 
[22:04:16] Assembly optimizations on if available.
[22:04:16] Entering M.D.
[22:04:22] Will resume from checkpoint file
[22:04:23] Working on Protein
[22:04:23] Client config found, loading data.
[22:04:23] Starting GUI Server
[22:04:30] Resuming from checkpoint
[22:04:30] Verified work/wudata_00.log
[22:04:30] Verified work/wudata_00.edr
[22:04:30] Verified work/wudata_00.xtc
[22:08:49] Completed 1%
[22:13:08] Completed 2%
[22:17:27] Completed 3%
[22:17:27] mdrun_gpu returned 
[22:17:27] NANs detected on GPU
[22:17:27] 
[22:17:27] Folding@home Core Shutdown: UNSTABLE_MACHINE
[22:17:31] CoreStatus = 7A (122)
[22:17:31] Sending work to server
[22:17:31] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:17:31] - Error: Could not get length of results file work/wuresults_00.dat
[22:17:31] - Error: Could not read unit 00 file. Removing from queue.
[22:17:31] - Preparing to get new work unit...
[22:17:31] + Attempting to get work packet
[22:17:31] - Connecting to assignment server
[22:17:32] - Successful: assigned to (171.64.65.102).
[22:17:32] + News From Folding@Home: Welcome to Folding@Home
[22:17:32] Loaded queue successfully.
[22:17:35] + Closed connections
[22:17:40] 
[22:17:40] + Processing work unit
[22:17:40] Core required: FahCore_11.exe
[22:17:40] Core found.
[22:17:40] Working on queue slot 01 [December 12 22:17:40 UTC]
[22:17:40] + Working ...
[22:17:40] 
[22:17:40] *------------------------------*
[22:17:40] Folding@Home GPU Core - Beta
[22:17:40] Version 1.18 (Mon Oct 13 11:11:30 PDT 2008)
[22:17:40] 
[22:17:40] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[22:17:40] Build host: amoeba
[22:17:40] Board Type: AMD
[22:17:40] Core      : 
[22:17:40] Preparing to commence simulation
[22:17:40] - Looking at optimizations...
[22:17:40] - Created dyn
[22:17:40] - Files status OK
[22:17:40] - Expanded 98537 -> 492188 (decompressed 499.4 percent)
[22:17:40] Called DecompressByteArray: compressed_data_size=98537 data_size=492188, decompressed_data_size=492188 diff=0
[22:17:40] - Digital signature verified
[22:17:40] 
[22:17:40] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:17:40] 
[22:17:40] Assembly optimizations on if available.
[22:17:40] Entering M.D.
[22:17:47] Working on Protein
[22:17:47] Client config found, loading data.
[22:17:47] Starting GUI Server
[22:22:18] Completed 1%
[22:26:43] Completed 2%
[22:31:06] Completed 3%
[22:31:06] mdrun_gpu returned 
[22:31:06] NANs detected on GPU
[22:31:06] 
[22:31:06] Folding@home Core Shutdown: UNSTABLE_MACHINE
[22:31:09] CoreStatus = 7A (122)
[22:31:09] Sending work to server
[22:31:09] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:31:09] - Error: Could not get length of results file work/wuresults_01.dat
[22:31:09] - Error: Could not read unit 01 file. Removing from queue.
[22:31:09] - Preparing to get new work unit...
[22:31:09] + Attempting to get work packet
[22:31:09] - Connecting to assignment server
[22:31:09] - Successful: assigned to (171.64.65.102).
[22:31:09] + News From Folding@Home: Welcome to Folding@Home
[22:31:09] Loaded queue successfully.
[22:31:11] + Closed connections
[22:31:16] 
[22:31:16] + Processing work unit
[22:31:16] Core required: FahCore_11.exe
[22:31:16] Core found.
[22:31:16] Working on queue slot 02 [December 12 22:31:16 UTC]
[22:31:16] + Working ...
[22:31:17] 
[22:31:17] *------------------------------*
[22:31:17] Folding@Home GPU Core - Beta
[22:31:17] Version 1.18 (Mon Oct 13 11:11:30 PDT 2008)
[22:31:17] 
[22:31:17] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[22:31:17] Build host: amoeba
[22:31:17] Board Type: AMD
[22:31:17] Core      : 
[22:31:17] Preparing to commence simulation
[22:31:17] - Looking at optimizations...
[22:31:17] - Created dyn
[22:31:17] - Files status OK
[22:31:17] - Expanded 98537 -> 492188 (decompressed 499.4 percent)
[22:31:17] Called DecompressByteArray: compressed_data_size=98537 data_size=492188, decompressed_data_size=492188 diff=0
[22:31:17] - Digital signature verified
[22:31:17] 
[22:31:17] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:31:17] 
[22:31:17] Assembly optimizations on if available.
[22:31:17] Entering M.D.
[22:31:23] Working on Protein
[22:31:23] Client config found, loading data.
[22:31:23] Starting GUI Server
[22:35:52] Completed 1%
[22:40:16] Completed 2%
[22:44:33] Completed 3%
[22:44:33] mdrun_gpu returned 
[22:44:33] NANs detected on GPU
[22:44:33] 
[22:44:33] Folding@home Core Shutdown: UNSTABLE_MACHINE
[22:44:37] CoreStatus = 7A (122)
[22:44:37] Sending work to server
[22:44:37] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:44:37] - Error: Could not get length of results file work/wuresults_02.dat
[22:44:37] - Error: Could not read unit 02 file. Removing from queue.
[22:44:37] - Preparing to get new work unit...
[22:44:37] + Attempting to get work packet
[22:44:37] - Connecting to assignment server
[22:44:38] - Successful: assigned to (171.64.65.102).
[22:44:38] + News From Folding@Home: Welcome to Folding@Home
[22:44:38] Loaded queue successfully.
[22:44:40] + Closed connections
[22:44:45] 
[22:44:45] + Processing work unit
[22:44:45] Core required: FahCore_11.exe
[22:44:45] Core found.
[22:44:45] Working on queue slot 03 [December 12 22:44:45 UTC]
[22:44:45] + Working ...
[22:44:45] 
[22:44:45] *------------------------------*
[22:44:45] Folding@Home GPU Core - Beta
[22:44:45] Version 1.18 (Mon Oct 13 11:11:30 PDT 2008)
[22:44:45] 
[22:44:45] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[22:44:45] Build host: amoeba
[22:44:45] Board Type: AMD
[22:44:45] Core      : 
[22:44:45] Preparing to commence simulation
[22:44:45] - Looking at optimizations...
[22:44:45] - Created dyn
[22:44:45] - Files status OK
[22:44:45] - Expanded 98537 -> 492188 (decompressed 499.4 percent)
[22:44:45] Called DecompressByteArray: compressed_data_size=98537 data_size=492188, decompressed_data_size=492188 diff=0
[22:44:45] - Digital signature verified
[22:44:45] 
[22:44:45] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:44:45] 
[22:44:45] Assembly optimizations on if available.
[22:44:45] Entering M.D.
[22:44:52] Working on Protein
[22:44:52] Client config found, loading data.
[22:44:52] Starting GUI Server
[22:49:17] Completed 1%
[22:53:38] Completed 2%
[22:58:01] Completed 3%
[22:58:01] mdrun_gpu returned 
[22:58:01] NANs detected on GPU
[22:58:01] 
[22:58:01] Folding@home Core Shutdown: UNSTABLE_MACHINE
[22:58:06] CoreStatus = 7A (122)
[22:58:06] Sending work to server
[22:58:06] Project: 5733 (Run 4, Clone 21, Gen 521)
[22:58:06] - Error: Could not get length of results file work/wuresults_03.dat
[22:58:06] - Error: Could not read unit 03 file. Removing from queue.

Re: Project: 5733 (Run 4, Clone 21, Gen 521)

Posted: Sun Dec 13, 2009 1:40 am
by bruce
Pette Broad wrote:I noticed that after the 3rd delete there was no work available on the first attempt so is it likely that this one's been hanging around for a while??.
I don't see any other reports of trouble with this WU. That would be the only way I could tell if it has been hanging around for a while. No work available is the Assignment Server saying it can't find another ATI server for you, so it doesn't necessarily mean what you think it means.
Anyway, I think its very unfair to put a machine into sleep if it fails the same RCG over and over again. I understand that Pande group don't want an errant machine burning hundreds of WU's but wouldn't it be better if the sleep mode only applied if 2 different WU's EUE'd ?
I've suggested the same improvement, myself, and apparently there are some reasons why they can't do that with the present structure. Maybe something in V7 :?: :?: