Project: 3043 (Run 3, Clone 55, Gen 54) fails at 82% 0 (0)

Moderators: Site Moderators, FAHC Science Team

Post Reply
parkut
Posts: 365
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Project: 3043 (Run 3, Clone 55, Gen 54) fails at 82% 0 (0)

Post by parkut »

This WU has failed twice at 82%, and has restarted on it's third
run. I've killed the bad WU and my machine (q7) has moved on to
something else.

model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
cpu MHz : 1596.000
cache size : 4096 KB
Memory: 1.94 GB physical, 1.94 GB virtual
...
Client Version 6.24beta
Core: FahCore_a1.exe
Core Version 1.74 (November 27, 2006)
Current Work Unit
-----------------
Name: p3043_p3029_SMP-emsv-03
Tag: P3043R3C55G54

Code: Select all

[10:34:09] Project: 3043 (Run 3, Clone 55, Gen 54)
[10:34:09] 
[10:34:09] Assembly optimizations on if available.
[10:34:09] Entering M.D.
[10:34:26] ial work pa- Starting from initial work packet
[10:34:26] 
[10:34:26] Project: 3Entering M.D.
[10:34:26] one 55, Gen 54)
[10:34:26] 
[10:34:26] Entering M.D.
[10:34:32] cal files
[10:34:32] Completed 0 out of 10000000 steps  (0 percent)
[10:34:32]  SSE boost OK.
[10:43:26] iles
[10:43:26] Completed 100000 out of 10000000 steps  (1 percent)
[10:52:22] Completed 200000 out of 10000000 steps  (2 percent)
... snip ...
[22:35:33] Completed 8100000 out of 10000000 steps  (81 percent)
[22:44:26] Completed 8200000 out of 10000000 steps  (82 percent)
[22:50:11] Warning:  long 1-4 interactions

[22:50:15] CoreStatus = 0 (0)
[22:50:15] Sending work to server
[22:50:15] Project: 3043 (Run 3, Clone 55, Gen 54)
[22:50:15] - Error: Could not get length of results file work/wuresults_06.dat
[22:50:15] - Error: Could not read unit 06 file. Removing from queue.
[22:50:15] Trying to send all finished work units
[22:50:15] + No unsent completed units remaining.


[22:50:15] - Preparing to get new work unit...
[22:50:15] + Attempting to get work packet
[22:50:15] - Will indicate memory of 1985 MB
[22:50:15] - Connecting to assignment server
[22:50:15] Connecting to http://assign.stanford.edu:8080/
[22:50:16] Posted data.
[22:50:16] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[22:50:16] + News From Folding@Home: Welcome to Folding@Home
[22:50:16] Loaded queue successfully.
[22:50:16] Connecting to http://171.64.65.63:8080/
[22:50:17] Posted data.
[22:50:17] Initial: 0000; - Receiving payload (expected size: 283317)
[22:50:17] Conversation time very short, giving reduced weight in bandwidth avg
[22:50:17] - Downloaded at ~553 kB/s
[22:50:17] - Averaged speed for that direction ~380 kB/s
[22:50:17] + Received work.
[22:50:17] Trying to send all finished work units
[22:50:17] + No unsent completed units remaining.
[22:50:17] + Closed connections
[22:50:22] 
[22:50:22] + Processing work unit
[22:50:22] Work type a1 not eligible for variable processors
[22:50:22] Core required: FahCore_a1.exe
[22:50:22] Core found.
[22:50:22] Working on queue slot 07 [March 22 22:50:22 UTC]
[22:50:22] + Working ...
-version 624'
[22:50:22] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 15441 

[22:50:22] 
[22:50:22] *------------------------------*
[22:50:22] Folding@Home Gromacs SMP Core
[22:50:22] Version 1.74 (November 27, 2006)
[22:50:22] 
[22:50:22] Preparing to commence simulation
[22:50:22] - Ensuring status. Please wait.
[22:50:39] - Looking at optimizations...
[22:50:39] - Working with standard loops on this execution.
[22:50:39] - Previous termination of core was improper.
[22:50:39] - Going to use standard loops.
[22:50:39] - Files status OK
[22:50:39] - Expanded 282805 -> 1508541 (decompressed 533.4 percent)
[22:50:40] - Data doesn't match checksum.
[22:50:40] - Starting from initial work packet
[22:50:40] 
[22:50:40] Project: 3043 (Run 3, Clone 55, Gen 54)
[22:50:40] 
[22:50:40] Entering M.D.
[22:50:47] Protein: 9684 p3029_SProtein: 9684 p3029_SMP-emsv-03Extra SSE boost OK.
[22:50:47] 
[22:50:47] Extra SSE boost OK.
[22:50:47] Completed 0 out of 10000000 steps  (0 percent)
[22:59:43] Completed 100000 out of 10000000 steps  (1 percent)
[23:08:42] Completed 200000 out of 10000000 steps  (2 percent)
... snip ...
[10:51:54] Completed 8100000 out of 10000000 steps  (81 percent)
[11:00:47] Completed 8200000 out of 10000000 steps  (82 percent)
[11:06:34] Warning:  long 1-4 interactions

[11:06:38] CoreStatus = 0 (0)
[11:06:38] Sending work to server
[11:06:38] Project: 3043 (Run 3, Clone 55, Gen 54)
[11:06:38] - Error: Could not get length of results file work/wuresults_07.dat
[11:06:38] - Error: Could not read unit 07 file. Removing from queue.
[11:06:38] Trying to send all finished work units
[11:06:38] + No unsent completed units remaining.
susato
Site Moderator
Posts: 511
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project: 3043 (Run 3, Clone 55, Gen 54) fails at 82% 0 (0)

Post by susato »

Another user also had trouble with this WU. Would you do us a favor -- bump this thread in a few days if no one else posts problems with the WU?

On WUs that fail due to long 1-4 interactions, if you happen to come upon them during their second (or a subsequent) attempt, before they reach the failure point, they can often be salvaged by stopping Folding, waiting until the cores shut themselves down, then restarting the unit. The closer you are to the failure point, the better the strategy works, though it's not infallible. You can tell that the strategy is working if, after restart, the unit either runs to completion, or fails further on in the calculation than it did the first time.

Backing up the work folder when you shut the unit down gives you an opportunity to restart the backup copy and resave it closer to a failure point if this becomes necessary.
parkut
Posts: 365
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: Project: 3043 (Run 3, Clone 55, Gen 54) fails at 82% 0 (0)

Post by parkut »

I was assigned this same WU over the weekend (I was out of town).
The WU failed with the same error at 82%. I did not get a chance to
interrupt it and retry.
susato
Site Moderator
Posts: 511
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project: 3043 (Run 3, Clone 55, Gen 54) fails at 82% 0 (0)

Post by susato »

It failed for another donor at 82% (but sent back an EUE for partial credit).
It also failed for two more donors, one at 36% and one at 41%.
The other donors were using Windows as far as I could tell - you're on Linux aren't you?

I'm marking this as a bad WU.
parkut
Posts: 365
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: Project: 3043 (Run 3, Clone 55, Gen 54) fails at 82% 0 (0)

Post by parkut »

Correct, CentOS linux, version 5.2
Post Reply