Page 1 of 1

Project: 2662 (Run 2, Clone 5, Gen 20) - ERROR 0xff (etc.)

Posted: Fri Aug 29, 2008 3:39 pm
by 314159
Dual Core Intel - Stock Clock - First Failure in months. Linux Client

Code: Select all

[04:34:51] Initial: 0000; - Receiving payload (expected size: 4923994)
[04:34:54] - Downloaded at ~1602 kB/s
[04:34:54] - Averaged speed for that direction ~1355 kB/s
[04:34:54] + Received work.
[04:34:54] Trying to send all finished work units
[04:34:54] + No unsent completed units remaining.
[04:34:54] + Closed connections
[04:34:54] 
[04:34:54] + Processing work unit
[04:34:54] Core required: FahCore_a2.exe
[04:34:54] Core found.
[04:34:54] Working on Unit 03 [August 29 04:34:54]
[04:34:54] + Working ...
[04:34:54] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 15 -forceasm -verbose -lifeline 1051 -version 602'

[04:34:54] 
[04:34:54] *------------------------------*
[04:34:54] Folding@Home Gromacs SMP Core
[04:34:54] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[04:34:54] 
[04:34:54] Preparing to commence simulation
[04:34:54] - Ensuring status. Please wait.
[04:35:04] - Assembly optimizations manually forced on.
[04:35:04] - Not checking prior termination.
[04:35:06] - Expanded 4923482 -> 24360573 (decompressed 494.7 percent)
[04:35:06] Called DecompressByteArray: compressed_data_size=4923482 data_size=24360573, decompressed_data_size=24360573 diff=0
[04:35:07] - Digital signature verified
[04:35:07] 
[04:35:07] Project: 2662 (Run 2, Clone 5, Gen 20)
[04:35:07] 
[04:35:07] Assembly optimizations on if available.
[04:35:07] Entering M.D.
[06:51:54] CoreStatus = FF (255)
[06:51:54] Client-core communications error: ERROR 0xff
[06:51:54] Deleting current work unit & continuing...
[06:52:09] - Warning: Could not delete all work unit files (3): Core file absent
[06:52:09] Trying to send all finished work units
[06:52:09] + No unsent completed units remaining.
[06:52:09] - Preparing to get new work unit...
[06:52:09] + Attempting to get work packet
[06:52:09] - Will indicate memory of 1003 MB
[06:52:09] - Connecting to assignment server
[06:52:09] Connecting to http://assign.stanford.edu:8080/
[06:52:09] Posted data.
[06:52:09] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[06:52:09] + News From Folding@Home: Welcome to Folding@Home
[06:52:09] Loaded queue successfully.
[06:52:09] Connecting to http://171.64.65.56:8080/
[06:52:14] Posted data.
[06:52:14] Initial: 0000; - Receiving payload (expected size: 4923994)
[06:52:18] - Downloaded at ~1202 kB/s
[06:52:18] - Averaged speed for that direction ~1324 kB/s
[06:52:18] + Received work.
[06:52:18] + Closed connections
[06:52:23] 
[06:52:23] + Processing work unit
[06:52:23] Core required: FahCore_a2.exe
[06:52:23] Core found.
[06:52:23] Working on Unit 04 [August 29 06:52:23]
[06:52:23] + Working ...
[06:52:23] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 15 -forceasm -verbose -lifeline 1051 -version 602'

[06:52:23] 
[06:52:23] *------------------------------*
[06:52:23] Folding@Home Gromacs SMP Core
[06:52:23] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[06:52:23] 
[06:52:23] Preparing to commence simulation
[06:52:23] - Ensuring status. Please wait.
[06:52:33] - Assembly optimizations manually forced on.
[06:52:33] - Not checking prior termination.
[06:52:35] - Expanded 4923482 -> 24360573 (decompressed 494.7 percent)
[06:52:35] Called DecompressByteArray: compressed_data_size=4923482 data_size=24360573, decompressed_data_size=24360573 diff=0
[06:52:36] - Digital signature verified
[06:52:36] 
[06:52:36] Project: 2662 (Run 2, Clone 5, Gen 20)
[06:52:36] 
[06:52:36] Assembly optimizations on if available.
[06:52:36] Entering M.D.
[10:05:30] - Autosending finished units...
[10:05:30] Trying to send all finished work units
[10:05:30] + No unsent completed units remaining.
[10:05:30] - Autosend completed
[13:08:15] Completed 47510 out of 4750001 steps  (1%)
NOTE: See log between [06:52:36] and [10:05:30] :e?: <-- that's "confused".
I have noticed this anomaly on several other a2 WUs but they ultimately completed within the preferred deadline.

Second run hung at [13:08:15].

Anyone else complete this one?

Re: Project: 2662 (Run 2, Clone 5, Gen 20) - ERROR 0xff (etc.)

Posted: Fri Aug 29, 2008 4:08 pm
by kasson
No one else has completed it, and it looks like a bad WU (quick check server-side). I've stopped new assigns of this work unit (if we can figure out how to unwind one and re-do gen 19, we might start back up again).

Re: Project: 2662 (Run 2, Clone 5, Gen 20) - ERROR 0xff (etc.)

Posted: Fri Aug 29, 2008 5:09 pm
by 314159
Thank you for the quick response and action Dr. K. :!: :)

I suspect that there are quite a few faulty WUs that remain unreported since many do not examine their queues or review WU progress.

Using InCrease on my main Mac server, I can pick these up readily and intend to continue reporting them.
I encourage others to do the same; i.e. the reporting part.

I have "quite a few" machines folding as you probably know.
Most failures are being experienced on the Quad Linux machines (Q6600's and better).
Failure rate is about 1 or 2 per 30 to 40 WUs and I can live with that (in the name of science). :ewink:

There is no discernible pattern to the failures and they are not core type related.