Project: 5506 (Run 6, Clone 324, Gen 171)

Moderators: Site Moderators, FAHC Science Team

Jester
Posts: 102
Joined: Sun Mar 30, 2008 1:03 pm

Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Jester »

Had this one error before the first frame completed 5 times..... :roll:
Discovered the client "sleeping" this morning, restarted and it tried a sixth time and failed before the first frame before finally uploading a different Wu,
Folding without issue since.
Drugless
Posts: 58
Joined: Wed Jan 09, 2008 7:55 pm
Location: Durban, South Africa

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Drugless »

I can report same issue with this WU. Since last EUE:
[12:32:25] Project: 5506 (Run 1, Clone 122, Gen 148) [14:14:45] Completed 77% [14:15:38] Run: exception thrown during GuardedRun

this GPU completed a number of WU (5506, another 5506, then 5016) successfully then loaded:
[20:53:43] Project: 5506 (Run 6, Clone 324, Gen 171) mdrun_gpu returned NANs detected on GPU
[20:54:05] Project: 5506 (Run 6, Clone 324, Gen 171) mdrun_gpu returned NANs detected on GPU
[20:54:27] Project: 5506 (Run 6, Clone 324, Gen 171) mdrun_gpu returned NANs detected on GPU
[20:54:48] Project: 5506 (Run 6, Clone 324, Gen 171) mdrun_gpu returned NANs detected on GPU
[20:55:10] Project: 5506 (Run 6, Clone 324, Gen 171) mdrun_gpu returned NANs detected on GPU
[20:55:20] EUE limit exceeded. Pausing 24 hours.
Each stopped before finishing a single frame.
Stopped Client, Restarted Client
Still a problem.
[21:41:10] Project: 5506 (Run 6, Clone 324, Gen 171) [21:41:17] mdrun_gpu returned [21:41:17] NANs detected on GPU
Client then loaded (without me fiddling)
[21:41:48] Project: 5506 (Run 0, Clone 264, Gen 213) and is working fine on that one.

Code: Select all

[21:41:10] Project: 5506 (Run 6, Clone 324, Gen 171)
[21:41:10] 
[21:41:10] Assembly optimizations on if available.
[21:41:10] Entering M.D.
[21:41:17] Working on p5506_supervillin_e1
[21:41:17] Client config found, loading data.
[21:41:17] mdrun_gpu returned 
[21:41:17] NANs detected on GPU
[21:41:17] 
[21:41:17] Folding@home Core Shutdown: UNSTABLE_MACHINE
[21:41:20] CoreStatus = 7A (122)
9800GX2 - AirCooled, Peak Temp:70C (other 5506's) Standard Clocks.
EDIT: As a matter of interest has the 1.19 core been forced down yet replacing the 1.15 since the above client is running on the 1.15 (as a few of my other clients) since I have a mixture of 1.19/1.15 cores running 5506's right now? Not sure if the above specific WU would fair better on a 1.19 core. :?:
Image
Folding Tools:8 X PS3's, 5 x GTX280,1 x 8800GS, 8 x 9800GX2 GPU's
Jester
Posts: 102
Joined: Sun Mar 30, 2008 1:03 pm

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Jester »

I had the 1.19 core forced onto one of my Gpu's when it downloaded a p5800, after that the following p5506 Wu's ran 10-15% slower.....
My other question would be, how come a single "faulty" Wu can be downloaded 5 times and shutdown an otherwise stable rig for 24hrs (lucky I was home over the weekend) ?
Drugless
Posts: 58
Joined: Wed Jan 09, 2008 7:55 pm
Location: Durban, South Africa

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Drugless »

I had the 1.19 core forced onto one of my Gpu's when it downloaded a p5800, after that the following p5506 Wu's ran 10-15% slower.....
I'm had EXACT same experience on on of my other clients. Also a P5800. :o
My other question would be, how come a single "faulty" Wu can be downloaded 5 times and shutdown an otherwise stable rig for 24hrs (lucky I was home over the weekend) ?
Now whats going on?
How about this:
I go to bed:
The same WU we reported earlier as failing has been downloaded again (same rig but different Client (GPU0 now GPU3 last time))
After successfully completing: [23:53:15] Project: 5506 (Run 7, Clone 285, Gen 207) [02:05:04] Completed 100% [02:05:04] Successful run
I get: {Drum Rolling Please} Comeon, take a Guess. Man this looks familiar. :twisted:
[02:05:58] Project: 5506 (Run 6, Clone 324, Gen 171) [02:06:05] mdrun_gpu returned [02:06:05] NANs detected on GPU
[02:06:19] Project: 5506 (Run 6, Clone 324, Gen 171) [02:06:26] mdrun_gpu returned [02:06:26] NANs detected on GPU
[02:06:41] Project: 5506 (Run 6, Clone 324, Gen 171) [02:06:48] mdrun_gpu returned [02:06:48] NANs detected on GPU
[02:07:03] Project: 5506 (Run 6, Clone 324, Gen 171) [02:07:10] mdrun_gpu returned [02:07:10] NANs detected on GPU
[02:07:24] Project: 5506 (Run 6, Clone 324, Gen 171) [02:07:31] mdrun_gpu returned [02:07:31] NANs detected on GPU
[02:07:34] EUE limit exceeded. Pausing 24 hours.
I wake u and find client pausing again.
I stop and restart client and it downloads a [03:04:35] Project: 5013 (Run 6, Clone 335, Gen 176) and is working fine.
What a waste of an hour.. I know I'm producing a fair amount of WU (no brag intended) but what are the chances of receiving the EXACT SAME WU 4 hours later or has this been failing so many times across other users that it finally got back to me again?
NB. I currently have 12 clients busy with other 5506's.
Why doesn't the client load a different WU (R,G,C) after the first failure and download something else? Thats how all the other failures seem to work or am I mistaken?
As Jester said '(lucky I was home over the weekend)'
Image
Folding Tools:8 X PS3's, 5 x GTX280,1 x 8800GS, 8 x 9800GX2 GPU's
toTOW
Site Moderator
Posts: 6432
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by toTOW »

It looks like Project: 5506 (Run 6, Clone 324, Gen 171) is a bad WU. Nobody was able to get credit for it :(
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Jester
Posts: 102
Joined: Sun Mar 30, 2008 1:03 pm

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Jester »

Hopefully it's been removed from the server....
Oldhat
Posts: 30
Joined: Mon Dec 03, 2007 11:42 am
Location: Auckland

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Oldhat »

It may still be floating around at the moment as I got it this morning, although it could possibly be just due to timezone variations. :)

AMD 64 4000 stock 1Gb RAM 2 x 8800GS Windows XP

[19:38:41] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:38:44] CoreStatus = 7A (122)

Code: Select all

[19:37:53] + Attempting to send results [November 10 19:37:53 UTC]
[19:38:10] + Results successfully sent
[19:38:10] Thank you for your contribution to Folding@Home.
[19:38:10] + Number of Units Completed: 412

[19:38:14] - Preparing to get new work unit...
[19:38:14] + Attempting to get work packet
[19:38:14] - Connecting to assignment server
[19:38:14] - Successful: assigned to (171.64.65.106).
[19:38:14] + News From Folding@Home: GPU folding beta
[19:38:14] Loaded queue successfully.
[19:38:16] + Closed connections
[19:38:16] 
[19:38:16] + Processing work unit
[19:38:16] Core required: FahCore_11.exe
[19:38:16] Core found.
[19:38:16] Working on queue slot 05 [November 10 19:38:16 UTC]
[19:38:16] + Working ...
[19:38:16] 
[19:38:16] *------------------------------*
[19:38:16] Folding@Home GPU Core - Beta
[19:38:16] Version 1.19 (Mon Nov 3 09:34:13 PST 2008)
[19:38:16] 
[19:38:16] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[19:38:16] Build host: amoeba
[19:38:16] Board Type: Nvidia
[19:38:16] Core      : 
[19:38:16] Preparing to commence simulation
[19:38:16] - Looking at optimizations...
[19:38:16] - Created dyn
[19:38:16] - Files status OK
[19:38:16] - Expanded 45479 -> 246249 (decompressed 541.4 percent)
[19:38:16] Called DecompressByteArray: compressed_data_size=45479 data_size=246249, decompressed_data_size=246249 diff=0
[19:38:16] - Digital signature verified
[19:38:16] 
[19:38:16] Project: 5506 (Run 6, Clone 324, Gen 171)
[19:38:16] 
[19:38:16] Assembly optimizations on if available.
[19:38:16] Entering M.D.
[19:38:22] Working on p5506_supervillin_e1
[19:38:23] Client config found, loading data.
[19:38:23] mdrun_gpu returned 
[19:38:23] NANs detected on GPU
[19:38:23] 
[19:38:23] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:38:26] CoreStatus = 7A (122)
[19:38:26] Sending work to server
[19:38:26] Project: 5506 (Run 6, Clone 324, Gen 171)
[19:38:26] - Read packet limit of 540015616... Set to 524286976.
[19:38:26] - Error: Could not get length of results file work/wuresults_05.dat
[19:38:26] - Error: Could not read unit 05 file. Removing from queue.
[19:38:26] - Preparing to get new work unit...
[19:38:26] + Attempting to get work packet
[19:38:26] - Connecting to assignment server
[19:38:27] - Successful: assigned to (171.64.65.106).
[19:38:27] + News From Folding@Home: GPU folding beta
[19:38:27] Loaded queue successfully.
[19:38:29] + Closed connections
[19:38:34] 
[19:38:34] + Processing work unit
[19:38:34] Core required: FahCore_11.exe
[19:38:34] Core found.
[19:38:34] Working on queue slot 06 [November 10 19:38:34 UTC]
[19:38:34] + Working ...
[19:38:34] 
[19:38:34] *------------------------------*
[19:38:34] Folding@Home GPU Core - Beta
[19:38:34] Version 1.19 (Mon Nov 3 09:34:13 PST 2008)
[19:38:34] 
[19:38:34] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[19:38:34] Build host: amoeba
[19:38:34] Board Type: Nvidia
[19:38:34] Core      : 
[19:38:34] Preparing to commence simulation
[19:38:34] - Looking at optimizations...
[19:38:34] - Created dyn
[19:38:34] - Files status OK
[19:38:34] - Expanded 45479 -> 246249 (decompressed 541.4 percent)
[19:38:34] Called DecompressByteArray: compressed_data_size=45479 data_size=246249, decompressed_data_size=246249 diff=0
[19:38:34] - Digital signature verified
[19:38:34] 
[19:38:34] Project: 5506 (Run 6, Clone 324, Gen 171)
[19:38:34] 
[19:38:34] Assembly optimizations on if available.
[19:38:34] Entering M.D.
[19:38:40] Working on p5506_supervillin_e1
[19:38:41] Client config found, loading data.
[19:38:41] mdrun_gpu returned 
[19:38:41] NANs detected on GPU
[19:38:41] 
[19:38:41] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:38:44] CoreStatus = 7A (122)
[19:38:44] Sending work to server
Cheers
Oldhat
Posts: 30
Joined: Mon Dec 03, 2007 11:42 am
Location: Auckland

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Oldhat »

Oh joy, same PC and same card. Different time, Same result.
This is after completing two different work units.

Code: Select all

[04:48:32] Project: 5506 (Run 6, Clone 324, Gen 171)
[04:48:32] 
[04:48:32] Assembly optimizations on if available.
[04:48:32] Entering M.D.
[04:48:38] Working on p5506_supervillin_e1
[04:48:39] Client config found, loading data.
[04:48:39] mdrun_gpu returned 
[04:48:39] NANs detected on GPU
[04:48:39] 
[04:48:39] Folding@home Core Shutdown: UNSTABLE_MACHINE
[04:48:42] CoreStatus = 7A (122)
Jester
Posts: 102
Joined: Sun Mar 30, 2008 1:03 pm

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Jester »

How many times does a Wu have to be seen to fail instantly before it's removed from the server ?,
Not a rhetorical question, is there a set number of failures before it's officially flagged as "bad"...... 10, 20 ,50 ??
susato
Site Moderator
Posts: 511
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by susato »

Heh, it just shut down my machine with an "Unstable Machine" failure, and it has failed 23 other times since Drugless first attempted it on October 14 2008.

I'll email the researcher in charge to request a quick recall of this troublemaker of a WU.
Drugless
Posts: 58
Joined: Wed Jan 09, 2008 7:55 pm
Location: Durban, South Africa

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Drugless »

As a matter of interest, (probably wrong thread for this.) how does this type of 'bad' WU affect the science of 5506? I have searched throught the threads for an answer but it's just too much to dig through. I assume this 'bad' wu is recalled, examined, fixed and put out to be processed again. If not surely it makes the entire 5506 a dud / incomplete result? If someone can guide me to a thread with the answer please do.
Image
Folding Tools:8 X PS3's, 5 x GTX280,1 x 8800GS, 8 x 9800GX2 GPU's
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by bruce »

Drugless wrote:As a matter of interest, (probably wrong thread for this.) how does this type of 'bad' WU affect the science of 5506? I have searched throught the threads for an answer but it's just too much to dig through. I assume this 'bad' wu is recalled, examined, fixed and put out to be processed again. If not surely it makes the entire 5506 a dud / incomplete result? If someone can guide me to a thread with the answer please do.
I'm not sure if there is a single answer to your question.

One possible answer is that the atoms collided -- or got close enough to each other that the simulation failed. It may be easier to compare it to things we're more comfortable thinking about -- space travel. Newton's laws allow us to compute a trajectory to the moon or to Mars but those same equations don't work correctly in the atmosphere. Computing atmospheric equations adds a lot more unnecessary calculations, even in deep space (where it's zero).
Drugless
Posts: 58
Joined: Wed Jan 09, 2008 7:55 pm
Location: Durban, South Africa

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by Drugless »

Thanks Bruce, but I'm actually asking something slightly differnet (maybe I asked wrong question.) This above RCG of 5506 failed to be processed.
Scenario 1: The RCG above failed due to some scientific issue as you describe.
Scenario 2: Let's say for example that there is NOT a science issue with that particular RCG as you describe and 'somehow' this particular RCG has a 'flaw' that prevented it from running (because I dunno: eg:The app that splits the 5506 project into RCG's added a ; instead of a - somewhere) <- Maybe wishfull thinking but hey 'sht' happens. The fact that it failed before even processing a frame implies that 'possibly' there's a technical problem and not a scientific problem with this particular RCG?
Probably 100's other scenarios but anyway:
I assume the scientists have a way of establishing what was the cause and then:
1. Scenario 1:Science response: "Hey, so that what happens with this particular project, therefore if we do this and that and etc." RESULT: Something learnt
2. Scenario 2:Science response: "Hey, something messed with this RCG, lets fix it and send it out for processing again" RESULT: 5506 Final results will be a little delayed.
Now my question is that if scenario 2 is the reason and the particular RCG is not fixed, and recycled then is 5506 deemed a complete dud since surely that particular RCG (if all others returned a result) makes the whole 5506 project 'complete'?

Maybe I questioning something that my pea brain would never understand but just wanted to TRY understand a little better the PROCESS rather than the technical science.
( I gave up Biology the second they told me to cut open the frog! - Sorry Uncle Fungus) ;-)
Sorry, I KNEW I should have posted this in other thread!
Image
Folding Tools:8 X PS3's, 5 x GTX280,1 x 8800GS, 8 x 9800GX2 GPU's
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by bruce »

I've seen something like scenario 2 which was human error (bad parameter) or the RC splitter (illogical conditions)-- but it EUEd at the beginning of Gen 0 and would have been stopped during beta testing. I'm sure that other things happen but it's rare that we get a detailed explanation.
MstrBlstr
Posts: 578
Joined: Thu Nov 29, 2007 7:03 pm
Location: Texas

Re: Project: 5506 (Run 6, Clone 324, Gen 171)

Post by MstrBlstr »

Removed double posting.

Double posting is against site policy.

Other post here >> viewtopic.php?f=19&t=6984
-=MB=-
Post Reply