Page 1 of 1

Project: 6051:(Run 0, Clone 130, Gen 4)

Posted: Fri Apr 23, 2010 8:10 am
by Aardvark
Another case of an "early fail" WU using the A3core (2.17) while running OSX 10.6.3. However, things are getting more serious, if that can be possible. In the past, these failures occured before the WU reached 7% completion. This WU failed just before reaching 26%. The Client (6.29r3) was not able to complete the Work files so there was nothing for return to Stanford. NO RETURN, NO POINTS FOR PARTIAL COMPLETION. Nada, Zilch, Nothing...

Is there going to be a fix for this Problem????????? The Question MUST be asked Over and Over and Over until something effective is done. I had stopped posting entries in this Forum about these "early fail" WUs because they are just everyday occurrences. I understand the PG is working on the problem. I think that the active Folders are due something more than the casual comment. I thought this new group of 605X WUs were going to run Cleanly. IT ISN'T THE CASE.

Those that share my concern should return to posting about the failures in this Forum. Apparently the Squeaky Wheel syndrome is in effect here. If you want the Grease you better Squeak (loudly) because it's a long way to Palo Alto and it may be difficult to prioritize which wheel is gonna get Greased.

The Log File for this broken WU follows:

Code: Select all

# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.29r3

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/tedkreuserIII/Library/Folding@home
Executable: ./fah6
Arguments: -local -verbosity 9 -smp -advmethods 

[15:31:56] - Ask before connecting: Yes
[15:31:56] - User name: Aardvark (Team 48057)
[15:31:56] - User ID: XXXXXXXXXXXXXXXXXX
[15:31:56] - Machine ID: 1
[15:31:56] 
[15:31:56] Loaded queue successfully.
[15:31:56] 
[15:31:56] + Processing work unit
[15:31:56] Core required: FahCore_a3.exe
[15:31:56] Core found.
[15:31:56] - Autosending finished units... [April 22 15:31:56 UTC]
[15:31:56] Trying to send all finished work units
[15:31:56] + No unsent completed units remaining.
[15:31:56] - Autosend completed
[15:31:56] Working on queue slot 04 [April 22 15:31:56 UTC]
[15:31:56] + Working ...
[15:31:56] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 04 -np 2 -checkpoint 15 -verbose -lifeline 5262 -version 629'

[15:31:57] 
[15:31:57] *------------------------------*
[15:31:57] Folding@Home Gromacs SMP Core
[15:31:57] Version 2.17 (Mar 7 2010)
[15:31:57] 
[15:31:57] Preparing to commence simulation
[15:31:57] - Looking at optimizations...
[15:31:57] - Files status OK
[15:31:57] - Expanded 1766262 -> 2252021 (decompressed 127.5 percent)
[15:31:57] Called DecompressByteArray: compressed_data_size=1766262 data_size=2252021, decompressed_data_size=2252021 diff=0
[15:31:57] - Digital signature verified
[15:31:57] 
[15:31:57] Project: 6051 (Run 0, Clone 130, Gen 4)
[15:31:57] 
[15:31:57] Assembly optimizations on if available.
[15:31:57] Entering M.D.
[15:32:03] Using Gromacs checkpoints
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_04.tpr, VERSION 4.0.99_development_20090605 (single precision)

Reading checkpoint file work/wudata_04.cpt generated: Thu Apr 22 10:13:44 2010

Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Mutant_scan'
2500000 steps,   5000.0 ps (continuing from step 2108251,   4216.5 ps).
[15:32:04] Resuming from checkpoint
[15:32:04] Verified work/wudata_04.log
[15:32:05] Verified work/wudata_04.trr
[15:32:05] Verified work/wudata_04.edr
[15:32:05] Completed 108251 out of 500000 steps  (21%)
[15:39:33] Completed 110000 out of 500000 steps  (22%)
[16:00:51] Completed 115000 out of 500000 steps  (23%)
[16:22:12] Completed 120000 out of 500000 steps  (24%)
[16:43:30] Completed 125000 out of 500000 steps  (25%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[17:01:37] mdrun returned 255
[17:01:37] Going to send back what have done -- stepsTotalG=500000
[17:01:37] Work fraction=0.2585 steps=500000.
[17:01:38] CoreStatus = 0 (0)
[17:01:38] Sending work to server
[17:01:38] Project: 6051 (Run 0, Clone 130, Gen 4)
[17:01:38] - Error: Could not get length of results file work/wuresults_04.dat
[17:01:38] - Error: Could not read unit 04 file. Removing from queue.
[17:01:38] Trying to send all finished work units
[17:01:38] + No unsent completed units remaining.
[17:01:38] - Preparing to get new work unit...
[17:01:39] > Press "c" to connect to the server to download unit


Re: Project: 6051:(Run 0, Clone 130, Gen 4)

Posted: Fri Apr 23, 2010 8:30 am
by bruce
Aardvark wrote:Is there going to be a fix for this Problem????????? The Question MUST be asked Over and Over and Over until something effective is done. I had stopped posting entries in this Forum about these "early fail" WUs because they are just everyday occurrences. I understand the PG is working on the problem. I think that the active Folders are due something more than the casual comment.
Asking the question Over and Over and Over again is SPAM. If you get SPAM, does it make you purchase their product sooner than if you didn't get the SPAM?

The Pande Group (Kasson) has said that they're working on it. Something was changed in the latest updates to OS-X that broke FAH and apparently it's a very difficult bug to fix. You won't get a better answer than that until it's fixed.

Re: Project: 6051:(Run 0, Clone 130, Gen 4)

Posted: Fri Apr 23, 2010 8:45 am
by Aardvark
@Bruce,

Your point is well taken but I was not suggesting SPAM. What I was suggesting, even if it was poorly stated, is that when a failure occurs it should be posted and a Correction Requested. I fully understand that Kasson is our Point Man on this effort. I do not understand how much support he is getting from PG in cleaning out the Augean Stable.

Your comment seems to suggest that we are in some sort of Market Based Purchasing situation. That is NOT the case. We are partners in this Prestigious Distributed Computing project. Should we not expect treatment that is up to that position??????

Re: Project: 6051:(Run 0, Clone 130, Gen 4)

Posted: Fri Apr 23, 2010 4:14 pm
by codysluder
FAH is not market based. The science is primo.

Have you ever had a family member nag you about something they wanted you to do but you weren't able to do it? Does asking over and over make you drop whatever else you're working on and do it? It's a matter of trust. I trust that the PG members really are dedicated to having FAH succeed. They know more about the priorities than we do. Nagging them doesn't help.

Re: Project: 6051:(Run 0, Clone 130, Gen 4)

Posted: Fri Apr 23, 2010 8:29 pm
by Aardvark
@codysluder;
Your faith is, if nothing else, admirable.....

I have no way of knowing if the PG/Stanford Gurus understand just what is going on in this particular situation. I have no way of knowing just how priorities are sorted out and matched with available resources. I don't come from a background that accepts "sit on your backside and let it sort itself out" as the correct path. Sorry about that, if it bothers you.

I will be so crass as to repeat a very basic principle. We are PARTNERS in a very admirable DISTRIBUTED COMPUTING PROJECT. I believe that the effort and resources and EXPENSES that we bring to this PARTNERSHIP entitles us to a little better flow of information than PG seems to be comfortable with.