Re: Project: 2669 (Run 9, Clone 106, Gen 0) [problems sending]
Posted: Thu Sep 11, 2008 3:42 pm
Thank you for your beta test report regarding this persistent bug. Three running A1 cores isn't right, but so far nobody knows how to cause that to happen or how to prevent it from happening. Anything else that you discover may be the clue that allows the cause to be determined and a remedy to be programmed.
My theory is that after a WU reaches 100% and the work unit is finished, MPI is supposed to shut down all four copies of the core and return control to the client. For some reason not all cores shut down. (In your case, one probably terminated normally and the other three continued to run in spite of being told to quit.) I do not believe this has anything to do with the specific WU being processed, so the title of this thread may be inappropriate since the generic problem is that FAH hung after reaching 100%. Either way it's difficult to know.
Aardvark:
* Have you tried downloading the newest client (6.22 R3) from viewtopic.php?f=46&t=4913? This contains a fix that might help with this problem.
* I notice you've used the -pause parameter. I don't know if anybody tested that parameter thoroughly yet. Does the same problem occur if you remove that parameter?
My theory is that after a WU reaches 100% and the work unit is finished, MPI is supposed to shut down all four copies of the core and return control to the client. For some reason not all cores shut down. (In your case, one probably terminated normally and the other three continued to run in spite of being told to quit.) I do not believe this has anything to do with the specific WU being processed, so the title of this thread may be inappropriate since the generic problem is that FAH hung after reaching 100%. Either way it's difficult to know.
Aardvark:
* Have you tried downloading the newest client (6.22 R3) from viewtopic.php?f=46&t=4913? This contains a fix that might help with this problem.
* I notice you've used the -pause parameter. I don't know if anybody tested that parameter thoroughly yet. Does the same problem occur if you remove that parameter?