Page 1 of 1

Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Mon Apr 19, 2010 9:15 pm
by Mactin
When I arrived from work, the SMP2 console was stuck at 11:20:04.
Sinse it looked like it was stuck for the past 9.5 hours, I then stopeed it with Ctrl-C.

Code: Select all

[10:46:10] Folding@Home Gromacs SMP Core
[10:46:10] Version 2.17 (Mar 12, 2010)
[10:46:10] 
[10:46:10] Preparing to commence simulation
[10:46:10] - Assembly optimizations manually forced on.
[10:46:10] - Not checking prior termination.
[10:46:11] - Expanded 1389930 -> 2252229 (decompressed 162.0 percent)
[10:46:11] Called DecompressByteArray: compressed_data_size=1389930 data_size=2252229, decompressed_data_size=2252229 diff=0
[10:46:11] - Digital signature verified
[10:46:11] 
[10:46:11] Project: 6069 (Run 0, Clone 1, Gen 0)
[10:46:11] 
[10:46:11] Assembly optimizations on if available.
[10:46:11] Entering M.D.
[10:46:17] Completed 0 out of 500000 steps  (0%)
[10:48:52] Completed 5000 out of 500000 steps  (1%)
[10:51:28] Completed 10000 out of 500000 steps  (2%)
[10:54:04] Completed 15000 out of 500000 steps  (3%)
[10:56:41] Completed 20000 out of 500000 steps  (4%)
[10:59:17] Completed 25000 out of 500000 steps  (5%)
[11:01:53] Completed 30000 out of 500000 steps  (6%)
[11:04:30] Completed 35000 out of 500000 steps  (7%)
[11:07:05] Completed 40000 out of 500000 steps  (8%)
[11:09:40] Completed 45000 out of 500000 steps  (9%)
[11:12:15] Completed 50000 out of 500000 steps  (10%)
[11:14:51] Completed 55000 out of 500000 steps  (11%)
[11:17:26] Completed 60000 out of 500000 steps  (12%)
[11:20:04] Completed 65000 out of 500000 steps  (13%)
[16:47:08] - Autosending finished units... [April 19 16:47:08 UTC]
[20:53:31] Completed 70000 out of 500000 steps  (14%)
[20:53:31] Trying to send all finished work units
[20:53:31] Killing all core threads
[20:53:31] Completed 75000 out of 500000 steps  (15%)
[20:53:31] + No unsent completed units remaining.
[20:53:31] Could not get process id information.  Please kill core process manually
[20:53:31] Completed 80000 out of 500000 steps  (16%)
[20:53:31] - Autosend completed

Folding@Home Client Shutdown at user request.
[20:53:31] Completed 85000 out of 500000 steps  (17%)
[20:53:31] Completed 90000 out of 500000 steps  (18%)
[20:53:31] Completed 95000 out of 500000 steps  (19%)
[20:53:31] Completed 100000 out of 500000 steps  (20%)
[20:53:31] Completed 105000 out of 500000 steps  (21%)
[20:53:31] Completed 110000 out of 500000 steps  (22%)
[20:53:31] Completed 115000 out of 500000 steps  (23%)
[20:53:31] Completed 120000 out of 500000 steps  (24%)
[20:53:31] Completed 125000 out of 500000 steps  (25%)
[20:53:31] Completed 130000 out of 500000 steps  (26%)
[20:53:31] ***** Got a SIGTERM signal (2)
[20:53:31] Killing all core threads
[20:53:31] Completed 135000 out of 500000 steps  (27%)
[20:53:31] Could not get process id information.  Please kill core process manually
[20:53:31] Completed 140000 out of 500000 steps  (28%)

Folding@Home Client Shutdown.
[20:53:31] Completed 145000 out of 500000 steps  (29%)
I then started the client again, and it gave me a MISSING_WORK_FILES error.
While I was copying the log to post it here, I noticed a whole bunch of stuff after 11:20:04 that was not on the screen.

Code: Select all

--- Opening Log file [April 19 20:53:41 UTC] 


# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files (x86)\folding\m1 SMP
Executable: C:\Program Files (x86)\folding\m1 SMP\Folding@home-Win32-x86.exe
Arguments: -local -smp 10 -verbosity 9 -forceasm -config 

[20:53:41] - Ask before connecting: No
[20:53:41] - User name: Martin_i7-980x (Team 96377)
[20:53:41] - User ID: 6C6A12517E046D46
[20:53:41] - Machine ID: 1
[20:53:41] 
[20:53:41] Configuring Folding@Home...


[20:53:44] - Ask before connecting: No
[20:53:44] - User name: Martin_i7-980x (Team 96377)
[20:53:44] - User ID: 6C6A12517E046D46
[20:53:44] - Machine ID: 1
[20:53:44] 
[20:53:44] Loaded queue successfully.
[20:53:44] 
[20:53:44] - Autosending finished units... [April 19 20:53:44 UTC]
[20:53:44] + Processing work unit
[20:53:44] Trying to send all finished work units
[20:53:44] Core required: FahCore_a3.exe
[20:53:44] + No unsent completed units remaining.
[20:53:44] Core found.
[20:53:44] - Autosend completed
[20:53:44] Working on queue slot 08 [April 19 20:53:44 UTC]
[20:53:44] + Working ...
[20:53:44] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 10 -checkpoint 10 -forceasm -verbose -lifeline 3336 -version 629'

[20:53:44] 
[20:53:44] *------------------------------*
[20:53:44] Folding@Home Gromacs SMP Core
[20:53:44] Version 2.17 (Mar 12, 2010)
[20:53:44] 
[20:53:44] Preparing to commence simulation
[20:53:44] - Assembly optimizations manually forced on.
[20:53:44] - Not checking prior termination.
[20:53:44] Error: Missing work file=<>
[20:53:44] 
[20:53:44] Folding@home Core Shutdown: MISSING_WORK_FILES
[20:53:48] CoreStatus = 74 (116)
[20:53:48] The core could not find the work files specified. Removing from queue
[20:53:48] Deleting current work unit & continuing...
[20:53:52] Trying to send all finished work units
[20:53:52] + No unsent completed units remaining.
[20:53:52] - Preparing to get new work unit...
[20:53:52] Cleaning up work directory
[20:53:52] + Attempting to get work packet
[20:53:52] Passkey found
[20:53:52] - Will indicate memory of 4096 MB
[20:53:52] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 12, Stepping: 2
[20:53:52] - Connecting to assignment server
[20:53:52] Connecting to http://assign.stanford.edu:8080/
[20:53:53] Posted data.
[20:53:53] Initial: ED82; - Successful: assigned to (130.237.232.140).
[20:53:53] + News From Folding@Home: Welcome to Folding@Home
[20:53:53] Loaded queue successfully.
[20:53:53] Connecting to http://130.237.232.140:8080/
[20:53:54] Posted data.
[20:53:54] Initial: 0000; - Receiving payload (expected size: 1797263)
[20:54:00] - Downloaded at ~292 kB/s
[20:54:00] - Averaged speed for that direction ~327 kB/s
[20:54:00] + Received work.
[20:54:00] + Closed connections
[20:54:05] 
[20:54:05] + Processing work unit
[20:54:05] Core required: FahCore_a3.exe
[20:54:05] Core found.
[20:54:05] Working on queue slot 09 [April 19 20:54:05 UTC]
[20:54:05] + Working ...
[20:54:05] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 09 -np 10 -checkpoint 10 -forceasm -verbose -lifeline 3336 -version 629'

[20:54:05] 
[20:54:05] *------------------------------*
[20:54:05] Folding@Home Gromacs SMP Core
[20:54:05] Version 2.17 (Mar 12, 2010)
[20:54:05] 
[20:54:05] Preparing to commence simulation
[20:54:05] - Assembly optimizations manually forced on.
[20:54:05] - Not checking prior termination.
[20:54:05] - Expanded 1796751 -> 2078149 (decompressed 115.6 percent)
[20:54:05] Called DecompressByteArray: compressed_data_size=1796751 data_size=2078149, decompressed_data_size=2078149 diff=0
[20:54:05] - Digital signature verified
[20:54:05] 
[20:54:05] Project: 6012 (Run 2, Clone 21, Gen 62)
[20:54:05] 
[20:54:05] Assembly optimizations on if available.
[20:54:05] Entering M.D.
[20:54:11] Completed 0 out of 500000 steps  (0%)
[20:56:50] Completed 5000 out of 500000 steps  (1%)
[20:59:28] Completed 10000 out of 500000 steps  (2%)
Two other WUs (one GPU and one Classic) on the same system are processing normally.

The system is a Core i7-980x with 12GB (OCed to 3.9GHZ) running Windows 7 Ultimate (french)

Re: Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Mon Apr 19, 2010 9:41 pm
by bruce
The Microsoft text window can be suspended in a way that prevents the application (FAH, in this case) from updating it. I don't remember the exact way to do it intentionally, but it sounds like what happened here. FAH continued to run and continued to send messages to the console but you didn't see them.

Re: Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Mon Apr 19, 2010 9:58 pm
by Mactin
Bruce,

I dont think so. If it had continued to run, the WU would have complete at around 15:00.

I checked the processes before stopping it and the "FahCore_a3.exe *32" process was at 0%.
And, the temps where consistent with this.
Wheter it was at 13% or 29%, it did freeze.

Re: Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Mon Apr 19, 2010 10:25 pm
by bruce
Maybe . . . maybe not.

Assuming that my guess is correct, FAH would have continued to write console data to the console window until the buffer filled up. I have no knowledge of Windows internals, but it seems logical to assume that the messages between 13% and 29% did, in fact fill up that buffer and FAH could no longer write to the screen, at which point it Windows would have suspended it until the buffer could be cleared.

This might be a way to recreate the issue. Rt-Click on the task bar at the top of the console output and select EDIT+Mark. The title will change to include the word "Mark" Leave it in that condition long enough for one or two messages about % Complete to appear. (I suspect they won't appear.) Now go back to that screen and press Enter, and the messages should appear. If you're really brave, :wink: :wink: leave it for much longer and see if FAH is eventually suspended but can be "fixed" it by hitting Enter and letting the messages appear on the screen.

Re: Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Mon Apr 19, 2010 11:30 pm
by Mactin
Permit me again to respecfully discount your (helpfull) suggestion.
I did what you suggested and waited 7 minutes (enough for 2% of processing).
When I hit "Enter" the line appeared all at once with their own times, not bunched up together at the same time.
Also note that the input not only stopped on the console but also in the log file.
Thanks

Re: Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Tue Apr 20, 2010 12:07 am
by bruce
Mactin wrote:Permit me again to respecfully discount your (helpfull) suggestion.
Respect is appreciated, but not necessary. ;)
When I hit "Enter" the line appeared all at once with their own times, not bunched up together at the same time.
Also note that the input not only stopped on the console but also in the log file.
Thanks
OK, but we don't have a one-to-one comparison. If, instead of pressing Enter, you had done a CTRL-c (or a shutdown), you would have never seen the screen update but (probably) the messages would have been written to the log file, producing the same sort of results that you reported earlier. The only thing you did NOT confirm was whether Windows would have suspended FAH eventually leaving you with two options (1) Press Enter and allow processing to resume or (2) kill FAH at what might have been an inopportune time. All we know for sure is that if your original problem was more or less like the one we're playing with, you picked 2 without knowing about 1.

Re: Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Tue Apr 20, 2010 1:37 am
by Mactin
Well, I have more to report.
I just completed the WU I got after restarting my client (I should get 3744 points, translating to 19798PPD, not bad :D )
The next WU I got was the same I had trouble with this morming : Project: 6069 (Run 0, Clone 1, Gen 0).
We will see how it does. For sure I wont go to bed until it gets to 30%.

Re: Funny behaviour with Project: 6069 (Run 0, Clone 1, Gen 0)

Posted: Tue Apr 20, 2010 2:55 am
by Mactin
It's now way past my bed time and the console shows 31% :-)
Whether the problem is like what Bruce is thinking or something else, I probably was a one time glich, they happen :?

Thank you and good night