Page 1 of 1

2669 (Run 2, Clone 61, Gen 116)

Posted: Sun Apr 19, 2009 11:03 pm
by P5-133XL
This WU likes to spontanously interrupt itself! I then restart it and it does it again ...

9550@3.2GHz, 2x Asus 9600GSO (512), 4GB RAM
Windows Server 2008 SP1, Hyper-V
VM: Ubuntu 8.04, dual-core, 1GB RAM

Code: Select all

[00:37:49] CoreStatus = 64 (100)
[00:37:49] Unit 6 finished with 69 percent of time to deadline remaining.
[00:37:49] Updated performance fraction: 0.706218
[00:37:49] Sending work to server
[00:37:49] Project: 2669 (Run 0, Clone 107, Gen 112)


[00:37:49] + Attempting to send results [April 19 00:37:49 UTC]
[00:37:49] - Reading file work/wuresults_06.dat from core
[00:37:49]   (Read 25961552 bytes from disk)
[00:37:49] Connecting to http://171.64.65.56:8080/
[00:45:30] Posted data.
[00:45:30] Initial: 0000; - Uploaded at ~53 kB/s
[00:45:39] - Averaged speed for that direction ~52 kB/s
[00:45:39] + Results successfully sent
[00:45:39] Thank you for your contribution to Folding@Home.
[00:45:39] + Number of Units Completed: 280

[00:45:57] - Warning: Could not delete all work unit files (6): Core file absent
[00:45:57] Trying to send all finished work units
[00:45:57] + No unsent completed units remaining.
[00:45:57] - Preparing to get new work unit...
[00:45:57] + Attempting to get work packet
[00:45:57] - Will indicate memory of 1004 MB
[00:45:57] - Connecting to assignment server
[00:45:57] Connecting to http://assign.stanford.edu:8080/
[00:45:57] Posted data.
[00:45:57] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[00:45:57] + News From Folding@Home: Welcome to Folding@Home
[00:45:57] Loaded queue successfully.
[00:45:57] Connecting to http://171.64.65.56:8080/
[00:46:02] Posted data.
[00:46:02] Initial: 0000; - Receiving payload (expected size: 4836166)
[00:46:07] - Downloaded at ~944 kB/s
[00:46:07] - Averaged speed for that direction ~1138 kB/s
[00:46:07] + Received work.
[00:46:07] Trying to send all finished work units
[00:46:07] + No unsent completed units remaining.
[00:46:07] + Closed connections
[00:46:07] 
[00:46:07] + Processing work unit
[00:46:07] At least 4 processors must be requested.Core required: FahCore_a2.exe
[00:46:07] Core found.
[00:46:08] Working on queue slot 07 [April 19 00:46:08 UTC]
[00:46:08] + Working ...
[00:46:08] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 4675 -version 624'

[00:46:08] 
[00:46:08] *------------------------------*
[00:46:08] Folding@Home Gromacs SMP Core
[00:46:08] Version 2.06 (Tue Mar 31 08:29:45 PDT 2009)
[00:46:08] 
[00:46:08] Preparing to commence simulation
[00:46:08] - Ensuring status. Please wait.
[00:46:09] Called DecompressByteArray: compressed_data_size=4835654 data_size=23977273, decompressed_data_size=23977273 diff=0
[00:46:09] - Digital signature verified
[00:46:09] 
[00:46:09] Project: 2669 (Run 2, Clone 61, Gen 116)
[00:46:09] 
[00:46:09] Assembly optimizations on if available.
[00:46:09] Entering M.D.
[00:46:19] Run 2, Clone 61, Gen 116)
[00:46:19] 
[00:46:19] Entering M.D.
[00:56:22] pleted 2500 out of 250000 steps  (1%)
[01:06:16] Completed 5000 out of 250000 steps  (2%)
[01:16:09] Completed 7500 out of 250000 steps  (3%)
[01:26:03] Completed 10000 out of 250000 steps  (4%)
[01:35:58] Completed 12500 out of 250000 steps  (5%)
[01:45:45] Completed 15000 out of 250000 steps  (6%)
[01:52:54] - Autosending finished units... [April 19 01:52:54 UTC]
[01:52:54] Trying to send all finished work units
[01:52:54] + No unsent completed units remaining.
[01:52:54] - Autosend completed
[01:55:32] Completed 17500 out of 250000 steps  (7%)
[02:05:19] Completed 20000 out of 250000 steps  (8%)
[02:15:07] Completed 22500 out of 250000 steps  (9%)
[02:24:54] Completed 25000 out of 250000 steps  (10%)
[02:34:41] Completed 27500 out of 250000 steps  (11%)
[02:44:28] Completed 30000 out of 250000 steps  (12%)
[02:54:17] Completed 32500 out of 250000 steps  (13%)
[03:04:04] Completed 35000 out of 250000 steps  (14%)
[03:13:50] Completed 37500 out of 250000 steps  (15%)
[03:23:37] Completed 40000 out of 250000 steps  (16%)
[03:33:23] Completed 42500 out of 250000 steps  (17%)
[03:35:16] 
[03:35:16] Folding@home Core Shutdown: INTERRUPTED
[07:52:54] - Autosending finished units... [April 19 07:52:54 UTC]
[07:52:54] Trying to send all finished work units
[07:52:54] + No unsent completed units remaining.
[07:52:54] - Autosend completed


--- Opening Log file [April 19 02:12:50 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/mtt/foldingathome/CPU1
Executable: /home/mtt/foldingathome/CPU1/fah6
Arguments: -smp -verbosity 9 

[02:12:50] - Ask before connecting: No
[02:12:50] - User name: P5_133XL (Team 10047)
[02:12:50] - User ID: 51F02262379F5BC8
[02:12:50] - Machine ID: 4
[02:12:50] 
[02:12:50] Loaded queue successfully.
[02:12:50] - Autosending finished units... [April 19 02:12:50 UTC]
[02:12:50] Trying to send all finished work units
[02:12:50] + No unsent completed units remaining.
[02:12:50] - Autosend completed
[02:12:50] 
[02:12:50] + Processing work unit
[02:12:50] At least 4 processors must be requested.Core required: FahCore_a2.exe
[02:12:50] Core found.
[02:12:50] Working on queue slot 07 [April 19 02:12:50 UTC]
[02:12:50] + Working ...
[02:12:50] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 4647 -version 624'

[02:12:50] 
[02:12:50] *------------------------------*
[02:12:50] Folding@Home Gromacs SMP Core
[02:12:50] Version 2.06 (Tue Mar 31 08:29:45 PDT 2009)
[02:12:50] 
[02:12:50] Preparing to commence simulation
[02:12:50] - Ensuring status. Please wait.
[02:12:50] Files status OK
[02:12:52] - Expanded 4835654 -> 23977273 (decompressed 495.8 percent)
[02:12:52] Called DecompressByteArray: compressed_data_size=4835654 data_size=23977273, decompressed_data_size=23977273 diff=0
[02:12:52] - Digital signature verified
[02:12:52] 
[02:12:52] Project: 2669 (Run 2, Clone 61, Gen 116)
[02:12:52] 
[02:12:52] Assembly optimizations on if available.
[02:12:52] Entering M.D.
[02:12:58] Using Gromacs checkpoints
[02:13:02] 
[09:13:48] Entering M.D.
[09:13:55] Using Gromacs checkpoints
[09:14:01] Resuming from checkpoint
[09:14:01] Verified work/wudata_07.log
[09:14:02] Verified work/wudata_07.trr
[09:14:02] Verified work/wudata_07.xtc
[09:14:02] Verified work/wudata_07.edr
[09:14:02] Completed 42510 out of 250000 steps  (17%)
[09:23:52] Completed 45000 out of 250000 steps  (18%)
[09:33:42] Completed 47500 out of 250000 steps  (19%)
[09:43:40] Completed 50000 out of 250000 steps  (20%)
[09:53:32] Completed 52500 out of 250000 steps  (21%)
[10:03:24] Completed 55000 out of 250000 steps  (22%)
[10:13:22] Completed 57500 out of 250000 steps  (23%)
[10:23:14] Completed 60000 out of 250000 steps  (24%)
[10:33:07] Completed 62500 out of 250000 steps  (25%)
[10:42:58] Completed 65000 out of 250000 steps  (26%)
[10:52:49] Completed 67500 out of 250000 steps  (27%)
[11:02:40] Completed 70000 out of 250000 steps  (28%)
[11:12:30] Completed 72500 out of 250000 steps  (29%)
[11:22:22] Completed 75000 out of 250000 steps  (30%)
[11:32:20] Completed 77500 out of 250000 steps  (31%)
[11:42:14] Completed 80000 out of 250000 steps  (32%)
[11:52:06] Completed 82500 out of 250000 steps  (33%)
[12:01:58] Completed 85000 out of 250000 steps  (34%)
[12:11:50] Completed 87500 out of 250000 steps  (35%)
[12:21:42] Completed 90000 out of 250000 steps  (36%)
[12:31:34] Completed 92500 out of 250000 steps  (37%)
[12:41:25] Completed 95000 out of 250000 steps  (38%)
[12:45:19] 
[12:45:19] Folding@home Core Shutdown: INTERRUPTED
[15:13:33] - Autosending finished units... [April 19 15:13:33 UTC]
[15:13:33] Trying to send all finished work units
[15:13:33] + No unsent completed units remaining.
[15:13:33] - Autosend completed


--- Opening Log file [April 19 11:50:21 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/mtt/foldingathome/CPU1
Executable: /home/mtt/foldingathome/CPU1/fah6
Arguments: -smp -verbosity 9 

[11:50:21] - Ask before connecting: No
[11:50:21] - User name: P5_133XL (Team 10047)
[11:50:21] - User ID: 51F02262379F5BC8
[11:50:21] - Machine ID: 4
[11:50:21] 
[11:50:21] Loaded queue successfully.
[11:50:21] - Autosending finished units... [April 19 11:50:21 UTC]
[11:50:21] Trying to send all finished work units
[11:50:21] + No unsent completed units remaining.
[11:50:21] - Autosend completed
[11:50:21] 
[11:50:21] + Processing work unit
[11:50:21] At least 4 processors must be requested.Core required: FahCore_a2.exe
[11:50:21] Core found.
[11:50:21] Working on queue slot 07 [April 19 11:50:21 UTC]
[11:50:21] + Working ...
[11:50:21] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 4645 -version 624'

[11:50:22] 
[11:50:22] *------------------------------*
[11:50:22] Folding@Home Gromacs SMP Core
[11:50:22] Version 2.06 (Tue Mar 31 08:29:45 PDT 2009)
[11:50:22] 
[11:50:22] Preparing to commence simulation
[11:50:22] - Ensuring status. Please wait.
[11:50:22] Files status OK
[11:50:23] - Expanded 4835654 -> 23977273 (decompressed 495.8 percent)
[11:50:23] Called DecompressByteArray: compressed_data_size=4835654 data_size=23977273, decompressed_data_size=23977273 diff=0
[11:50:23] - Digital signature verified
[11:50:23] 
[11:50:23] Project: 2669 (Run 2, Clone 61, Gen 116)
[11:50:23] 
[11:50:23] Assembly optimizations on if available.
[11:50:23] Entering M.D.
[11:50:29] Using Gromacs checkpoints
[18:51:45] 
[18:51:49] Entering M.D.
[18:51:55] Using Gromacs checkpoints
[18:52:00] data_07.log
[18:52:01] Verified work/wudata_07.trr
[18:52:02] Verified work/wudata_07.xtc
[18:52:02] Verified work/wudata_07.edr
[18:52:02] Completed 95010 out of 250000 steps  (38%)
[19:01:50] Completed 97500 out of 250000 steps  (39%)
[19:11:38] Completed 100000 out of 250000 steps  (40%)
[19:21:26] Completed 102500 out of 250000 steps  (41%)
[19:31:14] Completed 105000 out of 250000 steps  (42%)
[19:41:02] Completed 107500 out of 250000 steps  (43%)
[19:50:51] Completed 110000 out of 250000 steps  (44%)
[20:00:40] Completed 112500 out of 250000 steps  (45%)
[20:10:28] Completed 115000 out of 250000 steps  (46%)
[20:20:15] Completed 117500 out of 250000 steps  (47%)
[20:30:03] Completed 120000 out of 250000 steps  (48%)
[20:39:50] Completed 122500 out of 250000 steps  (49%)
[20:49:38] Completed 125000 out of 250000 steps  (50%)
[20:59:33] Completed 127500 out of 250000 steps  (51%)
[21:09:21] Completed 130000 out of 250000 steps  (52%)
[21:19:09] Completed 132500 out of 250000 steps  (53%)
[21:28:57] Completed 135000 out of 250000 steps  (54%)
[21:38:45] Completed 137500 out of 250000 steps  (55%)
[21:47:16] 
[21:47:16] Folding@home Core Shutdown: INTERRUPTED
[21:47:21] CoreStatus = FF (255)
[21:47:21] Sending work to server
[21:47:21] Project: 2669 (Run 2, Clone 61, Gen 116)
[21:47:21] - Error: Could not get length of results file work/wuresults_07.dat
[21:47:21] - Error: Could not read unit 07 file. Removing from queue.
[21:47:21] Trying to send all finished work units
[21:47:21] + No unsent completed units remaining.
[21:47:21] - Preparing to get new work unit...
[21:47:21] + Attempting to get work packet
[21:47:21] - Will indicate memory of 1004 MB
[21:47:21] - Connecting to assignment server
[21:47:21] Connecting to http://assign.stanford.edu:8080/
[21:47:21] Posted data.
[21:47:21] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[21:47:21] + News From Folding@Home: Welcome to Folding@Home
[21:47:21] Loaded queue successfully.
[21:47:21] Connecting to http://171.64.65.56:8080/
[21:47:27] Posted data.
[21:47:27] Initial: 0000; - Receiving payload (expected size: 4836166)
[21:47:30] - Downloaded at ~1574 kB/s
[21:47:30] - Averaged speed for that direction ~1225 kB/s
[21:47:30] + Received work.
[21:47:30] Trying to send all finished work units
[21:47:30] + No unsent completed units remaining.
[21:47:30] + Closed connections
[21:47:35] 
[21:47:35] + Processing work unit
[21:47:35] At least 4 processors must be requested.Core required: FahCore_a2.exe
[21:47:35] Core found.
[21:47:35] Working on queue slot 08 [April 19 21:47:35 UTC]
[21:47:35] + Working ...
[21:47:35] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 4645 -version 624'

[21:47:35] 
[21:47:35] *------------------------------*
[21:47:35] Folding@Home Gromacs SMP Core
[21:47:35] Version 2.06 (Tue Mar 31 08:29:45 PDT 2009)
[21:47:35] 
[21:47:35] Preparing to commence simulation
[21:47:35] - Ensuring status. Please wait.
[21:47:36] Called DecompressByteArray: compressed_data_size=4835654 data_size=23977273, decompressed_data_size=23977273 diff=0
[21:47:36] - Digital signature verified
[21:47:36] 
[21:47:36] Project: 2669 (Run 2, Clone 61, Gen 116)
[21:47:36] 
[21:47:36] Assembly optimizations on if available.
[21:47:36] Entering M.D.
[21:47:47] Run 2, Clone 61, Gen 116)
[21:47:47] 
[21:47:47] Entering M.D.
[21:57:49] pleted 2500 out of 250000 steps  (1%)
[22:07:39] Completed 5000 out of 250000 steps  (2%)
[22:17:29] Completed 7500 out of 250000 steps  (3%)
[22:27:19] Completed 10000 out of 250000 steps  (4%)
[22:37:09] Completed 12500 out of 250000 steps  (5%)
[22:46:58] Completed 15000 out of 250000 steps  (6%)

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Sun Apr 19, 2009 11:15 pm
by toTOW
There's no data for this WU in the DB yet ...

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Sun Apr 19, 2009 11:26 pm
by P5-133XL
Well, it is claiming to be sending work to server, but there is no thank you ...

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Mon Apr 20, 2009 2:58 am
by kasson
Hmm. That's strange. The following two messages really shouldn't happen at the same time:
[21:47:16] Folding@home Core Shutdown: INTERRUPTED
[21:47:21] CoreStatus = FF (255)
We'll have to look into this more.

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Mon Apr 20, 2009 3:08 am
by P5-133XL
I also noticed, that the last interruption caused the WU to start over from the start and it has now successfully gone past the first spontanous interruption that occured @17% (It is now at 31%). I normally associate non-repeatability as a problem with the machine rather than the individual WU.

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Mon Apr 20, 2009 4:57 am
by kasson
In general, yes. There may be some WU's that are more "sensitive"--they may either pick up some subtle machine problems or may have a random crash on a machine that is within normal parameters. Checkpoint resumes should be mostly identical from the WU side (there are a couple things relating to parallelization that are non-identical); starting a WU over is pretty similar but I think a couple of the random number generator states can differ.

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Mon Apr 20, 2009 5:37 pm
by P5-133XL
Whatever happened, it is probably irrelevant in that the WU successfully completed. Feel free to continue investigating though.

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Tue Apr 21, 2009 12:16 am
by bruce
Are you running any systray clients? Were any other clients started at the time you got the INTERRUPTED message?

There have been some unconfirmed reports of INTERRUPTED associated with starting other clients.

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Tue Apr 21, 2009 1:18 am
by P5-133XL
This was occuring inside a virtual machine running Ubuntu 8.04. There were two GPU systray clients on the native Windows Server 2008 host, but they, in theory, should not be able to affect any VM short of bringing the whole machine down or killing the Hyper-V service (which should bring down all the VM's). None of those drastic-type of events occured.

If it matters, the method I used to restart the clients after detecting an interrupted state, was to shut down the VM and restart it. I did not manually touch the individual processes.

9550@3.2GHz, 4GB RAM, 2x Asus 9600GSO (512)
Host: Windows Server 2008 (SP1), Nvidia 185.2 Drivers, Hyper-V

2x VM's each running two cores with 1GB each: Ubuntu 8.04 and 6.24 SMP client.
2x 6.20R1 GPU clients running off the host: Windows Server 2008.

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Wed Apr 22, 2009 12:31 pm
by susato
For the record:

Hi P5_133XL (team 10047),
Your WU (P2669 R2 C61 G116) was added to the stats database on 2009-04-20 07:54:59 for 1920 points of credit.

Re: 2669 (Run 2, Clone 61, Gen 116)

Posted: Wed Apr 22, 2009 1:14 pm
by P5-133XL
Thanks