Project 2669 (Run 9 Clone 140 Gen 56)

Moderators: Site Moderators, FAHC Science Team

Post Reply
58Enfield
Posts: 26
Joined: Sun Dec 02, 2007 1:35 pm
Location: Cedar Wilds of North Central Arizona

Project 2669 (Run 9 Clone 140 Gen 56)

Post by 58Enfield »

This wu is running fine I think, but it started at 22%. I have never seen one do that before.

The machine is a new configuration of old parts that I just put together last week and it has successfully completed 11 wu's before this one.

Machine config: Kubuntu 8.04 (1) instance of Linux SMP Xeon 3060 (B3 E6600) C2D @ 3.15g 2 gig ram Gigabyte P965 MB etc Runs @ 40c doing A2 cores.

Other than the first two wu's which I ran @ 2.9g and 3.06g respectively during the course of overclocking, the next 9 wu's ran @ 117pph to 122pph as is this wu and this machine is consistent with the production of the other C2D's on A2 wu's.

My concern is whether the finished wu will give a valid scientific result. I will suspend the wu until this evening just in case as I will be out and about until then.

TIA

Code: Select all

[13:21:50] + Attempting to send results [January 2 13:21:50 UTC]
[13:21:50] - Reading file work/wuresults_02.dat from core
[13:21:50]   (Read 26037240 bytes from disk)
[13:21:50] Connecting to http://171.64.65.56:8080/
[13:25:59] Posted data.
[13:25:59] Initial: 0000; - Uploaded at ~97 kB/s
[13:26:12] - Averaged speed for that direction ~97 kB/s
[13:26:12] + Results successfully sent
[13:26:12] Thank you for your contribution to Folding@Home.
[13:26:12] + Number of Units Completed: 11

[13:26:13] - Warning: Could not delete all work unit files (2): Core file absent
[13:26:13] Trying to send all finished work units
[13:26:13] + No unsent completed units remaining.
[13:26:13] - Preparing to get new work unit...
[13:26:13] + Attempting to get work packet
[13:26:13] - Will indicate memory of 1900 MB
[13:26:13] - Connecting to assignment server
[13:26:13] Connecting to http://assign.stanford.edu:8080/
[13:26:13] Posted data.
[13:26:13] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[13:26:13] + News From Folding@Home: Welcome to Folding@Home
[13:26:13] Loaded queue successfully.
[13:26:13] Connecting to http://171.64.65.56:8080/
[13:26:20] Posted data.
[13:26:20] Initial: 0000; - Receiving payload (expected size: 4842860)
[13:26:35] - Downloaded at ~315 kB/s
[13:26:35] - Averaged speed for that direction ~301 kB/s
[13:26:35] + Received work.
[13:26:35] Trying to send all finished work units
[13:26:35] + No unsent completed units remaining.
[13:26:35] + Closed connections
[13:26:35] 
[13:26:35] + Processing work unit
[13:26:35] At least 4 processors must be requested.Core required: FahCore_a2.exe
[13:26:35] Core found.
[13:26:35] Working on queue slot 03 [January 2 13:26:35 UTC]
[13:26:35] + Working ...
[13:26:35] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 15 -verbose -lifeline 12815 -version 623'

[13:26:35] 
[13:26:35] *------------------------------*
[13:26:35] Folding@Home Gromacs SMP Core
[13:26:35] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[13:26:35] 
[13:26:35] Preparing to commence simulation
[13:26:35] - Ensuring status. Please wait.
[13:26:36] Called DecompressByteArray: compressed_data_size=4842348 data_size=23982265, decompressed_data_size=23982265 diff=0
[13:26:36] - Digital signature verified
[13:26:36] 
[13:26:36] Project: 2669 (Run 9, Clone 140, Gen 56)
[13:26:36] 
[13:26:36] Assembly optimizations on if available.
[13:26:36] Entering M.D.
[13:26:42] Will resume from checkpoint file
[13:26:46] ng M.D.
[13:26:52] Will resume from checkpoint file
[13:26:53] data_03.log
[13:26:53] Verified work/wudata_03.trr
[13:26:54] Verified work/wudata_03.xtc
[13:26:54] Verified work/wudata_03.edr
[13:26:54] Completed 55019 out of 250000 steps  (22%)
[13:28:31] - Autosending finished units... [January 2 13:28:31 UTC]
[13:28:31] Trying to send all finished work units
[13:28:31] + No unsent completed units remaining.
[13:28:31] - Autosend completed
[13:36:19] Completed 57509 out of 250000 steps  (23%)
[13:45:46] Completed 60009 out of 250000 steps  (24%)
[13:55:14] Completed 62509 out of 250000 steps  (25%)
[14:03:34] ***** Got an Activate signal (2)
[14:03:34] Killing all core threads

Folding@Home Client Shutdown.
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project 2669 (Run 9 Clone 140 Gen 56)

Post by kasson »

Thanks--this is definitely an odd one. We're looking into it.
Hyperlife
Posts: 192
Joined: Sun Dec 02, 2007 7:38 am

Re: Project 2669 (Run 9 Clone 140 Gen 56)

Post by Hyperlife »

Usually the problem of starting midway through a new WU is caused when the last WU using that same queue slot has some sort of interruption that leaves the old checkpoint files behind. I just got Project: 2669 (Run 16, Clone 156, Gen 49) starting out at 23% because my previous WU in that slot, Project: 2669 (Run 4, Clone 185, Gen 48), 10 WUs earlier, gave a CoreStatus = FF (255) error at 23% and failed to delete the checkpoint files. I deleted that new WU because it probably won't be scientifically accurate. (Fortunately I was reassigned that same WU again after deleting it, so it's now running from the beginning as it should.)

If your previous WU in queue slot 03 did successfully complete, then that is a new problem I haven't seen before.
Image
58Enfield
Posts: 26
Joined: Sun Dec 02, 2007 1:35 pm
Location: Cedar Wilds of North Central Arizona

Re: Project 2669 (Run 9 Clone 140 Gen 56)

Post by 58Enfield »

Hyperlife, thank you; you described the problem exactly. I had wrongly claimed that the machine had finished the prior 11 wu's correctly. The previous run through slot #3 had indeed crashed at 22% and was the impuetus for my tearing the machine down and changing the cooling substantially. I was thinking the "new" machine had been working properly and did not check (and did not know to check) the saved old logs. Will remember to look in the future.

Thanks Again
58Enfield
Posts: 26
Joined: Sun Dec 02, 2007 1:35 pm
Location: Cedar Wilds of North Central Arizona

Re: Project 2669 (Run 9 Clone 140 Gen 56)

Post by 58Enfield »

Extra info, after deleting the work directory, queue.dat and unitinfo.txt the assignment server gave this machine wu 2669 R9 C140 G56 to run again. It started clean from the beginning, ran normally and just finished uploading. Very likely given Hyperlife's experience and mine that that little bit of assignment server logic is not left to chance.
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project 2669 (Run 9 Clone 140 Gen 56)

Post by kasson »

Thanks again for the report. We're working on a fix. Has anyone noticed a similar behavior with other cores (uniprocessor or GPU)?
Hyperlife
Posts: 192
Joined: Sun Dec 02, 2007 7:38 am

Re: Project 2669 (Run 9 Clone 140 Gen 56)

Post by Hyperlife »

kasson wrote:Thanks again for the report. We're working on a fix. Has anyone noticed a similar behavior with other cores (uniprocessor or GPU)?
This has happened to me with the GPU client as well. Ivoshiee reported a similar problem in this thread. I think there may be another report of this elsewhere, but I can't find it.

Edit: Found it. I knew I had posted about this problem regarding a GPU WU in an earlier thread.

Edit #2: One more for good measure.
Post Reply