p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Moderators: Site Moderators, FAHC Science Team

Post Reply
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by Tobit »

Here's an odd one I haven't seen before here on my system. These generally run quite well here. In fact, this is the first abnormal result I've seen since ProtoMol started.

Code: Select all

[08:07:53] *********************** Log Started 25/Dec/2009 08:07:53 ***********************
[08:07:53] ************************** ProtoMol Folding@Home Core **************************
[08:07:53]   Version: 21
[08:07:53]      Type: 180
[08:07:53]      Core: ProtoMol
[08:07:53]   Website: http://folding.stanford.edu/
[08:07:53] Copyright: (c) 2009 Stanford University
[08:07:53]    Author: Joseph Coffland <joseph@cauldrondevelopment.com>
[08:07:53]      Args: -dir work/ -suffix 00 -checkpoint 15 -lifeline 5116 -version 623
[08:07:53] ************************************ Build *************************************
[08:07:53]      Date: Dec 24 2009
[08:07:53]      Time: 14:36:31
[08:07:53]  Revision: 1748
[08:07:53]  Compiler: Intel(R) C++ MSVC 1500 mode 1110
[08:07:53]   Options: /TP /nologo /EHsc /wd4297 /wd4103 /wd1786 /arch:IA32 /Ox
[08:07:53]            /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
[08:07:53]  Platform: Windows XP
[08:07:53]      Bits: 32
[08:07:53] ************************************ System ************************************
[08:07:53]        OS: Microsoft(R) Windows(R) XP Professional x64 Edition
[08:07:53]       CPU: Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz
[08:07:53]    CPU ID: GenuineIntel Family 6 Model 15 Stepping 6
[08:07:53]      CPUs: 2 Logical, 1 Physical
[08:07:53]    Memory: 4.00 GB
[08:07:53] ********************************************************************************
[08:07:53] Project: 10001 (Run 229, Clone 4, Gen 4)
[08:07:53] Reading tar file par_all27_prot_lipid.inp
[08:07:53] Reading tar file scpismQuartic.inp
[08:07:53] Reading tar file ww_exteq_nowater1.pdb
[08:07:53] Reading tar file ww_exteq_nowater1.psf
[08:07:53] Reading tar file checkpt
[08:07:53] Reading tar file ww_exteq_nowater1.208.pos
[08:07:53] Reading tar file ww_exteq_nowater1.208.vel
[08:07:53] Reading tar file protomol.conf
[08:07:53] Reading tar file core.xml
[08:07:53] Digital signatures verified
[08:07:53] Completed 0 out of 200000 steps (0%)
[08:10:07] WARNING: UnexpectedExitHandler triggered
[08:10:07] WARNING: Unexpected exit from science code
[08:10:07] Saving result file logfile_00.txt
[08:10:07] Saving result file checkpt
[08:10:07] Saving result file log.txt
[08:10:07] Saving result file protomol.conf
[08:10:07] Saving result file ww.dcd
[08:10:07] Saving result file ww_exteq_nowater1.208.pos
[08:10:07] Saving result file ww_exteq_nowater1.208.vel
[08:10:07] Folding@home Core Shutdown: UNKNOWN
[08:10:11] CoreStatus = 7B (123)
[08:10:11] Sending work to server
[08:10:11] Project: 10001 (Run 229, Clone 4, Gen 4)
[08:10:11] + Attempting to send results [December 25 08:10:11 UTC]
[08:10:13] + Results successfully sent
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by Grandpa_01 »

It looks like it did what it was suposed to do. Did you get the same WU again.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by Tobit »

Grandpa_01 wrote:It looks like it did what it was suposed to do. Did you get the same WU again.
No, it moved onto a different WU. Did you see these lines in the log?

Code: Select all

[08:07:53] Completed 0 out of 200000 steps (0%)
[08:10:07] WARNING: UnexpectedExitHandler triggered
[08:10:07] WARNING: Unexpected exit from science code
Unexpected tells me it didn't do something it was supposed to do. :o
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by Grandpa_01 »

Tobit wrote: No, it moved onto a different WU. Did you see these lines in the log?

Code: Select all

[08:07:53] Completed 0 out of 200000 steps (0%)
[08:10:07] WARNING: UnexpectedExitHandler triggered
[08:10:07] WARNING: Unexpected exit from science code
Unexpected tells me it didn't do something it was supposed to do. :o
From what I understand that is expected and when it happens with the new Version V21 it is supposed to do what it did and send the WU back to the server so you will get a different WU rather than keep getting that one over and over again. I looks like they got that bug fixed in V21.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by Tobit »

Grandpa_01 wrote:From what I understand that is expected
Ending early is expected with ProtoMol based units but not ending early with errors like this one did. I have plenty that end early but I've never seen one end early with the errors this one did. Finishing with a CoreStatus of 7B is not normal.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by Grandpa_01 »

I did not say it was what I did say was V21 is doing what it is suposed to do when a WU fails. Which the other verson were not always doing.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by bruce »

I agree with Grandpa_01 . . . to a point.

To understand it fully, we need to identify several different components that make up the FAH system. Most of the time people break things up into two pieces -- the servers and the software on your PC, or three pieces -- the servers, the client, and a FahCore. To understand what's going on here we need to look one level deeper and split the FahCore into two separate logical pieces that are integrally combined before you ever see it.

Any FahCore is made up of code written mostly by Stanford and code written mostly by someone else. The Stanford developers can find and fix bugs in the code they wrote rather quickly but if there is a bug in it, but if in the code that somebody else wrote has an error, it will probably take longer to get it fixed. In this case, the message "Unexpected exit from science code" says that there was some kind of error in that other code. The Stanford code responds by reporting a CoreStatus = 7B (123) to the client. The client responds by sending an error report to the server, as it should, and the server gives you a new assignment.

Some of the other FAHcores respond differently to an error in the science code and this is the first example I've seen of doing it right. Other FAHcores make a different report to the client and the result (an undesirable one) is that you may have the same WU reassigned, producing the same error repeatedly.

Version 19 and 20 of ProtoMol were important developmental steps toward this solution, and I commend jcoffland for promptly moving to what appears to be an excellent solution for those unexpected problems that come up in the non-Stanford code.
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: p10001 (Run 229, Clone 4, Gen 4) - UNKNOWN

Post by Tobit »

Thanks Bruce, that helps me to better understand what Grandpa_01 was trying to say. I would agree that error handeling is greatly improved with v21. However, we should still report these unexpected errors as a possible bad WU, correct? This clearly was more than a simple ending early because no more computation was possible.
Post Reply