Project: 6014 (Run 0, Clone 106, Gen 86)

Moderators: Site Moderators, FAHC Science Team

padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

What's happening here? This is 2x in a row now over the last couple of hours, fahcore3 runs, gets to around 2% and then bails out with this same error. Is this a bug in the WU's themselves? Or in the new fahcore3?

I see the /documentation/errors says (helpfully) that my system is blowing up :?. Can I 'deflate' it at my end, or is this a problem caused by fah/fahcore/the WU?

Code: Select all

Last login: Tue Mar 23 12:14:04 on ttys000
Macintosh:~ yoxi$ /Users/yoxi/Applications/mym/fah.command ; exit;
COLUMNS=55;
LINES=6;
export COLUMNS LINES;
Using local directory for configuration

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

Using local directory for work files
2 cores detected


--- Opening Log file [March 23 12:38:27 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.29r1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/yoxi/Library/Folding@home
Executable: ./fah6
Arguments: -smp -local -verbosity 9 -oneunit -forceasm 

[12:38:27] - Ask before connecting: No
[12:38:27] - User name: yoxi (Team 3007)
[12:38:27] - User ID: 80C71DC069F7846
[12:38:27] - Machine ID: 1
[12:38:27] 
[12:38:28] Loaded queue successfully.
[12:38:28] - Preparing to get new work unit...
[12:38:28] Cleaning up work directory
[12:38:28] - Autosending finished units... [March 23 12:38:28 UTC]
[12:38:28] Trying to send all finished work units
[12:38:28] + No unsent completed units remaining.
[12:38:28] - Autosend completed
[12:38:28] + Attempting to get work packet
[12:38:28] Passkey found
[12:38:28] - Will indicate memory of 8192 MB
[12:38:28] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[12:38:28] - Connecting to assignment server
[12:38:28] Connecting to http://assign.stanford.edu:8080/
[12:38:28] Posted data.
[12:38:28] Initial: ED82; - Successful: assigned to (130.237.232.140).
[12:38:28] + News From Folding@Home: Welcome to Folding@Home
[12:38:29] Loaded queue successfully.
[12:38:29] Sent data
[12:38:29] Connecting to http://130.237.232.140:8080/
[12:38:29] Posted data.
[12:38:29] Initial: 0000; - Receiving payload (expected size: 1799850)
[12:39:28] - Downloaded at ~29 kB/s
[12:39:28] - Averaged speed for that direction ~45 kB/s
[12:39:28] + Received work.
[12:39:28] + Closed connections
[12:39:28] 
[12:39:28] + Processing work unit
[12:39:28] Core required: FahCore_a3.exe
[12:39:28] Core found.
[12:39:28] Working on queue slot 01 [March 23 12:39:28 UTC]
[12:39:28] + Working ...
[12:39:28] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 30 -forceasm -verbose -lifeline 3912 -version 629'

[12:39:28] 
[12:39:28] *------------------------------*
[12:39:28] Folding@Home Gromacs SMP Core
[12:39:28] Version 2.17 (Mar 7 2010)
[12:39:28] 
[12:39:28] Preparing to commence simulation
[12:39:28] - Assembly optimizations manually forced on.
[12:39:28] - Not checking prior termination.
[12:39:28] - Expanded 1799338 -> 2396877 (decompressed 133.2 percent)
[12:39:28] Called DecompressByteArray: compressed_data_size=1799338 data_size=2396877, decompressed_data_size=2396877 diff=0
[12:39:28] - Digital signature verified
[12:39:28] 
[12:39:28] Project: 6014 (Run 0, Clone 106, Gen 86)
[12:39:28] 
[12:39:28] Assembly optimizations on if available.
[12:39:28] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
43500004 steps,  87000.0 ps (continuing from step 43000004,  86000.0 ps).
[12:39:35] Completed 0 out of 500000 steps  (0%)
[12:50:41] Completed 5000 out of 500000 steps  (1%)
[13:02:12] Completed 10000 out of 500000 steps  (2%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
16 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[13:02:25] mdrun returned 255
[13:02:25] Going to send back what have done -- stepsTotalG=500000
[13:02:25] Work fraction=0.0202 steps=500000.
[13:02:25] CoreStatus = 0 (0)
[13:02:25] Sending work to server
[13:02:25] Project: 6014 (Run 0, Clone 106, Gen 86)
[13:02:25] - Error: Could not get length of results file work/wuresults_01.dat
[13:02:25] - Error: Could not read unit 01 file. Removing from queue.
[13:02:25] Trying to send all finished work units
[13:02:25] + No unsent completed units remaining.
[13:02:25] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[13:02:25] Cleaning up work directory
[13:02:25] ***** Got a SIGTERM signal (15)
[13:02:25] Killing all core threads

Folding@Home Client Shutdown.
logout

[Process completed]
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: fahcore3 breaks after 1-2%

Post by padmavyuha »

...and again. There's no point running any more of these until it's clear why they're not working now.

Code: Select all

Last login: Tue Mar 23 12:38:27 on ttys000
Macintosh:~ yoxi$ /Users/yoxi/Applications/mym/fah.command ; exit;
COLUMNS=55;
LINES=6;
export COLUMNS LINES;
Using local directory for configuration

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

Using local directory for work files
2 cores detected


--- Opening Log file [March 23 14:36:20 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.29r1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/yoxi/Library/Folding@home
Executable: ./fah6
Arguments: -smp -local -verbosity 9 -oneunit -forceasm 

[14:36:20] - Ask before connecting: No
[14:36:20] - User name: yoxi (Team 3007)
[14:36:20] - User ID: 80C71DC069F7846
[14:36:20] - Machine ID: 1
[14:36:20] 
[14:36:20] Loaded queue successfully.
[14:36:20] - Preparing to get new work unit...
[14:36:20] Cleaning up work directory
[14:36:20] - Autosending finished units... [March 23 14:36:20 UTC]
[14:36:20] Trying to send all finished work units
[14:36:20] + No unsent completed units remaining.
[14:36:20] - Autosend completed
[14:36:20] + Attempting to get work packet
[14:36:20] Passkey found
[14:36:20] - Will indicate memory of 8192 MB
[14:36:20] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[14:36:20] - Connecting to assignment server
[14:36:20] Connecting to http://assign.stanford.edu:8080/
[14:36:21] Posted data.
[14:36:21] Initial: ED82; - Successful: assigned to (130.237.232.140).
[14:36:21] + News From Folding@Home: Welcome to Folding@Home
[14:36:21] Loaded queue successfully.
[14:36:21] Sent data
[14:36:21] Connecting to http://130.237.232.140:8080/
[14:36:22] Posted data.
[14:36:22] Initial: 0000; - Receiving payload (expected size: 1799850)
[14:37:19] - Downloaded at ~30 kB/s
[14:37:19] - Averaged speed for that direction ~42 kB/s
[14:37:19] + Received work.
[14:37:19] + Closed connections
[14:37:19] 
[14:37:19] + Processing work unit
[14:37:19] Core required: FahCore_a3.exe
[14:37:19] Core found.
[14:37:19] Working on queue slot 02 [March 23 14:37:19 UTC]
[14:37:19] + Working ...
[14:37:19] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 02 -np 2 -checkpoint 30 -forceasm -verbose -lifeline 4168 -version 629'

[14:37:19] 
[14:37:19] *------------------------------*
[14:37:19] Folding@Home Gromacs SMP Core
[14:37:19] Version 2.17 (Mar 7 2010)
[14:37:19] 
[14:37:19] Preparing to commence simulation
[14:37:19] - Ensuring status. Please wait.
[14:37:28] - Assembly optimizations manually forced on.
[14:37:28] - Not checking prior termination.
[14:37:29] - Expanded 1799338 -> 2396877 (decompressed 133.2 percent)
[14:37:29] Called DecompressByteArray: compressed_data_size=1799338 data_size=2396877, decompressed_data_size=2396877 diff=0
[14:37:29] - Digital signature verified
[14:37:29] 
[14:37:29] Project: 6014 (Run 0, Clone 106, Gen 86)
[14:37:29] 
[14:37:29] Assembly optimizations on if available.
[14:37:29] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_02.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
43500004 steps,  87000.0 ps (continuing from step 43000004,  86000.0 ps).
[14:37:35] Completed 0 out of 500000 steps  (0%)
[14:48:53] Completed 5000 out of 500000 steps  (1%)
[15:00:09] Completed 10000 out of 500000 steps  (2%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
16 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[15:00:21] mdrun returned 255
[15:00:21] Going to send back what have done -- stepsTotalG=500000
[15:00:21] Work fraction=0.0202 steps=500000.
[15:00:21] CoreStatus = 0 (0)
[15:00:21] Sending work to server
[15:00:21] Project: 6014 (Run 0, Clone 106, Gen 86)
[15:00:21] - Error: Could not get length of results file work/wuresults_02.dat
[15:00:21] - Error: Could not read unit 02 file. Removing from queue.
[15:00:21] Trying to send all finished work units
[15:00:21] + No unsent completed units remaining.
[15:00:21] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[15:00:21] Cleaning up work directory
[15:00:21] ***** Got a SIGTERM signal (15)
[15:00:21] Killing all core threads

Folding@Home Client Shutdown.
logout

[Process completed]
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: fahcore a3 breaks after 1-2%

Post by bruce »

padmavyuha wrote:...and again. There's no point running any more of these until it's clear why they're not working now.

Code: Select all

[15:00:21] Project: 6014 (Run 0, Clone 106, Gen 86)
All of your reports are for the same WU. There's a reasonable chance that it's a bad WU and you just need to get past it and move on to something else. Nobody has successfully returned that WU yet.

I'm going to change the title and move this topic to the forum where reports of possible bad WUs are found.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by P5-133XL »

Both of these are the same WU. Seems like you have a reproducible error and that indicates a bad WU. Kasson may want a copy of the data files for this WU. I suggest that you PM him.

If you are not interested in that, you will need to force a new WU. You do that by deleting the work folder, the queue and reconfiguring the client to a different MachineID so that from the servers point of view it is a totally different install.
Image
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

Oh, I didn't realise it was downloading me the same WU over and over again. That's a bit weird. I'll archive my current work files and reset the system as you suggest, and PM Kasson.
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

Now running with a fresh WU, and have PM'd Kasson.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by P5-133XL »

If you don't get a "Thank you ..." in the log, then the server is not getting the WU back and then if your machine requests a new WU the server will look up the last wu that was assigned and give the same one back to you. It is designed that way to prevent people from cherry-picking WU's.
Image
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

Smart. I did actually start to muse (in a half-awake sort of way) on why the WU's I was getting all seemed to be around the same size :), but I had no reason to suspect it was the same one over and over. This makes me feel better about the ones that crashed (in my other thread) after I hit ctrl-c - as presumably, I ended up getting them again and completing them. Closure!
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

Another one bites the dust :( - Project: 6015 (Run 0, Clone 8, Gen 109)

I got no reply from Kasson about the last one yet, by the way.

Code: Select all

Last login: Wed Mar 24 11:08:42 on console
/Users/yoxi/Applications/mym/fah.command ; exit;
Macintosh:~ yoxi$ /Users/yoxi/Applications/mym/fah.command ; exit;
COLUMNS=55;
LINES=6;
export COLUMNS LINES;
Using local directory for configuration

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

Using local directory for work files
2 cores detected


--- Opening Log file [March 25 01:00:25 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.29r1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/yoxi/Library/Folding@home
Executable: ./fah6
Arguments: -smp -local -verbosity 9 -oneunit -forceasm 

[01:00:25] - Ask before connecting: No
[01:00:25] - User name: yoxi (Team 3007)
[01:00:25] - User ID: 80C71DC069F7846
[01:00:25] - Machine ID: 2
[01:00:25] 
[01:00:26] Loaded queue successfully.
[01:00:26] - Preparing to get new work unit...
[01:00:26] Cleaning up work directory
[01:00:26] + Attempting to get work packet
[01:00:26] Passkey found
[01:00:26] - Will indicate memory of 8192 MB
[01:00:26] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[01:00:26] - Connecting to assignment server
[01:00:26] Connecting to http://assign.stanford.edu:8080/
[01:00:26] - Autosending finished units... [March 25 01:00:26 UTC]
[01:00:26] Trying to send all finished work units
[01:00:26] + No unsent completed units remaining.
[01:00:26] - Autosend completed
[01:00:27] Posted data.
[01:00:27] Initial: ED82; - Successful: assigned to (130.237.232.140).
[01:00:27] + News From Folding@Home: Welcome to Folding@Home
[01:00:27] Loaded queue successfully.
[01:00:27] Sent data
[01:00:27] Connecting to http://130.237.232.140:8080/
[01:00:29] Posted data.
[01:00:29] Initial: 0000; - Receiving payload (expected size: 1799333)
[01:00:47] - Downloaded at ~97 kB/s
[01:00:47] - Averaged speed for that direction ~56 kB/s
[01:00:47] + Received work.
[01:00:47] + Closed connections
[01:00:47] 
[01:00:47] + Processing work unit
[01:00:47] Core required: FahCore_a3.exe
[01:00:47] Core found.
[01:00:47] Working on queue slot 02 [March 25 01:00:47 UTC]
[01:00:47] + Working ...
[01:00:47] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 02 -np 2 -checkpoint 30 -forceasm -verbose -lifeline 2422 -version 629'

[01:00:47] 
[01:00:47] *------------------------------*
[01:00:47] Folding@Home Gromacs SMP Core
[01:00:47] Version 2.17 (Mar 7 2010)
[01:00:47] 
[01:00:47] Preparing to commence simulation
[01:00:47] - Assembly optimizations manually forced on.
[01:00:47] - Not checking prior termination.
[01:00:47] - Expanded 1798821 -> 2392545 (decompressed 133.0 percent)
[01:00:47] Called DecompressByteArray: compressed_data_size=1798821 data_size=2392545, decompressed_data_size=2392545 diff=0
[01:00:47] - Digital signature verified
[01:00:47] 
[01:00:47] Project: 6015 (Run 0, Clone 8, Gen 109)
[01:00:47] 
[01:00:47] Assembly optimizations on if available.
[01:00:47] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_02.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
55000004 steps, 110000.0 ps (continuing from step 54500004, 109000.0 ps).
[01:00:54] Completed 0 out of 500000 steps  (0%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
8 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[01:04:35] mdrun returned 255
[01:04:35] Going to send back what have done -- stepsTotalG=500000
[01:04:35] Work fraction=0.0035 steps=500000.
[01:04:35] CoreStatus = 0 (0)
[01:04:35] Sending work to server
[01:04:35] Project: 6015 (Run 0, Clone 8, Gen 109)
[01:04:35] - Error: Could not get length of results file work/wuresults_02.dat
[01:04:35] - Error: Could not read unit 02 file. Removing from queue.
[01:04:35] Trying to send all finished work units
[01:04:35] + No unsent completed units remaining.
[01:04:35] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[01:04:35] Cleaning up work directory
[01:04:35] + Attempting to get work packet
[01:04:35] Passkey found
[01:04:35] - Will indicate memory of 8192 MB
[01:04:35] - Connecting to assignment server
[01:04:35] Connecting to http://assign.stanford.edu:8080/
[01:04:35] ***** Got a SIGTERM signal (15)
[01:04:35] Killing all core threads

Folding@Home Client Shutdown.
logout

[Process completed]
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

...and another one - Project: 6015 (Run 0, Clone 30, Gen 108)

I deleted the fahcores as well this time, hence the new download. I'm suggesting there may be a problem with the fahcore_a3 itself here: this problem began to appear only after the most recent update to the fahcore_a3 a couple of weeks ago (or whenever it was). I remember running fah6 and seeing a new version of fahcore_a3 being downloaded, anyway, and it's only since then that these errors have started appearing.

Code: Select all

Last login: Thu Mar 25 08:12:31 on ttys000
Macintosh:~ yoxi$ /Users/yoxi/Applications/mym/fah.command ; exit;
COLUMNS=55;
LINES=6;
export COLUMNS LINES;
Using local directory for configuration

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

Using local directory for work files
2 cores detected


--- Opening Log file [March 25 08:13:37 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.29r1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/yoxi/Library/Folding@home
Executable: ./fah6
Arguments: -smp -local -verbosity 9 -oneunit -forceasm 

[08:13:37] - Ask before connecting: No
[08:13:37] - User name: yoxi (Team 3007)
[08:13:37] - User ID: 80C71DC069F7846
[08:13:37] - Machine ID: 3
[08:13:37] 
[08:13:37] Work directory not found. Creating...
[08:13:37] Could not open work queue, generating new queue...
[08:13:37] - Autosending finished units... [March 25 08:13:37 UTC]
[08:13:37] Trying to send all finished work units
[08:13:37] + No unsent completed units remaining.
[08:13:37] - Autosend completed
[08:13:37] - Preparing to get new work unit...
[08:13:37] Cleaning up work directory
[08:13:37] + Attempting to get work packet
[08:13:37] Passkey found
[08:13:37] - Will indicate memory of 8192 MB
[08:13:37] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[08:13:37] - Connecting to assignment server
[08:13:37] Connecting to http://assign.stanford.edu:8080/
[08:13:38] Posted data.
[08:13:38] Initial: ED82; - Successful: assigned to (130.237.232.140).
[08:13:38] + News From Folding@Home: Welcome to Folding@Home
[08:13:38] Loaded queue successfully.
[08:13:38] Sent data
[08:13:38] Connecting to http://130.237.232.140:8080/
[08:13:39] Posted data.
[08:13:39] Initial: 0000; - Receiving payload (expected size: 1798679)
[08:14:35] - Downloaded at ~31 kB/s
[08:14:35] - Averaged speed for that direction ~31 kB/s
[08:14:35] + Received work.
[08:14:35] + Closed connections
[08:14:35] 
[08:14:35] + Processing work unit
[08:14:35] Core required: FahCore_a3.exe
[08:14:35] Core not found.
[08:14:35] - Core is not present or corrupted.
[08:14:35] - Attempting to download new core...
[08:14:35] + Downloading new core: FahCore_a3.exe
[08:14:35] Downloading core (/~pande/OSX/x86/Core_a3.fah from www.stanford.edu)
[08:14:37] Initial: AFDE; + 10240 bytes downloaded
 [...snip...]
[08:16:39] Initial: 5486; + 2125653 bytes downloaded
[08:16:40] Verifying core Core_a3.fah...
[08:16:40] Signature is VALID
[08:16:40] 
[08:16:40] Trying to unzip core FahCore_a3.exe
[08:16:40] Decompressed FahCore_a3.exe (6392344 bytes) successfully
[08:16:40] + Core successfully engaged
[08:16:45] 
[08:16:45] + Processing work unit
[08:16:45] Core required: FahCore_a3.exe
[08:16:45] Core found.
[08:16:45] Working on queue slot 01 [March 25 08:16:45 UTC]
[08:16:45] + Working ...
[08:16:45] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 5 -forceasm -verbose -lifeline 3259 -version 629'

[08:16:45] 
[08:16:45] *------------------------------*
[08:16:45] Folding@Home Gromacs SMP Core
[08:16:45] Version 2.17 (Mar 7 2010)
[08:16:45] 
[08:16:45] Preparing to commence simulation
[08:16:45] - Assembly optimizations manually forced on.
[08:16:45] - Not checking prior termination.
[08:16:46] - Expanded 1798167 -> 2392545 (decompressed 133.0 percent)
[08:16:46] Called DecompressByteArray: compressed_data_size=1798167 data_size=2392545, decompressed_data_size=2392545 diff=0
[08:16:46] - Digital signature verified
[08:16:46] 
[08:16:46] Project: 6015 (Run 0, Clone 30, Gen 108)
[08:16:46] 
[08:16:46] Assembly optimizations on if available.
[08:16:46] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
54500004 steps, 109000.0 ps (continuing from step 54000004, 108000.0 ps).
[08:16:52] Completed 0 out of 500000 steps  (0%)
[08:28:03] Completed 5000 out of 500000 steps  (1%)
[08:39:16] Completed 10000 out of 500000 steps  (2%)
[08:51:00] Completed 15000 out of 500000 steps  (3%)
[09:02:02] Completed 20000 out of 500000 steps  (4%)
[09:13:39] Completed 25000 out of 500000 steps  (5%)
[09:25:33] Completed 30000 out of 500000 steps  (6%)
[09:37:45] Completed 35000 out of 500000 steps  (7%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[09:47:59] mdrun returned 255
[09:47:59] Going to send back what have done -- stepsTotalG=500000
[09:47:59] Work fraction=0.0794 steps=500000.
[09:48:03] logfile size=15619 infoLength=15619 edr=0 trr=25
[09:48:03] logfile size: 15619 info=15619 bed=0 hdr=25
[09:48:03] - Writing 16157 bytes of core data to disk...
[09:48:03]   ... Done.
[09:48:03] 
[09:48:03] Folding@home Core Shutdown: UNSTABLE_MACHINE
[09:48:03] CoreStatus = 7A (122)
[09:48:03] Sending work to server
[09:48:03] Project: 6015 (Run 0, Clone 30, Gen 108)


[09:48:03] + Attempting to send results [March 25 09:48:03 UTC]
[09:48:03] - Reading file work/wuresults_01.dat from core
[09:48:03]   (Read 16157 bytes from disk)
[09:48:03] Connecting to http://130.237.232.140:8080/
[09:48:04] Posted data.
[09:48:04] Initial: 0000; - Uploaded at ~16 kB/s
[09:48:04] - Averaged speed for that direction ~16 kB/s
[09:48:04] + Results successfully sent
[09:48:04] Thank you for your contribution to Folding@Home.
[09:48:04] Trying to send all finished work units
[09:48:04] + No unsent completed units remaining.
[09:48:04] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[09:48:04] Cleaning up work directory
[09:48:04] + Attempting to get work packet
[09:48:04] Passkey found
[09:48:04] - Will indicate memory of 8192 MB
[09:48:04] - Connecting to assignment server
[09:48:04] Connecting to http://assign.stanford.edu:8080/
[09:48:04] ***** Got a SIGTERM signal (15)
[09:48:04] Killing all core threads

Folding@Home Client Shutdown.
logout

[Process completed]
toTOW
Site Moderator
Posts: 6435
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by toTOW »

Is overclocking involved here ? Did you check your system stability (StressCPU and Memtest) ?
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

Just your basic dual core iMac, no cheese, hold the overclocking. I've checked the memory, but will check it again. In the meantime, either there are some rogue WUs out there suddenly, or it's the fahcore_a3 that's choking. All was working fine on the folding front until the fahcore_a3 got updated. Of course, it could be coincidence and maybe something's up with my mac, but I'm not having any other problems.
Aardvark
Posts: 143
Joined: Sat Jul 12, 2008 4:22 pm
Location: Team MacResource

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by Aardvark »

I am afraid that this problem is more serious than the discussion to date seems to accept. Over the past 7-10 days, I have experienced almost exactly the same failure. Approximately 4 or 5 a3core WUs have failed with the same failure report at the 1 to 7 % stage of folding. Some of the WUs turned into EUEs and the report on those should have been received by and noted by the Servers at Stanford at the time of return. They have not all been from the same Project. They have all been a3core Projects, however.

I am running these on a 1.83Ghz Mac Mini using OSX10.6.2. Nothing has changed recently with my rig other than the recent update of the a3core to v2.17 and I did recently go to OSX10.6.2 from OSX10.6.1. I have not posted anything about this problem because it will happen to two WUs in a row and then another a3core WU will show up and run without a problem. I am not into any overclocking. This is strictly a "stock" Mac Mini rig.

I am not including any logs in this post since I am sending this from a different machine and I don't have immediate access to the log files. If there is any interest in my providing those, let me know.

I sense this is more insidious and widespread than is being accepted. Someone should give it some "serious" attention.
What is past is prologue!
padmavyuha
Posts: 11
Joined: Fri Apr 25, 2008 5:37 pm

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by padmavyuha »

Well, I'm glad to know it's not just me.
Aardvark
Posts: 143
Joined: Sat Jul 12, 2008 4:22 pm
Location: Team MacResource

Re: Project: 6014 (Run 0, Clone 106, Gen 86)

Post by Aardvark »

@ padmavyuha;

You are not alone in this, and I suspect that there are others beyond you and I. I agree with your musings about it probably being associated with the new v2.17 a3core. Timing would just seem to point in that direction.

It is important that the word get distributed on this problem so others can chime in and display that "critical mass" necessary to get a significant PG response in time to save us from total collapse.....:-) If moving the thread again seems to be indicated, I hope one of the Moderators will do the deed.

Don't let the issue die. Unless it does just go away on it's own.
What is past is prologue!
Post Reply