p2662 (Run 1 Clone 173 Gen 8) running too slow?

Moderators: Site Moderators, FAHC Science Team

Post Reply
parkut
Posts: 365
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

p2662 (Run 1 Clone 173 Gen 8) running too slow?

Post by parkut »

Long posting, sorry... Need advice... Hope this is useful information.

I have a Core2 machine that has had no problems completing 2662's (see end of this posting for prior percent to deadline time remaining). However it is now stuck on a WU that won't complete, and I keep getting re-assigned the same WU.

model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz
cpu MHz : 2664.000
cache size : 4096 KB
Memory: 976.11 MB physical, 1.94 GB virtual

It crashed at 67% with a 0xff error on August 9th and restarted the same WU over.

Code: Select all

[08:04:28] Completed 812500 out of 1250000 steps  (65%)
[09:03:08] Completed 825000 out of 1250000 steps  (66%)
[10:01:49] Completed 837500 out of 1250000 steps  (67%)
[10:59:15] 
[10:59:15] Folding@home Core Shutdown: INTERRUPTED
[10:59:19] CoreStatus = FF (255)
[10:59:19] Client-core communications error: ERROR 0xff
[10:59:19] Deleting current work unit & continuing...
[10:59:27] - Warning: Could not delete all work unit files (1): Core file absent
[10:59:27] Trying to send all finished work units
[10:59:27] + No unsent completed units remaining.
[10:59:27] - Preparing to get new work unit...
[10:59:27] + Attempting to get work packet
[10:59:27] - Will indicate memory of 976 MB
[10:59:27] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 11
[10:59:27] - Connecting to assignment server
[10:59:27] Connecting to http://assign.stanford.edu:8080/
[10:59:28] Posted data.
[10:59:28] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[10:59:28] + News From Folding@Home: Welcome to Folding@Home
[10:59:28] Loaded queue successfully.
[10:59:28] Connecting to http://171.64.65.56:8080/
[10:59:33] Posted data.
[10:59:33] Initial: 0000; - Receiving payload (expected size: 5001321)
[11:00:01] - Downloaded at ~174 kB/s
[11:00:01] - Averaged speed for that direction ~172 kB/s
[11:00:01] + Received work.
[11:00:01] + Closed connections
[11:00:06] 
[11:00:06] + Processing work unit
[11:00:06] Core required: FahCore_a2.exe
[11:00:06] Core found.
[11:00:06] Working on Unit 02 [August 9 11:00:06]
[11:00:06] + Working ...
-version 602'
[11:00:06] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 2756 

[11:00:06] 
[11:00:06] *------------------------------*
[11:00:06] Folding@Home Gromacs SMP Core
[11:00:06] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[11:00:06] 
[11:00:06] Preparing to commence simulation
[11:00:06] - Ensuring status. Please wait.
[11:00:07] Called DecompressByteArray: compressed_data_size=5000809 data_size=24742709, decompressed_data_size=24742709 diff=0
[11:00:07] - Digital signature verified
[11:00:07] 
[11:00:07] Project: 2662 (Run 1, Clone 173, Gen 8)
On August 12th, it crashed again, woth a 0xff error, stating that the deadline had passed. The clients failed to shut down, so I ended up needing to manually kill the cores and deleted the work folder contents and the queue.dat file.

Code: Select all

[09:11:42] Completed 900000 out of 1250000 steps  (72%)
[10:10:16] Completed 912500 out of 1250000 steps  (73%)
[11:08:48] Completed 925000 out of 1250000 steps  (74%)
[11:08:48] Unit 2's deadline (August 12 11:00) has passed.
[11:08:48] Going to interrupt core and move on to next unit...
[11:08:52] CoreStatus = FF (255)
[11:08:52] Client-core communications error: ERROR 0xff
[11:08:52] Deleting current work unit & continuing...
[11:23:52] - Autosending finished units...
[11:23:52] Trying to send all finished work units
[11:23:52] + No unsent completed units remaining.
[11:23:52] - Autosend completed

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [August 12 12:01:31] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/fah6
Executable: ./fah6
Arguments: -advmethods -verbosity 9 -smp 

[12:01:31] - Ask before connecting: No
[12:01:31] - User name: parkut (Team 4)
[12:01:31] - User ID: 7B76FF2E050086E6
[12:01:31] - Machine ID: 1
[12:01:31] 

A potential conflict was detected:

Process 2756 is currently running and may also be a client with Mach. ID 1.
Program will now exit. Upon restart, this check will not be done -- 
you may wish to check that no client is currently running in
/root/fah6 before restarting.

Please press any key to exit.

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [August 12 12:08:53] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/fah6
Executable: ./fah6
Arguments: -advmethods -verbosity 9 -smp 

[12:08:53] - Ask before connecting: No
[12:08:53] - User name: parkut (Team 4)
[12:08:53] - User ID: 7B76FF2E050086E6
[12:08:53] - Machine ID: 1
[12:08:53] 
[12:08:53] Could not open work queue, generating new queue...
[12:08:53] - Preparing to get new work unit...
[12:08:53] + Attempting to get work packet
[12:08:53] - Will indicate memory of 976 MB
[12:08:53] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 11
[12:08:53] - Connecting to assignment server
[12:08:53] Connecting to http://assign.stanford.edu:8080/
[12:08:53] - Autosending finished units...
[12:08:53] Trying to send all finished work units
[12:08:53] + No unsent completed units remaining.
[12:08:53] - Autosend completed
[12:08:53] Posted data.
[12:08:53] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[12:08:53] + News From Folding@Home: Welcome to Folding@Home
[12:08:53] Loaded queue successfully.
[12:08:53] Connecting to http://171.64.65.56:8080/
[12:08:58] Posted data.
[12:08:58] Initial: 0000; - Receiving payload (expected size: 5001321)
[12:09:26] - Downloaded at ~174 kB/s
[12:09:26] - Averaged speed for that direction ~174 kB/s
[12:09:26] + Received work.
[12:09:26] + Closed connections
[12:09:26] 
[12:09:26] + Processing work unit
[12:09:26] Core required: FahCore_a2.exe
[12:09:26] Core found.
[12:09:26] Working on Unit 01 [August 12 12:09:26]
[12:09:26] + Working ...
-version 602'
[12:09:26] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 17679 

[12:09:26] 
[12:09:26] *------------------------------*
[12:09:26] Folding@Home Gromacs SMP Core
[12:09:26] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[12:09:26] 
[12:09:26] Preparing to commence simulation
[12:09:26] - Ensuring status. Please wait.
[12:09:27] Called DecompressByteArray: compressed_data_size=5000809 data_size=24742709, decompressed_data_size=24742709 diff=0
[12:09:27] - Digital signature verified
[12:09:27] 
[12:09:27] Project: 2662 (Run 1, Clone 173, Gen 8)

Today August 15th, it crashed again, with the same 0xff error, and the same deadline had passed notice. The clients failed to shut down, so I ended up needing to manually kill the cores and deleted the work folder contents and the queue.dat file.

Code: Select all

[10:34:19] Completed 900000 out of 1250000 steps  (72%)
[11:33:03] Completed 912500 out of 1250000 steps  (73%)
[12:08:53] - Autosending finished units...
[12:08:53] Trying to send all finished work units
[12:08:53] + No unsent completed units remaining.
[12:08:53] - Autosend completed
[12:31:47] Completed 925000 out of 1250000 steps  (74%)
[12:31:47] Unit 1's deadline (August 15 12:09) has passed.
[12:31:47] Going to interrupt core and move on to next unit...
[12:31:51] CoreStatus = FF (255)
[12:31:51] Client-core communications error: ERROR 0xff
[12:31:51] Deleting current work unit & continuing...

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [August 15 13:01:31] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/fah6
Executable: ./fah6
Arguments: -advmethods -verbosity 9 -smp 

[13:01:31] - Ask before connecting: No
[13:01:31] - User name: parkut (Team 4)
[13:01:31] - User ID: 7B76FF2E050086E6
[13:01:31] - Machine ID: 1
[13:01:31] 

A potential conflict was detected:

Process 17679 is currently running and may also be a client with Mach. ID 1.
Program will now exit. Upon restart, this check will not be done -- 
you may wish to check that no client is currently running in
/root/fah6 before restarting.

Please press any key to exit.

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [August 15 13:31:31] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/fah6
Executable: ./fah6
Arguments: -advmethods -verbosity 9 -smp 

[13:31:31] - Ask before connecting: No
[13:31:31] - User name: parkut (Team 4)
[13:31:31] - User ID: 7B76FF2E050086E6
[13:31:31] - Machine ID: 1
[13:31:31] 
[13:31:31] Loaded queue successfully.
[13:31:31] Unit 1's deadline (August 15 12:09) has passed.
[13:56:47] ***** Got a SIGTERM signal (15)
[13:56:47] Killing all core threads

Folding@Home Client Shutdown.

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [August 15 13:58:05] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/fah6
Executable: ./fah6
Arguments: -advmethods -verbosity 9 -smp 

[13:58:05] - Ask before connecting: No
[13:58:05] - User name: parkut (Team 4)
[13:58:05] - User ID: 7B76FF2E050086E6
[13:58:05] - Machine ID: 1
[13:58:05] 
[13:58:06] Loaded queue successfully.
[13:58:06] Unit 1's deadline (August 15 12:09) has passed.

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [August 15 14:01:04] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/fah6
Executable: ./fah6
Arguments: -advmethods -verbosity 9 -smp 

[14:01:04] - Ask before connecting: No
[14:01:04] - User name: parkut (Team 4)
[14:01:04] - User ID: 7B76FF2E050086E6
[14:01:04] - Machine ID: 1
[14:01:04] 
[14:01:04] Could not open work queue, generating new queue...
[14:01:04] - Preparing to get new work unit...
[14:01:04] + Attempting to get work packet
[14:01:04] - Will indicate memory of 976 MB
[14:01:04] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 11
[14:01:04] - Connecting to assignment server
[14:01:04] Connecting to http://assign.stanford.edu:8080/
[14:01:04] - Autosending finished units...
[14:01:04] Trying to send all finished work units
[14:01:04] + No unsent completed units remaining.
[14:01:04] - Autosend completed
[14:01:04] Posted data.
[14:01:04] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[14:01:04] + News From Folding@Home: Welcome to Folding@Home
[14:01:05] Loaded queue successfully.
[14:01:05] Connecting to http://171.64.65.56:8080/
[14:01:10] Posted data.
[14:01:10] Initial: 0000; - Receiving payload (expected size: 5001321)
[14:01:40] - Downloaded at ~162 kB/s
[14:01:40] - Averaged speed for that direction ~162 kB/s
[14:01:40] + Received work.
[14:01:40] + Closed connections
[14:01:40] 
[14:01:40] + Processing work unit
[14:01:40] Core required: FahCore_a2.exe
[14:01:40] Core found.
[14:01:40] Working on Unit 01 [August 15 14:01:40]
[14:01:40] + Working ...
-version 602'
[14:01:40] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 16974 

[14:01:40] 
[14:01:40] *------------------------------*
[14:01:40] Folding@Home Gromacs SMP Core
[14:01:40] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[14:01:40] 
[14:01:40] Preparing to commence simulation
[14:01:40] - Ensuring status. Please wait.
[14:01:41] Called DecompressByteArray: compressed_data_size=5000809 data_size=24742709, decompressed_data_size=24742709 diff=0
[14:01:41] - Digital signature verified
[14:01:41] 
[14:01:41] Project: 2662 (Run 1, Clone 173, Gen 8)
[14:01:41] 
[14:01:41] Assembly optimizations on if available.

It has been running 2662's with no problems prior to this particular WU

[/code]
[16:19:36] Project: 2662 (Run 1, Clone 173, Gen 8)
[16:16:22] Unit 0 finished with 73 percent of time to deadline remaining.
[20:46:13] Project: 2662 (Run 1, Clone 395, Gen 3)
[20:42:49] Unit 9 finished with 73 percent of time to deadline remaining.
[01:09:23] Project: 2662 (Run 1, Clone 428, Gen 1)
[01:06:09] Unit 8 finished with 73 percent of time to deadline remaining.
[05:39:29] Project: 2662 (Run 1, Clone 315, Gen 1)
[05:36:15] Unit 7 finished with 73 percent of time to deadline remaining.
[10:03:25] Project: 2662 (Run 1, Clone 163, Gen 5)
[10:00:15] Unit 6 finished with 74 percent of time to deadline remaining.
[15:38:20] Project: 2662 (Run 0, Clone 328, Gen 0)
[15:35:12] Unit 5 finished with 73 percent of time to deadline remaining.
[20:05:21] Project: 2662 (Run 1, Clone 191, Gen 3)
[20:01:59] Unit 4 finished with 73 percent of time to deadline remaining.
[00:33:14] Project: 2662 (Run 1, Clone 305, Gen 0)
[00:30:08] Unit 3 finished with 74 percent of time to deadline remaining.
[06:01:28] Project: 2662 (Run 0, Clone 235, Gen 1)
[05:58:23] Unit 2 finished with 74 percent of time to deadline remaining.
[11:32:58] Project: 2662 (Run 0, Clone 141, Gen 0)
[11:29:51] Unit 1 finished with 74 percent of time to deadline remaining.
[/code]
jonault
Posts: 218
Joined: Fri Dec 14, 2007 9:53 pm

Re: p2662 (Run 1 Clone 173 Gen 8) running too slow?

Post by jonault »

I have a couple of 2662 work units (Run 2, Clone 213, Gen 7) that are also running really slowly and will not be completed before the due date (but they haven't crashed). At the current rate, they will require about 80 hours to complete. For comparison, the last two 2662s I ran earlier this week (Run 0, Clone 281, Gen 5 and Run 2, Clone 336, Gen 2) both finished in about 12 hours. They all used the same core (FahCore_a2). The core processes are all getting better than 90% of the cpu time, so there isn't anything else on the system slowing them down.

Code: Select all

[12:51:25] Trying to send all finished work units
[12:51:25] + No unsent completed units remaining.
[12:51:25] - Preparing to get new work unit...
[12:51:25] + Attempting to get work packet
[12:51:25] - Will indicate memory of 4096 MB
[12:51:25] - Connecting to assignment server
[12:51:25] Connecting to http://assign.stanford.edu:8080/
[12:51:25] Posted data.
[12:51:25] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[12:51:25] + News From Folding@Home: Welcome to Folding@Home
[12:51:26] Loaded queue successfully.
[12:51:26] Connecting to http://171.64.65.56:8080/
[12:51:31] Posted data.
[12:51:31] Initial: 0000; - Receiving payload (expected size: 4920915)
[12:51:39] - Downloaded at ~600 kB/s
[12:51:39] - Averaged speed for that direction ~583 kB/s
[12:51:39] + Received work.
[12:51:39] Trying to send all finished work units
[12:51:39] + No unsent completed units remaining.
[12:51:39] + Closed connections
[12:51:39] 
[12:51:39] + Processing work unit
[12:51:39] Core required: FahCore_a2.exe
[12:51:39] Core found.
[12:51:39] - Using generic ./mpiexec
[12:51:39] Working on queue slot 04 [August 14 12:51:39 UTC]
[12:51:39] + Working ...
[12:51:39] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 15 -forceasm -verbose -lifeline 767 -version 620'

[12:51:39] 
[12:51:39] *------------------------------*
[12:51:39] Folding@Home Gromacs SMP Core
[12:51:39] Version 2.01 (Wed Jul 16 08:26:53 PDT 2008)
[12:51:39] 
[12:51:39] Preparing to commence simulation
[12:51:39] - Ensuring status. Please wait.
[12:51:48] - Assembly optimizations manually forced on.
[12:51:48] - Not checking prior termination.
[12:51:49] - Expanded 4920403 -> 24360573 (decompressed 495.0 percent)
[12:51:49] Called DecompressByteArray: compressed_data_size=4920403 data_size=24360573, decompressed_data_size=24360573 diff=0
[12:51:50] - Digital signature verified
[12:51:50] 
[12:51:50] Project: 2662 (Run 2, Clone 213, Gen 7)
[12:51:50] 
[12:51:50] Assembly optimizations on if available.
[12:51:50] Entering M.D.
[12:51:56] Node 1 initialized
[13:39:38] Completed 17500 out of 1750000 steps  (1%)
[14:27:03] Completed 35000 out of 1750000 steps  (2%)
[15:14:35] Completed 52500 out of 1750000 steps  (3%)
[15:56:38] - Autosending finished units... [August 14 15:56:38 UTC]
[15:56:38] Trying to send all finished work units
[15:56:38] + No unsent completed units remaining.
[15:56:38] - Autosend completed
[16:02:01] Completed 70000 out of 1750000 steps  (4%)
[16:49:43] Completed 87500 out of 1750000 steps  (5%)
[17:36:49] Completed 105000 out of 1750000 steps  (6%)
[18:24:44] Completed 122500 out of 1750000 steps  (7%)
[19:16:29] Completed 140000 out of 1750000 steps  (8%)
[20:04:04] Completed 157500 out of 1750000 steps  (9%)
[20:51:54] Completed 175000 out of 1750000 steps  (10%)
[21:39:40] Completed 192500 out of 1750000 steps  (11%)
[21:56:39] - Autosending finished units... [August 14 21:56:39 UTC]
[21:56:39] Trying to send all finished work units
[21:56:39] + No unsent completed units remaining.
[21:56:39] - Autosend completed
[22:27:05] Completed 210000 out of 1750000 steps  (12%)
[23:14:30] Completed 227500 out of 1750000 steps  (13%)
[00:01:10] Completed 245000 out of 1750000 steps  (14%)
[00:48:48] Completed 262500 out of 1750000 steps  (15%)
[01:36:28] Completed 280000 out of 1750000 steps  (16%)
[02:24:04] Completed 297500 out of 1750000 steps  (17%)
[03:10:59] Completed 315000 out of 1750000 steps  (18%)
[03:56:40] - Autosending finished units... [August 15 03:56:40 UTC]
[03:56:40] Trying to send all finished work units
[03:56:40] + No unsent completed units remaining.
[03:56:40] - Autosend completed
[03:58:27] Completed 332500 out of 1750000 steps  (19%)
[04:45:44] Completed 350000 out of 1750000 steps  (20%)
[05:33:10] Completed 367500 out of 1750000 steps  (21%)
[06:20:31] Completed 385000 out of 1750000 steps  (22%)
[07:07:48] Completed 402500 out of 1750000 steps  (23%)
[07:55:09] Completed 420000 out of 1750000 steps  (24%)
[08:42:19] Completed 437500 out of 1750000 steps  (25%)
[08:44:25] ***** Got a SIGTERM signal (15)
[08:44:25] Killing all core threads

Folding@Home Client Shutdown.
(That SIGTERM is me rebooting the system.)
Image
parkut
Posts: 365
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: p2662 (Run 1 Clone 173 Gen 8) running too slow?

Post by parkut »

After failing yet again, I ended up deleting the entire work folder and all associated files.
This particular WU running at just about one hour per % frame completion will never finish.

[14:53:40] Deleting current work unit & continuing...
[14:53:40] Client-core communications error: ERROR 0xff
[14:53:40] CoreStatus = FF (255)
[14:53:36] Going to interrupt core and move on to next unit...
[14:53:36] Unit 1's deadline (August 18 14:01) has passed.
[14:53:36] Completed 912500 out of 1250000 steps (73%)
[13:55:00] Completed 900000 out of 1250000 steps (72%)
bollix47
Posts: 2976
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: p2662 (Run 1 Clone 173 Gen 8) running too slow?

Post by bollix47 »

Have you tried deleting the FahCore_a2.exe file to ensure that you have the latest one?

For reference on an E6600 @ 2.4 the frame times on P2662 here are between 12-13 minutes.
Image
parkut
Posts: 365
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: p2662 (Run 1 Clone 173 Gen 8) running too slow?

Post by parkut »

It was the latest version, Version 2.00 (Wed Jul 9 13:11:25 PDT 2008). I've been assigned a different 2662 WU, and it's running at "normal" speed

Code: Select all

Project: p2662 (Run 2 Clone 175 Gen 12)
...
[15:47:23] Completed 10000 out of 250000 steps  (4%)
[15:36:06] Completed 7500 out of 250000 steps  (3%)
[15:24:48] Completed 5000 out of 250000 steps  (2%)
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: p2662 (Run 1 Clone 173 Gen 8) running too slow?

Post by kasson »

As noted in the other posts, we have a few bad work units for 2662. P2662R1C173 got past the bad segment and is now running fine, on Gen18. I just stopped assigns of the bad copy of P2662R2C213G7. Thanks.
Post Reply