repeated failure with project 100nn [Too many cores]

Moderators: Site Moderators, FAHC Science Team

Sailer
Posts: 40
Joined: Thu Jan 13, 2011 2:55 am

repeated failure with project 100nn [Too many cores]

Post by Sailer »

The two topics reporting problems have been merged after it was determined the cause for both was a mis-assignment of WU's for a small number of cores to many core systems.


My SR-2 has been getting project 21:36:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0) and then failing to run. it will then cycle through several attempts to upload and start the project with repeated failures. Finally it will load a different project and then work fine. The typical log entry is as follows:

21:36:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0)
21:36:40:WU01:FS01:0xa4:
21:36:40:WU01:FS01:0xa4:Entering M.D.
21:36:46:WU01:FS01:0xa4:Mapping NT from 22 to 20
21:36:58:WARNING:WU01:FS01:FahCore returned an unknown error code which probably indicates that it crashed
21:36:58:WARNING:WU01:FS01:FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029)
21:37:30:WU01:FS01:Starting
21:37:30:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Fred/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 704 -lifeline 2548 -checkpoint 15 -np 22
21:37:30:WU01:FS01:Started FahCore on PID 1068
21:37:30:WU01:FS01:Core PID:4124
21:37:30:WU01:FS01:FahCore 0xa4 started

Then it repeats the failure:

21:37:40:WU01:FS01:0xa4:- Files status OK
21:37:40:WU01:FS01:0xa4:- Expanded 45496 -> 206116 (decompressed 453.0 percent)
21:37:40:WU01:FS01:0xa4:Called DecompressByteArray: compressed_data_size=45496 data_size=206116, decompressed_data_size=206116 diff=0
21:37:40:WU01:FS01:0xa4:- Digital signature verified
21:37:40:WU01:FS01:0xa4:
21:37:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0)
21:37:40:WU01:FS01:0xa4:
21:37:40:WU01:FS01:0xa4:Entering M.D.
21:37:46:WU01:FS01:0xa4:Mapping NT from 22 to 20
21:37:46:WU01:FS01:0xa4:mdrun returned 255
21:37:46:WU01:FS01:0xa4:Going to send back what have done -- stepsTotalG=10000000
21:37:46:WU01:FS01:0xa4:Work fraction=0.0000 steps=10000000.
21:37:50:WU01:FS01:0xa4:logfile size=0 infoLength=0 edr=0 trr=25
21:37:50:WU01:FS01:0xa4:logfile size: 0 info=0 bed=0 hdr=25
21:37:50:WU01:FS01:0xa4:- Writing 642 bytes of core data to disk...
21:37:50:WU01:FS01:0xa4:Done: 130 -> 143 (compressed to 110.0 percent)
21:37:50:WU01:FS01:0xa4: ... Done.
21:37:50:WU01:FS01:0xa4:
21:37:50:WU01:FS01:0xa4:Folding@home Core Shutdown: EARLY_UNIT_END
21:37:51:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:37:51:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:10090 run:98 clone:23 gen:0 core:0xa4 unit:0x000000000001329c546e75549e6f2853

Not sure what is wrong, but this project won't run on my computer. The computer is a SR-2 with E5649 CPUs. I also have a GTX780 Ti in it running a GPU client.
Joe_H
Site Admin
Posts: 7875
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: repeated failure with project 10090 (Run 98, Clone 23, G

Post by Joe_H »

This is just one WU from Project 10090, it has been successfully completed by another folder. Are you also getting other WU's from this project that fail to run? If so, please give us a list of those that do not work on your setup. It is possible that 20 threads is too many for WU's from this project in general, or that it has a problem with decomposition that involves 5 as a factor. If there is more than one example, we will bring it to the attention of the researcher running this project so the assignment settings can be modified and this project not assigned to systems similar to yours.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
007quick
Posts: 9
Joined: Fri Dec 05, 2014 12:37 am

Project 10085 Failed (48 core system)

Post by 007quick »

Hey all,
so I've been getting assigned these WUs regularly and they continue to fail on my machine. there error returned in this log is just one example, it seems to return different error messages all the time. I can successfully fold other projects just this one continues to fail. Also, it is not just this WU but others in the same project as I look through the log and can see that after a couple failed attempts it would try a new WU which would also be a 10085 but different run clone and gen which would then fail again. If you need more info... please let me know and I'll try to get around to posting it.

Code: Select all

22:39:25:WU00:FS00:0xa4:Project: 10085 (Run 5, Clone 214, Gen 3)
22:39:25:WU00:FS00:0xa4:
22:39:25:WU00:FS00:0xa4:Assembly optimizations on if available.
22:39:25:WU00:FS00:0xa4:Entering M.D.
22:39:31:WU00:FS00:0xa4:mdrun returned 255
22:39:31:WU00:FS00:0xa4:Going to send back what have done -- stepsTotalG=10000000
22:39:31:WU00:FS00:0xa4:Work fraction=12884901888.0000 steps=10000000.
22:39:35:WU00:FS00:0xa4:logfile size=7942 infoLength=7942 edr=25 trr=1
22:39:35:WU00:FS00:0xa4:logfile size: 7942 info=7942 bed=25 hdr=1
22:39:35:WU00:FS00:0xa4:- Writing 8480 bytes of core data to disk...
22:39:35:WU00:FS00:0xa4:Done: 7968 -> 2787 (compressed to 34.9 percent)
22:39:35:WU00:FS00:0xa4:  ... Done.
22:39:36:WU00:FS00:0xa4:
22:39:36:WU00:FS00:0xa4:Folding@home Core Shutdown: UNSTABLE_MACHINE
22:39:36:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
007quick
Posts: 9
Joined: Fri Dec 05, 2014 12:37 am

Re: Project 10085 Failed (48 core system)

Post by 007quick »

I have to make a correction... I am also getting the errors on P10083 on various run clone and gen...
Sailer
Posts: 40
Joined: Thu Jan 13, 2011 2:55 am

Re: repeated failure with project 10090 (Run 98, Clone 23, G

Post by Sailer »

My SR-2 has had the same problem with projects 10090 (run 118, clone 9, gen 1), 10090 (run 161, clone 11, gen 0), 10090 (run 128, clone 13, gen 0) and a 10070, though I don't remember which run, clone and gen of the 10070. I'll watch it and copy down which ones that fail during the next few days. I have other computers that are 6/12 CPU types and they have not run into any problems.
Joe_H
Site Admin
Posts: 7875
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: repeated failure with project 10090 (Run 98, Clone 23, G

Post by Joe_H »

The client by default keeps the 16 most recent logs in a folder in the F@H data directory along with the current log. If you could search those and post the PRCG's of the failing work units, that would help.

As for the ones you just mentioned, all have been successfully completed. The second one did have a couple failures as well.

P.S. A message has been sent to the researcher in charge of this project.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
Posts: 7875
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Project 10085 Failed (48 core system)

Post by Joe_H »

Please also post the beginning section of your log file that shows the version information, system info and the folding configuration. More of the log that also showed the beginning of the core starting up with the WU that failed would also be useful.

For each of the other WU's that have failed on your system, could you post the Project, Run, Clone and Generation numbers. Those specify unique WU's that can be checked for problems, or to see if there is a pattern of which ones fail.

P.S. Since the failure appears similar to the one reported here - viewtopic.php?f=19&t=27097, a message has been sent to the researcher responsible for the server these projects come from.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
AtwaterFS
Posts: 30
Joined: Wed Jan 21, 2009 9:08 pm

Re: Project 10085 Failed (48 core system)

Post by AtwaterFS »

10084 (Run 4, Clone 429, Gen 0) and also 10083 of various types - everything else runs fine.....
24 core system - the other lower core count systems I have are all humming along.

In all cases FahCore crashs and logs: FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029)
ImageImage
007quick
Posts: 9
Joined: Fri Dec 05, 2014 12:37 am

Re: Project 10085 Failed (48 core system)

Post by 007quick »

I will try to get around to fishing out the logs tomorrow afternoon. I see that it has failed a bunch more WU last night as well and I will confirm which precise WUs are failing tomorrow.
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project 10085 Failed (48 core system)

Post by 7im »

As a test, do these projects complete if you change to 24 cores?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
007quick
Posts: 9
Joined: Fri Dec 05, 2014 12:37 am

Re: Project 10085 Failed (48 core system)

Post by 007quick »

I hope this is enough of the log... I think that is what you need anyways. I will now try and make a list of WU that have failed...

Code: Select all

16:03:39:WU00:FS00:Starting
16:03:39:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 703 -lifeline 1597 -checkpoint 24 -np 48
16:03:39:WU00:FS00:Started FahCore on PID 21985
16:03:39:WU00:FS00:Core PID:21989
16:03:39:WU00:FS00:FahCore 0xa4 started
16:03:39:WU00:FS00:0xa4:
16:03:39:WU00:FS00:0xa4:*------------------------------*
16:03:39:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
16:03:39:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
16:03:39:WU00:FS00:0xa4:
16:03:39:WU00:FS00:0xa4:Preparing to commence simulation
16:03:39:WU00:FS00:0xa4:- Ensuring status. Please wait.
16:03:48:WU00:FS00:0xa4:- Looking at optimizations...
16:03:48:WU00:FS00:0xa4:- Working with standard loops on this execution.
16:03:48:WU00:FS00:0xa4:Examination of work files indicates 8 consecutive improper terminations of core.
16:03:48:WU00:FS00:0xa4:- Expanded 53806 -> 201448 (decompressed 374.3 percent)
16:03:48:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=53806 data_size=201448, decompressed_data_size=201448 diff=0
16:03:48:WU00:FS00:0xa4:- Digital signature verified
16:03:48:WU00:FS00:0xa4:
16:03:48:WU00:FS00:0xa4:Project: 10085 (Run 2, Clone 656, Gen 4)
16:03:48:WU00:FS00:0xa4:
16:03:48:WU00:FS00:0xa4:Entering M.D.
16:03:54:WU00:FS00:0xa4:mdrun returned 255
16:03:54:WU00:FS00:0xa4:Going to send back what have done -- stepsTotalG=10000000
16:03:54:WU00:FS00:0xa4:Work fraction=17179869184.0000 steps=10000000.
16:03:55:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Sailer
Posts: 40
Joined: Thu Jan 13, 2011 2:55 am

Re: repeated failure with project 10090 (Run 98, Clone 23, G

Post by Sailer »

Yesterday (Friday) I received several more projects that failed. They are: 10090 (212, 24, 0), 10090 (211, 26, 0), 10090 (6, 34, 0), 10090 (47, 7 0), 10090 (151, 35, 0), 10090 (18, 7, 1), 10090 (160, 35 0), 10090 (165, 35, 0), 10090 (210, 34, 0), and 10084 (5, 151, 3).

Whereas earlier I would get switched to an older project which did run after a failure of one of the problem WUs, I started receiving nothing but WUs that won't run. I finally paused the client and gave up. I don't have the time to just sit at the computer, close the program and wait for a new WU assigned only to get another one which won't run.
007quick
Posts: 9
Joined: Fri Dec 05, 2014 12:37 am

Re: Project 10085 Failed (48 core system)

Post by 007quick »

P10083,4,118,1
P10083,0,172,1
P10085,4,92,4
P10084,5,67,2
P10083,4,177,0
P10085,3,59,7
P10084,2,97,1
P10083,2,103,6
P10083,4,208,0
P10085,0,206,2
P10083,5,208,0
P10085,5,214,3
P10085,4,31,5
P10084,1,352,0
P10084,2,352,0
P10083,2,395,1
P10084,2,443,0
P10083,6,387,1
P10084,6,443,0
P10083,4,350,2
P10085,4,533,0
P10083,2,403,1
P10083,1,433,1
P10083,3,309,1
P10083,0,435,1
P10085,5,223,6
P10083,3,466,0
P10085,4,274,2
P10085,4,6,9
P10085,4,582,1
P10084,0,484,1
P10083,5,480,2
P10085,3,464,1
P10083,1,509,3
P10084,5,977,1
P10083,5,933,1
P10084,2,956,1
P10084,0,885,2
P10085,2,656,4
P10085,3,85,8
007quick
Posts: 9
Joined: Fri Dec 05, 2014 12:37 am

Re: Project 10085 Failed (48 core system)

Post by 007quick »

That is the list so far. When my current WU finishes I will split the slot into 2 and see whether 24 cores allows it to fold. I will also add a 4 core slot an 8 and a 12 and I will watch it to the best of my ability
Sailer
Posts: 40
Joined: Thu Jan 13, 2011 2:55 am

Re: Project 10085 Failed (48 core system)

Post by Sailer »

Similar to AtwaterFS, I have received a project 10084 (5, 151, 3) that refused to run on my SR-2. I only have 22 cores engaged as I leave 2 cores for the GPU client. I've received a project 10070 that wouldn't run and several project 10090 series that will not run with a log entry; FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029).
Post Reply