Page 1 of 1

BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Sun Oct 21, 2018 1:58 pm
by westk
10:58:26:WU00:FS00:0xa7:ERROR:
10:58:26:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:58:26:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20161122-4846b12ba-unknown
10:58:26:WU00:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
10:58:26:WU00:FS00:0xa7:ERROR:
10:58:26:WU00:FS00:0xa7:ERROR:Fatal error:
10:58:26:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
10:58:26:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
10:58:26:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
10:58:26:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
10:58:26:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
10:58:26:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:58:31:WU00:FS00:0xa7:WARNING:Unexpected exit() call
10:58:31:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
10:58:31:WU00:FS00:0xa7:WARNING:While cleaning up: Failed to remove directory '01': boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01\md.log"
10:58:31:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
10:58:38:WARNING:WU03:FS00:AS lowered CPUs from 11 to 10

Re: BAD_WORK_UNIT (114 = 0x72)

Posted: Sun Oct 21, 2018 3:37 pm
by Joe_H
Without any information being included about your system configuration or which WU was being processed, there is nothing we can do with this error report. At most it indicates that some project may have an issue with folding on a system with a multiple of 5 available CPU threads available for use.

Please read the section of the Welcome topic on how to post a log - viewtopic.php?p=261082&f=24#p261082. As mentioned the important parts are the first couple pages that give the system configuration and folding settings and sections showing the beginning and end of processing a WU with associated errors.

Re: BAD_WORK_UNIT (114 = 0x72)

Posted: Mon Oct 22, 2018 1:20 am
by bruce
I'm assuming your CPU supports 12 threads.\ and you have one or two GPUs. (That could be confirmed if you posted your log per Joe_H's comment above.)

Open FAHControl and click Configure + Slots + (edit the CPU slot). At the very top, adjust the number of CPU slots to either 8 or 9 and see if the problem stops happening.

Re: BAD_WORK_UNIT (114 = 0x72)

Posted: Mon Oct 22, 2018 2:24 am
by westk
Sorry, here is the missing info

Code: Select all

*********************** Log Started 2018-10-22T02:21:45Z ***********************
02:21:45:************************* Folding@home Client *************************
02:21:45:        Website: https://foldingathome.org/
02:21:45:      Copyright: (c) 2009-2018 foldingathome.org
02:21:45:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:21:45:           Args: 
02:21:45:         Config: C:\Users\WeStK\AppData\Roaming\FAHClient\config.xml
02:21:45:******************************** Build ********************************
02:21:45:        Version: 7.5.1
02:21:45:           Date: May 11 2018
02:21:45:           Time: 13:06:32
02:21:45:     Repository: Git
02:21:45:       Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
02:21:45:         Branch: master
02:21:45:       Compiler: Visual C++ 2008
02:21:45:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
02:21:45:       Platform: win32 10
02:21:45:           Bits: 32
02:21:45:           Mode: Release
02:21:45:******************************* System ********************************
02:21:45:            CPU: AMD Ryzen 5 2600X Six-Core Processor
02:21:45:         CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
02:21:45:           CPUs: 12
02:21:45:         Memory: 15.93GiB
02:21:45:    Free Memory: 11.59GiB
02:21:45:        Threads: WINDOWS_THREADS
02:21:45:     OS Version: 6.2
02:21:45:    Has Battery: false
02:21:45:     On Battery: false
02:21:45:     UTC Offset: -3
02:21:45:            PID: 1124
02:21:45:            CWD: C:\Users\WeStK\AppData\Roaming\FAHClient
02:21:45:             OS: Windows 10 Enterprise
02:21:45:        OS Arch: AMD64
02:21:45:           GPUs: 1
02:21:45:          GPU 0: Bus:9 Slot:0 Func:0 NVIDIA:5 GM204 [GeForce GTX 970]
02:21:45:  CUDA Device 0: Platform:0 Device:0 Bus:9 Slot:0 Compute:5.2 Driver:10.0
02:21:45:OpenCL Device 0: Platform:0 Device:0 Bus:9 Slot:0 Compute:1.2 Driver:416.34
02:21:45:  Win32 Service: false
02:21:45:***********************************************************************
02:21:45:<config>
02:21:45:  <!-- Network -->
02:21:45:  <proxy v=':8080'/>
02:21:45:
02:21:45:  <!-- Slot Control -->
02:21:45:  <power v='FULL'/>
02:21:45:
02:21:45:  <!-- User Information -->
02:21:45:  <passkey v='********************************'/>
02:21:45:  <team v='142520'/>
02:21:45:  <user v='WeStK'/>
02:21:45:
02:21:45:  <!-- Folding Slots -->
02:21:45:  <slot id='0' type='CPU'/>
02:21:45:  <slot id='1' type='GPU'>
02:21:45:    <client-type v='beta'/>
02:21:45:  </slot>
02:21:45:</config>
Mod Edit: Added Code Tags - PantherX

Re: BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Mon Oct 22, 2018 2:39 am
by bruce
Thank you. That confirmed what I had guessed.

It would also be helpful if you included a little more information surrounding the first part that you posted ... especially including the Project/Run/Clone/Gen numbers which would have been included a few lines earlier or a few lines later.

Re: BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Mon Oct 22, 2018 5:49 am
by Joe_H
Additionally, unless you are a member of the beta test team, it is strongly recommended that you not use the Beta flag to get WU's.

Re: BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Mon Oct 22, 2018 10:37 am
by westk
bruce wrote:Thank you. That confirmed what I had guessed.

It would also be helpful if you included a little more information surrounding the first part that you posted ... especially including the Project/Run/Clone/Gen numbers which would have been included a few lines earlier or a few lines later.
For example, 08:18:25:WU00:FS00:0xa4:Project: 14108 (Run 15, Clone 342, Gen 10)

Re: BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Mon Oct 22, 2018 4:43 pm
by Joe_H
westk wrote:
bruce wrote:Thank you. That confirmed what I had guessed.

It would also be helpful if you included a little more information surrounding the first part that you posted ... especially including the Project/Run/Clone/Gen numbers which would have been included a few lines earlier or a few lines later.
For example, 08:18:25:WU00:FS00:0xa4:Project: 14108 (Run 15, Clone 342, Gen 10)
Your "for example" has no relation to the reported errors in your first post. It is for a different project using the A4 folding core and hours before. The errors you posted were for a WU from a project using the A7 core.

Re: BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Mon Oct 22, 2018 7:02 pm
by bruce
In your original post, you reported two errors.
First: 10:58:26:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm.
That error is reporteing a problem with a particular WU. We can not have that error corrected unless we know the actual Project number producing that error.

Second: WARNING:While cleaning up: Failed to remove directory '01': boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01\md.log"
That indicates that some other process (perhaps you, editing a file that belong to FAH, not to external programs), prevented a valid error recovery. Since that's a Warning, not an Error, it probably was able to correct itself later once the other process ended. The applicable information for that WU is repeated in log.txt and shown on the FAHControl screen.

Re: BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Wed Oct 31, 2018 10:43 am
by westk
06:11:27:WU01:FS00:0xa7:ERROR:
06:11:27:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
06:11:27:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20161122-4846b12ba-unknown
06:11:28:WU01:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
06:11:28:WU01:FS00:0xa7:ERROR:
06:11:28:WU01:FS00:0xa7:ERROR:Fatal error:
06:11:28:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
06:11:28:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
06:11:28:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
06:11:28:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
06:11:28:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
06:11:28:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
06:11:32:WU01:FS00:0xa7:WARNING:Unexpected exit() call
06:11:32:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
06:11:32:WU01:FS00:0xa7:Saving result file ..\logfile_01.txt
06:11:32:WU01:FS00:0xa7:Saving result file md.log
06:11:32:WU01:FS00:0xa7:Saving result file science.log
06:11:32:WU01:FS00:0xa7:WARNING:While cleaning up: Failed to remove directory '01': boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01\md.log"
06:11:32:WU01:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
06:11:33:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
06:11:33:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:14200 run:885 clone:2 gen:12 core:0xa7 unit:0x0000001180fccb045b8dcc6ad8fb718f

Re: BAD_WORK_UNIT (114 = 0x72) -- cpu

Posted: Wed Oct 31, 2018 8:42 pm
by bruce
> There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm

Domain compositions are sort of a black art based on the size of the prime factors. The factor 5 will have a higher failure rate (10=5*2*1) than say 9 (9=3*3*3) or even 8 (8=4*2*2) With a CPU that supports 12 threads, no more than 11 can be allocated but 11 is a terrible choice (11=11*1*1) so FAH reduces your setting to 10. Personally, I think it should automatically reduce it to 9 but you can manually do that yourself and your system will see fewer BAD_WORK_UNITs.

The project:14200 run:885 clone:2 gen:12 was reassigned and completed by someone else ... probably with a different number of CPU threads.

Re: BAD_WORK_UNIT (114 = 0x72)

Posted: Thu Apr 02, 2020 3:51 am
by Meneldur
bruce wrote:...
Open FAHControl and click Configure + Slots + (edit the CPU slot). At the very top, adjust the number of CPU slots to either 8 or 9 and see if the problem stops happening.
Thank you! This solution worked for me today.