Hello all,
I may have found the problem and a work around. I will need more time to be sure, but my last 4 runs were fine.
In my experience, when a program dies with an intermittent unknown error near the exit of the process, it is usually a treading issue. Or things are not destructed properly.
Based on that theory, I did the following:
1) I bumped up the priority of the actual 22 core. Setting this value in the advanced control panel would only change the priority of the wrapper. I set the actual core. As the priority is still below normal, I doubt this is doing much, but I wanted to try it.
2) I also isolated the affinity of the actual 22 core to one cpu. I saw there were 21 threads associated with that process. Locking them to one cpu will prevent these threads from stomping on each other.
This has had the following effects:
1) I have not seen the unknown error (although I really need some more run time).
2) The utilization of the GPU is a little less (sometimes falling to 0 utilization), but mostly good. I can accept less utilization if I can be sure of a result that gets back to the server. Not having the GPU fully loaded might also be a reason my system has not generated the error so far.
I am not recommending that any user actually make these changes. I need more time, and this problem seems to be isolated to me for some reason. I just wanted to mention it here to see if this information makes the people who maintain the cores think of something.
If this continues to work, I will write myself a little program that wakes up every 1/2 hour or so, iterates through the running programs, and changes the priority and affinity of FAH core programs.
Of course, with my luck (since I am writing message and making it public), everything will fail now, and I will be back to square one.
I will keep the forum informed.
==============================
Update. I wrote a program that checks the log file every three minutes. Once the process gets to 95 percent, I then change the process properties. This lets me get most of the job done using all CPUs, as opposed to changing sometime on the first half hour.