Page 1 of 1

Occasional bursts of insane TPF that don't self-correct

Posted: Mon Dec 20, 2021 9:14 am
by FalconFour
I run a lot of F@H PCs around my house in the winter - if I'm going to heat it, it may as well do work first. Surprised this isn't more of a thing. ;)

Anyway, on relatively frequent occasion, I'll check in on one, and it'll have a slot constantly showing 99.99%, unknown ETA, and a completely insane TPF reading. From what I can tell, the client ought to notice the time between a 1% increase in progress, and call that "TPF", and project it into the future linearly. Each step always seems to recalculate, so it always seems to be a "X->Y" calculation, not a T->U->V->W->X->Y average over multiple steps... just a really basic, 2+2=4 calculation. Except when it's not...

Image

y'all. :? This is a Core 2 Quad from like 2008. :lol:

It comes and goes. But when it "comes", it can just persist for hours... check back in on it again later, and it's just hangin' out at 8 million PPD on a Core 2 Quad again.

Obviously, the log isn't showing anything weird (I study it a lot). It's like this:

Code: Select all

08:40:59:WU02:FS00:0xa8:Completed 415318 out of 500000 steps (83%)
08:48:51:WU00:FS01:0x22:Completed 25000 out of 2500000 steps (1%)
08:54:10:WU02:FS00:0xa8:Completed 420000 out of 500000 steps (84%)
08:59:07:WU00:FS01:0x22:Completed 50000 out of 2500000 steps (2%)
09:08:11:WU02:FS00:0xa8:Completed 425000 out of 500000 steps (85%)
09:09:21:WU00:FS01:0x22:Completed 75000 out of 2500000 steps (3%)
Can't find anyone else really making note of this, but most people seem to not be using the advanced controls, so maybe it's just going unnoticed?

(btw, don't mind the 4-core with a GPU slot. I have a script looping externally that pushes the FahCore_22.exe process to "Normal" priority so it keeps well-fed, and the CPU WU neatly gobbles up the excess 95% of the otherwise-wasted GPU core!)

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Mon Dec 20, 2021 10:16 am
by Neil-B
I use advanced control 24/7/356 and tbh have never seen such behaviour .. I wonder if your script and maxing the cpu as you do is maybe the cause of the issues with advanced control?

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Mon Dec 20, 2021 11:27 am
by toTOW
I never trusted FAHControl on a long period of time ... and in the past, it used to leak memory and crash after a while ...

I use HFM to monitor my clients : viewtopic.php?f=14&t=9903

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Tue Dec 21, 2021 6:37 pm
by FalconFour
Neil-B wrote:I use advanced control 24/7/356 and tbh have never seen such behaviour .. I wonder if your script and maxing the cpu as you do is maybe the cause of the issues with advanced control?
Irrelevant; I shouldn't've mentioned it (people always jump to blame the one odd thing). Happened long before I did that, and it happens on computers that don't even have a GPU.

HFM might be worth a look, if just as a change of scenery. Screenshot from XP though isn't very ... endearing :lol: (heck, this 2003 phpBB board isn't far off either, so it all matches, haha) and I don't think FAHControl is the one doing the TPF calculation. So it'd probably affect any viewer.

edit: oooh, but I like HFM. Looks a lot nicer in person. They ought to update those screenshots if that's the primary "front door" to learn about the app! ;)

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Tue Dec 21, 2021 6:56 pm
by Neil-B
FalconFour wrote:Irrelevant; I shouldn't've mentioned it (people always jump to blame the one odd thing). Happened long before I did that, and it happens on computers that don't even have a GPU.
OK in that case I can't help I'm afraid as my kit never has this issue ... Wasn't trying to blame anything tbh - when trouble shooting one does tend to look for what might be different from stable configurations ... Hope someone is able to help you track down what is causing it.

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Wed Dec 22, 2021 6:17 am
by BobWilliams757
I had a work unit hang up at 99.9% (or close forget the exacts) some time within the last few weeks, but didn't notice it do anything to change the PPD. I took at least 10-15 minutes for it to complete, but everything still finished and transferred just fine.

The delay showed in the logs, but that was the only indication that anything was weird. I don't think I've seen anything similar either before or since.

And agreed on HFM. It's a great addition, and so far has been completely painless to use.

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Wed Dec 22, 2021 10:29 am
by toTOW
When FAHControl shows 99.9% completion and doesn't move, it's the sign that the WU is taking more time than expected to progress ... it could be because of an issue (core crash) or because something is using the system more than usually.

Gromacs doesn't like when something disturbs its threads ... progress will run at the pace of the slowest one.

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Thu Dec 23, 2021 10:03 am
by FalconFour
toTOW wrote:When FAHControl shows 99.9% completion and doesn't move, it's the sign that the WU is taking more time than expected to progress ... it could be because of an issue (core crash) or because something is using the system more than usually.

Gromacs doesn't like when something disturbs its threads ... progress will run at the pace of the slowest one.
See, that's what's weird here. There's no stall, no slowdown - the WUs work at a predictable and expected rate -- and the one here was really 84...85...86% complete. The problem here is just a visual one, though. Take a look at the TPF and PPD in that screenshot. Sometimes (as is my concern), FAHClient will start picking up an insane TPF not reflective of the actual progress in the log, and extrapolate it into madness. Don't know what causes it or what the pattern to it is, but I've seen it crop up numerous times when checking status on various machines. (not every day, just often enough over the months/years to remember)

When it happens, the 99.99% is just a result of the TPF being used for multi-percent jumps in reported percentage, though the logs still show much lower percent. Since the client logs/breaks up reporting into 1% chunks, I'm surprised FAHClient wouldn't constrain the percentage interpolation to 0.99% (e.g. it'd only interpolate 84.00% to 84.99%, then hang there 'til the WU itself reports 85%). Instead, it barrels through from 84.00% to 99.99% if it thinks TPF is only a fraction of second, as here... then jump back to 85.00% and barrel back up to 99.99% if the TPF glitch still persists.

Mostly what I'm hoping to achieve here is to put one datapoint in the board, as I wasn't able to find a report of this kind of thing happening. If anyone else sees the same thing, maybe we could get more samples and see what's up - maybe months later or more. It's not a functional problem, but just an odd cosmetic glitch. Best I can tell, nothing is stalling or failing - it's just being reported weirdly. Maybe some obscure typing or scanning issue.

Re: Occasional bursts of insane TPF that don't self-correct

Posted: Thu Dec 23, 2021 11:34 am
by toTOW
It would be interesting to compare the numbers between FAHControl and Webcontrol when it happens ... if both are wrong, the issue might be from the client, but if only one is wrong, it might be located in FAHControl only ...