Occasional bursts of insane TPF that don't self-correct

Moderators: Site Moderators, FAHC Science Team

Post Reply
FalconFour
Posts: 29
Joined: Fri Sep 05, 2008 11:57 am

Occasional bursts of insane TPF that don't self-correct

Post by FalconFour »

I run a lot of F@H PCs around my house in the winter - if I'm going to heat it, it may as well do work first. Surprised this isn't more of a thing. ;)

Anyway, on relatively frequent occasion, I'll check in on one, and it'll have a slot constantly showing 99.99%, unknown ETA, and a completely insane TPF reading. From what I can tell, the client ought to notice the time between a 1% increase in progress, and call that "TPF", and project it into the future linearly. Each step always seems to recalculate, so it always seems to be a "X->Y" calculation, not a T->U->V->W->X->Y average over multiple steps... just a really basic, 2+2=4 calculation. Except when it's not...

Image

y'all. :? This is a Core 2 Quad from like 2008. :lol:

It comes and goes. But when it "comes", it can just persist for hours... check back in on it again later, and it's just hangin' out at 8 million PPD on a Core 2 Quad again.

Obviously, the log isn't showing anything weird (I study it a lot). It's like this:

Code: Select all

08:40:59:WU02:FS00:0xa8:Completed 415318 out of 500000 steps (83%)
08:48:51:WU00:FS01:0x22:Completed 25000 out of 2500000 steps (1%)
08:54:10:WU02:FS00:0xa8:Completed 420000 out of 500000 steps (84%)
08:59:07:WU00:FS01:0x22:Completed 50000 out of 2500000 steps (2%)
09:08:11:WU02:FS00:0xa8:Completed 425000 out of 500000 steps (85%)
09:09:21:WU00:FS01:0x22:Completed 75000 out of 2500000 steps (3%)
Can't find anyone else really making note of this, but most people seem to not be using the advanced controls, so maybe it's just going unnoticed?

(btw, don't mind the 4-core with a GPU slot. I have a script looping externally that pushes the FahCore_22.exe process to "Normal" priority so it keeps well-fed, and the CPU WU neatly gobbles up the excess 95% of the otherwise-wasted GPU core!)
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Occasional bursts of insane TPF that don't self-correct

Post by Neil-B »

I use advanced control 24/7/356 and tbh have never seen such behaviour .. I wonder if your script and maxing the cpu as you do is maybe the cause of the issues with advanced control?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Occasional bursts of insane TPF that don't self-correct

Post by toTOW »

I never trusted FAHControl on a long period of time ... and in the past, it used to leak memory and crash after a while ...

I use HFM to monitor my clients : viewtopic.php?f=14&t=9903
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
FalconFour
Posts: 29
Joined: Fri Sep 05, 2008 11:57 am

Re: Occasional bursts of insane TPF that don't self-correct

Post by FalconFour »

Neil-B wrote:I use advanced control 24/7/356 and tbh have never seen such behaviour .. I wonder if your script and maxing the cpu as you do is maybe the cause of the issues with advanced control?
Irrelevant; I shouldn't've mentioned it (people always jump to blame the one odd thing). Happened long before I did that, and it happens on computers that don't even have a GPU.

HFM might be worth a look, if just as a change of scenery. Screenshot from XP though isn't very ... endearing :lol: (heck, this 2003 phpBB board isn't far off either, so it all matches, haha) and I don't think FAHControl is the one doing the TPF calculation. So it'd probably affect any viewer.

edit: oooh, but I like HFM. Looks a lot nicer in person. They ought to update those screenshots if that's the primary "front door" to learn about the app! ;)
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Occasional bursts of insane TPF that don't self-correct

Post by Neil-B »

FalconFour wrote:Irrelevant; I shouldn't've mentioned it (people always jump to blame the one odd thing). Happened long before I did that, and it happens on computers that don't even have a GPU.
OK in that case I can't help I'm afraid as my kit never has this issue ... Wasn't trying to blame anything tbh - when trouble shooting one does tend to look for what might be different from stable configurations ... Hope someone is able to help you track down what is causing it.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
BobWilliams757
Posts: 493
Joined: Fri Apr 03, 2020 2:22 pm
Hardware configuration: ASRock X370M PRO4
Ryzen 2400G APU
16 GB DDR4-3200
MSI GTX 1660 Super Gaming X

Re: Occasional bursts of insane TPF that don't self-correct

Post by BobWilliams757 »

I had a work unit hang up at 99.9% (or close forget the exacts) some time within the last few weeks, but didn't notice it do anything to change the PPD. I took at least 10-15 minutes for it to complete, but everything still finished and transferred just fine.

The delay showed in the logs, but that was the only indication that anything was weird. I don't think I've seen anything similar either before or since.

And agreed on HFM. It's a great addition, and so far has been completely painless to use.
Fold them if you get them!
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Occasional bursts of insane TPF that don't self-correct

Post by toTOW »

When FAHControl shows 99.9% completion and doesn't move, it's the sign that the WU is taking more time than expected to progress ... it could be because of an issue (core crash) or because something is using the system more than usually.

Gromacs doesn't like when something disturbs its threads ... progress will run at the pace of the slowest one.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
FalconFour
Posts: 29
Joined: Fri Sep 05, 2008 11:57 am

Re: Occasional bursts of insane TPF that don't self-correct

Post by FalconFour »

toTOW wrote:When FAHControl shows 99.9% completion and doesn't move, it's the sign that the WU is taking more time than expected to progress ... it could be because of an issue (core crash) or because something is using the system more than usually.

Gromacs doesn't like when something disturbs its threads ... progress will run at the pace of the slowest one.
See, that's what's weird here. There's no stall, no slowdown - the WUs work at a predictable and expected rate -- and the one here was really 84...85...86% complete. The problem here is just a visual one, though. Take a look at the TPF and PPD in that screenshot. Sometimes (as is my concern), FAHClient will start picking up an insane TPF not reflective of the actual progress in the log, and extrapolate it into madness. Don't know what causes it or what the pattern to it is, but I've seen it crop up numerous times when checking status on various machines. (not every day, just often enough over the months/years to remember)

When it happens, the 99.99% is just a result of the TPF being used for multi-percent jumps in reported percentage, though the logs still show much lower percent. Since the client logs/breaks up reporting into 1% chunks, I'm surprised FAHClient wouldn't constrain the percentage interpolation to 0.99% (e.g. it'd only interpolate 84.00% to 84.99%, then hang there 'til the WU itself reports 85%). Instead, it barrels through from 84.00% to 99.99% if it thinks TPF is only a fraction of second, as here... then jump back to 85.00% and barrel back up to 99.99% if the TPF glitch still persists.

Mostly what I'm hoping to achieve here is to put one datapoint in the board, as I wasn't able to find a report of this kind of thing happening. If anyone else sees the same thing, maybe we could get more samples and see what's up - maybe months later or more. It's not a functional problem, but just an odd cosmetic glitch. Best I can tell, nothing is stalling or failing - it's just being reported weirdly. Maybe some obscure typing or scanning issue.
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Occasional bursts of insane TPF that don't self-correct

Post by toTOW »

It would be interesting to compare the numbers between FAHControl and Webcontrol when it happens ... if both are wrong, the issue might be from the client, but if only one is wrong, it might be located in FAHControl only ...
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Post Reply