WU and crashing systems (inefficient)

Moderators: Site Moderators, FAHC Science Team

Post Reply
DamianT
Posts: 3
Joined: Sat Jun 04, 2022 1:40 am

WU and crashing systems (inefficient)

Post by DamianT »

Hello,
so there is an issue with FAH where a GPU crash causes the whole WU to be set to 0 or in other words I'm getting a new WU on each crash?
I'm undevolting and optimizing my system, while doing so I get penalized for making my system more efficient.
The WU was at 87% and it crashed while I tested my GPU at full turbo and 1000mv, now I need to start from the scratch again?
Why isn't the FAH software creating backups each 20%? I'm using a NVME so I see no issue to create a backup and then maybe being able to also disable the Backups for more performance. So people could enable backups while optimizing their systems, so if it crashes they don't feel bad.

My PC with only optimized GPU draws 165W instead of 225W. I also need to optimize RAM and the CPU so there are many crashes ahead of me. Because of the crashes and lack of Backups for WUs I won't get any points? :(
PaulTV
Posts: 179
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: WU and crashing systems (inefficient)

Post by PaulTV »

Hi,

The client does make backups, in C:\ProgramData\FAHClient\work or /var/lib/fahclient/work, depending on OS. The client probably drops the WU for another reason, possibly because the number of errors during calculation were too high or something, the log (see FAH Control) may tell why.

System stability and system accuracy are ciritcal for FAH. If you have a minor glitch in a video game, it's no big deal, but an error in the calculations for a FAH WU is an issue. There are safe guards which hopefully will catch those errors, but it's not something to rely on. When you want to optimize your system, you first should do so using performace test tools that just put load on your system. Please don't use FAH to test the stability of your system. Most people don't mess around with voltage settings, undervolting, etc. Using MSI Afterburner or nvidia-smi to limit the power draw of the GPU is fairly common (setting a max power percentage, and let the software do the rest), but that is as far as most people go.

Again, the number one priority for a system running FAH is stability. If your computer isn't, just don't run FAH.
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 20.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
aetch
Posts: 447
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: WU and crashing systems (inefficient)

Post by aetch »

Warning - Long post and a steep learning curve.

Backups
In FAH backups are called checkpoints.
For CPUs it seems to be set to 15 mins, the advanced control panel has a slider so you can adjust this but I don't know if it has any effect.
For GPUs it is set by the researcher and is typically either 2% or 5%.

System stability
Don't undervolt or overclock. FAH is not a stability testing tool, although there are some here who argue that it is and treat it as such. Every work unit that FAH sends to your system is experimental in nature and it's important your system is stable to ensure the results are valid. That's not simply a "your system doesn't crash", that's a "it's carrying out the calculations accurately" and there's a big difference between the two.
I always tell people that the CPU and GPU have stabile operating windows which were predetermined by the manufacturer. By manually undervolting or overclocking you're pushing those components to the edge, if not outside, their stabile operating window.
There are things you can do to reduce power draw without affecting stability. There are also some tools you can install/run to check the health of your system.

System monitoring tools
motherboard monitoring tools - this will vary between manufacturers but it should give you an overview of the temperatures of the main chips on your motherboard. Most will also give you control of the CPU/System fans.
HWMonitor/HWInfo64 - the important thing about this tool is that it will give you detailed monitoring of your CPU cores - temp/power/current.
MSI Afterburner - this is a GPU tweaking tool, it can both monitor the temps and adjust the power limits and fans speeds of you GPU (it can do a lot more but that what we're interested in).
These tools can give too much information, don't be afraid to hide anything.
The main thing here is that you're checking the individual components are getting adequate cooling and nothing is running too hot.

Power tweaks
CPU - disable turbo/PBO (precision boost overdrive), this can be done in the bios of your motherboard.
For Windows you can also lower frequency of you CPU by adjusting the "max processor state".
Start -> Windows System -> Control Panel (large/small icons) -> Power options -> Change Plan Settings -> Change Advanced Power Settings -> Processor power management -> Maximum processor state -> Plugged in.
It's worth noting that setting this to 99% or lower will also disable turbo/PBO, so a trip into bios is not required.

GPU - I recommend MSI afterburner - the important sliders here are "Power limit (%)" and "Fan Speed (%)". I would suggest a power limit of about 70% (you still get about 80-90% performance). I'd also suggest manually setting the fan speed slightly higher to keep the GPU slightly cooler and to stop it ramping up and down with load.

Stability testing
This is where things get awkward, FAH uses instruction sets which are not generally used in everyday use or games which makes testing them harder.
CPU - my goto program for this is prime 95. You'll want the latest, of course. Test with small FFTs (maximum heat), you're looking to run the test for a number of hours, preferably 24, without error. Note - make sure the "disable AVX-512" box is checked, FAH only uses up to AVX-256.
An unstable core could take from a few minutes to a few hours to show itself. My Ryzen had an unstable core, initially I increased the VCore to stabilise it but eventually had to replace the motherboard, the VRMs (voltage regulator modules) on the original motherboard were too weak for the processor.

GPU - TBH, I don't have a good stress test for the GPU. Also, some test favour different cards.
FurMark - is pretty much ubiquitous, for our purposes it's really only producing heat. Beyond visual glitches it's not really an error checker.
GPUmemtest - this does seem to error check memory but last time I saw it was only 32-bit so is limited to 3.5-4GB.
Others may have good suggestions on tools to test your GPU.

CUDA - this is NVidia only technology, the biggest problem is that CUDA isn't really used in games so doesn't have a specific stress test. There are tests/benchmarks which use it as part of a larger test but nothing that really hammers it constantly.

About the best you can do here is to make sure your GPU isn't glitching and the card isn't getting too hot.
Folding Rigs - None (25-Jun-2022)

ImageImage
DamianT
Posts: 3
Joined: Sat Jun 04, 2022 1:40 am

Re: WU and crashing systems (inefficient)

Post by DamianT »

Thank you for all the posts they were really helpful, I checked the logs and this is what it says:
******************************* Date: 2022-06-04 *******************************
07:53:52:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
07:53:53:WARNING:WU02:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
07:59:54:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
07:59:54:WARNING:WU02:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
08:03:45:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
08:03:45:WARNING:WU02:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
08:07:14:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
08:07:17:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
08:35:08:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
08:35:09:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
11:00:51:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
11:00:52:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
12:49:39:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
12:49:40:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)

I don't use FAH as a benchmark or stress testing, with FAH I'm donating my computing power.
But as I'm running FAH 24/7 meaning it runs all the time, having the GPU on stock, meaning 1.15v at 2700Mhz (RX6600) will result in a total system power draw of 220W.
220W * 24h * 365days = 1927200 / 1000W = 1927KW * 0.33Cent = 635Euros
When Miners can optimize their systems so can FAH users too.
Running the GPU now at 2648Mhz @ 993mV results with 160W meaning 1401KW * 0.33cent = 462Euros.

I use Prime95, Furmark, Cinebench R23, superPI

My GPU runs now at 51°C and my CPU is at 59°C and I have 26°C in my room right now.
aetch
Posts: 447
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: WU and crashing systems (inefficient)

Post by aetch »

To be honest, with it dumping work units I'm not sure it's contributing anything other than heat and an electric bill.
FAH does have a limit for how many times a folding core can crash before dumping the work unit, I'm not sure if it's 5 or 10.
Other people will be tasked with processing the work units you were assigned.
Something worth noting, it's not always your system, something it's just a bad work unit. Your log extract shows at least two distinct work units, are there more?

You can go to the WU stats page to see what happened with your work units -> https://apps.foldingathome.org/wu
You will have various log entries like these:-

Code: Select all

07:13:33:WU01:FS01:0x22:Project: 18213 (Run 13445, Clone 0, Gen 20)
07:08:33:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:16995 run:117 clone:36 gen:162 core:0xa8 unit:0x00000024000000a20000426300000075
02:26:38:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:16969 run:6 clone:961 gen:282 core:0xa8 unit:0x000003c10000011a0000424900000006
Copy the line into the PRCG box on the WU page (as long as you have copied everything from the start of the word "Project" to the number after "gen" you're golden).

It looks like it's your GPU that is crashing.
When the client sets itself up it automatically assigns the CPU slot to FS00 and the GPUs start at FS01 and increments from there.
Try following what I'd suggested about MSI Afterburner.

FAH has limited resources and normally assigns each work unit to a single folder for processing. There are very few reasons for assigning a work unit to a second folder:-
1). it was dumped
2). it wasn't returned in time (the timeout was exceeded)
3). a researcher is doing something specific to cause a work unit to be assigned to multiple folders
Folding Rigs - None (25-Jun-2022)

ImageImage
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: WU and crashing systems (inefficient)

Post by toTOW »

DamianT wrote: Sat Jun 04, 2022 1:15 pm Thank you for all the posts they were really helpful, I checked the logs and this is what it says:
******************************* Date: 2022-06-04 *******************************
07:53:52:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
07:53:53:WARNING:WU02:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
07:59:54:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
07:59:54:WARNING:WU02:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
08:03:45:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
08:03:45:WARNING:WU02:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
08:07:14:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
08:07:17:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
08:35:08:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
08:35:09:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
11:00:51:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
11:00:52:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
12:49:39:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
12:49:40:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)

I don't use FAH as a benchmark or stress testing, with FAH I'm donating my computing power.
But as I'm running FAH 24/7 meaning it runs all the time, having the GPU on stock, meaning 1.15v at 2700Mhz (RX6600) will result in a total system power draw of 220W.
220W * 24h * 365days = 1927200 / 1000W = 1927KW * 0.33Cent = 635Euros
When Miners can optimize their systems so can FAH users too.
Running the GPU now at 2648Mhz @ 993mV results with 160W meaning 1401KW * 0.33cent = 462Euros.

I use Prime95, Furmark, Cinebench R23, superPI

My GPU runs now at 51°C and my CPU is at 59°C and I have 26°C in my room right now.
Undervolting require the same adjustments as overclocking. Your GPU is unstable with current settings, increase voltage or reduce clock.

It's better to optimize your GPU by reducing power target and let the manufacturer settings do the job in a safe way.

FAH is a scientific project that requires hardware stability.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
DamianT
Posts: 3
Joined: Sat Jun 04, 2022 1:40 am

Re: WU and crashing systems (inefficient)

Post by DamianT »

Also, how do I compare GPUs based on FaH?
So for example I can buy a 3060 or a RX6600XT now how do I compare them if I care also about FAH?
I don't care about 5FPS more but I care about efficiency and that it performs better in FaH.

Also I just dropped my GPU frequency by 20% and my power consumption dropped by 50% in games and in FAH... so it was 100W and now it's 52W in games and 42W in FAH... crazy and I didn't even touch the voltages at all
aetch
Posts: 447
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: WU and crashing systems (inefficient)

Post by aetch »

DamianT wrote: Mon Jun 06, 2022 2:52 pm Also, how do I compare GPUs based on FaH?
Use a testing database.
https://folding.lar.systems
Folding Rigs - None (25-Jun-2022)

ImageImage
PaulTV
Posts: 179
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: WU and crashing systems (inefficient)

Post by PaulTV »

Nvidia cards are much better for FAH compared to AMD, mostly due to drivers, and Nvidia working with FAH to optimize the software.
As to which Nvidia card to go for depends on several variables and what you aim for.
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 20.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
BobWilliams757
Posts: 493
Joined: Fri Apr 03, 2020 2:22 pm
Hardware configuration: ASRock X370M PRO4
Ryzen 2400G APU
16 GB DDR4-3200
MSI GTX 1660 Super Gaming X

Re: WU and crashing systems (inefficient)

Post by BobWilliams757 »

toTOW wrote: Sat Jun 04, 2022 8:23 pm

FAH is a scientific project that requires hardware stability.
Bolded for emphasis.

The only hard part about getting set up for better folding is accepting that it will tax a system much more than most benchmarks. I used multiple benchmarks running at once for days, but when folding it was not up to the task.

We need a mega intensive work unit simulation version of F@H Bench. :mrgreen:
Fold them if you get them!
Post Reply