Failed GPU slot daily :(

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed GPU slot daily :(

Post by bruce »

I edited my earlier post so it contains new information.

Your GPU is clearly having trouble with that WU. I would not leave it running. Let's start by simply pausing your GPU. Go to FAHControl and in the middle of the initial screen, you'll see a small chart called "Folding Slots" and another called "Work Queue" In the upper chart, you'll see two green "Running" words, one called cpu and one called gpu. Right-click on the green gpu slot Status flag and select Pause
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

bruce wrote:
... So far, what's weird is, i'm still noticing that PRCG is changing from 13422 (2975, 69, 2) to PRCG 16918 (4, 50, 39) then right back to PRCG 13422 (2975, 69, 2) again and repeat. Is that a sign of anything?
Yes, and it's a good sign. FAH is running two WUs that run independently of each other, one on your CPU and one on your GPU. The output logs are intermixed so you have to learn to read the combined output or use the filtering function that's built into to FAHControl.

Code: Select all

20:18:34:WU00:FS00:0xa7:       SIMD: avx_256
20:18:34:WU00:FS00:0xa7:Project: 14824 (Run 1225, Clone 3, Gen 52)
...
20:18:37:WU01:FS01:0x22:Project: 13422 (Run 3163, Clone 20, Gen 0)
FAHCore 0xa7 is running WU00 on slot FS00 using the avx-256 hardware feature. Independently, FAHCore 0x22 is running (or trying to) WU01 using slot FS01
Thanks bruce.
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

Paused!
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed GPU slot daily :(

Post by bruce »

I don't have a good explanation as to why the Hawaii [Radeon R9 200/300 Series] is having trouble with Project 13422 (2975, 69, 2) but it is ... and it looks like it's looping. That's not good so I suggested the Pause. The log should continue to show (only) what's happening to the CPU assignment and it will be a lot easier to follow.

Then we'll figure out what to do with the GPU.

Your GPU is running Core: Core22 Version: 0.0.11. It has been going through a process of bug fixing and I think you've found a new one. The developer is on the east coast, so I don't think we can contact him a 01:00 EST. :(

That leaves us 2 choices. 1) tell you how to dump the WU and hope that it's replaced with something your GPU can process or 2) wait until morning in NYC and let him recommend a way to figure out what's going on.

If you choose 2, it's actually better than leaving it running (wasting GPU power and gathering useless repeated messages.

https://apps.foldingathome.org/cpu shows that your AMD GPU has completed several other WUs from Project 13422 but this one is somehow different.
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

bruce, Just can't shake it this one! I hope I found something! Because before this, I was downing 13,000 point WU's in an hour with this bad puppy. If it were you how would you go about?
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

bruce wrote:I don't have a good explanation as to why the Hawaii [Radeon R9 200/300 Series] is having trouble with Project 13422 (2975, 69, 2) but it is ... and it looks like it's looping. That's not good so I suggested the Pause. The log should continue to show (only) what's happening to the CPU assignment and it will be a lot easier to follow.

Then we'll figure out what to do with the GPU.

Your GPU is running Core: Core22 Version: 0.0.11. It has been going through a process of bug fixing and I think you've found a new one. The developer is on the east coast, so I don't think we can contact him a 01:00 EST. :(

That leaves us 2 choices. 1) tell you how to dump the WU and hope that it's replaced with something your GPU can process or 2) wait until morning in NYC and let him recommend a way to figure out what's going on.

If you choose 2, it's actually better than leaving it running (wasting GPU power and gathering useless repeated messages.

https://apps.foldingathome.org/cpu shows that your AMD GPU has completed several other WUs from Project 13422 but this one is somehow different.
It's up to you, I don't mind taking a 1am dump! If that fails we can unleash the devs

(funny i was actually reading up about dumping a WU but I definitely need a walk though there.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed GPU slot daily :(

Post by bruce »

You have to choose. Personally, I'd rather help fix a bug than complete more WUs, but both are important. Others don't always have the same preferences that I do.
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

bruce wrote:You have to choose. Personally, I'd rather help fix a bug than complete more WUs, but both are important. Others don't always have the same preferences that I do.
I rather wait for Dev my friend thank you for your help.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed GPU slot daily :(

Post by bruce »

I'll send him an email. If you change your mind, this should allow you to dump it. ... or if Dev takes Sunday off.

FAH's data files are at C:\Users\Crimson\AppData\Roaming\FAHClient
The work files are in \work\0n where n is the queue position. (01, in your case).

I'd make a backup of 01 somewhere. With the WU paused, think you can delete enough of the contents of 01 to force it to abort itself if that's your choice. There's no guarantee that the same thing might or might not happen to another WU.
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

This is definitely Mr. Mcbuggy buggerton's bug house going on here.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed GPU slot daily :(

Post by bruce »

I take it from your handle that you're a dedicated Red-box (AMD) fan. There are a number of unexplained AMD bugs that Green-box (nV) fans don't encounter. All the red-box fans will thank you for you dedication if we can fix this one.
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

bruce wrote:I take it from your handle that you're a dedicated Red-box (AMD) fan. There are a number of unexplained AMD bugs that Green-box (nV) fans don't encounter. All the red-box fans will thank you for you dedication if we can fix this one.
Not necessarily, I'm not a fan of either at the time the R9 390 was the better buy oppose to the GTX 970 for Me and for gaming at the time. (I never thought I'd fold on my gaming rig.)
I don't see this ever happening on my two 660's I have folding right now. I just got my first Nvida cards this year and I'm loving both! I have to say up, until this, it was kicking butt with 390! Easy 200k 24 avg. But, i'm not so sure about AMD folding now... sheesh. I do wanna at least get it going again, she's a beaut of a card... she's worth it. 8-)

hint, hint, Roll Tide.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed GPU slot daily :(

Post by bruce »

Oh, that Crimson 8-)
crimson1077
Posts: 33
Joined: Sat Aug 22, 2020 4:37 pm

Re: Failed GPU slot daily :(

Post by crimson1077 »

bruce wrote:Oh, that Crimson 8-)
Thanks for your help bruce. I did as you said with 01 folder and backed up and deleted. As I fired up FAH I watched in windows file explore tab 01 reappear as it should but still on PRCG 13422.
Maybe too soon to tell as of now. We shall see!
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Failed GPU slot daily :(

Post by Neil-B »

Project 13422 is the P part of PRCG ... R is run, C is clone and G is generation ... a PRCG identifies a specific WU within a Project ... hopefully you have actually got a new PRCG within Project 13422 ... related to the buggy PRCG - was the couple of days pause iirc you mentioned during the folding of that WU? - and did you pause the slot and exit the client before shutting down? - I am simply wondering if the initial failure of the WU may have been linked to some corruption caused at that point ... for a safe shutdown it can be worth pausing slots (there are threads discussing best time to do this relating to checkpoints) quit the client then wait a bit to ensure everything is saved properly before shutting down the system - this seems to minimise chances of issues from that respect.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Post Reply