Page 4 of 8

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 6:46 am
by bruce
Now that time has passed since the last post by UofM.MartinK, we cah look again at what happened to his 16600 WU. It has been assigned to another machine and completed successfully.

https://apps.foldingathome.org/wu#proje ... 12&gen=402

Now that we know that another machine has completed it successfully, that increases the chances that it's a local problem or it's a driver problem. We can hope that by comparing your error report with the successful completion by someone else we can see what differences caused the crash. Any suggestions?

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 7:03 am
by PantherX
gunnarre wrote:...Is it possible to get FAHBench to work on a chosen good work unit? (project:13421 run:3765 clone:27 gen:1 works on the RX580 under Windows here...
Once FAHBench has been updated to support FahCore_22 then yes, you can run individual WUs.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 1:47 pm
by foldy
I can provide a Windows prebuild FAHbench with FahCore_22 if anyone is interested. Or you can build from source OpenMM and FahBench yourself.
viewtopic.php?f=38&t=24225&p=327396&hilit=fahbench#p327396

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 2:46 pm
by UofM.MartinK
Out of the blue, two units succeeded (one 16600, one 13421), now a lot of fails again:

Code: Select all

******************************* Date: 2020-08-12 *******************************
16:32:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:33:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
17:39:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:39:46:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:56:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
18:40:44:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
18:40:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
20:16:42:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:112 gen:402 core:0x22 unit:0x000001bb8f59f36f5ec36912518a1dea
23:20:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1450 gen:151 core:0x22 unit:0x000000aa8f59f36f5ec369114358cbbf
23:36:43:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:16600 run:0 clone:1745 gen:109 core:0x22 unit:0x000000798f59f36f5ec3691054460df8
08:43:33:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:16600 run:0 clone:1189 gen:420 core:0x22 unit:0x000001d48f59f36f5ec369117e373557
******************************* Date: 2020-08-13 *******************************
11:01:26:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13421 run:7982 clone:63 gen:2 core:0x22 unit:0x0000000212bc7d9a5f26fb5a703f266e
12:59:37:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:7828 clone:80 gen:2 core:0x22 unit:0x0000000212bc7d9a5f26fb55a92eb41f
12:59:52:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:7695 clone:22 gen:2 core:0x22 unit:0x0000000312bc7d9a5f224a430ccd3540
13:00:03:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:7695 clone:46 gen:2 core:0x22 unit:0x0000000312bc7d9a5f224a41fbe4d9f2
All of the 16600 fails were completed by others in the meanwhile:
https://apps.foldingathome.org/wu#proje ... 09&gen=126 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 92&gen=280 (failed by another AMD, completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 12&gen=402 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 50&gen=151 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 45&gen=109 (completed by NVIDIA)

Whereas all 13421's show no additional results but my fails yet.

Have yet to find a WU my RX580 failed and another AMD GPU completed, but I am sure they are out there since this seems all a statistics game.

In what regard would FAHbench'ing help in this case? This rig runs Ubuntu 20.04. If it makes sense to FAHbench'ing, anything I can help with, like provide some WU work directory snapshots?

I will try underclocking the GPU next if I have the time.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 2:54 pm
by muziqaz
I believe we might make a decision to ban AMD folding on Linux :D fahbenching will not help anything on Linux. It is clear as day that AMD on Linux is like winning a lottery. Does more harm than good. :(

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 4:32 pm
by gunnarre
I've seen some reports of ROCm working when the regular AMD Pro drivers don't work. Could that be something to try?

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 5:47 pm
by bruce
muziqaz wrote:I believe we might make a decision to ban AMD folding on Linux :D fahbenching will not help anything on Linux. It is clear as day that AMD on Linux is like winning a lottery. Does more harm than good. :(
That would be a real shame but it's better than continuing to do more harm than good. I wonder if it's possible to (A) find dependable drivers and (B) (somehow) ban the "lottery" drivers. :idea:

It might be possible that all the WUs that fail with these AMD drivers are later completed with nV drivers on nV hardware but from the limited information I can gather, I have no way to explore the validity of such a guess.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 6:09 pm
by muziqaz
Problem with linux, it has 5 billion different flavours, with half a billion different drivers, and another quarter of the billion OpenCL packages. Everything nearly handpicked and DIY created. While on Windows you get a single official driver package, and if that doesn't work, you go, throw a stone to MS window and hope for the best. With linux there are so many variables, its insane. Combine that with catastrophic hit or miss stability from AMD, and we have a recipe for disaster

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 6:41 pm
by UofM.MartinK
I don't know if it's the moon phase or something, but my rig returned another 16600 WU successfully, and is going strong on the next one.

Keep in mind, nothing changed on the rig - even temperature is pretty constant.
bruce wrote:It might be possible that all the WUs that fail with these AMD drivers are later completed with nV drivers on nV hardware but from the limited information I can gather, I have no way to explore the validity of such a guess.
I checked into some more of the 16600 WUs which failed on my RX580 on August 4th, the same picture:

project:16600 run:0 clone:391 gen:243 (completed by NVIDIA)
project:16600 run:0 clone:1826 gen:16 (completed by NVIDIA)
project:16600 run:0 clone:1154 gen:368 (failed by another AMD, completed by NVIDIA)
project:16600 run:0 clone:1724 gen:53 (completed by NVIDIA)

But keep in mind, my RX580 also completed some other 16600 in the meanwhile, although some with "restarting".

It is pretty clear that this is a very specific GPU(architecture)<>WU combination issue, perhaps facilitated by the driver. And then a lot of rolling dice - once I get a lot of fails again, I will update the driver - but even if it works better for two days in a row without a fail, it won't tell us much?

I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?

For reference, driver since creation of this rig in April, is amdgpu-pro 20.10-1048554 on Ubuntu 20.04

Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 8:33 pm
by NormalDiffusion
UofM.MartinK wrote: Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
The Sapphire nitro+ is factory overclocked. Could you try to lower it to the default rx580 clock?
Had to do this on my Sapphire 290x...

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 9:36 pm
by bruce
I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
The client knows the driver version but I don't believe that the FAHCore places that information in the error report. That would potentially be an enhancement for the FAHCore.

The fact that your GPU is (factory) overclocked is potentially another important piece of information, so we have to ask you to make that change as NormalDiffusion gas syggested and report back to us. :!:

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 11:36 pm
by _r2w_ben
bruce wrote:
I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
The client knows the driver version but I don't believe that the FAHCore places that information in the error report. That would potentially be an enhancement for the FAHCore.
On Windows, science.log contains a version reported by the OpenCL implementation and is sent to the server when a work unit returned. (2348.4) corresponds to the Product version attribute of amdocl64.dll.

Code: Select all

  PROFILE = FULL_PROFILE
  VERSION = OpenCL 2.0 AMD-APP (2348.4)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.
It's the same version number reported when FAHClient starts up.

Code: Select all

19:09:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:2348.4

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Fri Aug 14, 2020 12:42 am
by ViTe
NormalDiffusion wrote:
UofM.MartinK wrote: Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
The Sapphire nitro+ is factory overclocked. Could you try to lower it to the default rx580 clock?
Had to do this on my Sapphire 290x...
All RX580 Nitro+ has unchanged base core clock (1257Mhz). Only boost clock is a bit higher

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Fri Aug 14, 2020 5:11 am
by UofM.MartinK
Okay, found the driver in the log file, just have to figure out what the number refers to, thanks, _r2w_ben ! :)

Code: Select all

04:34:20:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.10
The thing with the boost clock is interesting, I did some more data mining, and got munin to display all temperature and fan speeds the "sensors" package found for that rig:

Image

Update 8/15: I originally wrote "Temp1" instead of "Edge" was the GPU temperature, it was late in the day, sorry for the confusion.
"Edge" seems to be the GPU temperature, and whenever it is "high", most WUs fail, and when it is "medium", most WUs complete, and "low" might be no GPU WU active.

So something puts the card into either state (high temp perhaps boost clock state?) - the driver being a top candidate.

I will update the driver, and keep reporting. Perhaps also playing with downclocking later, but I will prioritize understanding the causes (since other AMD cards seem affected as well) and focusing on finding the "simplest" fix. An option to tell the driver to disable "boosting", for example.

Update: For now, instead of updating the driver, I just enabled the "POWER SAVING" profile instead of the default "3D_FULL_SCREEN" profile in pp_power_profile_mode. I still see the SCLK to boost to 1411 MHz occasionally (the card seems to be able to do the following frequencies: 300, 600, 900, 1145, 1215, 1257, 1300, and 1411Mhz). Will disallow the 1411 MHz state next if WUs are still failing in "POWER SAVING" mode.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Fri Aug 14, 2020 9:33 am
by muziqaz
OpenCL v3075 is quite recent. So OpenCL is up to date and shouldn't be a factor.