Page 1 of 1

Tesla P4 failing

Posted: Fri Aug 19, 2022 10:47 pm
by TheDevil
Let this sit for 3 hours - this is all I get on this device. I've tried more than a few drivers set. Any advice? I also tried in Linux - and it would not enable.

20:59:14:WU00:FS05:0x22: Version: 7.7.0
20:59:14:WU00:FS05:0x22:********************************************************************************
20:59:14:WU00:FS05:0x22:Project: 17918 (Run 929, Clone 0, Gen 58)
20:59:14:WU00:FS05:0x22:Reading tar file core.xml
20:59:15:WU00:FS05:0x22:Reading tar file integrator.xml
20:59:15:WU00:FS05:0x22:Reading tar file state.xml
20:59:15:WU00:FS05:0x22:Reading tar file system.xml
20:59:18:WU00:FS05:0x22:Digital signatures verified
20:59:18:WU00:FS05:0x22:Folding@home GPU Core22 Folding@home Core
20:59:18:WU00:FS05:0x22:Version 0.0.20
20:59:18:WU00:FS05:0x22: Checkpoint write interval: 50000 steps (5%) [20 total]
20:59:18:WU00:FS05:0x22: JSON viewer frame write interval: 10000 steps (1%) [100 total]
20:59:18:WU00:FS05:0x22: XTC frame write interval: 25000 steps (2.5%) [40 total]
20:59:18:WU00:FS05:0x22: Global context and integrator variables write interval: disabled
20:59:18:WU00:FS05:0x22:There are 4 platforms available.
20:59:18:WU00:FS05:0x22:Platform 0: Reference
20:59:18:WU00:FS05:0x22:Platform 1: CPU
20:59:18:WU00:FS05:0x22:Platform 2: OpenCL
20:59:18:WU00:FS05:0x22: opencl-device 0 specified
20:59:18:WU00:FS05:0x22:Platform 3: CUDA
20:59:18:WU00:FS05:0x22: cuda-device 0 specified
21:00:25:WU00:FS05:0x22:Attempting to create CUDA context:
21:00:25:WU00:FS05:0x22: Configuring platform CUDA
21:00:42:WU00:FS05:0x22:ERROR:Discrepancy: Forces are blowing up! 683 0
21:00:42:WU00:FS05:0x22:Saving result file ..\logfile_01.txt
21:00:42:WU00:FS05:0x22:Saving result file science.log
21:00:42:WU00:FS05:0x22:Saving result file state.xml
21:00:47:WU00:FS05:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT


*********************** Log Started 2022-08-19T19:12:18Z ***********************
20:05:32:WU00:FS05:0x22:WARNING:Console control signal 1 on PID 9796
20:06:33:WARNING:FS05:Killing WU00
20:59:08:WU00:FS05:0x22:ERROR:exception: Error loading CUDA module: CUDA_ERROR_ILLEGAL_ADDRESS (700)
20:59:13:WARNING:WU00:FS05:FahCore returned an unknown error code which probably indicates that it crashed
20:59:13:WARNING:WU00:FS05:FahCore returned: UNKNOWN_ENUM (-1073740791 = 0xc0000409)
21:00:42:WU00:FS05:0x22:ERROR:Discrepancy: Forces are blowing up! 683 0

Re: Tesla P4 failing

Posted: Sat Aug 20, 2022 1:37 pm
by toTOW
Did you test the GPU with other applications ? I don't know if OCCT would work on a Tesla card ...

Did you check your system RAM for errors with Memtest86+ ?

Are temperatures and voltages fine on the GPU and the CPU ?

Which drivers did you use ?

Re: Tesla P4 failing

Posted: Sat Aug 20, 2022 6:24 pm
by TheDevil
toTOW wrote: Sat Aug 20, 2022 1:37 pm Did you test the GPU with other applications ? I don't know if OCCT would work on a Tesla card ...

Did you check your system RAM for errors with Memtest86+ ?

Are temperatures and voltages fine on the GPU and the CPU ?

Which drivers did you use ?
Currently 516.94 with Cuda 11.7
Tried:
412.36 - Cuda 10.0
453.64 - Cuda 11.0

Idle temp is 43c hotspot is 53c
WHen i send a job to it temp is 58c hotspot 70c

Idle draw is 1w - with 0% load and a task assigned to it its at 25w. and only pulling 30% of TDP

System ram HPE 752369-081 16GB 2RX4 DDR4 2133Mhz PC4-17000 Ecc x8 (128gb). Server would know if any of this Ram was bad, IIRC.

FYI this is a server so there is No OC or sillyness about stability as far s I know. device is a HP DL360 G9.

Device worked in ESXi to and was able to do graphics duty on my VMs

GPU-Z validated it - https://www.techpowerup.com/gpuz/details/g3uau

And then JUST for giggles, I ran it on a ETH miner, and got the 16-17/mhs and it ran fine for 45 mins as purely a test.

Re: Tesla P4 failing

Posted: Mon Aug 22, 2022 8:38 pm
by JimboPalmer
One idea, are you getting the driver's directly from Nvidia?

https://www.nvidia.com/Download/driverR ... 588/en-us/