Multiple Failures

Moderators: Site Moderators, FAHC Science Team

Post Reply
theteofscuba
Posts: 96
Joined: Wed Dec 05, 2007 7:15 am
Hardware configuration: PS3, Phenom II X4, QX9775, HD 8570
Contact:

Multiple Failures

Post by theteofscuba »

Lately, I've been getting quite a bit of MACHINE_UNSTABLE results. Infact, one time a work unit ("Project: 5736 (Run 1, Clone 43, Gen 485)" barfed at 12%, but consequently when the client restarted, the WU completed successfully! I don't know what is going on here, as I'm getting a mixed bag of failures resulting in EUE, failing then completing after restarting, and other work units that simply complete successfully without any trouble. Judging by the fact there were so many different EUEs for projects such as:

Project: 5735 (Run 3, Clone 114, Gen 513)
Project: 5737 (Run 3, Clone 9, Gen 550)
Project: 5741 (Run 3, Clone 35, Gen 629)
Project: 5742 (Run 0, Clone 19, Gen 689)
Project: 5743 (Run 0, Clone 98, Gen 539)
Project: 5744 (Run 3, Clone 95, Gen 249)


I am led to suspect that someone is intentionally feeding back corrupt data to the WU servers in an effort to get points at the expense of the science. On one hand I hope that is the case because it would mean that it is not due to a hardware problem. I am also hoping that this could just be a software bug in the GPU2 client. I am using -forcegpu ati_r700 because I am usng an HD 5870 which is currently unsupported.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple Failures

Post by bruce »

I don't think your conspiracy theory is too likely. The servers are really very good at rejecting corrupt data. In fact, we get a number of complaints when the server rejects an upload as corrupt and the donor claims that it couldn't possibly be corrupt.

Nevertheless, whether the servers have an issue or not, you're saying that your hardware does not calculate the same results twice. The fact that a WU barfed at 12% and then restarted and it finished normally points to a problem in your hardware. The most common causes of non-repeatability is a GPU or CPU that is overclocked, overheated, or underpowered. What temperature is your GPU running? How are your 12v rails doing?
theteofscuba
Posts: 96
Joined: Wed Dec 05, 2007 7:15 am
Hardware configuration: PS3, Phenom II X4, QX9775, HD 8570
Contact:

Re: Multiple Failures

Post by theteofscuba »

I'm ruling CPU overheating out. my other 7 cpu cores are running normally without error. also, this system automatically shuts down and alarms when overheating so i'm fairly confident that its not the cpu, and they are not over clocked or in the red zone temperature wise. as for the gpu on the other hand, i'm pretty sure it is getting plenty of juice from this 1000w psu and is plugged in properly. i'm not sure how to get a temperature reading on it though. also, no overclocking on the gpu. any possibility it is a driver bug? I updated to catalyst 9.11 for windows 7 32bit.

edit:

ati catalyst control center reports a steady ~ 80 C temperature. the guage is pretty much half way in between super cool and max temp, which makes temperature appear to not be the issue at this time.
theteofscuba
Posts: 96
Joined: Wed Dec 05, 2007 7:15 am
Hardware configuration: PS3, Phenom II X4, QX9775, HD 8570
Contact:

Re: Multiple Failures

Post by theteofscuba »

Ah, found the problem. The GPU was not fully seated in its pci-e slot. I expected problems like that to be detected at boot time with some kind of alarm or obvious sign that something is wrong instead of these subtle, unpredictable and unreproducable errors that I was encountering.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Multiple Failures

Post by 7im »

I'm glad to see a more realistic cause to the problem (and solution!) has been found. ;)
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Post Reply