When good work units go bad

Bobcat · Post by **Bobcat** » Sat Jan 22, 2011 12:05 am

I don't understand why there are "bad" work units. In particular:

1. Exactly what is a bad work unit?

2. Why do they occur?

3. Why can't they be detected and handled automatically, instead of people having to start threads about them and moderators manually marking them as "bad"?

You can get somewhat technical in your responses, as I have an electrical engineering degree. However, I'm not a statistics weenie, so complicated statistical answers may go over my head.

Thanks...

Post by **PantherX** » Sat Jan 22, 2011 2:17 am

You can read this -> viewtopic.php?f=19&t=16526

Post by **bruce** » Sat Jan 22, 2011 5:42 am

The complete answer to your question is actually statistical.

To put it in non-technical terms that most everybody can understand, let's talk about simulating the solar system.

It is relatively easy to simulate 9 planets and a considerable number of moons. The masses and forces are pretty well known and nothing is really statistically uncertain. It's easy to pick a set of initial conditions because we know where everything is and at what velocity each mass is moving. Now try to simulate Saturn's rings, where there are lots and lots of tiny particles that are not only influenced by gravity, but also by the pressure of light on each particle and perhaps by the irregular shape of nearby particles.

When we replace planets and moons with atoms and the laws of gravitation with atomic force fields, several significant changes have to be made. First, every atom moves both under the influence of a deterministic force field (that would make simulation easy) and also get kicked around by a statistical thermal component (unless everything is at absolute zero, ant that wouldn't be a useful simulation for proteins). Second, we have to make some assumptions about initial positions and initial velocities. There are conditions on those assumptions (e.g.-they have to be consistent with the desired temperature).

Going back to the solar system model for a moment, suppose you had to ASSUME a set of initial positions and initial velocities for each planet/moon. (No peeking allowed.) Chances are pretty good that some choices of assumptions would result in a planet shooting right out of the solar system or at least a moon breaking away from its planet. How would you know which set of assumptions to discard? You'd have to run the simulation for a while and eliminate those WUs that are inconsistent with the concept of a (relatively) stable solar system.

[FAH does weed out most unstable WUs by beta-testing new projects. Unstable WUs generally show up rather quickly and they're weeded out before the project is released. As each simulated trajectory is extended, instabilities are less and less likely.]

OK, now back to Saturn's rings. We need more a more detailed simulation than you'd get if you assume every particle is moving in an orderly progression around the planet. There are random disturbances that are important at a detailed level. Even if you know the average speed and direction of a group of particles, you need to start with a number of different random distributions of initial velocities and see which ones establish known patterns like the gaps and twists that have been photographed by spacecraft.

At many levels, I've ignored a lot of details in this description. For a more accurate and a more complete explanation, get a good book on physical chemistry or better yet, on molecular simulaion.

I think that answers your questions 1 and 2. Question 3 is more difficult. Writing software that can detect instabilities without a software crash is a real challenge. There have been some real improvements in capturing data from a software crash and reporting it back to the server and future versions of the software will continue to improve. Also (see the topic referenced by PantherX above) instabilities in the calculations or software crashes can be caused by "bad WUs" but they can also be caused by "bad hardware" and that includes overclocked or overheated hardware components. The FAH system (including client software, the FAHCores that do the simulation, and the servers that try to make sense out of various crash reports still have to figure out which kind of a crash it was before a WU can be marked as bad. That software, too, is improving but still has room for more improvements.

Bobcat · Post by **Bobcat** » Sat Jan 22, 2011 11:21 am

So you're assuming various sets of initial conditions, some of which do not lead to valid solutions.

For the moment, assume the hardware being used is operating correctly: Are the bad WUs when some boundary conditions are exceeded (e.g., Venus crashes into the Sun), or mathematical errors (e.g., divide by zero or tangent of 90 degrees), or both?

Qinsp · Post by **Qinsp** » Sat Jan 22, 2011 4:53 pm

Random Ramblin':

I used to write CAD software in a previous life, and one of my areas was importing files created by other systems.

You always use some kind of input validation, but when the math gets hairy (Non-uniform rational B-spline surfaces or bi-cubic parametric boundary curves comes to mind), it can be really hard to spot bad data. Processing bad data can crash the app. Not just divide by zero or infinite loop, but stack overflow, non-resolving loops (doesn't repeat, but doesn't finish either). Calculus formula are often solved by looping in a computer program, but there are types of problems where the loop goes nuts.

The trick is to figure out how to give a program an iron-stomach. No matter what kind of data it's given, it will exit gracefully. While an admirable goal, I'm not sure Microsoft can even guarantee that.

Ideally, PG is evaluating bad WU's to figure out to exit gracefully from that particular issue, but there are hundreds of ways to crash an app.

Post by **bruce** » Sun Jan 23, 2011 2:49 am

Strictly speaking, initial conditions which cause Venus to crash into the sun are valid -- it's just not one of the cases Stanford is interested in. The simulation would be valid right up to the point that the two bodies got close enough to need equations for how the atmospheres behave when they start to interact. ["splash"

] Those equations would never be needed when running a "normal" simulation so nobody is going to spend any time writing and testing that part of the code. Fortunately it's pretty easy to stop a solar system simulation and say "Simulation cannot continue." The results are valid right up to that point. [I've seen a few equivalent cases in FAH reports, and those simulations stop cleanly and award points proportional to how much of the WU was actually completed.]

Leonardo · Post by **Leonardo** » Sun Jan 23, 2011 2:55 am

Chances are pretty good that some choices of assumptions would result in a planet shooting right out of the solar system....

But what if we downgrade it, and strip its 'planet' title from it, like poor, poor, Pluto!

Actually, 7im, thanks for the excellent analogy.

Non-uniform rational B-spline surfaces or bi-cubic parametric boundary curves comes to mind

I had a girlfriend back in the 70s who matched that description. She was all looks and no brains. We parted ways.

I am continually impressed by dedicated Folders.

Leonardo · Post by **Leonardo** » Sun Jan 23, 2011 4:43 am

Sorry, my attribution was to the wrong person.

Bruce, thanks for the excellent analogy.

Bobcat · Post by **Bobcat** » Sun Jan 23, 2011 6:05 pm

Qinsp wrote:I used to write CAD software in a previous life, and one of my areas was importing files created by other systems.

I've been writing embedded software for over 30 years. I started to write a lengthy post about validating user input, but... Let's just say that if a bad value will cause a system reset or hang-up, I'll correct the value to something which will allow the system to run (e.g., if the valid range is 1 to 255, and 0 will cause a crash, I'll change the value to 1). I don't check for values that will cause the system to do things that don't make sense, as long as the system will keep running. That's what testing is for, and the user should test the system after making changes to the input data.

Back on topic: I guess the only bad WU reports needed are ones that cause a program crash and results are not automatically sent by the client.

Post by **bruce** » Sun Jan 23, 2011 9:53 pm

Bobcat wrote:
Qinsp wrote:I used to write CAD software in a previous life, and one of my areas was importing files created by other systems.
I've been writing embedded software for over 30 years. I started to write a lengthy post about validating user input, but... Let's just say that if a bad value will cause a system reset or hang-up, I'll correct the value to something which will allow the system to run (e.g., if the valid range is 1 to 255, and 0 will cause a crash, I'll change the value to 1). I don't check for values that will cause the system to do things that don't make sense, as long as the system will keep running. That's what testing is for, and the user should test the system after making changes to the input data.

Back on topic: I guess the only bad WU reports needed are ones that cause a program crash and results are not automatically sent by the client.

Validating user input is simple compared to the problems we're talking about here. Qinsp's point is that if data from one system that uses B-splines has to be converted to/from a system that doesn't support B-splines, is easy to write code that converts all of your test cases and then later run into rare cases that simply don't convert in any reasonable way,

In the case of Venus crashing into the Sun, there will be cases where Venus is orbiting in a rather distorted ellipse. The first 20 or 30 Gens will all seem reasonable and then Venus will happen to pass too close to Earth and it's orbit will be distorted enough to send it into the sun. es

There are no user inputs that are automatically "bad" or "good". The Pande Group cannot detect bad WUs except by running them. As any serious astronomer knows, eventually the planetary orbits will decay and they'll crash into the sun -- if you wait long enough -- or unless something else unexpected happens (the sun goes nova or some large object drops in from outside the solar system or ....)

Amaruk · Post by **Amaruk** » Sun Feb 27, 2011 5:34 am

Bobcat wrote:Back on topic: I guess the only bad WU reports needed are ones that cause a program crash and results are not automatically sent by the client.

Since PG has no way of knowing about WUs that fail without reporting to the server, I would think it a good idea to report them regardless of their effect on the client or core.

Statistically speaking, I think the percentage of bad WUs is pretty low. In the last two and a half years I've folded 134850 WUs, give or take. About 40 of those have been bad. That's 1 bad WU for every 3371 completed.

That works out to .0003 - that's three hundredths of one percent. Given the nature of this project and how it's consantly pushing the boundaries of distributed computing, I find this rather remarkable.

Folding Forum

When good work units go bad

When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad

Re: When good work units go bad