Project: 2671 (Run 36, Clone 26, Gen 73)

Moderators: Site Moderators, FAHC Science Team

Post Reply
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Project: 2671 (Run 36, Clone 26, Gen 73)

Post by alpha754293 »

error/bombed out.

Here's the fahlog.txt

Code: Select all

[15:47:02] - Preparing to get new work unit...
[15:47:02] + Attempting to get work packet
[15:47:02] - Connecting to assignment server
[15:47:02] - Successful: assigned to (171.67.108.24).
[15:47:02] + News From Folding@Home: Welcome to Folding@Home
[15:47:02] Loaded queue successfully.
[15:47:19] + Closed connections
[15:47:19] 
[15:47:19] + Processing work unit
[15:47:19] Core required: FahCore_a2.exe
[15:47:19] Core found.
[15:47:19] Working on queue slot 04 [January 26 15:47:19 UTC]
[15:47:19] + Working ...
[15:47:19] 
[15:47:19] *------------------------------*
[15:47:19] Folding@Home Gromacs SMP Core
[15:47:19] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[15:47:19] 
[15:47:19] Preparing to commence simulation
[15:47:19] - Ensuring status. Please wait.
[15:47:20] Called DecompressByteArray: compressed_data_size=4840724 data_size=24028493, decompressed_data_size=24028493 diff=0
[15:47:20] - Digital signature verified
[15:47:20] 
[15:47:20] Project: 2671 (Run 36, Clone 26, Gen 73)
[15:47:20] 
[15:47:21] Assembly optimizations on if available.
[15:47:21] Entering M.D.
[15:47:30] Run 36, Clone 26, Gen 73)
[15:47:30] 
[15:47:30] Entering M.D.
[15:56:26] Completed 5008 out of 250000 steps  (2%)
[16:00:49] Completed 7508 out of 250000 steps  (3%)
[16:05:13] Completed 10008 out of 250000 steps  (4%)
[16:09:37] Completed 12508 out of 250000 steps  (5%)
[16:14:01] Completed 15008 out of 250000 steps  (6%)
[16:18:25] Completed 17508 out of 250000 steps  (7%)
[16:22:49] Completed 20008 out of 250000 steps  (8%)
[16:27:13] Completed 22508 out of 250000 steps  (9%)
[16:31:37] Completed 25008 out of 250000 steps  (10%)
[16:36:02] Completed 27508 out of 250000 steps  (11%)
[16:40:26] Completed 30008 out of 250000 steps  (12%)
[16:44:50] Completed 32508 out of 250000 steps  (13%)
[16:49:15] Completed 35008 out of 250000 steps  (14%)
[16:53:39] Completed 37508 out of 250000 steps  (15%)
[16:58:08] Completed 40008 out of 250000 steps  (16%)
[17:02:37] Completed 42508 out of 250000 steps  (17%)
[17:07:05] Completed 45008 out of 250000 steps  (18%)
here's what it says in console:

Code: Select all

------------------------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3. will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MP
------------------------------------------------------------------------
Program mdrun. VERSION 3.3.99_development_200800503
Source code file: nsgrid.c , line: 358

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +- Infinity orNaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483649. It should have been withing [ 0 .. 1540 ]
------------------------------------------------------------------------

Thanx for using GROMACS - Have a Nice Day

Error on node 5, will try to stop all the nodes
Halting parallel program mdrun on CPU 5 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_5]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
I_Abort(MPI_COMM_WORLD, -1) - process 3
I_Abort(MPI_COMM_WORLD, -1) - process 5
Run stopped. No prompt. F@H Halted. Abnormal program termination.

Suggestions?
toTOW
Site Moderator
Posts: 6433
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by toTOW »

Isn't it the second time you get this kind of error ?

There's no data for this WU in the DB yet.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by alpha754293 »

toTOW wrote:Isn't it the second time you get this kind of error ?

There's no data for this WU in the DB yet.
Uh...honestly. Don't know.

The first time it said that it was because the molecule was unstable.

This time, I think that it's every so slightly different in the sense that I think that I may have an encountered a diverging solution which resulted in the velocities of the molecules to go to inf./NaN.

So they're computationally different (if I understand what it's reporting/saying correctly, or at least interpreting it correctly).

Instabilities can be detected via the FFT (I'm like...making a wild guess here) within the code.

Velocities is usually a position and/or first time derivative thereof, so while it may end up with similiar errors, the cause of it can be very different and mean very different things altogether.

I couldn't pull/read back into the console outputs because I was running with just text-only console.

I miss the good old days of the DEC/Alpha/VT100 terminals where you can scroll back up and then copy&paste. *sigh*
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by bruce »

alpha754293 wrote:Instabilities can be detected via the FFT (I'm like...making a wild guess here) within the code.
Only if FFT is part of the FahCore you're running. Different cores use different computational methods.
Velocities is usually a position and/or first time derivative thereof, so while it may end up with similiar errors, the cause of it can be very different and mean very different things altogether.
Not to belabor a point, but positions and velocities are both obtained by numerical integrals, not numerical derivatives -- with suitable adjustments for Brownian Motion.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by alpha754293 »

bruce wrote:
alpha754293 wrote:Instabilities can be detected via the FFT (I'm like...making a wild guess here) within the code.
Only if FFT is part of the FahCore you're running. Different cores use different computational methods.
Velocities is usually a position and/or first time derivative thereof, so while it may end up with similiar errors, the cause of it can be very different and mean very different things altogether.
Not to belabor a point, but positions and velocities are both obtained by numerical integrals, not numerical derivatives -- with suitable adjustments for Brownian Motion.
intergal ONLY if velocities are calculated first.

But considering that it's supposed to match up to some sort of grid, I would think that the program's probably tracking the time-dependent positions, and taking the derivative in order to obtain the velocity.

On the other hand, if it is calculating the velocities first, then yes, you are absolutely correct. I have no idea how they would solve the momentum equations (if that's indeed what they're using) to obtain the velocities.

I would think that FFTs would be one of the quicker way (again, wild guess here) in order to determine if there are any vibrational characteristics. You track the molecule's position as a function of time, and given that we're talking time scales of 10^-12, I would think that it wouldn't take much/long to be able to get FFT results.

I have no idea if F@H even took the FFT part out of the GROMACS core. *shrug* who knows. Based on GROMACS v4 user's manual, that's all I can find out. That and apparently they're still working implementing FFTW (3D decomposition for FFTs (needed for PME algorithm)). Source: wiki.gromacs.org

*shrug*

I'm a mechanical engineer by training, so this stuff pertaining to MD and F@H and biochemistry/computational chemistry/coding/programming -- it's over my head.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by alpha754293 »

Getting back on topic:

do I keep the WU? send it back to PandeGroup? purge? Let me know please whenever you can. Thanks.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by 7im »

No specific user intervention is needed. The client is designed to handle errors as appropriate. Restart the client. It will either dump the WU, and download a new WU, or it will upload partial results, and then get a new WU. In either case, the server sees that you requested a new WU, and that is noted in the server logs. That's enough for Pande Group to act on that WU if they so choose.

Speaking from past experience, NaN errors are typically related to hardware problems in the computer. It doesn't mean the hardware is bad, but it might. A loose DIMM is just as problematic as having the incorrect RAM voltage set in the bios, or having the RAM timings set too aggresively.

And if another user completes these work units to 100 percent, that would be another indication there is a system problem. If others error out at the same place, then it's likely a bad WU. Be nice to the Mods and Admins, and they might check the WU logs for you again in a few days to see which way it went. :twisted:
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by alpha754293 »

7im wrote:No specific user intervention is needed. The client is designed to handle errors as appropriate. Restart the client. It will either dump the WU, and download a new WU, or it will upload partial results, and then get a new WU. In either case, the server sees that you requested a new WU, and that is noted in the server logs. That's enough for Pande Group to act on that WU if they so choose.

Speaking from past experience, NaN errors are typically related to hardware problems in the computer. It doesn't mean the hardware is bad, but it might. A loose DIMM is just as problematic as having the incorrect RAM voltage set in the bios, or having the RAM timings set too aggresively.

And if another user completes these work units to 100 percent, that would be another indication there is a system problem. If others error out at the same place, then it's likely a bad WU. Be nice to the Mods and Admins, and they might check the WU logs for you again in a few days to see which way it went. :twisted:
I am nice. :D lol. j/k. sorta.

Interestingly enough, I had ONE case where I was running the client, stopped it. re-ran it, it bombed out. Then re-ran it again, and it work.

I'm so used to just freeze any/all transactions on something that's failed/bombed out computationally in case there's something that can be read/processed, etc. within whatever it was that failed in order to try and pinpoint the cause of failure.

It's like computational forensics after the program has died, you know?

*edit*
Here's something for your reading pleasure -- I just restarted that client right now (without purging the failed WU) AND so far it's running. Granted, it's only been 15 minutes or so, but....*shrug* *sigh* who knows what's going on there.

*throws arms up in air* I have NOOOO idea.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by alpha754293 »

It's been running for an 9 hours since I restarted the client. No further hitches so far.

I wonder what the heck happened originally.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by alpha754293 »

WU completed successfully. Anybody here as confused as I am?
uncle_fungus
Site Admin
Posts: 1288
Joined: Fri Nov 30, 2007 9:37 am
Location: Oxfordshire, UK

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Post by uncle_fungus »

It could have been a random computational error. These things happen ;)
Post Reply