p2671 -- all Gen 17

Moderators: Site Moderators, FAHC Science Team

uncle_fungus
Site Admin
Posts: 1288
Joined: Fri Nov 30, 2007 9:37 am
Location: Oxfordshire, UK

Re: p2671 -- all Gen 17

Post by uncle_fungus »

bollix47 wrote:Try:

Code: Select all

./fah6 | tee -a foldinglog.txt
I just tried it and it seems to be working .... output shows in console and is being written to a file called foldinglog.txt. The -a option will append to the file rather than overwrite it.
This being linux, you can do it another way too, which I find slightly more flexible:

Code: Select all

./fah6 1>>foldinglog.txt 2>&1 &
That appends stdout and stderr into foldinglog.txt

I then attach to and detach from that logfile at will with

Code: Select all

tail -f foldinglog.txt
The -f is key.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: p2671 -- all Gen 17

Post by alpha754293 »

uncle_fungus wrote:
bollix47 wrote:Try:

Code: Select all

./fah6 | tee -a foldinglog.txt
I just tried it and it seems to be working .... output shows in console and is being written to a file called foldinglog.txt. The -a option will append to the file rather than overwrite it.
This being linux, you can do it another way too, which I find slightly more flexible:

Code: Select all

./fah6 1>>foldinglog.txt 2>&1 &
That appends stdout and stderr into foldinglog.txt

I then attach to and detach from that logfile at will with

Code: Select all

tail -f foldinglog.txt
The -f is key.
Good to know. tee seems to be working for the time being though.

Would there be any real functional difference between using tee and the method described above?
tear
Posts: 254
Joined: Sun Dec 02, 2007 4:08 am
Hardware configuration: None
Location: Rocky Mountains

Re: p2671 -- all Gen 17

Post by tear »

alpha754293 wrote:
tear wrote:Alpha -- if you add the following line to your /etc/sysctl.conf

Code: Select all

kernel.randomize_va_space = 0
and call

Code: Select all

sysctl -p
from root, does it make any difference? [it requires (re-)starting the client tho]

Cheers,
tear
k...did that. added the line to /etc/sysctl.conf and restarted the clients.

How will I tell if there's a difference?
I'd say a difference would be better WU completion rate than before (assuming you encountered
something-else-than-checkpoint-issue).

tear
One man's ceiling is another man's floor.
Image
uncle_fungus
Site Admin
Posts: 1288
Joined: Fri Nov 30, 2007 9:37 am
Location: Oxfordshire, UK

Re: p2671 -- all Gen 17

Post by uncle_fungus »

alpha754293 wrote:Would there be any real functional difference between using tee and the method described above?
Front-end wise, yes. The method I described will allow you to attach and detach from the output at will, i.e. you don't need to have the output on the screen all the time as with tee. My method pipes everything into a file which you then read with `tail -f`
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: p2671 -- all Gen 17

Post by alpha754293 »

tear wrote:
I'd say a difference would be better WU completion rate than before (assuming you encountered
something-else-than-checkpoint-issue).

tear
I have no recorded history of completing P2671G17 WUs. Therefore; I wouldn't be able to tell whether it's faster or slower.

From what I can also see, the issue isn't whether it's completing the or not, it's whether they're starting properly or not.

It might however minimize some of the seg faults that I've been getting with this and other WUs though. We shall have to wait and see.

And to complicate matters, if it wasn't running on my quad Opteron 880, then we can say that it was hardware limited. But now that it's running on it and my other system; that rules it out as being a hardware-related issue.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: p2671 -- all Gen 17

Post by alpha754293 »

P.S. There's got to be something else wrong with it too.

I'm watching the PPD numbers and they're about HALF of what I normally get on the systems despite it being an a2 WU.

There's definitely something up with these WUs.

e.g.
on quad 880
computenode CPU1: P2676R2C76G82 = 3141 PPD
computenode CPU2: P2671R28C33G17 = 1570 PPD

both are a2 WUs. both clients are running with:
-smp 4 -verbosity 9
tear
Posts: 254
Joined: Sun Dec 02, 2007 4:08 am
Hardware configuration: None
Location: Rocky Mountains

Re: p2671 -- all Gen 17

Post by tear »

As a datapoint --

So far I was assigned and successfully completed (at normal speed)
following P2671/G17 WUs:

Project: 2671 (Run 0, Clone 75, Gen 17)
Project: 2671 (Run 6, Clone 70, Gen 17)
Project: 2671 (Run 7, Clone 68, Gen 17)
Project: 2671 (Run 7, Clone 75, Gen 17)
Project: 2671 (Run 8, Clone 26, Gen 17)
Project: 2671 (Run 8, Clone 86, Gen 17)
Project: 2671 (Run 23, Clone 28, Gen 17)

Currently crunching:
Project: 2671 (Run 2, Clone 57, Gen 17) (normal speed)
Project: 2671 (Run 8, Clone 68, Gen 17)
Project: 2671 (Run 10, Clone 52, Gen 17)
Project: 2671 (Run 11, Clone 38, Gen 17)
Project: 2671 (Run 11, Clone 77, Gen 17)
Project: 2671 (Run 13, Clone 68, Gen 17)
Project: 2671 (Run 14, Clone 68, Gen 17)
Project: 2671 (Run 24, Clone 72, Gen 17)
Project: 2671 (Run 24, Clone 95, Gen 17)
Project: 2671 (Run 25, Clone 36, Gen 17)
Project: 2671 (Run 26, Clone 43, Gen 17)
Project: 2671 (Run 28, Clone 84, Gen 17) (normal speed)


tear
One man's ceiling is another man's floor.
Image
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: p2671 -- all Gen 17

Post by alpha754293 »

I'm guessing that unless otherwise specified, the WUs are running at something OTHER than normal speed?
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: p2671 -- all Gen 17

Post by kasson »

We have 2248 successfully returned gen17's for P2671. That said, it's always possible that there's something wrong with more recently completed gen 16's.
From the log posted, the checkpoint read is definitely a problem. Make sure all the checkpoints are cleared before loading. That could also be causing the unitinfo/percent complete problem.

If people have any sense of whether the _prev.cpt causes this problem, that would be helpful to know.
tear
Posts: 254
Joined: Sun Dec 02, 2007 4:08 am
Hardware configuration: None
Location: Rocky Mountains

Re: p2671 -- all Gen 17

Post by tear »

alpha754293 wrote:I'm guessing that unless otherwise specified, the WUs are running at something OTHER than normal speed?
I haven't checked other ones TBH (just the "Runs" you are folding). Pardon me for not being clear enough.
It's just easier for me to check the speed once WU has completed.

tear
One man's ceiling is another man's floor.
Image
tear
Posts: 254
Joined: Sun Dec 02, 2007 4:08 am
Hardware configuration: None
Location: Rocky Mountains

Re: p2671 -- all Gen 17

Post by tear »

kasson wrote:We have 2248 successfully returned gen17's for P2671. That said, it's always possible that there's something wrong with more recently completed gen 16's.
From the log posted, the checkpoint read is definitely a problem. Make sure all the checkpoints are cleared before loading. That could also be causing the unitinfo/percent complete problem.

If people have any sense of whether the _prev.cpt causes this problem, that would be helpful to know.
Upon completion of last couple units (-oneunit) I removed work/*_prev.cpt files and started the clients again.
Each client picked up a fresh unit and unitinfo eventually* got populated with progress way greater than 100%.

*) read: the moment client (or is it core?) wrote unitinfo.txt for the new unit for the first time

There were no .cpt files in clients' root directories.


tear
One man's ceiling is another man's floor.
Image
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: p2671 -- all Gen 17

Post by alpha754293 »

kasson
Initial P2671G17 starts were NOT from clean start. i.e. there were multiple client stop and restarts prior to that.

As mentioned before, I cleared *.pdb, queue.dat, and the work/ folder recursively in order to "unclog" the client. That seem to have worked and it does seem to be running now though. However, it does appear to be running at approximate half of its usual speed as compared to other projects/WUs.

Frame times for computenodeCPU1 for P2671R50C61G16:

Code: Select all

[22:00:20] Completed 97500 out of 250000 steps  (39%)
[22:09:04] Completed 100000 out of 250000 steps  (40%)
[22:17:48] Completed 102500 out of 250000 steps  (41%)
[22:26:29] Completed 105000 out of 250000 steps  (42%)
Frame times for computenodeCPU2 for P2671R28C33G17:

Code: Select all

[13:15:20] Completed 0 out of 250001 steps  (0%)
[13:29:11] Completed 2501 out of 250001 steps  (1%)
[13:43:01] Completed 5001 out of 250001 steps  (2%)
[13:56:53] Completed 7501 out of 250001 steps  (3%)
[14:10:42] Completed 10001 out of 250001 steps  (4%)
However, because I had stopped the client, modified the /etc/sysctl.conf, and reload/reread that file; this is the current frametimes for P2671R28C33G17:

Code: Select all

[14:20:05] Completed 0 out of 250001 steps  (0%)
[14:28:56] Completed 2501 out of 250001 steps  (1%)
[14:37:47] Completed 5001 out of 250001 steps  (2%)
[14:46:38] Completed 7501 out of 250001 steps  (3%)
[14:55:29] Completed 10001 out of 250001 steps  (4%)
...
[22:00:40] Completed 130001 out of 250001 steps  (52%)
[22:09:36] Completed 132501 out of 250001 steps  (53%)
[22:18:30] Completed 135001 out of 250001 steps  (54%)
[22:27:26] Completed 137501 out of 250001 steps  (55%)
It is becoming more consistent with what it should be getting but according to the FahMon calculations, the PPD rate is still about half. So, at this point I don't know if it's something to do with the WU itself, the changes in the configuration, or FahMon.

Needless to say, the slower frame rates can't be artifically generate, which seems to be more indicative of an underlying situation.

On my other system that's also currently working on a P2671G17 WU, the frame times are close:

Code: Select all

[22:00:37] Completed 205001 out of 250001 steps  (82%)
[22:08:17] Completed 207501 out of 250001 steps  (83%)
[22:15:55] Completed 210001 out of 250001 steps  (84%)
[22:23:35] Completed 212501 out of 250001 steps  (85%)
but FahMon is also showing that to be lower PPD. No previous log is available for other WUs.

Here are some other statistics via FahMon:

Code: Select all

 -- computenode CPU1 --

 Min. Time / Frame : 4mn 23s  - 3153.76 ppd
 Avg. Time / Frame : 6mn 00s  - 2304.00 ppd
 Cur. Time / Frame : 8mn 42s  - 1588.97 ppd
 R3F. Time / Frame : 8mn 42s  - 1588.97 ppd
 Eff. Time / Frame : 10mn 19s  - 1339.97 ppd


 -- computenode CPU2 --

 Min. Time / Frame : 4mn 46s  - 2900.14 ppd
 Avg. Time / Frame : 8mn 32s  - 1620.00 ppd
 Cur. Time / Frame : 8mn 56s  - 1547.46 ppd
 R3F. Time / Frame : 8mn 55s  - 1550.36 ppd
 Eff. Time / Frame : 5mn 03s  - 2737.43 ppd


 -- OPTERON3 CPU --

 Min. Time / Frame : 7mn 23s  - 1872.33 ppd
 Avg. Time / Frame : 7mn 24s  - 1868.11 ppd
 Cur. Time / Frame : 7mn 39s  - 1807.06 ppd
 R3F. Time / Frame : 7mn 39s  - 1807.06 ppd
 Eff. Time / Frame : 4mn 14s  - 3265.51 ppd
Hope it helps.

And for comparison, stats (via FahMon) for P2669 on computenode CPU1:

Code: Select all

 -- computenode CPU1 --

 Min. Time / Frame : 4mn 21s  - 6355.86 ppd
 Avg. Time / Frame : 7mn 29s  - 3694.61 ppd
 Cur. Time / Frame : 8mn 58s  - 3083.42 ppd
 R3F. Time / Frame : 8mn 55s  - 3100.71 ppd
 Eff. Time / Frame : 10mn 48s  - 2560.00 ppd
Note the difference in PPD despite the frame times.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: p2671 -- all Gen 17

Post by alpha754293 »

tear wrote:
alpha754293 wrote:I'm guessing that unless otherwise specified, the WUs are running at something OTHER than normal speed?
I haven't checked other ones TBH (just the "Runs" you are folding). Pardon me for not being clear enough.
It's just easier for me to check the speed once WU has completed.

tear
I'm not sure I understand this correctly.

Of the WUs that you have listed that are currently running, P2671R28C33G17 isn't listed, and neither is P2671R21C36G17.

So I am not sure what you mean by "the runs that I am folding."
tear
Posts: 254
Joined: Sun Dec 02, 2007 4:08 am
Hardware configuration: None
Location: Rocky Mountains

Re: p2671 -- all Gen 17

Post by tear »

alpha754293 wrote:
tear wrote:
alpha754293 wrote:I'm guessing that unless otherwise specified, the WUs are running at something OTHER than normal speed?
I haven't checked other ones TBH (just the "Runs" you are folding). Pardon me for not being clear enough.
It's just easier for me to check the speed once WU has completed.

tear
I'm not sure I understand this correctly.

Of the WUs that you have listed that are currently running, P2671R28C33G17 isn't listed, and neither is P2671R21C36G17.

So I am not sure what you mean by "the runs that I am folding."
I was referring to R+G numbers (not including clone numbers).
ToTow had initially suggested problem might be tied to particular "runs"
and that's how I interpreted his words; incorrectly perhaps.
alpha754293 wrote:e.g.
on quad 880
computenode CPU1: P2676R2C76G82 = 3141 PPD
computenode CPU2: P2671R28C33G17 = 1570 PPD
Took P2676 for P2671 :oops: .
Still, P2671R28 is running at normal speed (your C = 33, mine = 84).
Anyway, it clearly seems I should rest for a while.


tear
One man's ceiling is another man's floor.
Image
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: p2671 -- all Gen 17

Post by alpha754293 »

tear

What's weird is that they seem to be running at the proper speed, but the PPD calculations are showing it as being half speed. So I don't know.

Hard to say. Hard to tell.

That and running P2671R28C33G17 two different times -- one had a frame time of ~15 minutes, and the other was about half that despite it being on the same hardware, same everything except for the modification of the /etc/sysctl.conf.

And that it only seems to be affect that particular run so far; everything else seems to be running normal. *shrug* don't know.
Post Reply