Tuesday, November 21, 2006

CLOSE_WAIT strikes again!

Since Wednesday, the number of tcp connections on one of our dCache door nodes that remain in a CLOSE_WAIT state have been steadily increasing. Once the number of CLOSE_WAITs reached the MaxLogin value (100) for the door, there was no further increase. For information, I upgraded to the v1.7.0-19 of dcache-server on Wednesday. UKI-NORTHGRID-LANCS-HEP are also experiencing the
CLOSE_WAIT problem.

I should also add that ~0409 this morning all of the dCache processes on one of my door nodes and the head node died simultaneously. You can see the sharp drop in CLOSE_WAITs in the ganglia plot as the door process stopped. Initially I thought this was connected to the CLOSE_WAIT issue, but deeper investigation showed that dcache-server v1.7.0-20 was released yesterday which was automatically updated on my nodes and subsequently caused the running services to stop working. I know that it must have been the update since the dCache services continued to run on the one node which did not have automatic updates enabled.

The log files contained references to runIO errors before stopping.

11/21 04:06:08 Cell(c-100@gridftp-pool1Domain) : runIO :
java.lang.InterruptedException: runIo has been interrupted
11/21 04:06:08 Cell(c-100@gridftp-pool1Domain) : connectionThread

Automatic updates of dCache have been turned off. It is always the case that after upgrade you must re-run the YAIM configure_node function or run the dCache install.sh script. Otherwise tomcat will not start up due to a problem with the definition of the JAVA_HOME variable (even if it has not changed during the update).

So, in summary:

1. The CLOSE_WAIT issue is very much alive and kicking in v1.7.0 of dCache.

2. It is highly recommended to update your dCache by hand to make sure that everything comes back up.

Friday, November 17, 2006

We're back!

I had cfengine installing and configuring the CE and the batch system by early Wednesday, but neither the gatekeeper or the information system were working at all. Then at this point I had to leave things until Thursday afternoon (teaching, writing exam questions and having teeth removed (ouch!)), so I didn't get a chance to look at the gatekeeper until Thursday afternoon.

I couldn't find any significant difference between the files on the old production site and the new system, so what was wrong? Eventualy I attached an strace to the gatekeeper which revealed it could not open the jobmanager-fork file. Checked the permissions and they were 0600!

Turns out that cfengine runs shell scripts with a very aggressive umask of 077, so it had created many configuration files only readable by root. Gatekeeper forks the process as the pool account and the job falls flat on its face...

This is defintely a cfengine gotcha!

Judicious use of find for non-world readable files allowed me to fix things up.

The information system was suffereing the same problem as it was unable to execute the lcg-info-generic script.

It seems this was also causing job aborts for grid jobs on the WNs, as they too were running YAIM with the bad umask. I blasted them with kickstart and cfengine this morning and reinstalled the whole cluster in < 1 hour. Cool!

Tuesday, November 14, 2006

Subject: batch system completely screwed
Date: Mon, 13 Nov 2006 16:19:31 +0000

David, Tony

I cannot seem to get the batch system stable after this weekend's mess (see http://scotgrid.blogspot.com/2006/11/cluster-had-its-first-weekend-down.html).

torque and maui will run for some 10s of minutes and then lock up and need restarted. The they run for a while more and die again. I have suspicions that some of the worker nodes are bouncing jobs, but with the general unreliability of the system at the moment this is hard to demonstrate one way or the other,

Clearly there's something deeply screwed up here. The rate of dying nodes with MCE errors probably isn't helping anyone.

I don't seem to have any option but to put the site into downtime and completely gut the batch system.

However, I will also take the opportunity to use cfengine to redo the CE and incorporate local type accounts into the batch system as well.

Various deeply offensive words should be said about the unhelpfulness of torque and maui in this situation.

On the bright side, cfengine is now doing a splendid job on the worker nodes and this should be extended.



Monday, November 13, 2006

The cluster had its first weekend down - / filled up on the CE so that jobs could not start or end properly as sandboxes could not be gridftped. The information system also went belly up, as the plugins could not write their data. Some processes were clearly hanging as ganglia was reporting a load average of 100+ by this morning, so the node needed power cycled to come back to life.

Since then I have spent the most frustrating time clensing the batch system from all of the dead/zombified batch jobs. This was made worse by the fact that torque seemed to hang when it detected a job which had exceeded its wallclock time. It seemed to try and delete it on the worker node, but of course the job was long since gone, and it did not return. I wasted a lot of time thinking it was maui which was hanging - as maui does hang if torque is not responding properly.

By the time I realised all of this I had rebooted several times and even put the site into downtime in an attempt to purge things.

There seems to be no good way to clear up this mess. Manually checking a few of the jobs which were in state 0 cpu used but running, it was clear they were no longer on the worker nodes, but there seemed to be no good way to automatically probe this status. In the end I had to pick all the old jobs (id < 26800) which were showing 0 cpu time used, and were in state "R", and force their deletion with "qdel -p".

Finally, there were a number of jobs in state "W" which must have come in this morning when things were flaky, which could not be started properly so they had to be qdeled too (see last week's post).

As remedial action, I have moved /var onto the large disk partition and made a soft link in /. Compression of logfiles has been enabled, and at the next downtime /home will also be moved.

Friday, November 10, 2006

For those of you who are interested, ScotGRID-Edinburgh is now consistently passing the SAM/SFT replica management tests. Looking at the the SAM results we have been 'green' for the past week. Digging into the logs it looks like the remote SE being used by the tests has been changing more frequently of late; RAL, CNAF, SARA as well as our old friend lxn1183.cern.ch are mentioned. Even for lxn1183 the tests are passing. Nothing has changed in the configuration of our dCache so (unfortunately) I can't take the credit for solving the mystery.

Hopefully this will be the end of the problem, however, with this being WLCG I won't be surprised if it rears its ugly head again. You have been warned...
Why do we need cfengine? I got an email from Rod Walker (ATLAS) alerting us to the fact that two workers had become job blackholes. Investogation showed they didn't have CA certificates installed, so they would not globus-url-copy a job's output back to the gatekeeper.

Now, I did run a distributed shell install of lcg-CA on these nodes, but with 100 batch workers I failed to notice that this had somehow failed on 2 of the nodes.

With cfengine we have built into the system:

lcg-CA action=install
glite-yaim action=install elsedefine=runyaim

So it will check every hour if lcg-CA is properly installed and install it for us if it is not.

Note also how we can define the "runyaim" class when yaim is first installed - this will run yaim automatically after the metapackage is installed (and then triggers the switching on of the batch system).

More details in the wiki soon...

Thursday, November 09, 2006

I have made very good progress with cfengine over the last 2 days. I am now able to:

  • Manage cfengine from a central location (on the cluster master)
  • Use cfengine in the usual way to control the copying of files across the cluster
  • Trigger daemon reloads/restarts when their configuration files change
  • Ensure that rpm packages are installed, and install them if they are not

The next step is to automate the running of YAIM for worker nodes, and the installation and setup of the batch system.

I should probably have a chat with Colin Morey at Manchester to find out how he does this - share best practice!

Wednesday, November 08, 2006

Some 20 atlas jobs went into a funny state in the batch system, being in "hold". I eventually figured that this was abnormal condition caused by the jobs not starting correctly.

Interestingly the fact that 20 atlas jobs were waiting, even though this was an abnormal wait, caused the GIP plugin to report a very high ERT/WRT for atlas. So there was nothing for it but to cancel these jobs (qdel). As soon as this was done the ERT returned to 0 and more atlas jobs arrived.

Didn't manage to get to the bottom of why they were failing to start though!

Tuesday, November 07, 2006

I have now installed a site level BDII on svr021. This was dead easy:

  1. Use YAIM to setup a top level BDII (metapackage glite-BDII, node type BDII).
  2. Copy the site level BDII configuration files from the CE.
  3. Update the GIIS URL in the GOCDB.

Tomorrow I will shut down the CE's BDII.

We're suffering from information system dropouts. This seems to happen particularly when jobs come into the site - I can see that the load on the CE jumped to about 10 as the number of jobs ramped up last night. The corresponds exactly to the time that we start to drop out of the BDII.

So it's really urgent that we have a separate site BDII to stop this happening (we failed a few SFTs because of this).

It is a pretty poor show from the BDII to be so sensitive about the CE's load.

Wednesday, November 01, 2006

Access for "local users" is now being considered at the new cluster. After looking at the methods supported by NGS partner sites, it would seem that this is a reasonable level of "gridness" to ask of users.

This is then based around:

  1. All of our users have a grid certificate.
  2. gsissh login to svr020,to allow seeding of files and perparation of job environment.
  3. globus-job-submit or edg-job-submit to the gatekeeper (from svr020 or from anyother UI type machine) to submit jobs - i.e., no qsub!

At the site level we then need to ensure:

  • That certificates from this type of user are mapped to the same user account on the UI and CE.
  • That home directories for this class of user are shared across the cluster.

We will dedicate one of the disk servers to this purpose - together with a suitable automount map.
Investigating our empty queues a little more, we have quite a large number of jobs which seem to exit immediately (0s CPU and walltime). However, may more have better usage patterns, e.g. 25h09m cput, 25h16m wallclock.

Everything has exit status 0, so I really can't see anything that the site is doing wrong.


There seems to be a problem with publishing SAM/SFT tests on our site. Although we were passing every time until 2006-10-25, suddenly the publication of the tests on the SAM sebsite stopped.

Checking the torque logs it's clear that these tests are still running - just not being published. I have raised this as a (now urgent) GGUS ticket: https://gus.fzk.de/pages/ticket_details.php?ticket=14737.

This seems to be affacting the number of jobs running through the cluster seems to be dropping and dropping. After peaking at 200+ we have dropped to less than 20 runnng jobs after 10 days!