Wednesday, December 20, 2006

I have finally fixed the problem where the BDII was refusing to publish the VO view for biomed on UKI-SCOTGRID-GLASGOW. This had been going on since the cluster was reconfigured. At this time the biomed torque queue was renamed biom (I'm sure I had a good reason for doing this, although I now cannot recall it...), which somehow was causing the dynamic scheduler wrapper to fail to publish for the biomed queue. I had manually hacked LDIF files, rerun YAIM, etc., etc., without success.

This afternoon I just renamed the biomed queue biomed, redefined the queue in site-info.def and reran YAIM's config_gip - finally, the information system started to do the right thing.

How annoying that I wasn't able to really understand the problem.

Tuesday, December 19, 2006

The old scotgrid production site (scotgrid-gla) has been failing on SAM rm tests for a few days because we have completely run out of storage! After the failure of the large RAID array we only had a single array with 1.6TB. It looks from the SRM logs that both ATLAS and ZEUS have been trying to add more files than we can cope with.

I have managed to secure a few 10s of GB by deleting some old dteam jobs. I have also reactivated a small pool on se2-gla itself and reserved it for ops.

I really need to do an EGEE broadcast today announcing the closure of the old site. Unfortunately the queue length is 200 hours - so we cannot close before Xmas.

Monday, December 18, 2006

First TPM Shift

Finished my first week on TPM duty. It was a far easier task than I had anticipated. It could just all be winding down for Xmas, but the volume of tickets needing processed was far less than I had anticipated.

I found it much easier to find tickets for processing via the web interface than trying to track things via email - by email every single ticket change is sent - I got well over 500 emails in the course of the week, most of which were changes made to tickes being processed and solved. Dealing with that volume would have been unmanegable, but via the web it's an absolute doddle to find the tickets currently assigned to TPM and process them.

I solved two tickets directly: one about DPM pool accounts and the other about DTEAM VOMS/LDAP servers.

Wednesday, December 13, 2006

We started failing SAM tests yesterday with the rather unhelpful "Unspecified job manager error". This usually means there was a problem starting the job at all. As usual, debugging this is a black art - nothing useful at all in globus-gatekeeper.log or the torque logs. Finally found emails from the batch system to the pool account users stating that there was no space in /home so their job could not be started. Poked around and it was clear it was /home on the WNs which was the problem.

Investigating I found that /home had been filled up with ATLAS jobs which had seemingly unpacked >3GB of software into their torque working directories - filled up /home and left the node dead to the batch system. Eventually we'd built up such a stack of these that SAM jobs found there way here and started to fail. It took quite some time to identify all the relevant WNs and clear out the mess.

Raised a ticket on ATLAS - got a quick response from Fredric and then from Rod. It seems that EDG_WL_SCRATCH was wong. Arggg, so partly my fault after all! I note that in the environment there were 4 environment variables pointing to /tmp (TMPDIR, EDG_TMP, GLITE_LOCATION_TMP, LCG_TMP) but EDG_WL_SCRATCH was pointing to the non-existant /local/glite (initially I had intended to put the large "scratch" partition on the WNs here, but later changed my mind to /tmp).

However, how can ATLAS jobs run with any sort of efficiency if they try and unpack GBs of software?

Things look a bit better for recent jobs, but there is a stack of ~400 ATLAS jobs which seem stuck in the cluster - 20s CPU time but up to days of wall time. I will email Fredric and see if I can delete them to unblock things.

Monday, December 04, 2006

I installed the beta DPM plugins onto scotgrid-gla and UKI-SCOTGRID-GLASGOW today. My instructions were really clear ;-)

Two minor problems:

  1. On UKI-SCOTGRID-GLASGOW the DPMINFO user was added with a new style MySQL password, which is not supported by the SL3 version of perl-DBI. I had to rewrite the password hash using the old_password() function.
  2. I edited the /opt/lcg/var/gip/plugin/lcg-info-dynamic-se with emacs, which left an editor backup file. Unfortunately the /opt/lcg/bin/lcg-info-generic master script will run editor backup files in the plugin directory, so I had two entries generated by plugins, both of which were published in the BDII. Deleting the backup files and restarting MDS cured the problem.
I have been working with a local ATLAS user who had to replicate 20,000 files of a PhD student's thesis onto ScotGrid for backup - it was about 1.8TB of data in total.

We discussed the various tools available and he opted for replicating the local directory structure into the ATLAS LFC, then manually generating the catalog entries for each of the files.

I helped him to write loops around the various lcg-utils and lfc client commands and we tried to do a bit of error catching and sanity checking. First of all it is indicative of the very poor state of data management end user tools that such loops have to be written - why isn't there a recurse option on lcg-cr? Why don't commands re-try properly?

Running these loops we have discovered that:

  1. LFC is good - no commands failed here. It helps a lot that the server is given manually and not subject to information lookup.
  2. File copying over SRM is pretty robust, but not 100% perfect. A fair few files failed to copy and had to force a catalog de-registration.
  3. The information system is dreadful - forcing an LDAP lookup from the BDII for every single filecopy is madness. Many, many BDII timeout errors, connections refused, etc.

Why don't we have a decent information system, which at least can failover and/or retry, and why on earth is there no caching? Every lcg-cr requires a fresh lookup of the global LFC for the ATLAS VO - information which pretty much never changes. Think http queries, think squid and we would at least be in the right infrastructure area for robust information systems with failover (redirects) and caching (TTLs).