Monday, December 04, 2006

I have been working with a local ATLAS user who had to replicate 20,000 files of a PhD student's thesis onto ScotGrid for backup - it was about 1.8TB of data in total.

We discussed the various tools available and he opted for replicating the local directory structure into the ATLAS LFC, then manually generating the catalog entries for each of the files.

I helped him to write loops around the various lcg-utils and lfc client commands and we tried to do a bit of error catching and sanity checking. First of all it is indicative of the very poor state of data management end user tools that such loops have to be written - why isn't there a recurse option on lcg-cr? Why don't commands re-try properly?

Running these loops we have discovered that:

  1. LFC is good - no commands failed here. It helps a lot that the server is given manually and not subject to information lookup.
  2. File copying over SRM is pretty robust, but not 100% perfect. A fair few files failed to copy and had to force a catalog de-registration.
  3. The information system is dreadful - forcing an LDAP lookup from the BDII for every single filecopy is madness. Many, many BDII timeout errors, connections refused, etc.

Why don't we have a decent information system, which at least can failover and/or retry, and why on earth is there no caching? Every lcg-cr requires a fresh lookup of the global LFC for the ATLAS VO - information which pretty much never changes. Think http queries, think squid and we would at least be in the right infrastructure area for robust information systems with failover (redirects) and caching (TTLs).

No comments: