Wednesday, December 03, 2008

All go at ScotGrid

This is a quick update to make up for the fact that we've been too busy to blog here in ScotGrid land - lack of activity in the blog rather indicates a frenzy of activity on the ground!

  • The new Viglen hardware arrived, was installed and passed its acceptance test without any problems. However, we did have severe air conditioning issues in the new computer room which prevented us from actually switching on the new kit in anger (we didn't want it to cook itself!). These were cured at the end of last week, when a failover between the two chilled water pumps was installed. Since then Mike has been proving the new worker nodes in the batch system and we're on the point of bringing the new nodes online.
  • Meanwhile, in ATLAS land, I have been helping to organise UK Distributed Analysis Challenge. This has been hammering our system with 100s of ATLAS user analysis jobs. The first round we had inherited a bad setting of rfio readahead, so we delivered GB of data to the jobs which they did not want. Second time around this was cured, but it looked like we had serious load issues on the DPM headnode and some files could not be opened by jobs. What's worrying here is that we peaked at about 110 user analysis jobs running simultaneously, yet DPM really struggled to keep up with the rate of opens - to be investigated later.
  • On the middleware front I installed a new CE (svr026) to provide redundant access to the batch system and a 'hot spare' DPM (svr025) which is there to (a) investigate peculiar client timeout errors we see with svr018 (do they repeat? initial answer seems to be no) and (b) provide a 'ready to go' DPM headnode if anything unfortunate happens to svr018.
  • ECDF has been working much better using mw05, the new SL4 SGE CE. Also, thanks to continual pressure from Phil, we nailed the last of the VSZ problems (the sgm accounts had the low VSZ limits which caused the installation fo software to fall over in very peculiar ways). Since then ATLAS as run very well at ECDF.
  • Continuing the CE improvements, Sam and Steve hope to introduce a second ECDF CE and retire the old SL3 CE very soon.
  • Durham's new kit (all 1MSI2K of it) should arrive very soon now, so they will revamp the whole cluster and dump the old kit. They will be in downtime for a while as this happens. They are taking the ScotGrid lead on virtulaising services which we see as a really important step to providing rapid recovery from equipment failures and lots of flexibility in deployment.
Finally, we have seen a welcome return of LHCb production jobs; had some serious gripes with biomed (I think they are disabled on all our SEs now) and seen some excellent SAM test figures for all the sites, despite generaly being full to the gunnels with jobs.

