Wednesday, April 28, 2010

batch migrations

A week ago we finally migrated our batch system to better hardware. This had been on the cards for a while but was expedited as we need to start a series of server moves from an old rack which will be removed come the installation of our new kit. We also took this opportunity to upgrade the pbs server version inline with our mom version which was a little out of step. If that wasn't enough things changing at once we also built the latest MAUI 3.3 to test how it performs. So far so good.

Next up, will be the two WMS. Both will be put in downtime, drained and then moved out of the old rack.

serious multi core

who would have thought it possible ....



well with an large SSD, 24 cores, file stager analysis and it survives! More soon on our testing with some cutting edge equipment.

Tuesday, April 13, 2010

The User Forum in Uppsala continues with lots of interesting talks today. More user focussed today with sessions from Bioinformatics, Earth Science and Computational Chemistry. Again the buzz words of cloud, EC2, Eucalyptus and Open Nebula continue to mentioned during the Novel Technologies and Architectures sessions.


The cathedral in Uppsala.

Uppsala Begins

The last EGEE User Conference kicked off yesterday in Uppsala, Sweden. In fact this will be the last EGEE event ever as project finally shuts it doors at the end of the month. Even with this sad event looming everyone is in high spirits with the transition to EGI and the change that this will bring. Monday saw the conference begin with some interesting plenaries, including the history of Uppsala University. The 'old' building was the only building to be saved when the entire town burnt down. The 'new' building is actually constructed from the remnants of an old boat. Bought by the builder who was later made bankrupt by stumping up the cash in order to complete the building out of his own pocket. You could never tell that this incredibility ornate main auditorium has columns made of cast iron and an incredibility useful bullet proof ceiling of solid steel plates!

The rest of the day followed with sessions on security, user support, application porting and the 1st of two poster sessions. Tuesday will start with two technical plenaries and the first and last EGEE photo call.

Thursday, April 08, 2010

Take my outputs, damn you...




We recently ran up a very large backlog of production output files waiting to go from Glasgow back to the Tier-1 (reminder, panda doesn't consider a job finished until the outputs are safely stored at the T1). This is clearly seen in the red line on the panglia plot above, which reaches very high values. As we recently cut the timeout for the UK cloud to 2 days for transferring jobs, to improve the responsiveness of the production system, we started to leak out failed jobs (light green line) as panda gave up and decided to rerun.

Fortunately we got a big boost in the number of FTS slots from Glasgow to RAL, increasing from 10 to 25 active transfers (see the bottom FTS monitoring plot). Even so it clearly takes 24 hours for all the backlogs to drain down.

One of the problems here is that the output files are small from simulation (a tiny log file and a 20-50MB HITS file), so the overheads of FTS + SRM are very considerable and the actual bandwidth achieved is quite low. One possibility we are considering in ATLAS is introducing a pre-merge of outputs on the T2, which will allow us to send much bigger files back to the T1 (although a final "super-merge" will probably still be necessary). For this we are waiting for the generic Athena merge transform and then we will need to test integrating this into the mainline production workflow.

Until then we just have to take the operational load of tweaking the FTS settings when necessary.

Tuesday, April 06, 2010

CREAM gets an upgrade

The CREAM instance at Glasgow has now been upgraded to the latest SL5 version. This continues the push to migrate those services that can be moved from SL4 to SL5 and should also make it easier to upgrade to the new 1.6 instance when it is released. The only hitch to a relatively painless upgrade was cfengine tweaking LCAS and replacing 64 bit path names with 32 bit paths.

Thursday, April 01, 2010

Où est le site bdii

Our upgrade to the SL5 gLite3.2 site bdii has been tormenting me of late as even although the BDII was installed, it was only returning data from a local ldapsearch.

It was listening on port 2170 and the bdii process was running. Then when you tried an ldapsearch from a local machine, it worked. Trying it from a external machine, it could not connect.

First thought was firewall but iptables was not working. Then what about campus firewall. Nope, nothing had changed there. I checked the configs from SL4 to SL5 and they were the same. I turned on logging for slapd and turned up the verbosity. You could then see the DENY's being made by slapd itself.

After much googling I tried slapd in /etc/hosts.allow and this worked! It looks like with the transition to SL5 there is a requirement to add the slapd service to hosts.allow. This looks to have been a bug with openldap in SL4.

With the site bdii upgraded the change over occurred yesterday.