ScotGrid

Tuesday, October 24, 2006

We were certified at 17:50 this evening. Immediately atlas jobs started coming in and running.

As of 23:00 we're running 152 jobs: 141 ATLAS, 11 Biomed.

The gstat plot shows us coming online really nicely!

Monday, October 23, 2006

I feel like Victor Hugo.

"?"
"!"


ui1-gla:~$ edg-job-status --config rb/test.conf   https://gm02.hep.ph.ic.ac.uk:9000/o6G2hiRip1t_smJPwk7QYA


******************************************************
BOOKKEEPING INFORMATION:

Status info for the Job :   https://gm02.hep.ph.ic.ac.uk:9000   /o6G2hiRip1t_smJPwk7QYA
Current Status:     Done (Success)
Exit code:          0
Status Reason:      Job terminated successfully
Destination:        svr016.gla.scotgrid.ac.uk:2119/   jobmanager-lcgpbs-dteam
reached on:         Mon Oct 23 14:14:23 2006
******************************************************
ui1-gla:~$ edg-job-get-output   https://gm02.hep.ph.ic.ac.uk:9000/   o6G2hiRip1t_smJPwk7QYA

Retrieving files from host: gm02.hep.ph.ic.ac.uk ( for 
   https://gm02.hep.ph.ic.ac.uk:9000/o6G2hiRip1t_smJPwk7QYA )

******************************************************
                        JOB GET OUTPUT OUTCOME

 Output sandbox files for the job:
 - https://gm02.hep.ph.ic.ac.uk:9000/o6G2hiRip1t_smJPwk7QYA
 have been successfully retrieved and stored in the directory:
 /tmp/jobOutput/graeme_o6G2hiRip1t_smJPwk7QYA

******************************************************

ui1-gla:~$ cat /tmp/jobOutput/graeme_o6G2hiRip1t_smJPwk7QYA/hw.out 
Hello World

Saturday, October 21, 2006

The Glasgow DPM has been tested by Jamie last week.

The summary is:

The i/o and network internal to the cluster is working very, very well. When we patched a direct link across from grid08 (as the source SE), we got a rate of 800+Mb/s. This with only 3 of our 10 new disk servers. (These were transfers managed by FTS, so exercising the full data transfer hierarchy.)

However, the network rate from within, and without, the university seems to be very low. iperf tests from the PPE research network to the new cluster struggle to reach 300Mb/s. From outside we seem to only reach 150Mb/s.

Because of the excellent test above, we are convinced that the problem is the networking on the campus backbone.

We have raised this issue with the network team in the university, but have yet to get a response.

Clustervision have more or less admitted that their system cannot be made secure on a faster timescale than the alternative kickstart installer can be done.

While kickstart has some disadvantages for managing large numbers of worker nodes, this is clearly the install method which we will use going into production next week.

The installer is coming along nicely - see http://www.gridpp.ac.uk/wiki/Glasgow_New_Cluster_Installer

Monday, October 16, 2006

This morning, Edinburgh passed an ops CE-sft-lcg-rm-rep test for the first time in weeks! The reason for the success: a different destination SE was being used in the 3rd party replication. Instead of lxn1183.cern.ch, the test used castorsrm.cern.ch. Unsurprisingly, we immediately started failing the test again once the default SE switched back to lxn1183. This one really is proving difficult to debug.

Friday, October 13, 2006

Last night the first batch job ran through the Glasgow cluster successfully!

I had a maddening time getting ssh host based authentication to work, which turned out to be because root's PATH had had /opt/globus/bin put in it, so when I ran "sshd -d -p 8022" I was running the globus version, which is configured from /opt/globus/etc/ssh instead of /etc/ssh. Argggg!

Thankfully it is sorted properly now.

I have also found out how to put in the static routes to force the WNs to speak to the disk and grid servers directly over their eth0 interfaces - this was necessary because sshd on svr016 was not too happy about the WNs talking to it through the NAT gateway on the masternode.

The last of Glasgow's hardware arrived on Wednesday and has been installed yesterday and today.

This includes the last worker nodes, then 3 remaining disk servers and the UPS.

Picture soon...

I've made some significant steps towards being able to improve the security of installation in the cluster.

I have adopted a "watcher" method, where a process running on the master node looks for a signal that a client is ready to receive secrets, then checks an authorisation database (sqlite) to see if this is allowed. If it's not, then it igmores it. If it is it will push the node's ssh keys and restart its ssh server, then push out its grid certificate (if applicable).

This is easy enough to patch in to CVOS, which can send the signal (the signal is in fact just a side effect of requesting "firstboot.php" from the master's web server). However, after the initial install we still want thre rsh server turned off!

Thursday, October 12, 2006

Glasgow's DPM storage was configured to work in a reduced form: 6TB spread over 3 disk servers on Friday.

This took much longer than it should have because of an odd 'feature' in YUM: if, in yum.repos.d, a repostiory is named twice, then the second defintion is ignored. I had copied the CERN YUM repository defintions, and modified the baseurl to point to our local mirror. I then disabled the CERN repos. However, YUM just ignored my mirror definitions as it only read the, now disabled, CERN definition.

The YUM couldn't resolve any dependencies. Maddening!

Tuesday, October 10, 2006

The disk servers were stress tested for ~4 days using "stress" from the end of last week until Monday morning (when we had to shut down for electrical work. These tests consisted of 500 readers and writers, each writing 1GB files, reading them back, then deleting them (so the machines had a load average of 500+).

None of the disk servers seemed to have significant problems. disk032 did produce some odd kernel messages about being unable to read the partition table of /dev/sdb, but seems to have suffered no ill effects from this.

After the next stage of tests, which uses dt to write and read across the whole "disk" surface, we will do a RAID integrity check as a final validation of the suitability of this disk solution.

So far, ARECA + 500GB Hitachi disks is looking good.

No real response from clustervision about the insecurities in CVOS. So, we are forging ahead with an alternative kickstart based install scheme. The key security aspect of this scheme will be that the master server pushes secrets onto the clients - the clients to not have the ability to provoke the server into giving away a secret.

In fact, based on the improvements already made to the kickstart scheme this is almost ready. I shall install a few worker nodes today and start running jobs through the batch system.

Tuesday, October 03, 2006

We will start stressing the disk servers today, so that they are throughly tested before we get a bill from CV.

Tools which have been suggested are

stress: http://weather.ou.edu/~apw/projects/stress/
dt: http://home.comcast.net/~SCSIguy/SCSI_FAQ/RMiller_Tools/dt.html

Very worryingly we've realised that CVOS is rather insecure - it runs an insecure rsync daemon on the master node, which allows anything running internally to the cluster to read any file from the disk images. This would include ssh private keys and even grid certificates.

I have emailed CV with my concerns but no reply yet.

This could be the show stopper for CVOS, which is a great disappointment after putting nearly 4 weeks of effort into seeing this as the way to manage the new cluster.

Monday, October 02, 2006

Cluster update:

We have a DPM up and running as of Friday morning. srmcp of a single file from Glasgow's UI works, as do all of the DPM/DPNS administration commands
I have setup torque and maui (the Steve Traylen build). I have managed to get the first jobs through the dteam queue! The output from the jobs isn't coming back to me. Hopefully a minor thing to resolve.

Friday, September 29, 2006

More Edinburgh storage-related news: One of our disks died during the night. Similar to the problem of failing SAM tests (see below) this happened during the Edinburgh<->Durham simultaneous read/write transfer test. It appears this has been quite a stressful operation for our site, although it should be noted that we lost power to the ScotGRID machines this week when the computing facility went down for maintenance. Unfortunately this happened a day earlier than we had originally been told. You can draw your own conclusions about that one.

The client tools for analysing the RAID setup confirm that a disk has broken, but also show that even though we have hot spares available to replace it, none of them have been used. The software is also highlighting a couple of other problems. Looks like some manual intervention is required.

Problems, problems and more RM'ing problems. As of yesterday, we were still failing the ops lcg-rm SAM tests with the same issue as before (timing out of the 3rd party copy to the remote SE). The ops SAM SRM tests were all passing, as were the dteam SAM SRM and CE tests. However, yesterday afternoon things got worse and we started failing all SRM tests (dteam, ops) and all lcg-rm tests (dteam, ops). This appears to have been correlated with the start of the simultaneous Edinburgh<->Durham transfer tests which ran until ~0800 this morning. Even after the transfers stopped, the SAM tests continued to fail. Note, the dCache was still operational during this time.

This afternoon I started up the Edinburgh->Cambridge transfers. Initially everything was OK, but then I noticed a strange load pattern on the dCache head node which was causing the transfer rate to drop down to 0 for periods of ~10mins. Digging around, it appeared that the high load was due to the dcache-pool process on the head node (I had set up a small pool on the head node a few weeks ago). After shutting this down, then back on again, then off again, the relationship was confirmed. See the attached ganglia plot. The pool has now been switched off permanently. This is a warning to anyone attempting to run dCache with a pool on the SRM/PNFS head node. Presumably the high load generated by simultaneous reads and writes had started to cause the SAM failures. I am still waiting for further SAM tests to run, but hopefully this will return or state to how it was prior to the Durham test (i.e. just failing ops lcg-rm).

In an attempt to solve the ops lcg-rm problem I have stripped the dCache pool configuration back to basics. Again, I'm waiting on SAM tests to run before I can find out if this has been successful. Watch this space...

Thursday, September 21, 2006

As hinted at in previous posts I have created a kickstart build environment within CVOS. So as not to upset the CVOS tools this pretends to be another CVOS category and slave image called "alt". However, it boots pxe installer images and passes the magic kickstart file "autokick.php" to the installer kernel.

Different classes for each machine are supported, as well as the copying of skeleton files, etc.

The intention here is not to run a fully fledged installation from kickstart, but rather to get a base from which to then work.

From those around GridPP to whom I've spoken it seems that most people don't try to do to much within a kickstart environment, but instead ensure a ssh root login from the installer to finish the install after the machine has rebooted.

I'm now not thinking to bother with using CVOS for the grid nodes - at least not initially. It seems there's too much pain for no real gain in this. Single nodes are on their own and might as well remain there.

Rain stopped play today on the cluster.

The weather was so horrible I stayed at home. Unfortunately this coencided with my PXE boot environment breaking, so that no machine would install itself. As IPMI is not working I couldn't find out why, despite putting some debugging traces into the kickstart file. I suspect the disk formatting might be going wrong - building RAID partitions in kickstart has always seemed flaky.

Wednesday, September 20, 2006

Some notes on switching to the OPS VO for reporting.

I found that Glasgow's WNs were missing an default SE entry for the OPS VO, which caused us to be basically down for the whole week. That was easy to fix.

However, more puzzling, Glasgow and Durham's LFCs are also failing, for no good reason that I can see. Glasgow's certainly has a valid host certificate and the permissions on the catalog seem to be ok. As I'm so pressed for time with the new cluster being our most urgent priority I have just removed our LFC from the site BDII. This should stop SAM from trying to test this element of the site. The catalog is unused, of course, as ALICE are the only LCG VO who require a site local catalog, and they have a very different notion of what a T2 should provide so they never send us any jobs.

Perhaps Durham should do the same, until the problem is better understood?

One really annoying thing about the SAM tests is that as there are so many more tests there's no nice single page summary which shows the site's status. And selecting the individual tests is quite awkward in the way it's rendered in safari at any rate.

I will change my magic safari scotgrid status tab to show at least the SE and CE tests for the sites.

Edinburgh have been failing the replica management SFTs for over a week now. Initially, only one of the RM sub-tests (CE-sft-lcg-rm-rep) was failing, due to a server timeout problem (which I am sure was not a problem at our end since the dCache has been used almost continuously for the ongoing inter-T2 transfer tests). However, on Monday afternoon we started failing 4 of the sub-tests, and were getting an error that indicated a permissions problem with the /pnfs/epcc.ed.ac.uk/data/ops directory on our dCache (the SFTs are now running as ops). At the same time, the ops SAM tests started to fail with a similar error. Meantime, the dteam SAM tests were all green.

In order to try and work out what the problem was, I used /opt/edg/etc/grid-mapfile-local and mapped my DN to the ops VO. I could then use srmcp to copy files into and out of the ops directory of dCache. There were some problems in using the lcg-cr command, but it was unclear if this was to so with me trying to interact with the ops file catalog when my DN would map me to dteam. I also changed the dCache configuration to something more basic, just to check that this was not causing some problem, but this did not have an impact on the SFT results. Note, can be very tricky trying to debug a problem with a VO that you are not a member of.

However, about 2230 last night, the SFTs switched back to only failing on the single RM sub-test and at about 0100, the ops SAM tests started passing again. Strange, I know. Checking LCG-ROLLOUT this morning, there had already been a few postings about the RM tests failing with dCache's at other sites. It appeared that the cause of this was that Judit Novak (who helps run the SFTs) had now joined the ops VO, but her DN was still being mapped to dteam within the grid-mapfile. She has now unregistered from ops and has stopped (I think) the SFTs for today to ensure that the grid-mapfiles are up to date.

I'll update tomorrow if I see that things have changed.