ScotGrid: October 2006

Tuesday, October 24, 2006

We were certified at 17:50 this evening. Immediately atlas jobs started coming in and running.

As of 23:00 we're running 152 jobs: 141 ATLAS, 11 Biomed.

The gstat plot shows us coming online really nicely!

Monday, October 23, 2006

I feel like Victor Hugo.

"?"
"!"


ui1-gla:~$ edg-job-status --config rb/test.conf   https://gm02.hep.ph.ic.ac.uk:9000/o6G2hiRip1t_smJPwk7QYA


******************************************************
BOOKKEEPING INFORMATION:

Status info for the Job :   https://gm02.hep.ph.ic.ac.uk:9000   /o6G2hiRip1t_smJPwk7QYA
Current Status:     Done (Success)
Exit code:          0
Status Reason:      Job terminated successfully
Destination:        svr016.gla.scotgrid.ac.uk:2119/   jobmanager-lcgpbs-dteam
reached on:         Mon Oct 23 14:14:23 2006
******************************************************
ui1-gla:~$ edg-job-get-output   https://gm02.hep.ph.ic.ac.uk:9000/   o6G2hiRip1t_smJPwk7QYA

Retrieving files from host: gm02.hep.ph.ic.ac.uk ( for 
   https://gm02.hep.ph.ic.ac.uk:9000/o6G2hiRip1t_smJPwk7QYA )

******************************************************
                        JOB GET OUTPUT OUTCOME

 Output sandbox files for the job:
 - https://gm02.hep.ph.ic.ac.uk:9000/o6G2hiRip1t_smJPwk7QYA
 have been successfully retrieved and stored in the directory:
 /tmp/jobOutput/graeme_o6G2hiRip1t_smJPwk7QYA

******************************************************

ui1-gla:~$ cat /tmp/jobOutput/graeme_o6G2hiRip1t_smJPwk7QYA/hw.out 
Hello World

Saturday, October 21, 2006

The Glasgow DPM has been tested by Jamie last week.

The summary is:

The i/o and network internal to the cluster is working very, very well. When we patched a direct link across from grid08 (as the source SE), we got a rate of 800+Mb/s. This with only 3 of our 10 new disk servers. (These were transfers managed by FTS, so exercising the full data transfer hierarchy.)

However, the network rate from within, and without, the university seems to be very low. iperf tests from the PPE research network to the new cluster struggle to reach 300Mb/s. From outside we seem to only reach 150Mb/s.

Because of the excellent test above, we are convinced that the problem is the networking on the campus backbone.

We have raised this issue with the network team in the university, but have yet to get a response.

Clustervision have more or less admitted that their system cannot be made secure on a faster timescale than the alternative kickstart installer can be done.

While kickstart has some disadvantages for managing large numbers of worker nodes, this is clearly the install method which we will use going into production next week.

The installer is coming along nicely - see http://www.gridpp.ac.uk/wiki/Glasgow_New_Cluster_Installer

Monday, October 16, 2006

This morning, Edinburgh passed an ops CE-sft-lcg-rm-rep test for the first time in weeks! The reason for the success: a different destination SE was being used in the 3rd party replication. Instead of lxn1183.cern.ch, the test used castorsrm.cern.ch. Unsurprisingly, we immediately started failing the test again once the default SE switched back to lxn1183. This one really is proving difficult to debug.

Friday, October 13, 2006

Last night the first batch job ran through the Glasgow cluster successfully!

I had a maddening time getting ssh host based authentication to work, which turned out to be because root's PATH had had /opt/globus/bin put in it, so when I ran "sshd -d -p 8022" I was running the globus version, which is configured from /opt/globus/etc/ssh instead of /etc/ssh. Argggg!

Thankfully it is sorted properly now.

I have also found out how to put in the static routes to force the WNs to speak to the disk and grid servers directly over their eth0 interfaces - this was necessary because sshd on svr016 was not too happy about the WNs talking to it through the NAT gateway on the masternode.

The last of Glasgow's hardware arrived on Wednesday and has been installed yesterday and today.

This includes the last worker nodes, then 3 remaining disk servers and the UPS.

Picture soon...

I've made some significant steps towards being able to improve the security of installation in the cluster.

I have adopted a "watcher" method, where a process running on the master node looks for a signal that a client is ready to receive secrets, then checks an authorisation database (sqlite) to see if this is allowed. If it's not, then it igmores it. If it is it will push the node's ssh keys and restart its ssh server, then push out its grid certificate (if applicable).

This is easy enough to patch in to CVOS, which can send the signal (the signal is in fact just a side effect of requesting "firstboot.php" from the master's web server). However, after the initial install we still want thre rsh server turned off!

Thursday, October 12, 2006

Glasgow's DPM storage was configured to work in a reduced form: 6TB spread over 3 disk servers on Friday.

This took much longer than it should have because of an odd 'feature' in YUM: if, in yum.repos.d, a repostiory is named twice, then the second defintion is ignored. I had copied the CERN YUM repository defintions, and modified the baseurl to point to our local mirror. I then disabled the CERN repos. However, YUM just ignored my mirror definitions as it only read the, now disabled, CERN definition.

The YUM couldn't resolve any dependencies. Maddening!

Tuesday, October 10, 2006

The disk servers were stress tested for ~4 days using "stress" from the end of last week until Monday morning (when we had to shut down for electrical work. These tests consisted of 500 readers and writers, each writing 1GB files, reading them back, then deleting them (so the machines had a load average of 500+).

None of the disk servers seemed to have significant problems. disk032 did produce some odd kernel messages about being unable to read the partition table of /dev/sdb, but seems to have suffered no ill effects from this.

After the next stage of tests, which uses dt to write and read across the whole "disk" surface, we will do a RAID integrity check as a final validation of the suitability of this disk solution.

So far, ARECA + 500GB Hitachi disks is looking good.

No real response from clustervision about the insecurities in CVOS. So, we are forging ahead with an alternative kickstart based install scheme. The key security aspect of this scheme will be that the master server pushes secrets onto the clients - the clients to not have the ability to provoke the server into giving away a secret.

In fact, based on the improvements already made to the kickstart scheme this is almost ready. I shall install a few worker nodes today and start running jobs through the batch system.

Tuesday, October 03, 2006

We will start stressing the disk servers today, so that they are throughly tested before we get a bill from CV.

Tools which have been suggested are

stress: http://weather.ou.edu/~apw/projects/stress/
dt: http://home.comcast.net/~SCSIguy/SCSI_FAQ/RMiller_Tools/dt.html

Very worryingly we've realised that CVOS is rather insecure - it runs an insecure rsync daemon on the master node, which allows anything running internally to the cluster to read any file from the disk images. This would include ssh private keys and even grid certificates.

I have emailed CV with my concerns but no reply yet.

This could be the show stopper for CVOS, which is a great disappointment after putting nearly 4 weeks of effort into seeing this as the way to manage the new cluster.

Monday, October 02, 2006

Cluster update:

We have a DPM up and running as of Friday morning. srmcp of a single file from Glasgow's UI works, as do all of the DPM/DPNS administration commands
I have setup torque and maui (the Steve Traylen build). I have managed to get the first jobs through the dteam queue! The output from the jobs isn't coming back to me. Hopefully a minor thing to resolve.

ScotGrid