Friday, June 24, 2011

The Grid is a hungry, hungry beast....

... and it eats networks. From here begins a long, convoluted story, ending, as these often do, in something that has something that seems like it should have been obvious.

We've been noticing some 'blips', during which Maui fights bravely but ultimately fails to schedule jobs. This is generally considered rather sub-optimal.

The root of it was Maui was failing with an error:

ERROR:    cannot get node info: Premature end of message


That Maui error results in Maui taking a break for 15 minutes, before trying to schedule anything again. Which is fair enough, in the face of communication errors. Only ... Maui doesn't speak to anything except the Torque server. Which is running on the same host.

So what's actually happening here is that Torque can't talk to some node or other, and reporting that to Maui, which is then breaking. It didn't seem right that a communication failure to a single node once should stop jobs from starting elsewhere, which prompted some deeper investigation.

Looking for obvious correlations, we noticed that the scheduling blips happened right when we're running lots of analysis jobs - exactly when we don't want scheduler blips! However, it wasn't an obvious correlation, in that sometimes running 1000 jobs at once was fine, other times 400 caused things to gum up.

More worry-some than sub-optimal scheduling was that during the same time period we got occasional errors from the CE's, of the form:

BLAH error: submission command failed (exit code = 1) 
(stdout:) 
(stderr:pbs_iff: cannot read reply from 
pbs_server-No Permission.-qsub: 
cannot connect to server svr016.gla.scotgrid.ac.uk 
(errno=15007) Unauthorized Request


Dissecting that down, the BLAH part is CREAM saying it can't submit the job, so we're looking at the pbs_iff part. The purpose of pbs_iff is to authenticate the current user to the Torque server, so that the job is run with the correct user id (and can be checked with the ACL's on the server, if appropriate). The next part with qsub is just reporting that it's not able to talk to the server.

The root problem is pbs_iff not able to communicate, after which the rest of the qsub is failing for lack of authentication. This is a problem, because these are jobs that are already accepted by the CREAM CE, and shouldn't be failed here. (If a site can't cope with the jobs, the CE should be disabled, so it never accepts the jobs - that's the signal to the submitter/WMS to try elsewhere.)

How does all this link back to the network issues? Well, our cluster is split into two rooms - liked by a couple of fibres.

During analysis, we can see 2 GB per second (yes, that's in bytes) in traffic leaving the disk servers. Roughly half the disk and about half of the CPUs [see later!] are in each room; that implies that given a random distribution half that traffic has to pass through the fibre link.

And, yep, that's the problem right there. The Torque server unable to shout loud enough to talk to the nodes when the link is full, or be heard from some of the CE's. Digging into the stats shows that the link is running at 83% average utilisation, over the past month. So when analysis hits, it wipes out any other traffic.

For the moment, then, I've put a cap on the number of analysis jobs until we can resolve this, as mitigation. And sent Mark off to find some more fibre and ports on the switches!

Some interesting sums: Turns out we have nearer 1/3 the CPU upstairs, and 2/3 (1200 job slots) downstairs. Disk is close to 1/2 each. Matching this up with the planning number of 5 MB per second 'disk spindle to analysis cpu' bandwidth suggests that we need 3 GB per second, or 24 Gbs-1 bandwidth between the rooms to run at full capacity. Compared to 10 Gbs-1 at the moment.

Hrm. No wonder we were having difficulty! On the other hand, it's probably been this link that's the limiting factor in our analysis throughput, so we should be able to roughly double our peak throughput of analysis jobs once that link is upgraded.

That, and not have the scheduler taking a wee nap during peak times.

Wednesday, June 01, 2011

Side Effects may include ...

On Wednesday, the 25th, the Glasgow Scotgrid site was part of the wider SSC5 Security Challenge and during the course of the challenge we encountered several issues with the network security configuration on our core switch.

The configuration changes which caused issued are specifically:
1) Access List Configuration for inbound services
2) ICMP dos-control settings

The Access List Configuration (ACL) did not accept a global default permit with a wild card mask for both IP address ranges and subnets. The key issue here is that when the Access List was applied  on an access port for inbound traffic the Access List worked correctly. However, when applied to the primary egress port onto our network switch it disabled remote connectivity into the cluster, while not impacting internal  machine to machine traffic on the cluster.  The access list was removed and remote access was restored. The root cause for this failure was traced to an incorrectly set ACL ANY permit within the list, however on further investigation each network requiring access to and from the cluster will require its own unique entry rather than a default network range with a series of denied services.  The central IT group at the University also run a series of access lists and fire walls within the edge routing and switching network to the JANET environment which can be adapted to fit our requirements within the cluster setup at Glasgow.

A secondary issue;

A dos-control setting which controls the maximum payload for ICMP also caused unusual network behaviour after it was implemented. Effectively by limiting the payload to 512 bytes, this caused Maui and Torque to encounter issues when attempting to communicate with one another which then impacted other services within the cluster environment, while this slowed down Torque and Maui it did not completely stop the cluster, however its removal immediately improved data connectivity within the cluster. This issue is being referred back to the manufacturer as the payload incrementation only increases to 1023 bytes presently.

Once we have an update on this issue we will post it up on the blog.

EGEE to EGI

We were recently asked to make sure that we were tagging our site as belonging to EGI and EGEE since the latter project has been ended for some time. This would typically involve changing a line entry in our site-info.def file and rerunning YAIM on the appropriate servers. However, as rerunning YAIM is a complete reconfiguration of a service, we decided to look into the exact alteration required to ensure that there was a low impact for the change.

As of June 2011, using a glite installation, the information that is published through the site bdii is stored in the /opt/glite/etc/gip/ldif directory on each server (this would be different using an EMI installation). The exact files that are in that directory depend on the type of service that is publishing, but in this case we're interested in the glite-info-site.ldif file which is on the site bdii itself. We have (or had) 3 entries mentioning EGEE:

GlueSiteOtherInfo: EGEE_ROC=UK/I
GlueSiteOtherInfo: EGEE_SERVICE=prod
GlueSiteOtherInfo: GRID=EGEE

Of these, we have updated

GlueSiteOtherInfo: GRID=EGEE

to

GlueSiteOtherInfo: GRID=EGI

and restarted the site bdii. After a small wait for the update to appear, we are now appropriately tagged as belonging to EGI as opposed to EGEE. Discussions are now underway as to the appropriate values for the other two variables.

In the site-info.def file itself (which should be updated to make sure that a future run of YAIM on the site BDII does not reverse this change) the corresponding change in our case is:

SITE_OTHER_GRID="EGEE|WLCG|SCOTGRID|GRIDPP"

to

SITE_OTHER_GRID="EGI|WLCG|SCOTGRID|GRIDPP"

For more information see https://wiki.egi.eu/wiki/MAN01