Monday, October 01, 2007

Torque Queue ACLs

I got an email from Rod Walker on Friday night. He was having trouble submitting to the ATLAS queue on the cluster - again the infamously unhelpful "Unspecified gridmanager error".

I checked his mapping, and LCMAPS was correctly mapping him to one of the new atlas production accounts correctly. However, when I looked that the queue, the torque queue configuration had lost the ACL which allowed sgm and prd accounts to submit to it.

I corrected that and all was well again.

I think that probably these ACLs were never correctly set on the cluster as they seemed to be missing on most of the queues (notably the ops queue was not affected). The cfengine script to setup the queues had the correct ACL setup in it, but I guess it had never been run.

The effect on ATLAS services was notable - we ran an awful lot more ATLAS production this weekend (blue jobs).

