Wednesday, February 18, 2009

Draining an lcg-CE

Well draining an lcg-CE should be fairly straightforward if you have n CE's and n separate queues. However, in Glasgow's case we have 2 CE's that share queues. So it is just not as simple as running qmgr -c 'set queue enabled=false'on torque/maui as this would have put both CE's into drain!

After much playing around, breaking the site bdii and getting Graeme's help to fix it, it appears that the only way I can see to do this is to hack with gip plugins on the CE you wish to drain. The plugin in question appeared to be /opt/glite/etc/gip/plugin/glite-info-dynamic-ce which subsequetly called /opt/lcg/libexec/lcg-info-dynamic-pbs

This script contains the dynamic qstat queries to find out the state of the queues. So in order to drain a specific CE where queues are shared one possible solution is to hack this file to change the line:
push @output, "GlueCEStateStatus: $Status\n"; to force drain with push @output, "GlueCEStateStatus: Draining\n";

This worked as the LDAP query to svr027 now showed 4 queues in drain:

svr021:/opt/glite/etc/gip/ldif# ldapsearch -xLLL -b mds-vo-name=UKI-SCOTGRID-GLASGOW,o=grid -p 2170 -h svr027.gla.scotgrid.ac.uk | grep Dra
GlueCEStateStatus: Draining
GlueCEStateStatus: Draining
GlueCEStateStatus: Draining
GlueCEStateStatus: Draining

and glite-wms-job-list-match does not display the queues for use through the WMS:

-bash-3.00$ glite-wms-job-list-match -a --vo vo.scotgrid.ac.uk hello.jdl
Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
==========================================================================
COMPUTING ELEMENT IDs LIST
The following CE(s) matching your job requirements have been found:

*CEId*
....
- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q1d
- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q2d
- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m
- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q6h
....
==========================================================================

However, this does not stop direct job submission via Globus. After much playing around we found that if you edit the hosts.equiv file on Torque. You can stop job submission from the desired CE but still allow running jobs to finish. Handy that - its just what we need as we were seeing some users still using direct submission even though we were trying to drain them.

No comments: