Wednesday, April 02, 2008

Only one bite at the cherry...

I have modified the default RetryCount on our UIs to now set zero retries. Automatic retries were actually working quite well for us when we were losing a lot of nodes to MCE errors (in the days before the upgrade to SL4, x86_64) - users' jobs would automatically rerun if they got lost and there was no need for them to worry about failures. However, recently we see users submitting more problematic jobs to the cluster - some which fail to start at all, some which run off into wallclock limits, others which stall half way through. Often we have to gut the batch system with our special spoon and in this case having to do it four times because the RB/WMS keeps resubmitting the job is less then helpful.

For once cfengine's editfiles stanza was useful and a simple:

ui::
{ /opt/glite/etc/glite_wmsui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}
{ /opt/egd/etc/edg_wl_ui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}

got the job done.

No comments: