Saturday, January 05, 2008

Happy New Year ScotGrid - now with added ECDF...

Well, we didn't get it quite as a Christmas present, but the combined efforts of the scotgrid team have managed to get ECDF green for New Year.

In the week before Christmas Greig and I went through period of intensive investigation as to why normal jobs would run, but SAM jobs would not. Finding that jobs which fork a lot, like SAM jobs, would fail was the first clue. However, it turned out not to be a fork or process limit, butn a limitation on the virtual memory size which was the problem. SGE can set a VSZ limit on jobs, and the ECDF team have set this to 2GB, which is the amount of memory they have per core. Alas for jobs which fork, virtual memory is a huge over estimate of their actual memory usage (my 100 child python fork job registers ~2.4GB of virtual memory, but uses only 60MB of resident memory). That's a 50 fold over estimate of memory usage!

As SAM jobs to fork a lot, they hit this 2GB limit and are killed by the batch system, leading to the failures we were plagued by.

A work around, suggested by the systems team, was to submit ops jobs to the ngs queue, which is a special short running test queue (15 min wall time) which has no VSZ limit on it.

Greig modified the information system to publish the ngs queue and ops jobs started to be submitted to this queue on the last day before the holidays.

Alas, this was not quite enough to get us running. We didn't find out until after new year that we also needed to place a specify a run time limit of 15 minutes on the jobs and submit them to a non-standard project. The last step required me to hack the job manager in a frightful manner as I really couldn't fathom how the perl (yuk!) job manager was supposed to set the project - in fact even though project methods existed they didn't seem to emit anything into the job script.

Finally, with that hack made this morning, ECDF started to pass SAM tests. A long time a coming, that one.

The final question, however, is what to do about this VSZ limit. The various wrappers and accoutrements which grid jobs bring mean that before a line of user code runs there are about 10 processes running, as 600MB of VSZ has been grabbed. This is proving to be a real problem for local LHCb users, because ganga forks a lot and also gets killed off. Expert opinion is that VSZ limits are just wrong.

We have a meeting with the ECDF team, I hope, in a week, and this will be our hot topic.

Big thanks go to Greig for a lot of hard work on this, as well as Steve Traylen, for getting us on the right track, and Kostas Georgiou, for advice about the perils of VSZ in SGE.

No comments: