Well we now know how much it takes to kill our CREAM instance. Yesterday it stoppped working completely and it appeared to be caught in a tailspin with the Lease and Proxy Renew processes within CREAM. Grepping the logs indicated that most of the Renewals and Lease Manager entries were all related to condor submission from ATLAS.
From speaking to Massimo at INFN it was described how Proxy and Lease renewals are operations which are executed with higher priorities wrt other commands. One hypothesis might be that the CREAM CE was so overloaded doing these commands that it was unable to deal with basic job submission since all the test jobs I submitted never made it out out the REGISTERED state.
It looked bad on Ganglia:
The first course of action was to disable job submission using the command line tool: glite-ce-disable-submission and try to deal with the renewals. This worked for a time but they reoccurred later on that evening.
The timestamps on these ATLAS cream jobs seems to be very old and hinted at stale jobs so the next course of action was to manually purge the database using the tool provided by the CREAM developers: here. The easiest way I could see to do this was to connect to the creamdb, select out the id's and create a script that called the purger for each id. Note: you need jdk 1.6 in order to run the purger!
This ended up removing around 3000 CREAM entries.
Ganglia looked much happier:
So I think you have to be careful when getting submissions from Condor at the moment as it looks to be quite easy to denial of service your CREAM CE.
Roll on CREAM 1.6
- That proxy renewal is not very efficient in the release now in production (already addressed in the coming CREAM CE: see here)
- When there are too many pending commands, new job submissions will be disabled by the limiter: see here