Monday, June 25, 2007

Tails and Spikes

Tony and I have been trying to draft a policy on killing off jobs which just fail to start properly, so I pulled some stats out of our local accounting MySQL database and plotted a histogram of job efficiencies. This is a very interesting plot - a clear "decay" down from high efficiency into a long tail, then a significant spike of very low efficiency jobs (< 0.02).

Jeremy said that Dario was quite sanguine about killing these sorts of jobs off - things which fail to consume CPU after 6 hours are probably never going to get anywhere.

However, it turns out this is a bit of a can of worms. The RB will resubmit the job (up to 3 times) and the same thing might happen again on a different site. On the other hand, jobs running out of wall clock look the same to the user - and the RB will also resubmit them! If we do kill off jobs, should we email the user? Is this scalable in terms of our time? How much information do we provide to the end user? Will they even care?

It will be an interesting discussion.

No comments: