

From our two phenogrid DOS attacks, it seems that the maximum number of queued jobs the system can cope with is about 2500. After this the system slides into a crisis, running out of CPU with too many gatekeeper processes active and a context switch storm starts - from which the system can rarely spontaneously recover, it seems.
So, I have set a max_queueable parameter on every queue of 1000, which seems a reasonable number for any single VO or queue.
It seems a limitation of torque that it cannot also have a global cap on queued jobs (at 2500, for instance), but this is only a parameter settable for queues.
No comments:
Post a Comment