As we've been really full recently, I have reduced the maxwallclock available to grid VOs from 148 to 100 hours. The maxcpu time stays the same at 96 hours. I'm growing very frustrated with jobs which just stall at the start - we had 9 atlas jobs which consumed 1s of cpu time in 9 hours, hanging on an lcg-cp.
I also increased the maxcpu and wallclock on the gridpp queue to 168 hours, to make sure that Swetha's bio jobs run through ok - 96 hours was probably too close to the wire. We can cut our local users a bit more slack as their jobs, when they do run, tend to be almost 100% efficient.
I had to reduce the maxprocs on the glee queue to 400 - we can't really afford to get the whole cluster filled with EE jobs as their maxcpu/wallclock is so high at 28 days, and this will completely mess up fairsharing.
We've been suffering this weekend from two very large job surges from pheno and glee. As these groups have a large, but underused, fairshare, they get to start an awful lot of jobs in a short time - and as they run for a very long time then the cluster starts to suffer from very few jobs slots coming free and can't run anything for anyone else. 100hr/560 is 11 minutes, but the job inflow, when local users are involved, is far from uniform.
I would like to move the *sgm jobs from atlas and lhcb into the dteam/ops reserved job slot. I will have to ask Sam how to do this.
No comments:
Post a Comment