Gosh, I tickled pink at how well MonAMI coped with Glasgow's job storm last Thursday.
To put this in context, we had a CE that was running at a constant 100% CPU usage: 50% in kernel (context-switching) and 50% in user-land (running Perl scripts). The ssh daemon wasn't working properly any more: the ssh-client would (almost always) time out because the ssh server would taking too long to fork. The machine's 1-min load average peaked at ~300!
All in all, this was an unhappy computer.
Despite all this, MonAMI just kept on going. As matters got progressively worse, it took longer and longer to gather data, particularly from Maui. From the normal value of less than a second, the Maui acquisition time peaked at around 15 minutes. Torque's faired better, peaking at around 30s (still far longer than normal).
Despite this, MonAMI didn't flood Torque or Maui. It only ever issuing one request at a time and enforced a 1-minute gap between successive requests. MonAMI also altered it's output to Ganglia to compensate for taking 15-times longer than normal. This prevented Ganglia from (mistakenly) purging the metrics.
So, although everything was running very, very slowly and it was difficult to log into the machine, the monitoring kept working and we have a record of what was happening to the jobs.
Incidentally, the failing ssh is why most (all?) of the jobs were going into wait-state: the worker node mom daemons couldn't stage-in, via scp, the files the jobs needed. This would fail the job being accepted by the WN, causing torque (or maui?) to reschedule the job for some time in the future, putting the job in to wait-state.
No comments:
Post a Comment