Paul has installed MonAMI onto out DPM, which has been very useful (and will become more so when we get nagios running again). However, we started to report zero storage over the weekend, which was tracked down to MySQL running out of connections (as DPM doesn't have a monitoring API we have to query the db directly, which is not ideal). When I looked in detail I (eventually) found that MonAMI had eaten all of the MySQL connections by swallowing sockets.
Paul is investigating and seems to have found at least one place where connections could leak (although he's unclear why it was triggered).
However, even stopping MonAMI at 11pm last night didn't entirely resolve the situation. At some point in the early hours MySQL seemed to again run out of connections. This caused some of the DPM threads to go mad and write as fast as they could to the disk. By 6am there was a 2.5GB DPM log file and / was full. Yikes.
This morning I had to stop all of DPM and MySQL, move the giant logfile out of the way, and then do a restart.
Paul will try the fix soon, but this time keep a much closer eye on things.
I believe we should also make sure that /var/log on the servers is a separate large partition in the future. Although we have enough space in / during normal running, clearly an abnormal situation can fill things up pretty quickly - and running out of space on the root file system is not desirable!
2 comments:
Ah well, I guess all software has its problems. The problem is diagnosed and fixed now. A new RPM will be released imminently.
Post a Comment