Wednesday, March 12, 2008

Another ECDF/Grid requirement mismatch.

While ECDF is, in principle, functional and capable of running jobs, this is a bit useless if no-one can see if you're doing it. So, in the face of APEL accounting still not working for the cluster, I had another look.

There were two problems:
Firstly, the sge account parser was looking in the wrong directory for SGE accounting logs - this fails silently with "no new records found", so I didn't notice before. The configured location actually was correct when I set the thing up, but the mount point had been moved since (as the CE is not on the same box as the SGE Master, we export the SGE account directory over NFS to the CE so it can parse them) with no indication that anything was wrong.

Secondly, after I fixed this...
It turns out that the APEL java.lang.OutOfMemoryError strikes again for ECDF.
The ECDF systems team configure the SGE accounting to roll over accounting logs on a monthly basis. Unfortunately, this leads to rather large accounting files:
# ls --size /opt/sge/default/common/acc* --block-size=1M
1543 /opt/sge/default/common/accounting

(yes, gentlemen and ladies, that is a one and a half gig accounting file...and we're only half-way through the month. The archived accounting logs tip the scales at around a quarter to half a gig compressed, but they compress rather efficiently so the "true" size is much larger - up to 10x larger, in fact.)

I suspect the next step is to arrange some way of chopping the accounting file
into bitesized chunks that the APEL log parser is capable of swallowing.
The irony is that we already parse the accounting logs internally using a thing called ARCo - I've not seen any indication that it would be easy to get APEL to understand the resulting database, though.

No comments: