Monday, July 30, 2007

CIC Portal Reports


As I was on holiday on Friday, I tried to fill in the CIC portal report quite late on and it was locked.

Frankly the interface on the report section of the CIC is really rather rubbish (no aggregation, unreliable locking, difficult to review) and the 1 day time window has always been restrictive.

As it's clear now that the CIC "availability", where sites get to mark up failures as relevant/non-relevant/unknown, is a thing of the past (gridview will be used, warts and all...), the whole thing looks rather broken as a way of us telling Jeremy and Phillipa issues to report in the Ops meeting.

However, I checked the gridview page. Looks like a quiet week. Our one CE test failure was the infamous Globus 79...

Report: Quiet week. Ran lots of jobs ;-)

Monday, July 23, 2007

Job submission with DIRAC

In order to get some "real user" experience of performing physics analysis on the Grid, I have been doing a lot of reading and playing with the LHCb computing software. First of all, there's a lot of it so it takes a while to get to understand what each component does, how they can be linked together, how they are configured and built and how the applications can be run locally or on the Grid to do some real physics.

I was particularly interested in getting some basic jobs running on the Grid, so I quickly started playing with Ganga, the user interface for job configuration and submission. At first I was quite impressed. It was very simple to use Ganga to submit small jobs to the local system, CERN batch or the Grid (via the LHCb DIRAC workload management system). However, a few problems quickly appeared:

1. Jobs were continually failing on the Grid due to poorly configured software installations on the sites. Missing libraries was the main source of problems. It also seems that the latest version of Gauss v30r3 (LHCb MC generation) is a bit broken due to a mis-configured path. These things weren't a problem with Ganga as such, but using it meant that another layer had to potentially be debugged.

2. I found bulk job submission was very difficult in Ganga. Writing the python code to loop over the jobs is easy, but the client just couldn't handle the 100's of jobs going through it. It became very slow and eventually just hung. Even just starting up the client is slow. Maybe running on a non-lxplus machine would be better. There were also inconsistencies between the Ganga job monitoring and that reported by DIRAC.

As an alternative, I decided to bypass Ganga and use the DIRAC API directly. This proved to be quite successful, being much faster for bulk submission. I put together some notes on this, which can be found here:

http://twiki.cern.ch/twiki/bin/view/Main/LHCbEdinburghGroupDIRACAPI

Using DIRAC didn't help with the site mis-configurations (although it is easy to get the job output and check the log files for problems), but I found it a more efficient way of working. I'll try again with Ganga once I understand better the problems that keep on appearing on the Grid.

From my brief foray into running jobs on the Grid, it appears that Ganga/DIRAC do insulate users from malfunctioning middleware, however, there are still real problems when it comes to poorly installed software on the sites. From a deployment point of view, maybe this should be taken as encouragement, as the problem is at the application level and not so much with the middleware. I think we would need to do a more systematic study to find this out (much like Steve's ATLAS jobs).

What is needed is better testing of the sites by through VO-specific SAM tests. This information then has to be fed back into DIRAC (or whatever) so that mis-configured sites can be ignored until their problems are resolved. User will then find running jobs on the Grid a much easier and pleasant experience.

Return of Globus Error 79...


Glasgow suffered a bit from the infamous Globus Error 79 last week - the one we think might be the unexplained gatekeeper identity error. In fact it was bit of a flaky week altogether - Steve Lloyd's tests seemed to be suffering from some RB issues and most sites dropped into the 70% efficiency range last week.

Overall though, even the "bad" weeks are not so bad as the gridview availability plot shows (remember this is a "warts and all" plot - no excuses or chances to mark things as non-relevant) we seem to still be 95% plus.

However, the gatekeeper error is a real pest and I still don't have a good way of even trying to get a handle on it. I checked in the gatekeeper logs, against some known error events of this type (David found these on 11 June), but alas there's no clear signature.

Holiday Time: Durham Survives


Phil's been away for the last three weeks, with responsibility for Durham falling between myself (to respond to tickets and advise on grid problems) and Lydia (on the ground to press the buttons). This has worked pretty well - we have managed to deal with helmsley locking up (week 3 in the graph) and needing rebooted and a period of scheduled downtime (week 4, which revealed a problem in downtime synchronisation between SAM and the GOC).

A sterner test will come at Glasgow in a fortnight when Andrew and I are both on holiday and are more or less uncontactable. Time for icons and prayers?

ECDF Update

The procurement of our grid front end nodes for ECDF has been held up for about 2 weeks. Sam had tried to setup one of the old Edinburgh worker nodes as a trial CE to iron out any glitches in the gatekeeper scripts, job manager, accounting chain, but has hit an ACL in some router between ECDF and KB which prevents him from being able to even qsub.

The systems team are going to give him a requisitioned worker node to test with instead, which should happen this week - although as they are going live today and it's holiday time this has also been somewhat held up.

Friday, July 20, 2007

SAM Gets Downtime Wrong?



Durham were in downtime from Wednesday -> Thursday, but SAM thinks they were in downtime from Thursday -> Friday. Doubtless this happened because I made the initial mistake (out by 1 day!), then edited the downtime to bring it forward. But SAM did not pick up the change.

I've raised a GGUS ticket - after all I'm always editing downtimes!

Glasgow to Edinburgh Lightpath Approved

After messing about for ages with application forms and revised procedures, our application, when finally submitted to JANET, was approved within 24 hours! They are now conducting a technical feasibility study, but no news on how long that might take.

Update: JANET people say this study will take "the shorter end of 'a few weeks'", which is good news.

GridICE Ate My CPU...


After upgrading the CE yesterday, the CPU and load were rather high. Re-running YAIM had re-enabled the GridICE monitoring system, which had merrily decided to swallow an entire CPU itself.

When I switched if off CPU load on the CE dropped from 70% to 20%. (See ganglia plot - the difference is pretty obvious.)

Although GridICE gives some interesting monitoring information and aggregation at the Grid level, it's a duplication of information elsewhere (like gstat) and consuming a whole CPU is absurd.

I put in a GGUS ticket about this, but for the moment GridICE is disabled on our CE.

Postscript:

I've added this to cfagent.conf:

processes:
ce::
"/opt/gridice/monitoring/bin/*" signal=term
"/opt/edg/sbin/edg-fmon-agent" signal=term

Thursday, July 19, 2007

Glasgow Updated to gLite 3.0.1 r27

Upgrade notes:

Basically I'm following
https://www.gridpp.ac.uk:443/wiki/UKI-SCOTGRID-GLASGOW_enabling_VO,
but being also aware that when services are updated YAIM needs rerun.

Preparations
------------

Added new groups for sgm and prd pool accounts - even when these will
not yet be enabled

Modified the poolacct.py script - now much improved and does the new
type of accounts (but can still do the old type!). Also now does
user.conf fragments as well.

Using this, added relevant entries to passwd, group, shadow and
users.conf.

Went through site-info.def again. Added supernemo stanzas and the
vo.d/supernemo.vo.eu-egee.org definitions.

All set!


Disk Servers
------------

Modify update.conf to clear cruft out of system (pruge=true)
Clear and restore yum.repos.d.
yum update
Remove local config_mkgridmap function
Run /opt/glite/yaim/bin/yaim -r -s /opt/glite/yaim/etc/site-info.def
-f config_mkgridmap
Remove MySQL-server, which had bizarrely been installed on some of the
nodes.

Noticed that config_lcgenv is broken for DNS VOs. Submitted ticket
24917, with patch. However, for us this is controlled by cfengine, so
defined appropriate things here: e.g.,
VO_SUPERNEMO_VO_EU_EGEE_ORG_DEFAULT_SE.


Worker Nodes
------------

Changed passwd/group files have already triggered home directory
creation via cfengine. Thus no need to run YAIM.

Ran yum update on WNs (using pdsh). Removed glite-SE_dpm_disk,
DPM-gridftp-server and DPM-rfio-server RPMs - relic of !disk037
(woops). WNs will be blown and rebuilt at transition to SL4 anyway.


MON/Top Level BDII
------------------

Checked svr019 (MON + Top BDII). Nothing to update here. (Top level
BDII was done last week:
http://scotgrid.blogspot.com/2007/07/top-level-bdii-updated-to-glue-13yaim.html)


Site BDII/svr021
----------------

Did yum-update then reran YAIM. Got error
SITE_SUPPORT_EMAIL not set
and
chown: failed to get attributes of `/opt/lcg/var/gip/ldif': No such file or directory
chmod: failed to get attributes of `/opt/lcg/var/gip/ldif': No such file or directory
???
Odd entry has appeared:
GIP file:///opt/lcg/libexec/lcg-info-wrapper
Which is not active on a stand alone site BDII (would work on CE?)

Note that DPM not yet upgraded, so still polling GRIS on svr018 until
this is done.


UI / svr020
-----------

yum update
rerun yaim - discovered that i need to defined supernemo queue to be
snemo in $QUEUES (which is still used). Caused gLite python to throw
an exception (my mistake, but crap code nonetheless...)

Corrected site-info.def and reran YAIM.


CE / svr016
-----------

Checked list of RPMs to update. Potentially dangerous ones are
vdt_globus_jobmanager_pbs (we have patched pbs job manager).
Seems that there are patches to pbs jobmanager to support DGAS
accounting. I have commented out the cfengine job manager replacement
and will diff and repatch as necessary after configuration.

Created cdf and snemo queues using torque_queue_cfg script (note this
now adds access for sgm and prd groups, even if they are not used).

Ran YAIM.

Immediate sanity check:
GRIS is ok - information system is up.
gatekeeper dead
Help! Restarted.
globus.conf was rewritten, blowing away pbs job manager
re-enabled pbs job manager and restarted
OK, so I moved under the control of
cfengine, but again caveats about running YAIM
apply.
gatekeeper restarted again
Now tailing gatekeeper logs, everything looks ok.
Whew!

Diffing the pbs and lcgbps job managers, YAIM has added DGAS support
for them. I used these as new template modules and repatched the
"completed" job state (https://savannah.cern.ch/bugs/?7874). Helper.pm
was unchanged, so still has the correct patch for stagin via globus.



svr023 / RB
-----------

Ran yum update. Actual RB has not changed, so did not run YAIM


svr018 / DPM
------------

Upgraded YAIM to check on config_DPM_upgrade. Looks quite simple.

Booked downtime for 2pm to do this.

At 2pm: Stopped DPM
yum update
run config_DPM_upgrade YAIM function (updated db). Took ~8 minutes.
Start DPM again

run config_gip YAIM function (publish access details for
cdf/supernemo in info system)
run config_mkgridmap YAIM function (add additional certificates
into gridmap files)
run config_BDII to redo information system
servers)

Checked BDII was ok. It is.

Then went back to site BDII, changing URL to
BDII_DPM_URL="ldap://$DPM_HOST:2170/mds-vo-name=resource,o=grid" -
restarted site BDII. Checked ldap info was ok.

Came up from downtime (took 15 minutes). Damn - we got SAM tested in the interval!


After Lunch
-----------

Wary of APEL changes I read Yves' notes in
http://www.gridpp.ac.uk/wiki/GLite_Update_27. I couldn't see the same
problems. Ran APEL publisher on CE and MON and things seem to be ok,
so let things lie here.

Finally able to lock atlas pool down to atlas members! When I looked
in the Cns_groupmap though, there had been rather an explosion of
atlas groups:

mysql> select * from Cns_groupinfo where groupname like 'atlas%';
+-------+------+-----------------------+
| rowid | gid | groupname |
+-------+------+-----------------------+
| 2 | 103 | atlas |
| 16 | 117 | atlas/Role=lcgadmin |
| 17 | 118 | atlas/Role=production |
| 55 | 156 | atlas/lcg1 |
| 57 | 158 | atlas/usatlas |
+-------+------+-----------------------+

Hmmm, I have given the pool to all these gids. Is this really
necessary?

Done! Whew!

Wednesday, July 18, 2007

cfengine cruft

Updating the disk servers today, the older ones were coming up with bizarre errors and refusing to update themselves at all. Eventually I tracked this back to stale files in the cfengine cache on the disk servers themselves - old repo definitions which were in conflict with the newer mirrors.

I found that cfengine's copy stanza has a flag, purge, which needs to be set to remove files which are not present in the source. I have now set this in update.conf and the disk servers are busily crunching their way through the backlog of RPMs.

Tuesday, July 17, 2007

Load on CE

I wish I understood what caused these periods of increased load on the CE. They can happen during a long period of job submission into the gatekeeper, as jobmanagers are forked off and then wilt, but that wasn't the case here. Odd...

Thursday, July 12, 2007

Queues Cut Back

I have now cut all the queues on UKI-SCOTGRID-GLASGOW to 36 hours of CPU and wall. The exceptions are:

  • gridpp: Our bio user's code, which she didn't write, takes up to 6 days to run
  • glee: The engineers claim they need a 28 day queue - we will have to talk to them about that, because it's ridiculous.
  • dteam and ops: 6 hours - even that's a bit long...

Clearing Out The Gatekeeper Cruft

As part of the general clean up after this morning's pheno crisis, I cleaned out stuck gatekeeper processes on the cluster. There were about 50 of these processes, which were all in "T" state. This is "traced or stopped", and I presume stopped actually, as nothing would be tracing them. Most of them has an associated child in zombie state.

Wonder why that happens?

Pheno goes Bang!



Crisis on the cluster this morning. After a long night of job submission by a phenogrid user (putting in more than 1000 jobs) the cluster went into a spasm, where the pheno jobs started to hit wait state en masse. Then what I think happened was that as torque saw each pheno job hit wait, failing to start, it immediately picked the next pheno job, tried to start that, failed, tried to start the next, and so on. This resulted in a load storm within torque (loads >100), which was then not even able to answer normal client queries - so maui locked up and the gip plugin started to timeout.

When I realised what was happening (and the pheno jobs were still coming in) I added the user's DN to the LCAS ban_users.db file. I then carried out some debugging tests, restarting maui, clearing out maui stats files, etc. In the end I saw no option but to qdel the user's waiting jobs, to attempt to take the pressure off torque.

Once the jobs were flushed out the system torque quite quickly started to recover. Maui started to respond again and the GIP plugin could get sensible answers.

Why were the jobs going into waiting state? The error the user seemed to be getting back was "Globus error 158: the job manager could not lock the state lock file." This seems to be an error which crops up when the job is being cancelled. There was a strange mix of jobs from this user - some with VOMS extensions, some vanilla proxy. Was this a problem with proxy renewal and the gatekeeper trying to cancel jobs which it no longer had the right to? The problem kicked in at almost exactly the time that the user's original submission proxy expired and the RB would have renewed it from the RAL MyProxy server. The wrong proxy might well also have affected the ability of the jobs to start - hence the wait crisis being sparked.

After I had been satisfied that the cluster was stable again, I took the user out of the banned list. Their jobs are now flowing back into the cluster, interestingly all with the vanilla proxy now.

I will keep a close eye on things and check that things don't go wrong again.

Postscript: VOMS proxy renewal is broken: http://savannah.cern.ch/bugs/?func=detailitem&item_id=15208

Tuesday, July 10, 2007

Multiple VO Woes

Steve Lloyd and I sat down after lunch today to try and get to the bottom of why his dteam submitted jobs always fail. Strangely this seems to be a RB specific problem. IC always works, Glasgow always fails and RAL seems to come and go.

Using the Glasgow RB we submitted a job to Edinburgh, so that we could trace things through the batch system. The job arrived at Edinburgh, and ran through the batch system. However, it continued to be considered by the RB as

Current Status: Scheduled
Status Reason: Job successfully submitted to Globus

Clearly this was not the case.

We had a good look through the logs on the RB, but there's no particular sign of things going wrong there - although it must be said that the logs are both dense and impenetrable.

When it became clear that there was no easy solution I decided to try and reproduce the problem myself. Now, recall I had joined gridpp a while ago to help our local users and never had any trouble. However, now I can't seem to get a single job running through as a gridpp member - even on the Glasgow cluster. And things are in fact even worse than for Steve, because my gatekeeper process dies almost instantly, so the job never even goes into the batch system:

grep 2007-07-10.14:49:10.0000028268.0000113028 /var/log/messages
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 for /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart on 130.209.239.23
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 mapped to gridpp001 (17601, 10016)
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 has GRAM_SCRIPT_JOB_ID 1184075356:lcgpbs:internal_434559272:9672.1184075355 manager type lcgpbs
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 JM exiting

I'll now try and poke around inside the gatekeeper logs and see if I can come up with any indication why things are going wrong.

And what the hell's this got to do with the RB anyway? It's deeply puzzling and frustrating in equal measure.

New Disk Servers Deployed

We've now completed the deployment of our 5 additional disk servers for ATLAS. This takes our SRM space total to 84TB, with 77TB for ATLAS.

There is about 12TB used so far.

Monday, July 09, 2007

Top Level BDII Updated to Glue 1.3/yaim 3.0.1

I upgraded our top level BDII today, so it should now happily understand the Glue 1.3 schema. This was the deadline for the upgrade set in the EGEE operations meeting - BDIIs have to be upgraded from the top down, because they don't like information provided by lower levels to be in a schema they don't understand.

This was a more significant thing it might first seem, as I also went very carefully through our site-info.def file and used the latest 3.0.1 yaim to configure the node.

It turned out that there were significantly fewer changes than I had thought there would be. The main additions seemed to be exposing new BDII configuration variables, and changing the way that queues are configured, to allow for queues accessible to FQANs. Of course, much of this is not relevant to the BDII/MON box, which doesn't care about users or VOs, but it proves that there's no major errors in the new configuration.

I have noticed that the startup time for the BDII is slower, presumably as it's creating all of its index files; however, the performance should be much better.

Friday, July 06, 2007

Success with MPI

Cracked it! I can now get MPI jobs running on the Glasgow cluster.

First thing to note is that the gatekeeper does not invoke mpirun for the job - this is very good, because it would be almost impossible to get this to work if it did.

The key file is the NODELIST file which the CE will generate and add to the executable's argument list. When given as the argument of the -p4pg option then mpirun will ssh to all of the "slave" nodes and start the binary which is given in the NODELIST file.

By default this breaks for 2 reasons:
  1. The gatekeeper only copies the job's sandbox into the working directory of the "master" worker node. So on the "slave" nodes the executable isn't present. (N.B. Even though we have a shared data area for our glaNNN accounts, the working directory is always in /tmp and local to the worker node.)
  2. The executable listed really needs to be a wrapper script, so it's the wrong thing for mpirun to be starting anyway.
So, wrapper script really has to do the following:
  1. Change to a more sensible shared directory (like $CLUSTER_SHARED).
  2. Rewrite the NODELIST file so that the name of the correct mpi binary to run is given, instead of the wrapper script itself.
  3. Invoke mpirun, giving the new NODELIST file.
Here's an example (with a lot of debugging hooks) which works, running the code:
#! /bin/sh
#
# Argument list is: BINARY -p4pg NODELIST -p4wd PATH
# What's really important for us is the NODELIST file, i.e., $3
cd $CLUSTER_SHARED/mpi
export MYBIN=$1
PGFILE=`pwd`/pgfile.`hostname -s`.$$
echo My Args: $@
echo "----"
echo "Original NODELIST file:
cat $3
echo "----"
cat $3 | perl -ne 'print "$1 $2 /cluster/share/gla012/mpi/$ENV{\"MYBIN\"}\n" if /^([\w\.]+)\s+(\d+)/;' > $PGFILE
echo "----"
echo "New NODELIST file:
cat $PGFILE
echo "----"
/opt/mpich-1.2.7p1/bin/mpirun $MYBIN -p4pg $PGFILE
(Hmm, it's splitting that perl one liner in a really nasty way - no line breaks there.)

There are, however, two problems which I can see.
  1. Accounting. Looking at the torque logs it's clear that only the master node's process is being accounted for. The slave node MPI processes are not accounted for. Do we multiply the master node's CPU and Wall by the node number as an interim measure?
  2. Orphaned and stray processes. As ssh is used to start the binary on the slave nodes, what happens if the code leaves them behind or they run away?
I wonder if there's a way we can modify mpirun to do things in a torque friendly way? I shall enquire of the MPI gurus.

(For more formal documentation, watch this wiki page....)

Thursday, July 05, 2007

MPI Progress

I am making progress with MPI jobs. I can now get MPICH jobs into the batch system via edg-job-submit and they do get a batch system reservation.

It turned out I had to add MPICH as a GlueHostApplicationSoftwareRunTimeEnvironment in the information system. It's also essential to have GlueCEInfoLRMSType as pbs. It doesn't work if you put torque (it must be the only thing on the grid that actually cares!).

The job wrapper then adds some interesting arguments to the executable:

-p4pg NODELIST -p4wd PATH

Where NODELIST looks like this:

node067.beowulf.cluster 0 /tmp/.mpi/https_3a_2f_2fsvr023.gla.scotgrid.ac.uk_3a9000_2fPSn7TiiAJeV6R6w-0vQjtA/./dummy.sh
node070 1 /tmp/.mpi/https_3a_2f_2fsvr023.gla.scotgrid.ac.uk_3a9000_2fPSn7TiiAJeV6R6w-0vQjtA/./dummy.sh
node102 1 /tmp/.mpi/https_3a_2f_2fsvr023.gla.scotgrid.ac.uk_3a9000_2fPSn7TiiAJeV6R6w-0vQjtA/./dummy.sh
node139 1 /tmp/.mpi/https_3a_2f_2fsvr023.gla.scotgrid.ac.uk_3a9000_2fPSn7TiiAJeV6R6w-0vQjtA/./dummy.sh

and PATH is just the working directory for the job. Note the magic number "0" seems to be the place where the job executable runs and "1" are all the nodes where other job slots are reserved for this job.

So clearly the NODELIST file then needs to be taken by mpirun and used to start all the mpi subprocesses. From the EGEE MPI Wiki, the standard method seems to like to use the i2g mpi-start command, so the arguments must be in a form appropriate for it. Open questions remain, though:
  1. How to get i2g mpi-start to work. When I give it an MPI binary it seems determined to compile it - however this falls over, even though MPICH 1.2.7 is in the path.
  2. How do I ignore mpi-start and run a pre-prepared MPI binary, which will be what a Glasgow use wants to do.
  3. How on earth will torque account for all of this properly?

Further reading: EGEE-II-MPI-WG-TEC.doc.

LHCb Stuck Jobs

Coincidentally with drafting the stalled jobs document, we got 23 stalled LHCb jobs last Friday. These jobs had consumed about a minute of CPU then just stopped.

I reported them to lhcb-production@cern.ch and the response from LHCb was very swift and helpful. We did quite a bit of debugging on them - although in the end we had to confess that exactly why these ones had stalled was something of a mystery. At first LHCb thought that NFS might have gone wobbly at our end, so the jobs got stuck reading the VO software. From what I could see this was unlikely, and when NIKHEF, RAL and IN2P3 reported similar problems we were off the hook.

Some useful tools for stuck jobs:
  • lsof - see what file handles are open
  • strace - what's the job doing right now
  • gdb - attach a debugger to the code
In fact, a lot of simple diagnostics also help: what's in the job's running directory. What STDOUR/STDERR has been produced to far, etc.

When these jobs are killed it's helpful to poke the stalled process - that way information gets back to the VO. A qdel will see the outputs all lost and the job resubmitted elsewhere, which is far less helpful.

In the end, whatever the bug is, it's down at the 10^-6 level!

Thanks to LHCb for being so responsive.

I also must take my hat off to Paul and his MonAMI torque plugin. His live efficiency plots for the batch system queues made spotting this very easy. In the past this sort of thing would have been noticed on a very hit or miss basis.

Tuesday, July 03, 2007

Health and Efficiency



As part of investigating the problems of stalled jobs, I have plotted Wall vs. CPU time for ATLAS and LHCb on our cluster.

LHCb jobs are generally quite efficient (as evidenced by their 93% efficiency from the EGEE accounting pages). What's interesting is seeing the cluster of jobs at 11 and 22 hours of CPU time, with a smear in wall clock from prefect efficiency to ~50% (data management stikes again?).

ATLAS jobs have a far more variable profile, with many more short jobs of high efficiency, with a more general, and flatter line out to lower efficiencies. There's a very distinct line of problematic jobs (the spike on the tail).

It seems really that with our new fast CPUs our queue times are really much too long (inherited from the old cluster, if I remember). LHCb and ATLAS both seem happy for queues to be reduced from 96/100 hours to 36/36 hours.

ATLAS Software Week

Last week I was at ATLAS Software Week at CERN.

It was a useful meeting (as ever meeting people and chatting is most important!). Some issues I picked up for ATLAS sites were:
  1. Although 13.0.10 has been released there are quite a few things known broken (event generation, for instance). This means we are stuck with having a lot of "old" ATLAS software releases on our sites. At Glasgow we have 86GB of ATLAS software - more then 60% of the total for all VOs.
  2. Preparations for Computing System Commissioning and the Final Dress Rehearsal are underway. The start date seems to have slipped (was meant to start this week)? Actually, I must find out what the site involvement schedule actually is.
  3. The DQ2 data management system was upgraded to 0.3 last week. There were a few teething troubles, but the next release should handle many common problems much better.
  4. There's pressure not to run too many simulations as part of each job sent to a site - so keep the wallclock down (< 24 hours), but this reduced the file sizes. Small files are a big problem - they are inefficient to transfer and gunge up any tape system. So they should really be merged before any migration to tape. (Problem for CASTOR though, which even puts T0D1 stuff onto tape?)
  5. Event sizes keep going up. Computing TDR had ESD at 0.5MB, but currently this is 1.6MB (1.8 for MC). Probably a realistic target will be 1.3MB files.
  6. Memory footprints are rising too. 2GB necessary for simulation and probably a subset of reconstruction jobs too.
  7. To deal with merging and pile-up jobs worker nodes should now be speced with at least 20GB of disk space per core. At the moment, however, jobs will try and limit their ambitions to 10GB. However, this requirement also seems monotonic, so make sure it's accounted for in forthcoming purchases.
  8. Queues for ATLAS production should be around 24 to 36 hours of cpu and wall time (N.B. this is on modern CPUs). NIKHEF are currently at 24/36 and I'm going to cut Glasgow back to 36 hours.
  9. If you see stuck ATLAS jobs try and investigate the problem and report to atlas-comp-oper@cern.ch. This will help cut off the nasty tail in the ATLAS efficiency curve.