Friday, September 15, 2006

It's been a while since I regailed the world with our continuing ClusterVision saga, so grab a mug of tea, wrap a blanket around your shoulders and prepare for a meaty read...

I finally managed to speak to Alex on Monday 11th. We discussed some of the reasons why 32bit software might break when run in a 64 bit environment, which was technically interesting. However, the bottom line is that if software breaks on our site and runs fine everywhere else then it's us who look bad, even if it is the developers fault. So we really need a 32 bit installation. He said that is was possible for them to do this - it would require rebuilding the CVOS installer, after which building images would be easy. This would take until late Tuesday, so it was going to be Wednesday before we could get started.

The 32 bit image was duely delivered on Tuesday. It was nice to get something done on time. When I started working with it on Wednesday I found that it was reporting x86_64 via uname. I discovered that this was because it is a 32 bit OS, but running a 64 bit kernel. This means:

  • Applications which want to know their archictecture (like YUM) need to be run within a linux32 environment, which modifies the result of uname to i386. Otherwise YUM gets terribly confused, even when its configuration files have been namually set to the i386 repository.
  • Therefore we probably need the job wrapper to execute in this environment, just to be on the safe side.
  • Module loading is broken - only modules loaded within initrd are available. I consider this to be a security advantage.
  • Memory handling is much better, because the kernel is 64 bit - it gets full bandwidth from the CPU to the memory and each running 32bit app can access 4GB.


We had a meeting that Wednesday (13th), in addition to clarifying the above, the following was discussed:

  • The rest of the cluster will be delivered between 2 and 3 weeks.
  • David and I were to get remote training in CVOS on Thursday.
  • The SL307 worker node image on the master node has hostname lookup broken because resolv.conf is not correct.
  • I would work on getting a base image for the gird servers (SL307 i386) and the disk servers (SL43 i386).


I still felt that the division of responsibilities was very unclear - on the one hand CV were offering to do the base install for the worker nodes, leaving me to layer on the grid software, yet I was to do the grid nodes and disk servers.

On Thursday and Friday I customised the worker node image on node001. The basically consisted of:


  • Enabling YUM, pointed at a repository on the master node. Because of the x86_64 kernel issues I told YUM to leave the kernel well alone
  • Patching in YAIM with a suitable site-info.defs.
  • Manually adding Steve Traylen's torque packages for workers.
  • Adding suitable ssh allowed keys.
  • Enabling r* services (yes, we will run rsh inside the cluster)


After I had done this, I emailed Lowrens at ClusterVision to grab this image and patch in what RPMs CV wanted.

No comments: