Spent more time on the new infrastructure for svr031 again today. I wrote the script to get a dhcpd.conf file out of the cluster database, which was quite straightforward to do. Then dhcpd could function again.
I tried a test install and found that tftp was, amazingly, still working even though xinetd had lost all of its configuration.
Then the install fell over slightly - the kickstart installer was using the old internal hostname of the svr031 server, master.beowulf.cluster, but setting the name server resolution to the university DNS servers. As an installing machine does not yet have the big cluster /etc/hosts file it could not resolve this name. So I changed the kickstart setup to code in only the IP address of the svr031, so no name resolution was needed. This allowed the install to proceed, however then I discovered some cleverness in the kickstart post install script which ensures that even on first boot the machine sets its hostname correctly (in particular, to the routed hostname for grid and disk servers) and that this relies on DNS to function.
So, I either have to rewrite the clever code, or go back to running DNS on svr031.
I decided that running DNS on svr031 was no bad thing. There's a lovely lightweight DNS server called dnsmasq - it loads up a DNS server with the contents of /etc/hosts, which is generated from the cluster database already, and can give out DNS during the install process.
After a machine has first booted, cfengine will copy in the global cluster /etc/hosts and set the DNS to the university servers to remove the single point of failure. But having internal DNS for the installs is nice.
Even better I found dnsmasq is built for RHEL4 x86_64 in the DAG repository, so installing it was a cinch.
Finally, dnsmasq also includes a dhcp server and a tftp server, so I will have a look at running these and possibly simplifying life even more.
Installs should be back up and running tomorrow.