Monday, October 26, 2009


HEPIX Workshop

Site Reports Session

Getting serious about ITIL. Solaris being phased out. Getting serious about 10GigE.
Lustre pilot project. New purchases discussed.

New LQCG Cluster "2009 Quad Infiniband - ARRA Cluster"
Storage - Whitebox 14 AMAXservers Solaris w/ZFS or Lustre
Compute - DellpowerEdge R4102 x4 Ghz QDR Infiiband, 24Gb RAM

Auger Cluster Upgraded
Nehalems - intel x5530 dual cpu, quad core, 24MB RAM, 500GB SATA
(seeing i/o contention on disk when running 14/16 jobs)
OS Switch from Fedora 8 32bit, to CentOS 5.3 64bit

No real Grid Computing
IBM TS3500 tape library installed. StorageTek Powderhorn silos replaced.
80 production VM's VMWare ESX3.5 planned to move to vSphere4.0

FAIR - new accelerator discussion. The futuristic talk!
The Cube DataCentre Building: 1000 19" water cooled racks held in 26x26x26 cube building. Lifts to reach the machines. Iron structure for racks to sit on.

T1 4LHC & D0, Babar, SL5 migration in Q2 2010 for both Main Cluster and MPI Cluster. New Purchases and New Server Building.


Your File System NexGen openAFS (Jeffery Altman):
YFS now funded by US Gov to create nextgen openAFS. 2 year funding. Deliverables included assessment of current AFS and 2 year upgrade plan to client and server for YFS deliverable. Still open source.

Storm and Lustre:
IOZONE discussion, Hammer-cloud Tests Discussion, Benchmarking summary, Good Results, performance below iozone tests. WMS jobs and Panda jobs different. file::// protocol support performs well but requires the VO to support it. Open questions: Lustre Striping (should yes or no). Performance (Raid config?), Monitoring - still work to be done, Support - Kernel Upgrades can take a while to be made available and Benchmarks - are they realistic? Tuning still to do.

Lustre at GSI:
Users - Alice Analysis for Tier2, GSI Exp, FAIR Simulations. Still on 1Pbtye, > 3000 nodes. Foundry RX32 ethernet switch. MDS HA Pair, one standby. 84 OSS, 200 OSTs. MDS 8 core, 3GHz Xeon, 32Bb RAM. Real throughput testing with Alice Analysis Train. 50Gbit/s using 2000 cores. Hardware and Software issues. Complex system and vulnerable to network communications. Using Robin Hood Filesystem Monitor for audit and management. This protects the MDS by directing requests to MYSQL instance. i.e top ten users, file moves etc. Using this rather than e2Scan.

Hadoop on your worker nodes using local hard drives & Fuse:
Hadoop compared against Lustre. Performed well when 8 jobs ran. Replication of files provides redundancy. Cost and maintenance factor very favourable to small sites. Deployed in some sites in the US. Not a really Tier 1 deployable solution. Name node redundancy exists (will lose at most one transaction) - requires additional software.

Virtualization Session

lxcloud at CERN:
Cern has developed a proof of concept for virtualized worker nodes. 'Golden nodes' serving images to the Xen Hypervisors using Open Nebula. Also looked at Platform's VMO. Production lxcloud being built. 10 machines, 24GB, 2TB disk dual Nehalem. Starting with Xen. Production release by March 2010. Memory an issue as the HyperVisor requires some memory i.e. with 16GB RAM you cannot run 8 2GB VM's.

Has moved much of its infrastructure to Xen HyperVisor. Looks like a solid infrastructure. Investigating KVM with the possibility of a move in the next few years if it proves to be better. INFN mentioned Xen vs KVM at Hepix Spring 2009 for discussion of differences.

No comments: