• Present
    • MAH, JB, RF, CG, JV, PT, BTK
  • Apologies
    • TJVB, SF, TG, SJM, PA

LCG cluster report

  1. System is shutdown for most of this week to allow repairs and an upgrade to the FORCE10 switch. The new cabling layout to the racks will be done tomorrow, ready for the FORCE10 engineer to come on Wednesday to install the system and do some training sessions.
  2. Problems: The second water cooling unit on the roof has been stripped down by DM and an engineer; a leak has been found, repaired and the unit undergone pressure tests. The system is being re-filled with gas and should be back up soon. This fix has saved us another £5K. DM will also try to replace the two faulty valves on racks 9 and 12 when they arrive. Clip-on units to read temperatures and flow rates will be added to the pipe work. DM continues to work his way through the racks looking for problems and doing hardware fixes, and putting information labels on each node.
  3. Dcache and SE Status: The Dcache still has problems as the software that runs on hepgrid5 still gets out-of-memory errors and crashes. An upgrade to the Java SDK software that might cure this is being tested. A quick and dirty (but basically unsatisfactory) cure by doing regular re-starts will be tried; this is done at other sites. Hence hepgrid5 will be re-booted on a weekly basis until this problem is cured. It’s planned that the memory usage on hepgrid5 will be monitored try to spot when a problem is arising. The new RAID has arrived and is undergoing soak tests. It will be added to the system once the FORCE10 upgrade is working.
  4. Internal network issues: none.
  5. Plans:
    1. A review of the whole system has been done to identify potential points of failure and ensure there is adequate redundancy in the system, and that spares are available on demand. The results can be seen on the new blog
    2. Single-supplier quotes are being sought for critical spares for the upgraded FORCE10.
  6. ATLAS software: V 12.0.5 has been installed but not fully tested or validated; V 12.0.6 is on-going, it hangs part-way through the installation, which is symptomatic of an NFS problem. It has been installed OK on the batch cluster.

Non-LCG cluster report

  1. Number of racks in use is now 5. Being used by BTK for lots of MC generation for local groups and some external groups, e.g. GMSB. A request from Oxford for some W => e nu is being done now. Users of these events are reminded they should include Barry’s name in author lists on talks as he puts a lot of time and effort into setting up the system and generating these events.
  2. Status of users' software: no news
  3. Statistics: no news.

Batch Cluster

  1. Working very smoothly: Users are requested to exploit this facility and report problems to helpdesk.

Network Issues

  1. OK. Discussions are on-going with CSD to see if a second connection to the PoP can be installed as there is only a single 1 GB link at present.


  1. Documentation: The new blog (to log all the issues connected with the cluster) at is working.
  2. Clean Room PC upgrades. JB/RF plan to visit clean rooms soon to investigate what is needed.
  3. PP Web page support. No more news
  4. GRIDPP Site Visit. A small committee has been set up to plan for this visit. Progress is satisfactory. The heat and power loads in the cluster room are being carefully studied so we can plan future upgrades/additions to the hardware.

Any Other Business

  1. PPARC GRIDPP3 application to supply new hardware and continue posts until 2011 has been approved on reduced funding: we await details but could expect to have about ~£80K to spend on new hardware each year, starting Oct 2007. We have to greatly increase the disk space in the LCG SE to ~ 80TB.

Date of next meeting

  1. 16April at 14:00.
Topic revision: r1 - 10 Apr 2007, JohnBland
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback