You are here: Foswiki>Computing Web>Meetings>MeetingMinutes19Mar2007 (20 Mar 2007, JohnBland)Edit Attach

MeetingMinutes19Mar2007

Present
- MAH, JB, RF, CG, PA, GB, JV
Apologies
- TJVB, SF, TG, SJM, PT (on way to GRIDPP 18 meeting).

LCG cluster report

Status of infrastructure: 11 of the water-cooled racks are back on at present; rack 9 and 12 are still off. A repair to a valve did not work (the replacement valve failed). Rack 2 is at 50% capacity because of the failure of one of the DELL switches. There are no known other network issues within the cluster and it should be stable.
Problems: There are indications of an instability developing in one of the water cooling units on the roof. It is monitored on a daily basis. It will be shut down and inspected/repaired around the time the FORCE10 switch is upgraded. There remains a general problem that there are insufficient (none) sensors on the air con and water cooling systems to enable their state to be monitored by our own computers so that a controlled shutdown of the system can be made if necessary. DM continues to work his way through the racks looking for problems and doing hardware fixes, and putting information labels on each node.
Dcache and SE Status: There was a discussion of the Dcache situation with NorthGRID technical rep AF. In principle it should work now the system is stable. It was apparently brought up over the weekend and showed 34TB as our SE, but then stopped sometime on Sunday. There are no more details available as PT is on his way to the GRIDD 18 meeting.
Internal network issues: none
Plans:
- The FORCE10 switch will be upgraded in the week beginning 2 April (probably on April 4). An engineer from FORCE10 will be here and will bring the software updates.
- A review of the whole system is underway to identify potential points of failure and ensure there is adequate redundancy in the system, and that spares are available on demand.
- A RAID ~11 TB has been approved and is being procured. In the first instance it will be added to the LCG SE until the Dcache is stable.
ATLAS software: installation of 12.0.5 and 12.0.6 is on-going. It continues to fail due to file problems at remote sites that are involved in the installation; CG will contact the author of this part of the code to solve this.

Non-LCG cluster report

Number of racks in use is now 5. Being used by BTK for lots of MC generation
Status of users' software: no news
Statistics: no news

Network Issues

OK, CSD upgrades to the PoP have finished.

Batch (BaBar) Cluster

Will rename this as the BATCH CLUSTER. Working fine: Users are requested to exploit this facility and report problems to Helpdesk@hepREMOVETHIS.ph.liv.ac.uk.
JV and GB from Cockcroft will discuss their needs for MPI on this cluster after this meeting.

Plans

Documentation: The Twiki is up and in use.
Clean Room PC upgrades. JB/RF plan to visit clean rooms soon to investigate what is needed.
PP Web page support. KS has finished and asked for feedback. She has now left to start a new job. SJM and JV will keep up the maintenance and the new site should go public asap.
GRIDPP Site Visit. A small committee has been set up to plan for this visit.

Any Other Business

RF will go to the GRIDPP 18 meeting with PT on 20/21 March in Glasgow.

Date of next meeting

02 April 2007 at 14:00. This will clash with the start of the IoP conference.

Topic revision: r1 - 20 Mar 2007, JohnBland

Computing

Categories

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback