You are here: Foswiki>Computing Web>Meetings>MeetingMinutes05Feb2007 (18 Oct 2012, JohnBland)Edit Attach

MeetingMinutes05Feb2007

Present
- PA, MAH, CG, JB, RF, JV, NMcC
Apologies
- PA, BTK, TJVB

LCG cluster report

Status of infrastructure. A refrigerant leak has occurred in the units on the roof of the OL, perhaps due to storm damage. This caused inefficient cooling of the water-cooled racks, so they slowly heated up. GS temp monitoring hardware shut down 4 racks, the others were powered off manually. The fridge unit is being investigated to see if it can be repaired; otherwise a replacement from Rittal will cost £4.5K. The 13 water-cooled racks remain off at present; one air-cooled rack is running on the GRID. There is now a pressing need to implement the USB port connection on GS boxes so the racks can be powered down in a controlled way to avoid the damage that can occur if the power is just switched off without warning. GS is modifying his USB DOS software to run under Windows, and JB and RF will assist to produce a Linux version. There were also failures of two air con units in the room. B&E are investigating.
Problems: There was software failure on the YP server last Friday that required a re-boot.
Dcache and SE Status. The SE is nearly full and PT circulated a request for VOs to remove unwanted files. Local ATLAS users will need more local disk space if they are to work within the approved ATLAS computing model. The Dcache logging software that caused Hepgrid5 &6 to crash (see last minutes) has been disabled by PT. In principle Dcache could be implemented on some of the existing nodes (giving ~30Tb) as soon as the internal network problems are cured (see next item). The meeting requested that a cost comparison be made between installing a new 10Tb RAID on the SE and installing extra disks in the nodes to give more Dcache space.
Internal network issues: FORCE10/DELL switch connectivity. Investigations of this intermittent fault will continue as soon as the whole cluster is back up. JB and RF said they could make use of the FORCE10 line cards to as part of the solution of the connectivity issue. MAH requested they produce a document detailing this request.
Plans: Memory has been ordered and awaits delivery. PS’s laptop is approved and is being sourced.
ATLAS software: installation of V 12.0.4 and 12.0.5 is underway. The new auto-install facility might work. Official ATLAS MC has been running on the (reduced) cluster with an efficiency of 54%. Problems are not due to local issues.

Non-LCG cluster report

Number of racks in use: still 3 racks are used at present. Four nodes that were causing problems (sucking in jobs) have been disconnected. ATLAS V12.0.4 /5 have been installed.
Status of users' software: Some software has been installed on the T2K front end; some problems using SL3.
Statistics: no news

Trash/CDF and SAM

No news

Network issues

Ok, no problems.

Plans

Documentation: The Twiki is up and in use.
Clean Room PC upgrades. JB/RF plan to visit clean rooms soon to investigate what is needed.
PP Web page support. KS has started work. Will convert the top layer of pages to the official University format and migrate these to CSD machines.
The current list of staff on the Webpages is badly out of date. Users suggested that key information like this, and internal phone numbers, could reside in a database that Jackie could maintain, given a suitable simple and robust interface. JB/RF will investigate.

Any Other Business

An agreement on GRID Site Operations is under discussion, see https://edms.cern.ch/file/726129/1/AgreementSiteOperations-20060808-0.9.doc . This has to be signed by each site. It’s not clear who could sign for us.

Date of next meeting

19th Feb at 14:00.

Topic revision: r2 - 18 Oct 2012, JohnBland

Computing

Categories

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback