TJVB, SF, TG, SJM, PT (on way to GRIDPP 18 meeting).
LCG cluster report
Status of infrastructure: 11 of the water-cooled racks are back on at present; rack 9 and 12 are still off. A repair to a valve did not work (the replacement valve failed). Rack 2 is at 50% capacity because of the failure of one of the DELL switches. There are no known other network issues within the cluster and it should be stable.
Problems: There are indications of an instability developing in one of the water cooling units on the roof. It is monitored on a daily basis. It will be shut down and inspected/repaired around the time the FORCE10 switch is upgraded. There remains a general problem that there are insufficient (none) sensors on the air con and water cooling systems to enable their state to be monitored by our own computers so that a controlled shutdown of the system can be made if necessary. DM continues to work his way through the racks looking for problems and doing hardware fixes, and putting information labels on each node.
Dcache and SE Status: There was a discussion of the Dcache situation with NorthGRID technical rep AF. In principle it should work now the system is stable. It was apparently brought up over the weekend and showed 34TB as our SE, but then stopped sometime on Sunday. There are no more details available as PT is on his way to the GRIDD 18 meeting.
Internal network issues: none
The FORCE10 switch will be upgraded in the week beginning 2 April (probably on April 4). An engineer from FORCE10 will be here and will bring the software updates.
A review of the whole system is underway to identify potential points of failure and ensure there is adequate redundancy in the system, and that spares are available on demand.
A RAID ~11 TB has been approved and is being procured. In the first instance it will be added to the LCG SE until the Dcache is stable.
ATLAS software: installation of 12.0.5 and 12.0.6 is on-going. It continues to fail due to file problems at remote sites that are involved in the installation; CG will contact the author of this part of the code to solve this.
Non-LCG cluster report
Number of racks in use is now 5. Being used by BTK for lots of MC generation