Status of infrastructure: 12 of the water-cooled racks are back on at present; rack 9 had cooling problems and had to be shut down quickly. This causes problems for PBS as the server seems to hang if large numbers of nodes running jobs suddenly vanish. This then causes problems for the Gstat monitoring software and the graph of site performance shows a lot of dips and spikes and the site is of then classified ‘error’. After the meeting it was found that solenoid valves in water cooled racks 9 and 12 were stuck shut. A repair is being planned by DM.
Problems: There are problems in getting data out of the SE when it is nearly full.
Dcache and SE Status: There was a long discussion about the problems installing Dcache on the site, and if we should go back to earlier versions on the software that seem to be running successfully at other sites. PT says the fact the site is unstable, with internal network problems and cooling problems causing racks to drop out, is the basic problem. He estimated it would take many months to install Dcache. The meeting decided it was much too long a timescale and it would be worth contacting AF, the NorthGRID technical rep, to see if expert help could be brought to bear on this problem. ACTION MAH.
Internal network issues: FORCE10/DELL problem is on-going. We have registered on the FORCE10 website and hope soon to download the updates.
JB and RF have proposed that the FORCE10 switch be upgraded just before Easter and will circulate an email warning users of the break in service they can expect while this work is done. It is hoped this will also cure the internal network problem (4) above.
There will be some outages of the main campus pop in the near future caused by upgrades to SUPERJANET. Adequate warning will be sent around.
There was a discussion about the SE and lack of disk space. The meeting agreed it would be a good idea to buy a 10 TB RAID (cost ~£10K + VAT) that could be used as part of the LCG SE until Dcache was working properly, and then it could migrate to be part of the general local working disk space for users as many of the current disk drives in the cluster room are becoming old and will soon start to become un-reliable. JB and RF will present this requested to the next hardware meeting.
ATLAS software: installation of 12.0.5 and 12.0.6 is on-going. It was crashing with an NFS problem part way through the install. CG has now understood the new ATLAS install website and PT has tweaked NFS. Will try the install again soon.
Non-LCG cluster report
Number of racks in use is now 5: BTK is installing the ATLAS code on these extra nodes. A new version of Torque has been installed.
Status of users' software: no news.
Statistics: no news.
OK, users should expect minor breaks as various upgrades are made.