Status of infrastructure: 12 of the water-cooled racks are back on at present; rack 9 had cooling problems and had to be shut down quickly. This causes problems for PBS as the server seems to hang if large numbers of nodes running jobs suddenly vanish. This then causes problems for the Gstat monitoring software and the graph of site performance shows a lot of dips and spikes and the site is of then classified ‘error’. After the meeting it was found that solenoid valves in water cooled racks 9 and 12 were stuck shut. A repair is being planned by DM.
Problems: There are problems in getting data out of the SE when it is nearly full.
Dcache and SE Status: There was a long discussion about the problems installing Dcache on the site, and if we should go back to earlier versions on the software that seem to be running successfully at other sites. PT says the fact the site is unstable, with internal network problems and cooling problems causing racks to drop out, is the basic problem. He estimated it would take many months to install Dcache. The meeting decided it was much too long a timescale and it would be worth contacting AF, the NorthGRID technical rep, to see if expert help could be brought to bear on this problem. ACTION MAH.
Internal network issues: FORCE10/DELL problem is on-going. We have registered on the FORCE10 website and hope soon to download the updates.
JB and RF have proposed that the FORCE10 switch be upgraded just before Easter and will circulate an email warning users of the break in service they can expect while this work is done. It is hoped this will also cure the internal network problem (4) above.
There will be some outages of the main campus pop in the near future caused by upgrades to SUPERJANET. Adequate warning will be sent around.
There was a discussion about the SE and lack of disk space. The meeting agreed it would be a good idea to buy a 10 TB RAID (cost ~£10K + VAT) that could be used as part of the LCG SE until Dcache was working properly, and then it could migrate to be part of the general local working disk space for users as many of the current disk drives in the cluster room are becoming old and will soon start to become un-reliable. JB and RF will present this requested to the next hardware meeting.
ATLAS software: installation of 12.0.5 and 12.0.6 is on-going. It was crashing with an NFS problem part way through the install. CG has now understood the new ATLAS install website and PT has tweaked NFS. Will try the install again soon.
Non-LCG cluster report
Number of racks in use is now 5: BTK is installing the ATLAS code on these extra nodes. A new version of Torque has been installed.
Status of users' software: no news.
Statistics: no news.
Network Issues
OK, users should expect minor breaks as various upgrades are made.
Clean Room PC upgrades. JB/RF plan to visit clean rooms soon to investigate what is needed.
PP Web page support. KS continues to work.
GRIDPP Site Visit. The date has been moved and the visit will take place on Tuesday 15th May. A copy of the questionnaire was tabled, and the one for Liverpool and can be seen at URL http://www.gridpp.ac.uk/tier2/ReadinessReview.zip
Any Other Business
It was suggested that one of RF or JB should go to the GRIDPP 18 meeting with PT on 20/21 March in Glasgow.
PT has prepared a detailed site configuration map that could be useful input to the site visit.