You are here: Foswiki>Computing Web>Meetings>MeetingMinutes11Jun2007 (18 Oct 2012, JohnBland)Edit Attach

MeetingMinutes11Jun2007

Present
- MAH, JB, CG, PT, RF
Apologies
- TJVB, SF, TG, SJM, PA, BTK, JV

GRIDPP Site Visit

This was successful and thanks to all who took part. The main recommendations to us were that we improve liaison with NORTHGRID, improve the cross-campus network connections, and continue with installing local monitoring systems for all the hardware and air-con. The dCache is seen as being experimental and, until proven, data should be stored in RAIDs, which should be the preferred way to supply disk space on the GRID in the future. Also CG has agreed to be the NorthGRID ATLAS contact.

LCG cluster report

About 11 racks are running at present with about 350 active nodes: Rack 9 has had the switches replaced and is in the process of being brought back up. Rack 12, containing repaired nodes, will be worked on when all the others are fully working. The switches in it have been configured and are ready for use. Due to increasing memory size needed by jobs the meeting decided that for the time being only one job per node would be allowed over all the VOs. Action PT to change PBS setting.
The FORCE10: multiple VLANS are now established over the LCG and non-LCG clusters. The new HEPWALL machine to optimise the routing will be commissioned shortly. This means changing the IP addresses in the switches.
Problems: none outstanding.
Dcache and SE Status: as dCache is viewed by GRIDPP as being experimental, work on this has been given a lower priority than bringing up the whole cluster.
Internal network issues: see above.
Plans :
- The 4 servers that make up Hepstore are old, nearly full, heavily used and show increasing signs of failure. It is recommended that these 4 be replaced very soon, before another major failure occurs, with one new RAID; this will also give an order of magnitude improvement in performance. A request will be made to the hardware committee.
- System resources have been monitored for the last few weeks with Nagios. This is open-source software that is used in the majority of HEP sites across the UK, and can be customised for local use. We have decided to adopt this system. It will be configured to send text messages to mobile phones in emergency situations. Monitors for the water cooling are being purchased.
- Improving network links: we are pressing CSD to implement the second 1 GB connection down the existing cable, as their Cisco and our FORCE10 switches can communicate using a standard protocol. This will mean buying a little extra equipment, which will be requested once it has been costed. We would also prefer to set up a second independent link to the PoP in the Randell Building, which means pulling a new cable from the OL, through Senate House. We are investigating if this can be done cheaper than the current budgetary quote from CSD of ~ £8K. It is alleged that 4 people are needed to pull the cable.
ATLAS jobs: over this weekend, the cluster has been full of mostly ATLAS jobs. CG reports that only 13% of these jobs were successful. The major reason for failing (53%) was the so-called ‘Emptyout’ error, where the job cannot even start. There is evidence that this problem has got worse here as the number of nodes we put on the LCG increases. Diagnosis of this problem is difficult due to lack of log information; one doesn’t even know which node the job was trying to run on. CG will work with PT and others to chase this problem. PT was asked to run the validation job that CG has provided on each rack in the cluster to check that the ATLAS code has been replicated correctly. This takes a few hours; Action CG and PT. Also ATLAS V 13.0.1 is now available in kit form, so expect to see it released for production in the coming weeks.

Non-LCG cluster report

Number of racks in use is now 5: Being used to generate more ATLAS V12.0.6. MC for testing b-tagging code.
Status of users' software: no news.
Statistics: none.

Trash/CDF and SAM

The disk array on CDFs matrix has been rebuilt by JB and should work. He needs to look at the nodes before the system is fully restored.
CFJIF1 is only used as a server for Trash/CDF software; moving this service to another server so Trash/CDFJIF1 can be retired is still being investigated.

Batch Cluster

Working smoothly and is heavily used. 26 Spare fans have been acquired but not yet fitted; the cluster is up with reduced capacity.

Network Issues

See above.

Plans

Documentation: JB and RF will, in the next few months, produce a document defining what procedures are needed.
Clean Room PC upgrades. JB/RF plan to visit clean rooms soon to investigate what is needed.
PP Web page support. No more news

Any Other Business

We may be driven to install SL 4 on the cluster if ATLAS RL 13 can’t be made to work on SL 3. This could be tricky and needs to be planned.
As part of the improvement of NORTHGRID liaison, the NORTHGRID Technical Coordinator AF will visit all day on Wednesday 20th June. A summary meeting at 16:00 was proposed when the results of the day can be discussed and plans for the development of the LCG cluster over the summer, including the dCache, can be agreed.

Date of next meeting

25 June at 1400.

Topic revision: r2 - 18 Oct 2012, JohnBland

Computing

Categories

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback