Minutes of HEP Computing Users Meeting 28: 7th July 2008
Present: MAH, CG, JB.
Apologies: TG, SJM, JB, CT, MK, TJVB, PA, CG, JNJ, N Mc, RF, DH, BTK
(1) LCG Cluster report:
(a)Cluster is running with 14 racks (481 nodes). Now getting lots of LHCb
jobs. Some outstanding upgrades to gLite need installing. ATLAS jobs have
run with 100% efficiency in the last 24 hours.
(b)The FORCE10: running OK
(c)Plans and news: A new CE on which to attach nodes in the NW Grid
cluster is on order using Gridpp3 money.
The Gridpp DB has announced a new accounting period to share out the next
tranche of Grid hardware money. For each Tier-2 site this will be the
best quarter in 2008 plus the first 2 quarters of 2009. Currently we are
the best performing UK Tier-2, delivering about 11% of the total CPU.
(d) Dcache and SE Status:
D-cache needs some software upgrades and the firmware in the 3ware RAID
card. This will mean a re-boot at some stage.
Testing of DPM as possible replacement for d-Cache has started.
(2) Plans and news:
(a)Network within the OL: The special HEP computing meeting last week
concluded that an approach to the University should be made to fund an
upgrade the entire building network as this needs specialist contractors
to deal with asbestos in the building and cable ducts. The present HEP
network is in a delicate state as the installed cables seem not work with
new switches and hubs which will cause major problems if these fail.
(b) The interactive nodes as UIs need more testing with Ganga: action CG.
(c) The GRIDPP3 funding announcement has arrived: we got £72.7k, of which
there is about ~£54k left. Most of this will be spent on RAID.
(e) Interviews for the Sys admin post will take place this Thursday 10th
July; there are 7 in the shortlist.
(3) ATLAS jobs: only a few ATLAS jobs in the last week, but run with high
(4) Non-LCG cluster report:
(a)Currently 177 nodes are running flat out with T2K
MC. Data will be
stored on T2K
-FE and hepstore.
(d)Cockcroft : no news
(5) Trash/CDF and SAM: The Trash/CDF disks have been powered off, and the Trash/CDF rack
will soon be stripped out and moved, ready to be re-furbished. The UPS
and network switch etc for this work are on order.
(6) Network Issues: see above
(7) BATCH CLUSTER. The 40 DELL nodes SL4 batch cluster is running well,
with 100s of jobs in the Q. The new /scratch RAID will be installed
again soon as the firmware fix to the 3ware RAID card seems to work.
(a)Documentation: no news
(b)Clean Room PC upgrades. Ongoing but delayed due to DM being off ill.
(c) PP Web page support. No more news
(a) FM (Dave Dutton) promised once again to install the cable from the
chiller units on the roof down to the cluster room so that the voltages
can be monitored. This request was first made in Nov 2007.
A FM rep attended the special HEP computer planning meeting last week and
is trying to get a power audit of the whole building done as a matter of
urgency. There is need to understand what the power implications are of
adding new multi-core nodes into the cluster room.
(c) Current ongoing tasks are;
(1) Finish the roll out of the MON system with auto warnings of node failure;
(2) Continue testing DPM.
(3) Update the 3ware cards in the RAIDs;
(4) Repair failed non-LCG system nodes
(5) Continue clean room upgrades; A software upgrade to the DAGE machine
was due this afternoon.
(6) Install new CE for the NW-GRID CSD hardware;
(7) Look at the LUSTRE file system.
(11) Date of next meeting: Monday Mon 21st July 2008.