UNIBE-LHEP site report

Document Sample
UNIBE-LHEP site report Powered By Docstoc
					                                  UNIBE-LHEP site report
Status progress since last meeting (1/2)
•   120 TB added to DPM (new raw capacity: 502TB after RAID6)
    ➡ 1x36-bay (4TB disks) server, with 3x ARECA 1880 RAID6 controllers, 4GB cache
       4(bonded) + 2 1Gb NICs, 6-core X5650 2.66GHz, 12GB RAM
    ➡ Installed with SLC6.3, emi-dpm_disk-1.8.6–3.el6.x86_64(EMI-2)

    ➡ Upgraded all SLC5/gLite 3.2 DPM disk servers to SLC6/EMI-2
    ➡ Upgraded DPM head node emi-dpm_mysql (security) from 1.8.4 to 1.8.6

    ➡ ATLASDATADISK 350 TB(pledges), ATLASLOCLGROUPDISK,                ops and SMSCG VOs on dedicated 22TB server

•   Commissioned SunBlade 6048 cluster (online last week of May, 95% of nodes up)
    ➡ fronted by ARC CE: ROCKS 6.1, SLC6.3, nordugrid-arc-2.0.1–1.el6.x86_64 :                            ce01.lhep.unibe.ch

    ➡ Teething issue:
      NIC locks up on Thumper servers (5          of them hit already)   =>   Not solved, failed to re-install so far

    ➡ Recurring issues:
    ➡ Full partition on CVMFS cache (cvmfs-2.1.11–1.el6.x86_64;            cache sized to have flash card ~50% populated)

     When partition is full, CVMFS fails, creates black hole, must be re-installed as one can’t recover CVMFS in such
     condition. Will follow up with CVMFS community; as mitigation, run a cache trimming every now and then:

     # rocks run host wn "cvmfs_config probe; cvmfs_talk cleanup 8000"

    ➡ Random issue:
    ➡ Lost power to one PDU during central cluster relocation =>                    lost network to all nodes
EGI-InSPIRE/NGI-CH - 02-09-2013                                                                 Gianfranco Sciacca - LHEP Universität Bern
                                  UNIBE-LHEP site report
Status progress since last meeting (2/2)
  • Operation of older cluster (ce.lhep.unibe.ch, CentOS 5)
  ➡ Upgraded ARC CE: nordugrid-arc-2.0.3.el5.x86_64: service       grid-infosys split in bdii and slapd

   Issues during upgrade:
  ➡ backlog of outdated jobs in controldir causing the infoprovider to fail =>                   clean up dir
  ➡ service slapd silently failing to start => due to malformed entry in arc.conf

   Random operational issues:
  ➡ Cooling issues in the server room
  ➡ Kernel panic on all/many nodes => suspect connection with temperature increase ???
  ➡ local job area becomes read-only on some nodes => black hole
  ➡ ARC instabilities observed since upgrade to newest version 2.0.3 => drops               in and out of giis

Immediate plans
  ➡ Adding 55 8-core nodes to older cluster (decommissioned by CERN/ATLAS, Xeon E5420 @2.5 GHz, 16GB RAM)
    might operate them in hyperthreading mode if doubling RAM
    scale lustre up accordingly
  ➡ Upgrade older cluster ce.lhep.unibe.ch to ROCKS 6.1, SLC6, nordugrid-arc-2.0.3–1.el6.x86_64
  ➡ Upgrade ARC to nordugrid-arc-2.0.3–1.el6.x86_64 on ce01.lhep.unibe.ch
  ➡ Re-install failed thumpers and deploy them
  ➡ Implement ambient temperature monitoring in the room
  ➡ ipmi scripts to monitor machine temperatures, raise alarms

EGI-InSPIRE/NGI-CH - 02-09-2013                                                    Gianfranco Sciacca - LHEP Universität Bern
                                        UNIBE-LHEP site report

EGI-InSPIRE/NGI-CH - 02-09-2013                                  Gianfranco Sciacca - LHEP Universität Bern

Shared By: