Docstoc

SRM Monitoring

Document Sample
SRM Monitoring Powered By Docstoc
					           DPM Monitoring

              Wahid Bhimji
         University of Edinburgh,




Apr-10         Wahid Bhimji – Files access   1
                            Intro
• New DPM developer Alejandro Álvarez Ayllón
working on new nagios based DPM monitoring
List of Probes:
https://twiki.cern.ch/twiki/bin/view/EGEE/LCGDMMonitoring
Bridge to examples running at CERN:
http://aalvarez.web.cern.ch/aalvarez/cgi/bridge.py/gt-
   septic/nagios3/
• He’s happy to add more probes (very responsive). He also
   wants feedback on sensible WARN / FAIL values
• We can also contribute in our own probes

Apr-10                  Wahid Bhimji – Files access          2
                                    LCGDM plugins
•    Check validity of host certificates.
      –    check_hostcert
      –    Warning and critical configurable: Days until the certificate expires
•    DB password lifetime
      –    check_oracle_expiration
      –    Warning and critical configurable: Days until the password expires
      –    Connection string, user and password can be specified
•    Disk partitions activity (bytes/s in and out)
      –    check_partition_activity
      –    No warning or critical criteria.
      –    Individual disks can be selected.
•    CPU utilization (System/Idle/IOwait/IRQ)
      –    check_cpu
      –    Warning and critical configurable: Upper limit of CPU percentage per category
•    Network activity: bytes/s in and out (and error percentage)
      –    check_network
      –    No warning or critical criteria.
      –    Individual interfaces can be selected
•    Pool free space plus filesystem status
       –   check_dpm_pool
       –   Warning and critical configurable: Free space per subsystem or per pool. Specified as bytes (with suffixes K,M,G,T,P).
Apr-10 –   Individual pools can be selected, but noWahid Bhimji – Files access
                                                    filesystems.                                                                  3
                           LCGDM probes cont..
•   Collecting information about disk server activity (network, disk I/O, memory, number of connections)
    splitting the information between sequential I/O (gridFTP and rfcp) and random I/O (rfio and xroot)
         –   check_process Can be used for that, excepting disk I/O and network usage (apparently a kernel patch is needed for
             that)
         –   Warning and critical configurable: Number of instances, % of CPU, % of memory, number of threads, number of
             connections, number of file descriptors.
         –   Individual processes can be selected.
•   DPNS ping
         –   check_dpns
         –   Warning and critical configurable: ping time in millisecond.
         –   Can be used remotely.
•   GridFTP
         –   check_gridftp
         –   No warning criteria. Critical if a file can not be uploaded, downloaded, or the comparison is not successful.
         –   Can be used remotely.
•   Published information
         –   check_dpm_infosys
         –   No warning criteria. Critical if any of the requests information is not being published.
         –   Can be used remotely.
•   RFIO
         –   check_rfio
         –   Everything that applies to GridFTP probe. Can NOT be executed locally.
Apr-10                                              Wahid Bhimji – Files access                                                  4
                     From NAGIOS itself
• DB activity and size
         – NAGIOS: check_oracle, check_mysql
• Number of processes and threads in use
         – NAGIOS: check_procs (not threads, though)
• Check if filesystem correctly mounted
         – NAGIOS: check_disk already does this
• Disk partitions: used and free
         – NAGIOS: check_disk
• Memory: swap, free and used
         – NAGIOS: check_swap
• Load average
         – NAGIOS: check_load
Apr-10                           Wahid Bhimji – Files access   5
                   From grid-monitoring
• Check validity of CRLs
         – crls from org.sam.sec
• Check validity of CAs
         – check_ca_dist
• Number of sockets used for RFIO and number of sockets used
  for gridFTP
         – check_netstat.pl from Nagios Exchange can be used fot that.
• Socket count
         – check_netstat.pl does that and much more.
• Directory size
         – check_dirsize.sh may be useful.

Apr-10                             Wahid Bhimji – Files access           6
Apr-10   Wahid Bhimji – Files access   7
         Can plot stuff with pnp4nagios




Apr-10              Wahid Bhimji – Files access   8
          Conclusions / Questions
• This is nice - Take a look at the probes and give me or
  Alex some feedback
• Or try it out yourself. Not tied to any release
http://etics-
  repository.cern.ch:8080/repository/pm/volatile/repo
  md/name/lcgdm_head_sl5_x86_64_gcc412/index.ht
  ml
• Do we want to add performance info into this?
       – Like what was in GridPPDPMMonitor
       – Summer student Martin (see DPM Stressing talk) could
Apr-10
         _maybe_ do some of that – Files access
                             Wahid Bhimji                       9

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:10/29/2011
language:English
pages:9
xiaohuicaicai xiaohuicaicai
About