									           DPM Monitoring

              Wahid Bhimji
         University of Edinburgh,

• New DPM developer Alejandro Álvarez Ayllón
working on new nagios based DPM monitoring
List of Probes:
Bridge to examples running at CERN:
• He’s happy to add more probes (very responsive). He also
   wants feedback on sensible WARN / FAIL values
• We can also contribute in our own probes

                                    LCGDM plugins
•    Check validity of host certificates.
      –    check_hostcert
      –    Warning and critical configurable: Days until the certificate expires
•    DB password lifetime
      –    check_oracle_expiration
      –    Warning and critical configurable: Days until the password expires
      –    Connection string, user and password can be specified
•    Disk partitions activity (bytes/s in and out)
      –    check_partition_activity
      –    No warning or critical criteria.
      –    Individual disks can be selected.
•    CPU utilization (System/Idle/IOwait/IRQ)
      –    check_cpu
      –    Warning and critical configurable: Upper limit of CPU percentage per category
•    Network activity: bytes/s in and out (and error percentage)
      –    check_network
      –    No warning or critical criteria.
      –    Individual interfaces can be selected
•    Pool free space plus filesystem status
       –   check_dpm_pool
       –   Warning and critical configurable: Free space per subsystem or per pool. Specified as bytes (with suffixes K,M,G,T,P).
Apr-10 – Individual pools can be selected, but no
                                                    filesystems.                                                                  3
                           LCGDM probes cont..
•   Collecting information about disk server activity (network, disk I/O, memory, number of connections)
    splitting the information between sequential I/O (gridFTP and rfcp) and random I/O (rfio and xroot)
         –   check_process Can be used for that, excepting disk I/O and network usage (apparently a kernel patch is needed for
         –   Warning and critical configurable: Number of instances, % of CPU, % of memory, number of threads, number of
             connections, number of file descriptors.
         –   Individual processes can be selected.
•   DPNS ping
         –   check_dpns
         –   Warning and critical configurable: ping time in millisecond.
         –   Can be used remotely.
•   GridFTP
         –   check_gridftp
         –   No warning criteria. Critical if a file can not be uploaded, downloaded, or the comparison is not successful.
         –   Can be used remotely.
•   Published information
         –   check_dpm_infosys
         –   No warning criteria. Critical if any of the requests information is not being published.
         –   Can be used remotely.
•   RFIO
         –   check_rfio
         –   Everything that applies to GridFTP probe. Can NOT be executed locally.
                     From NAGIOS itself
• DB activity and size
         – NAGIOS: check_oracle, check_mysql
• Number of processes and threads in use
         – NAGIOS: check_procs (not threads, though)
• Check if filesystem correctly mounted
         – NAGIOS: check_disk already does this
• Disk partitions: used and free
         – NAGIOS: check_disk
• Memory: swap, free and used
         – NAGIOS: check_swap
• Load average
         – NAGIOS: check_load
                   From grid-monitoring
• Check validity of CRLs
         – crls from org.sam.sec
• Check validity of CAs
         – check_ca_dist
• Number of sockets used for RFIO and number of sockets used
  for gridFTP
         – from Nagios Exchange can be used fot that.
• Socket count
         – does that and much more.
• Directory size
         – may be useful.

         Can plot stuff with pnp4nagios

          Conclusions / Questions
• This is nice - Take a look at the probes and give me or
  Alex some feedback
• Or try it out yourself. Not tied to any release
• Do we want to add performance info into this?
       – Like what was in GridPPDPMMonitor
       – Summer student Martin (see DPM Stressing talk) could
         _maybe_ do some of that – Files access
                             Wahid Bhimji                       9

