Nagios and RRD

Document Sample
Nagios and RRD Powered By Docstoc
					           Grid Monitoring using
           Nagios and RRDtool


                   Ian Stokes-Rees
                Oxford Particle Physics

HEPSYSMAN Conference                29 April 2003 – Ian Stokes-Rees
            In a perfect world …

• Individual node status
    o   Is it up?
    o   What is its load?
    o   What is the memory and swap usage?
    o   NFS and network load?
    o   Are the partitions full?
    o   Are applications and services running properly?
• Amalgamated node status
    o Same info, but across groups of nodes

HEPSYSMAN Conference                        29 April, 2003 – Ian Stokes-Rees
            In a perfect world …

• Historical information
    o Trends
• Notification of service states
    o   e.g. Storage down to 100 megs free = Warning
    o   Storage down to 10 megs free = Critical
    o   sshd no longer running = Failure
    o   notify by email, pager, mobile
• Easy access to monitoring information
    o web, email, digest, mobile

HEPSYSMAN Conference                     29 April, 2003 – Ian Stokes-Rees
            In a perfect world …
   • Avoidance of “Too many red flashing lights”
       o “Just the facts, ma’am” – only want root cause failures to be
         reported, not cascade of every downstram failure.
       o also includes avoiding unnecessary checks
       o e.g. HTTP responding, therefore no need to ping
       o e.g. power outage, doesn’t ping, so don’t bother trying
         anything else
   • Other wish list requirements?




HEPSYSMAN Conference                              29 April, 2003 – Ian Stokes-Rees
Aspects of Current Grid Monitoring

1. LDAP (Lightweight Directory Access Protocol) is the current
   foundation for MDS. Designed frequent read, infrequent write.
2. MDS (Monitoring and Discovery Service) uses LDAP for maintaining
   static and dynamic system details.
3. R-GMA (Relational Grid Monitoring Architecture) meant to address
   shortcomings of LDAP based MDS system by using hierarchy of
   relational databases. Now being deployed.
4. GRIS (Grid Resource Information Service) stores details about the
   state of “the grid” (at least from the local node)
5. GIIS (Grid Index Information Service) ties together several GRISes
6. HBM (Heart Beat Monitor) monitor Globus services – seems to have
   died a quiet death

HEPSYSMAN Conference                             29 April, 2003 – Ian Stokes-Rees
Existing Grid Monitoring Lacks…

• Historical information for trends
• Simple interface for accessing information
• Automated response to changes in system
  state

     Here is where RRDtool and Nagios can
                    contribute

HEPSYSMAN Conference           29 April, 2003 – Ian Stokes-Rees
                       RRDtool

                              www.rrdtool.com

•     Round Robin Database for time series data storage
•     Command line based
•     From the author of MRTG
•     Made to be faster and more flexible
•     Includes CGI and Graphing tools, plus APIs
•     Solves the Historical Trends and Simple Interface
      problems

HEPSYSMAN Conference                     29 April, 2003 – Ian Stokes-Rees
  Define Data Sources (Inputs)
• DS:speed:COUNTER:600:U:U
• DS:fuel:GAUGE:600:U:U
    o DS = Data Source
    o speed, fuel = “variable” names
    o COUNTER, GAUGE = variable type
    o 600 = heart beat – UNKNOWN returned for interval if
      nothing received after this amount of time
    o U:U = limits on minimum and maximum variable
      values (U means unknown and any value is
      permitted)

HEPSYSMAN Conference                    29 April, 2003 – Ian Stokes-Rees
      Define Archives (Outputs)
• RRA:AVERAGE:0.5:1:24
• RRA:AVERAGE:0.5:6:10
    o RRA = Round Robin Archive
    o AVERAGE = consolidation function
    o 0.5 = up to 50% of consolidated points may be UNKNOWN

    o 1:24 = this RRA keeps each sample (average over one 5 minute
      primary sample), 24 times (which is 2 hours worth)

    o 6:10 = one RRA keeps an average over every six 5 minute
      primary samples (30 minutes), 10 times (which is 5 hours worth)
• Clear as mud!
    o all depends on original step size which defaults to 5 minutes


HEPSYSMAN Conference                              29 April, 2003 – Ian Stokes-Rees
        RRDtool Database Format
 Recent data stored once every 5            Old data averaged to one entry per
minutes for the past 2 hours (1:24)         day for the last 365 days (288:365)




                                                                                  RRD
--step 300
                                                                                   File
(5 minute input
   step size)


                         RRA 1:24     RRA 6:10      RRA 288:365

                     Medium length data averaged to one entry per
                          half hour for the last 5 hours (6:10)
  HEPSYSMAN Conference                                  29 April, 2003 – Ian Stokes-Rees
                 RRDtool Example
• Monitoring a car – fuel in the tank plus odometer
                  12:05   12345   KM   7.0   L
                  12:10   12357   KM   5.8   L
                  12:15   12363   KM   5.2   L STOP
                  12:20   12363   KM   5.2   L
                  12:25   12363   KM   5.2   L RESTART
                  12:30   12373   KM   4.2   L
                  12:35   12383   KM   3.2   L
                  12:40   12393   KM   2.2   L
                  12:45   12399   KM   1.6   L
                  12:50   12405   KM   9.0   L REFUEL
                  12:55   12411   KM   8.4   L
                  13:00   12415   KM   8.0   L
                  13:05   12420   KM   7.5   L
                  13:10   12422   KM   7.3   L
                  13:15   12423   KM   7.2   L


HEPSYSMAN Conference                                     29 April, 2003 – Ian Stokes-Rees
                 RRDtool Example
• Create an RRD to store distance and fuel
rrdtool create car.rrd
   --start 920804400 \
   DS:speed:COUNTER:600:U:U \
   DS:fuel:GAUGE:600:U:U \
   RRA:AVERAGE:0.5:1:24 \
   RRA:AVERAGE:0.5:6:10
• --start Defines earliest time RRD accepts

HEPSYSMAN Conference            29 April, 2003 – Ian Stokes-Rees
                    RRDtool Example

• Input data:

rrdtool   update   car.rrd   920804700:12345:7.0   920805000:12357:5.8
rrdtool   update   car.rrd   920805300:12363:5.2   920805600:12363:5.2
rrdtool   update   car.rrd   920805900:12363:5.2   920806200:12373:4.2
rrdtool   update   car.rrd   920806500:12383:3.2   920806800:12393:2.2
rrdtool   update   car.rrd   920807100:12399:1.6   920807400:12405:9.0
rrdtool   update   car.rrd   920807700:12411:8.4   920808000:12415:8.0
rrdtool   update   car.rrd   920808300:12420:7.5   920808600:12422:7.3
rrdtool   update   car.rrd   920808900:12423:7.2




HEPSYSMAN Conference                                        29 April, 2003 – Ian Stokes-Rees
                RRDtool Graphing
• Now with data in the RRD, RRDtool can generate
  graphs:
rrdtool graph speed.gif \
        --start 920804400 --end 920808000 \
        --vertical-label m/s \
        DEF:myspeed=car.rrd:speed:AVERAGE\
        DEF:myfuel=car.rrd:fuel:AVERAGE \
        CDEF:realspeed=myspeed,1000,* \
        LINE2:realspeed#FF0000 \
        LINE2:myfuel#00FF00


HEPSYSMAN Conference               29 April, 2003 – Ian Stokes-Rees
      RRDtool Graphing Output




•   Much more interesting graphs possible
•   Multiple RRDs may be used as sources for variables
•   Auto-interpolation of points
•   Functions and calculations can be applied to variables
•   Legends, labels, and text can be inserted

HEPSYSMAN Conference                       29 April, 2003 – Ian Stokes-Rees
      RRDtool Graphing Output




HEPSYSMAN Conference   29 April, 2003 – Ian Stokes-Rees
                       Nagios

                            www.nagios.org

•     Instantaneous service level monitoring
•     Web based interface
•     Somewhat complicated set of configuration
      files to manually edit
•     Automated notification of change in service
      level (email, phone, etc.)
•     Defines WARNING, CRITICAL, FAILED levels
HEPSYSMAN Conference                 29 April, 2003 – Ian Stokes-Rees
What Do We Want to Monitor?
Static                 Dynamic               Services

CPU (SPECint)          Load                  Live

RAM (swap)             Mem/swap usage        Accessible

HD capacity            Storage available     Globus

Network b/w            Network utilisation   SSH

OS                     Users                 Etc.

Applications           Processes

Location, Admin        Queues (PBS)


HEPSYSMAN Conference                          29 April, 2003 – Ian Stokes-Rees
         Nagios Host Definitions
• Define details about each node and their hierarchy in the network:
define host{
    host_name                       tbce01
    alias                           Testbed CE
    address                         163.1.243.105
    parents                         edg-testbed
    notifications_enabled           1
    process_perf_data               1
    check_command                   check-host-alive
    notification_interval           120
    notification_period             24x7
    notification_options            d,u,r
}



HEPSYSMAN Conference                             29 April, 2003 – Ian Stokes-Rees
     Nagios Service Definitions
• Define details about each service:
define service{
    name                             ping
    check_command
                     check_ping!100.0,20%!500.0,60%
    contact_groups                   linux-admins
    check_period                     24x7
    max_check_attempts               3
    normal_check_interval            5
    notification_interval            120
    notification_period              24x7
    notification_options             c,r
}



HEPSYSMAN Conference                    29 April, 2003 – Ian Stokes-Rees
Nagios Service and Host Polling
• Pull model, where Nagios server executes
  command to fetch host or service status
• Requires remote hosts and services to cooperate
    o NRPE installed on clients allows server to execute “plugins” to
      poll for information
    o Alternatively use existing client reporting mechanisms (ping,
      wget, http)
• Server responsible for configuration of polling
  intervals and details to be polled



HEPSYSMAN Conference                              29 April, 2003 – Ian Stokes-Rees
Nagios Service and Host Reporting

• Push model, where services and hosts
  decide when to report status to Nagios
  server
    o   push data when available/relevant
    o   generally full access to node-local data
    o   requires configuring every node independently
    o   authentication of nodes at server
    o   nodes need to know who to send data to



HEPSYSMAN Conference                       29 April, 2003 – Ian Stokes-Rees
         Host and Service Status




HEPSYSMAN Conference      29 April, 2003 – Ian Stokes-Rees
         Host and Service Status




HEPSYSMAN Conference      29 April, 2003 – Ian Stokes-Rees
         Host and Service Status




HEPSYSMAN Conference      29 April, 2003 – Ian Stokes-Rees
   Finally, some other monitors
• NWS (Network Weather Service) attempts to
  predict network utilisation from historical
  information
• Ganglia cluster monitoring system, provides
  aggregate graphs of cluster performance –
  Globus/EDG tie-ins underway
• Map Center EDG project to monitor Grid status
  and services
• ActiveMap, GridPortal, and InfoPortal*
  appear to be inactive projects
HEPSYSMAN Conference             29 April, 2003 – Ian Stokes-Rees

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:51
posted:4/6/2010
language:English
pages:26