Windows Monitoring - Nagios

Document Sample
Windows Monitoring - Nagios Powered By Docstoc
					5 years of vaporware
   These slides represent the work and opinions
                   and do their fault!
    of the author It is not not constitute official
    positions of any organization sponsoring the
    author’s work It is not my fault!
                    It not been peer
    This material has is your fault! reviewed and
    is presented here as-is with the permission of
    the author.
   The author assumes no liability for any
    content or opinion expressed in this
    presentation and or use of content herein.
   Developer (not manager)
    ◦ Not working with Nagios
   Accidentally ended up in our NOC
    ◦ Hated BB so we migrated to Nagios
   2003: The birth of NSClient++
    ◦ NSClient sucked (Broke Exchange)
    ◦ NRPE_NT was to much work
   2004: The open source of NSClient++
    ◦ “just for fun”
   2007: The rebirth of NSClient++
    ◦ Got a lot of emails and hits on the webpage
   2011: The Present
    ◦ 0.3.9 out last may
    ◦ 0.4.0 out as alfa
   Windows Monitoring and NSClient++
    ◦ Quick Introduction
   What’s new in 0.3.9
    ◦   Disk/File/*
    ◦   Scheduled Tasks
    ◦   Aliases
    ◦   Crash Handling
   What’s new in 0.4.0
    ◦   New core
    ◦   Unix support
    ◦   New settings subsystem
    ◦   New protocol
    ◦   Python Scripting
   The end of NSClient++!
   Q/A
Quick Introduction
   What is NSClient?
    ◦ A (pretty old) program
      pNSClient
      A (pretty limited) protocol
      check_nt
    ◦ A (pretty incorrect) concept
      ”Windows monitoring”

   What is it not?
    ◦ NSClient++!
      NSClient++ was written as a replacement for pNSClient
      But it has evolved much since then
   NSClient++
    ◦ Freedom!
         Custom scripts
         Decentralized or centralized
         Active or Passive
         Can monitor “anything” (including your application)
         Can perform “tasks” (fix your problems)
   Other options:
    ◦ SNMP
       Generally complex to use and limited on “standard” hardware
    ◦ pNSClient/NRPE_NT/OpMonAgent/*
       Old, outdated and usually limited functionality
    ◦ “Agentless” WMI
       Limited functionality
       Enforces centralized and active monitoring
   But...
    ◦ I am biased, so might not want to take my word for it...
 Protocol    Method         Encryption   Auth   Payload   M. args.   M. cmds   HTTP
NSClient      Active           No        Yes     No         Yes       No       No
NRPE          Active           No        No     1024        Yes       No       No
NSCA         Passive           Yes       Yes     512        Yes       Yes      No
NRDP         Passive           Yes       Yes      ∞         Yes       Yes      Yes
NSCP          Active           Yes       Yes      ∞         Yes       Yes      Yes
DNSCP         MQ               No        Yes      ∞         Yes       Yes      No
check_mk     Active             ?        No       ∞         No        Yes      No
   Internals:
    ◦   C++
    ◦   Around 75.000 lines of code
    ◦   Actively developed (unfortunately only by me)
    ◦   Modularized design (use what you need)
   Runs on:
    ◦ Windows: NT4, w2k, XP, w2k3, Vista, w2k8, X64, X86 …
    ◦ Unix: Linux/Debian (probably many/most others as well)
   Current Version:
    ◦ 0.3.9 with 0.4.0 in beta
   Most features require NRPE or NSCA (or NSCP)
   Documentation online (WIKI)
   Not supported by a commercial entity
    ◦ Donations welcome
    ◦ Sponsoring available (contact me for details)
   Used by a lot of people (I think)
    ◦ Impossible to estimate any figures
   Please, Help out!
    ◦ Add documentation
    ◦ Report problems
    ◦ Come with ideas, thoughts, etc…
Using NSClient++
   NSClient++ is a command line program!
    ◦ nsclient++ -start (net start nsclientpp)
    ◦ nsclient++ -stop (net stop nsclientpp)
              nsclient++ -test
    ◦ nsclient++ -test
               Is your friend!
    ◦ notepad nsc.ini
   Testing:
    1. Local (nsclient++ -test)
    2. From CLI (check_nrpe ...)
    3. From Nagios (add command)
       Works with “anything”
    ◦    Including many non Nagios based systems
   New command line syntax!
    ◦ nscp --service --start
    ◦ nscp --service –-stop
    ◦ nscp --help
   Testing         nscp --test
                Is your friend!
    ◦ nscp --test
    ◦   nscp --settings-help
    ◦   nscp --settings --migrate-to ini
    ◦   nscp --settings --set …
    ◦   …
   Run scripts:
    ◦ nscp --client --module PythonScript --command
      execute-and-load-python --script --install
   Major simplification to the disk/file checker
    ◦ CheckFile (removed)
    ◦ CheckFile2 Deprecated
    ◦ CheckFiles (replaces above)
   Volume support (for real this time)
   Aliases
   NSCA/NRPE enhancements
   Scheduled task checks
   Crash Handling
   A bunch of new commands
   Bug fixes and many more things…
   We have recruited a new member to the team!
   A girl actually…
   …Still a bit wet behind the ears…
   The good:
    ◦ Powerfull interface!
    ◦ Simple to use!
    ◦ out-of-the-box solution!
      (on which you can expand)
   The bad:
    ◦ Nothing! Really, I mean it!
   …and then… yesterday…
    ◦ …in the bar…
    ◦ …all hopes shattered…
    ◦ …aparently it is still to complicated… 
   Same as was introduced for eventlog last year
   Based on SQL WHERE clauses
    ◦   generated > -2d AND severity = 'error‘
    ◦   size > 5k
    ◦   size > 5k OR size < 1k
    ◦   size > 5k AND written > -2d
    ◦   (size > 5k OR size < 1k ) AND written > -2d
    ◦   …
Type         Description
filename     Name of the file
path         Path of the file
size         Size of the file
accessed     When the file was last accessed
written      When the file was last written
creation     When the file was created
version      The exe file version (slow)
line_count   Number of lines in the file (slow)
Operator   Safe   Meaning
=          eq     Equality
!=         ne     Not equal
>          gt     Greater then
<          lt     Less then
=>         ge     Greater then or equal
=<         le     Less then or equal
like              String similarity (substring matching)
not like          Opposit of like
regexp            Regular expression matching
Option         Description
path           The root path to use
pattern        The file pattern to use
filter         Define the filter (there can only be one)
warn           How many hits constitutes a warning state.
               warn=>5, warn==5 warn=!=5
crit           How many hits constitutes a critical state.
truncate       Length of returned data.
               Since NRPE/NSCA has a limited capacity this is
               important. (Will be deprecated in 0.4.0)
syntax         How to format the return data
master-syntax How to format the “message string”
debug=true     Displays a lot more information in the logfile/console
   CheckDriveSize … CheckAll=volumes …
   Other new features
    ◦ Added a new option to ignore drives which are not
      readable (like office 2010 q: drive)
      ignore-unreadable
    ◦ Added magic modifiers (from check_mk)
      magic=0.7
Scheduled Tasks
   Works the ”same” as CheckEventLog
    ◦ ”filter=exit_code ne 0”
   Two modules:
    ◦ CheckTaskSched.dll
      Works on Windows NT4 and beyond
      But cannot check ”new” tasks (from Vista and beyond)
    ◦ CheckTaskSched2.dll
      Works on Windows Vista and beyond
      Has fewer filter keywords
Type                   Description
title                  Tasks name
application            The application
comment                Retrieves the comment for the work item.
parameters             Retrieves the command-line parameters of a task.
working_directory      Retrieves the working directory of the task.
                       Retrieves the last exit code returned by the executable
exit_code              associated with the work item on its last run.
max_run_time           Retrieves the maximum length of time the task can run.
                       Retrieves the status of the work item. Possible values include:
                       ready, running, not_scheduled, has_not_run, disabled,
status                 has_more_runs, no_valid_triggers
most_recent_run_time   Retrieves the most recent time the work item began running.
     "filter=exit_code ne 0"
     "syntax=%title%: %exit_code%"
WARNING:test.job (1)

    "filter=status = 'running' AND most_recent_run_time < -30m"

WARNING:test.job (2011-02-10 23:14:35)
   System
    ◦ alias_cpu
      CPU Load past 5 minutes, 80/90% bounds
    ◦ alias_cpu_ex
      CPU Load past 5 minutes, custom bounds
    ◦ alias_mem
      Memory utilization (all) 80/90% bounds.
    ◦ alias_mem_ex
      Memory utilization (all), custom bounds
    ◦ alias_up
      System uptime
   Disk/Drive
    ◦ alias_disk
       All fixed drives
    ◦ alias_disk_loose
       All fixed drives, ignore any problematic drives
    ◦ alias_volumes
       All volumes
    ◦ alias_volumes_loose
       All volumes, ignore any problematic drives
    ◦ alias_file_size
       Check the size of a given file (filename, size)
    ◦ alias_file_age
       Check the age of a given file
   Eventlog
    ◦ alias_event_log
      Check for errors in the event log
   Schedules Tasks
    ◦ alias_sched_all
      No scheduled jobs have failed
    ◦ alias_sched_long
      No task has been running for longer then a given time.
    ◦ alias_sched_task
      Check if a given task succeeded
   Misc
    ◦ alias_updates
      Check that updates are applied
   Processes
    ◦ alias_service
      All services in “sensible state”
    ◦ alias_service_ex
      All services in “sensible state” (exclude various services)
    ◦ alias_process
      A process must be running
    ◦ alias_process_stopped
      A process must not be running
    ◦ alias_process_count
      A process must not have more then X instances
    ◦ alias_process_hung
      A process must not be hung
Crash Handling
   Using Google break pad
    ◦ same as Google Chrome, Mozilla Firefox, etc
   Three options (not mutually exclusive)
    1. Send crash dumps to
        Server can be changed
             if you want to have an internal server or proxy server.
    2. Store crash dumps for analysis
        Will also be checked with check_nscp
    3. Restart service


Miscellaneous Fixes
   NSCA
    ◦ Fixed problems with sending ”many” results back
   NRPE
    ◦ Added support for large payloads
   Checks
    ◦ Added ”check_nscp” to check health of NSClient++
    ◦ Added new check for running other checks ”with a timeout”
    ◦ Added new negate check (to negate the result of another check)
   All filters (read CheckEventLog et al)
    ◦ Many fixes and additions (regular expressions)
   Process checks
    ◦ Added support for checking if processes has ”hung”
   Performance data
    ◦ Added it to many places where it was intermittently missing
Whats to come?
                                            0.4.2            •Bugfixes
                                            • Monitoring
                               0.4.1          Kits
                                            • New windows
                               • Bugfixes
               0.4.0                          subsytem
               • Core switch                • True passive
               • Linux                        checks
0.3.9            support                    • Distributed
               • Distributed                  Monitoring
• Last 0.3.x                                  (v2)
   Brand new core based upon libraries
    ◦ Things should *work* not just “work”
    ◦ More modular and extensible
   Unix support
    ◦ Both as a client and server
   New settings subsystem
    ◦ Registry, improved ini support, http, etc
   New protocol
    ◦ NSCP (HTTP(s), MQ, Native)
   Distributed monitoring
    ◦ Many new things in this area (including MQ)
   Python scripting
    ◦ Primary goal (for me) is to create “unit-test”
   Updated installer
    ◦ Wix 3.5, more customizable
   “Monitoring Kits”
    ◦ Monitoring solutions for “standard things”
   New windows check-subsytem
    ◦ More modern and less arcane (no NT4 support)
    ◦ Remote checking
   .Net plugin support
    ◦ Possibly internal VBA scripting support
   Metrics cache and aggregation
    ◦ Lightweight version of CEP
    ◦ “crit=cpu > 80% AND transactions_per_sec < 10”
   Filter-like API (in addition to options)
    ◦ “warn=any drive > 90% OR c: > 80%”
   Remote updates/upgrades
    ◦ Allow NSCP to upgrade itself
   “port” of the “standard plugins”?
    ◦ Run your favorite check_xxx from inside NSClient++
   Unix plugins?
    ◦ Run CheckCPU on unix machines?
   Client/web Interface?
    ◦ A nice little program (systray)

   Let me know what you would like to see!
Brand new core
   This is why it was so long in the making
    ◦ Merging each new version took forever!
   New internal protocol
    ◦   Removed all internal “limits” (think buffer sizes)
    ◦   Allows many new features
    ◦   Allows much more advanced internal scripts
    ◦   Allows for “non NRPE based checks”
   A lot of new bugs?
    ◦ This is the scary part (for me)
         but my testing has show it seems very stable
Unix support
   Good question…
    ◦ Since no one seems to like to program on Windows
      I brought NSClient++ to “unix” 
    ◦ Because I can
      With the new core comes portability
      So, perhaps the better question was:
        Why not?
   Will NOT be supported for some time though
    ◦ Unless someone wants to help out
New Settings
   Hierarchical settings subsystem
    ◦ [/settings/NRPE/server]
    ◦ allow arguments=false
   Instead of
    ◦ [NRPE Server]
    ◦ allow_arguments=false
   Why did I do this?
    ◦ Because it was fun 
    ◦ Number of options has started to explode
    ◦ Simpler to use the registry (as well as xml?)
   Since settings have “url:s”
    ◦   old://${exe-path}/nsc.ini
    ◦   ini://${base-path}/nsclient.ini
    ◦   registry://HKEY_LOCAL_MACHINE/software/NSClient++
    ◦   http://my.central.server/config/${hostname}.ini
   Allows extensions (not via plugins though)
    ◦ Maybe in the future:
         lua://${base-path}/config.lua
         python://${base-path}/
   You can mix and match:
    ◦ ini://${base-path}/nsclient.ini
           Can “include”:
           registry://HKEY_LOCAL_MACHINE/software/NSClient++
           Which in turn includes
           http://conf.server/${hostname}.conf
   Ability to load the same plugin twice.
   Normal (default alias is python)
    ◦   [/modules]
    ◦   PytonScript=
    ◦   [/settings/python/scripts]
   Multiple modules (define two aliases foo and bar)
    ◦   [/modules]
    ◦   foo=PytonScript
    ◦   bar=PythonScript
    ◦   [/settings/foo/scripts]
    ◦   [/settings/bar/scripts]
   It depends…
    ◦ If you are “still” using check_nt:
      Probably not
    ◦ If you are using NSCA:
      Maybe not
    ◦ If you want to use all new features
      Yes
   How do I change?
    ◦ It is pretty simple…
      nscp --settings --migrate-to ini
    ◦ (or)
      nscp --settings --migrate-to registry
New protocol

Windows Computer                        Nagios Server

    CPU             Fork      ...              Fork

    Disk            Fork      ...              Fork
    Mem             Fork      ...              Fork

     ... Fork        Fork     ...         ...   Fork

Windows Computer                    Nagios Server
    CPU                            check_nscp


   Allows more then one command to be sent
   Used internally for plugins
   Support both passive and active checks
   Supports configuration, management, etc…
   Extensible

   But will also support:
    ◦ Multiple locales (based on utf)
    ◦ Unlimited payloads (soft configurable)
    ◦ Support real performance data (not strings)
Distributed monitoring
            Command   CheckCPU
 NSCA...     broker

                       XXX Agent     XXX Server
Real time    broker    NSCA Agent    NSCA Server

           Command    CheckCPU
NRPE        broker
Server                   ...

            broker     NSCA Agent      NSCA Server

 Check       broker   SYSLOG Agent   SysLog Server
   an extension of the passive checks
    ◦   ”Something” can send notification events
    ◦   ”Something” can receive notification events
    ◦   Agents can forward notification events
    ◦   Replaces NSCAListener module
   Supports routing
   Not a one-to-one mapping.
    ◦ Multiple consumers
    ◦ multiple producers
   Allows
    ◦ Passive plugins (other then the built-in NSCA)
    ◦ Script and rule based routing
Python scripting
   Built-in python scripting
   Has full API support
    ◦ Can build ”modules” in python
    ◦ Can access settings
    ◦ Can do “anything”
   Primarily used by me for unit-testing
   Requires a working python install
Le Roi est mort, vive le Roi!
   0.4.x (ish) will be the last ”Windows”
    monitoring agent
   The idea is to make it more:
    ◦ A platform/client/server for distributed monitoring
      Regardless of os/system
      Regardless of Monitoring solutions
   Don’t worry…
    ◦ It will still work just fine as a ”Windows Monitoring
    ◦ But in addition to this you will be able to do more.


Shared By: