Best Practices on Campus Network Monitoring

Document Sample
Best Practices on Campus Network Monitoring Powered By Docstoc
					    Best Practices on
Campus Network Monitoring

    Ljubljana, October 20 2011
    Vidar Faltinsen, UNINETT
” We are convinced that most campuses do not     2
  take the task of measuring and understanding
  their traffic flows sufficiently seriously ”

                   - SERENATE, December 2003
Our context

   The network is complex
       A lot of equipment
       Heaps of traffic around the clock                  3

   No system is perfect
       Errors will occur – incidents will hit us
   Motto: be proactive and ahead
       The user should not call you – you should be the
        first to know!
   Keep in mind: If information is good…
     (posted at the right time, kept up to date)…
    …the user is (more) patient!
  Machine and       Report                 Device            IP Device                                 Network
                                                                                   L2 traceroute
  user trackers    Generator               History              Info                                   Explorer

 Detain machines                                                                                     Traffic Maps
                                                                                                     (geo and topo)

                                                     NAVdb                                             statistics
   switch ports

                                                                                                        Cricket       4
                           SNMP                                          RRD

     the network                                                                                     Alert profiles
                                 Status         Module       Service       Threshold
                                 Monitor        Monitor      Monitor        Monitor

…at a glance
     external      SNMP trap                                                                             SMS
                    or email          Event Engine                       Alert Engine

Do the most important first

For all your equipment:
  1.   Ping                   5

  2.   If down
         => send sms
Without numbers you are nothing

     When an incident occurs – do you have enough data to
      investigate – and actually pinpoint the cause?
     Disk is cheap
     Collect heaps of statistical data                              6
     Have a scheme for compressing data as time goes
             (RRD method)
     Focus on good search tools, reports and visualisation
      methods to make traffic/statistical anomalies easy to detect
         Isolation and classification of an error tends to
          consume most of the recovery time
     Autodetection of thresholds and more complex anomaly
      detection is even better
         Remember to moderate the total flow of alarms
          (classify alarms)
Logs are gold, scripts as well

   Log, log, log
        Syslog is also a management system 
   Small (shell) scripts can be gold
        A good idea can be only a few code lines
        A culture that motivates creativity, allows
         continous implementation of new
         scripts/add-ons will step by step improve
         the overall management process!
Avoid a monolithic NMS

   Not an absolute rule, but be a sceptic

   If the system is too massive it tends to set the agenda.
       You should shape the system, not the other way
       If too much resources must be invested into
         understanding the system…
       …then even more resources must be put into
         accommodating the system to your needs  

   The NMS has no intrinsic value…
        …it should be a useful tool for you

   But remember nothing is for free – you must in any
    case invest in understanding what your tools actually do
FCAPS – so many tasks to cover

      Fault management
      Configuration management   9

      Accounting management
      Performance management
      Security management
Not one tool - a set of tools
   Special purpose tools with limited scope is good
        Example of tool categories:
              inventory systems
              trouble ticket systems
              status monitors
              measurements (and threshold monitors)                           10
              server/services focused
              netflow analysis
              security-focused
              configuration tools
              simulation
   Tools should (ideally) not overlap
   Have a well defined single authority as source for your data sets, i.e.;
        the set of equipment (with attributes) we manage is defined in
         one place
        similarly for our locations (with attributes), etc, etc
   Autodetection is good
        But in a controlled environment (be aware of weak SNMPv2
            GigaCampus tool boxes
GC          Managing 30 campus networks around Norway

    The tool boxes are servers containing a number
     of management tools:
       NAV: Proactive network management
       nfsen: Netflow traffic analysis     

       Stager: Netflow and Qflow
       Hobbit: Service monitoring                      Stager
       tftp server, syslog server, radius server

    The tool boxes are placed on campus and used
     by the local IT staff.

    Management, tool enhancements, software                nfsen
     upgrades, etc, is done by UNINETT.

    Free training in tool usage is given.  

NAV – Network Administration Visualized
   Network management system
    developed by UNINETT and NTNU
    since 1999.

Key features
  Inventory information with topology
         topology autodetected                                            12
         L3, L2, per vlan
   Status monitor with alarm system
         sms and email alarms
   Client machine tracking
         based on ARP and bridge table data
   Client machine detention
   Statistics and graphing

  Free software – GPLv2
  Debian package ++
  Virtual appliance available

The service monitor Hobbit


    Agent on servers that reports on the ”local” status
    Monitors CPU load, disk usage, memory, processes
     running and whatever you script 
    Servers are organized in groups. Alarms are showed
     on a per group basis.
    Drill down to details of when an alarm occured and
     reported reason
Use a single event/alarm system

Place your monitor strategic

   A monitor placed in the periphery of your
    network is more likely to be cut off
       place in a central (network wise) location   15

       redundant network access (VRRP,
   Redundant power, incl redundant source of
    source (UPS/ideally standby generator)
   Monitor the monitor!
   Use SMS for alarms in addition to email
       Place the SMS sending device physically
        connected to the NMS
Adopt good naming standards

   Do not underestimate the value of sound
    names for your equipment, rooms and
    locations                                      16

   The name of the device should in itself give
    an idea of what the device is (does) and
    where it is placed
       Example: mtfs-272-sw
        (a switch in area ”mtfs”, wiring closet
   Also use a thought-through naming standard
    for router interfaces and switch ports
NMS Security

   Restrict access to NMS to authorized crew
       both network access and physical access   17

   Isolate management IP address of switches
    and base stations to dedicated subnets
   Firmly restrict SNMP access to the network
    equipment – only from the NMS(es).
       remember SNMP v2 security is weak
   Be even more restrictive if you allow/use
    SNMP Write
       consider SNMP v3 or Netconf
MIB requirements for your
   Your network equipment should support:

        RFC 3418: SNMPv2-MIB (system)
        RFC 2863: IF-MIB (interfaces, incl. 64 bit counters)
        RFC 4293: IP-MIB (IP-interfaces and ARP; IPv4 and IPv6)
        RFC 4133: ENTITY MIB (modules, optics, software, serial numbers)
              Not supported by Juniper 
        RFC 4188: BRIDGE-MIB (bridge table)
        RFC 4363: Q-BRIDGE MIB (bridge table per vlan, vlan config)
              Not supported by Cisco 
        RFC 3635: Etherlike-MIB (duplex)
        RFC 2368: MAU-MIB (medium)
              equipment support seems scarse  (HP has support)

   Your NMS should whenever possible use standard/IETF MIBs rather
    than vendor proprietory MIBs
Key points – in summary

   Be proactive
   Detect important alarms early
   Inform the users
   Log, log, log (snmp collect)
   Use a number of tools
   Adopt good naming standards
   Value the engineer – small scripts are gold
   Educate your crew!
    (in both NMS operations and procedures)

Shared By: