Docstoc

NTP monitoring

Document Sample
NTP monitoring Powered By Docstoc
					NTPMON



1. Introduction

All passive or active monitoring systems require accurate local clock. Knowledge of exact
system time is essential mainly for timestamps of captured packets and time sensitive
applications as one-way delay measurement. The expected absolute accuracy (difference
between local time and UTC) varies from 10 -3 s to 10 -5 s.

The most common clock synchronization method in the networking environment is the NTP
[1], optionally with a GPS receiver as an external time source. The NTP process should be
monitored otherwise we have no evidence that any particular time dependent measurement is
correct.

There exist universal tools for network services monitoring (e.g., Nagios[2]), however they
test only the availability of the NTP service and can not deal with details. Another tool is NTP
Time Server Monitor[3] but it is designed mainly for local NTP monitoring. We looked for a
centralized system which can monitor many external NTP sites and we decided to develop
such a system and integrate it into our network monitoring infrastructure.

This document describes NTPMON, a centralized NTP monitoring system which checks
parameters of NTP processes running on remote workstations, collects data into a database,
plots graphs and generates events if any parameter is above or under the specified threshold.
NTPMON can monitor even sites that are administered by another authority as it does not
need any nonstandard cooperation with the monitored site.

2. Data collection

There exists no universal method how to obtain all important parameters of the NTP process.
Some data are logged by the NTP process, other are available via system functions (e.g.,
adjtimex(), ntp_adjtime()) and tools (e.g., ntpq). As our goal was to implement a centralized
system without any piece of new software running on monitored site, we decided to omit logs
mining and all locally running tools.

We designed and programmed agents for parameters collection. Each monitoring agent runs
and saves data independently.

        NTP process polling

The agent ntpq queries periodically the status of the remote NTP process by the command
„ntpq -c rl‟. Parameters are parsed and inserted into the database.

Status of each particular NTP process is described by a set of qualitative and quantitative
parameters. NTPMON displays selected subset of these parameters:
              o stratum - a “synchronization distance” to the primary NTP server. Primary
                NTP server (i.e., NTP server with an external clock) has stratum 1,

              o time offset - difference between the NTP server time and the local time,

              o frequency offset - correction factor of local clock frequency. It is expressed
                as a relative unit-less value in ppm - parts per million. It is not important the
                frequency offset value itself but rather its variation due to changes of
                oscillator frequency,

              o root dispersion - maximal difference between local time and the root
                (primary) NTP server time. Its calculation is based on the assumption of the
                worst possible oscillator (un)stability.


        NTP client

In principle, the agent clie compares local time with time of the observed system, therefore it
has to operate on computer having very accurate and stable clock - we consider it as a
reference clock and call it REF.

The agent clie behaves like a NTP client which sends NTP query to the monitored (remote)
NTP process. According to the response, the agent clie calculates and stores into the database:

              o measured time offset (θC) – time difference between both REF time and
                monitored computer time,

              o measured delay (δC ) – propagation delay of NTP query and response

As a side effect, the agent also checks that the remote NTP service is available.

Let we assume that REF clock uncertainty is negligible. Then the real time offset θ of remote
NTP clock is done by the rule:

                θC + δC/2 ≤ θ ≤ θC + δC/2
where θC is the time offset calculated by clie and δC is the delay between REF and remote
clock.


        SNMP client

NTP version 4 is going to support SNMP, unfortunately, it is not yet neither standardized by
the IETF nor implemented. In the future, when NTP v.4 will be widely deployed, we assume
to program the snmp agent, which will probably replace the ntpq agent.


3. Database
NTPMON uses two databases, the MySQL and the RRD(round-robin database).

      SQL database

Agents store all collected data into the MySQL database and they also check specified
parameters and compare them with either the threshold or the previous value. Whenever a
limit is exceeded, the agent generates an event and stores it into the database.

We decided to avoid any floating point types, therefore we restricted field types to CHAR
(text of fixed length) and INT (integer value). We choose appropriate parameter units:

            o timestamps - all timestamps have resolution 1 s and are expressed by an
              unsigned integer value - number of UTC seconds since 0:00:00 1.1.1970 ,

            o time offset, delay, dispersion - expressed in microseconds by a signed integer
              value,

            o frequency offset - expressed in ppb – parts per billion (i.e., 10-9 or ns/s) by a
              signed integer value.

      RRD database

NTPMON displays several types of parameters in graphs - all such parameters are stored in
the RRD, as it implements two useful features: graphs plotting and old data aggregation that
corresponds to interval displayed by daily, weekly and monthly graphs. The database contains
individual values and average, minimum and maximum for every 10 minutes, 1 hour and 6
hours.

Each monitored site has its own RRD database which is split into two parts in order to avoid
an interaction of agents: time offset, frequency offset and dispersion is collected and stored
by the ntpq agent, measured time offset and measured delay is collected and stored by the clie
agent.


4. Events

Agents check in real-time values of collected parameters and generate static events (i.e., the
value exceeds a threshold) or dynamic events (i.e., the values changes too rapidly). A set of
events and thresholds have been selected according to our long time experience with NTP,
therefore we assume to update continuously the heuristic algorithm which generates events.
Currently, we recognize following 11 types of events that belong to 3 groups:

      availability
          o no system response – the observed system did not replayed in one minute,
          o no NTP service – the observed system did not answered by valid NTP
               message,
          o system restart,
      qualitative parameters change
          o OS version - OS has been changed recently,
          o NTP version - NTP has been changed recently,
            o stratum – Stratum level has been changed,
            o REFID – ID of reference NTP server server has been changed,
            o PPS signal.

       threshold exceeded
            o offset - measured offset exceeded (Startum-1 server) 50 μs or 1 ms
               (Stratum-2 and more),
            o delay – measured delay (round-trip time) between monitored site and
               reference site exceeded 20 ms,
            o frequency stability.

NTPMON implements an aggregation of events in order to reduce the number of past, less
important events. Aggregation is done in two steps every week and month. The aggregation
includes the deletion of warnings and the assignment of coarser time intervals to events.


5. Graphs

NTPMON generates graphs of following parameters for interval of 6 hours, one day, one
week and one month:

       time offset – time offset reported by the NTP process.
        Predefined range is (-50 μs : +50 μs).

       frequency offset - correction factor of local clock frequency.
        Predefined range is (AVR – 1 ppm : AVR + 1 ppm), where AVR is the average
        frequency offset.

       root dispersion - maximal difference between local clock and the root (primary) NTP
        server.
        Predefined range is (0 ms : +5 ms).

       measured time offset – time offset measured by the clie agent.
        Predefined range is (-50 μs : +50 μs).

       measured delay – round-trip time spent by NTP protocol packets between monitored
        and reference sites.
        Predefined range is (0 ms : +5 ms).

All graphs can be plotted with two possible ranges of Y-axis: the predefined and the
dynamically adjusted. Predefined range is suitable for brief comparison of several graphs but
it does not show values exceeded the limit. The dynamic value shows all values in observed
time interval.
When user clicks any graph, it is displayed detailed, two times larger graph with dynamically
adjusted range of Y-axis.


6. Implementation
NTPMON front-end has been programmed in PHP v.4 and both agents have been written in
C. The application includes also several PHP and bash scripts.

NTPMON is split into two computers. The clie agent runs on „reference NTP system‟, a
dedicated NTP server which has stable and accurate system clock. The computer is equipped
with an oven controlled oscillator and the system clock is synchronized by the 1pps signal
from a rubidium clock. All other parts of NTPMON, including the front-end and the database
are installed and operated on a standard Linux server.

Using NTPMON is simple and intuitive. The user has to select program parameters in several
sections:
     list of sites,
     type interval of displayed graphs and/or events: last 6 hours, last 24 hours, last 7 days,
       last 30 days, selected day, selected week or selected month,
     beginning or end of time interval – valid only when selected day / week / month is
       chosen.,
     displayed objects status, graphs, events.

User finishes selection by clicking to the “ Go “ button and all graphs and tables are
immediately displayed. When user clicks to any graph, more detailed, two time bigger graph
is plotted.


7. Conclusion

NTPMON currently monitors 12 sites running NTP – it includes our NTP servers, all
PerfMON sites (i.e., CESNET network monitoring system) and several testing computers. We
plan to add several new features in next version, for instance sending alarms by e-mail or
SMS when an event occurs, profiles specifying subset of investigated sites, access to archive
graphs. NTPMON is available at URL http://ntpmon.cesnet.cz/ntpmon.



Appendix

A. screen snapshots
Figure 1 - Input screen




Figure 2 - Detailed graph
Figure 3 - Status table




       Figure 4 – Plotted graphs
B. SQL database structure

The database consists of four main tables:

        host - description of monitored sites. Majority of fields are filled by the system
         administrator, only operating system version and NTP process description are updated
         by the agent,

        sample - table stores data collected by the ntpq agent,

        meas - table stores data collected by the clie agent,

        event - all agents check specified parameters and compare them with either the
         threshold or the previous value. Whenever a limit is exceeded, the agent generates an
         event and stores it into the table.


Following list of fields is not complete, it shows and explains only selected items:

host

       id               unique host system identification
       name             unique short host name (human readable)
       url              network address
       descr            long host name
       os               operating system type and version
       ver              NTP version


sample

       id_host          link to the host table
       time             sample timestamp
       stratum          NTP stratum
       refid            source of synchronization (NTP server, external clock)
       offset           time offset (declared by the system)
       freq             relative frequency offset
       disper           time dispersion (traced to stratum-1 server)
       reftime          last reference time
       stabil           frequency stability
       status           clock status

meas

       id_host          link to the host table
       time             sample timestamp
       stratum          NTP stratum
        refid             source of synchronization (NTP server, external clock)
        mea_offset        time offset (measured by the reference system)
        mea_delay         time delay (between local and reference clock)

event

        id_host           link to the host table
        time              timestamp of the event
        id_ev             type of event
        id_var            variable (parameter) associated with the event
        par               value of variable



References:
[1] Mills, D.L., “Network Time Protocol Specification, Implementation and Analysis“, RFC 1305, March 1992.


[2] Nagios, http://www.nagios.org/.


[3] NTP Time Server Monitor, http://www.meinberg.de/english/sw/time-server-monitor.htm.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:166
posted:3/11/2010
language:English
pages:9