NTPMON 1. Introduction All passive or active monitoring systems require accurate local clock. Knowledge of exact system time is essential mainly for timestamps of captured packets and time sensitive applications as one-way delay measurement. The expected absolute accuracy (difference between local time and UTC) varies from 10 -3 s to 10 -5 s. The most common clock synchronization method in the networking environment is the NTP , optionally with a GPS receiver as an external time source. The NTP process should be monitored otherwise we have no evidence that any particular time dependent measurement is correct. There exist universal tools for network services monitoring (e.g., Nagios), however they test only the availability of the NTP service and can not deal with details. Another tool is NTP Time Server Monitor but it is designed mainly for local NTP monitoring. We looked for a centralized system which can monitor many external NTP sites and we decided to develop such a system and integrate it into our network monitoring infrastructure. This document describes NTPMON, a centralized NTP monitoring system which checks parameters of NTP processes running on remote workstations, collects data into a database, plots graphs and generates events if any parameter is above or under the specified threshold. NTPMON can monitor even sites that are administered by another authority as it does not need any nonstandard cooperation with the monitored site. 2. Data collection There exists no universal method how to obtain all important parameters of the NTP process. Some data are logged by the NTP process, other are available via system functions (e.g., adjtimex(), ntp_adjtime()) and tools (e.g., ntpq). As our goal was to implement a centralized system without any piece of new software running on monitored site, we decided to omit logs mining and all locally running tools. We designed and programmed agents for parameters collection. Each monitoring agent runs and saves data independently. NTP process polling The agent ntpq queries periodically the status of the remote NTP process by the command „ntpq -c rl‟. Parameters are parsed and inserted into the database. Status of each particular NTP process is described by a set of qualitative and quantitative parameters. NTPMON displays selected subset of these parameters: o stratum - a “synchronization distance” to the primary NTP server. Primary NTP server (i.e., NTP server with an external clock) has stratum 1, o time offset - difference between the NTP server time and the local time, o frequency offset - correction factor of local clock frequency. It is expressed as a relative unit-less value in ppm - parts per million. It is not important the frequency offset value itself but rather its variation due to changes of oscillator frequency, o root dispersion - maximal difference between local time and the root (primary) NTP server time. Its calculation is based on the assumption of the worst possible oscillator (un)stability. NTP client In principle, the agent clie compares local time with time of the observed system, therefore it has to operate on computer having very accurate and stable clock - we consider it as a reference clock and call it REF. The agent clie behaves like a NTP client which sends NTP query to the monitored (remote) NTP process. According to the response, the agent clie calculates and stores into the database: o measured time offset (θC) – time difference between both REF time and monitored computer time, o measured delay (δC ) – propagation delay of NTP query and response As a side effect, the agent also checks that the remote NTP service is available. Let we assume that REF clock uncertainty is negligible. Then the real time offset θ of remote NTP clock is done by the rule: θC + δC/2 ≤ θ ≤ θC + δC/2 where θC is the time offset calculated by clie and δC is the delay between REF and remote clock. SNMP client NTP version 4 is going to support SNMP, unfortunately, it is not yet neither standardized by the IETF nor implemented. In the future, when NTP v.4 will be widely deployed, we assume to program the snmp agent, which will probably replace the ntpq agent. 3. Database NTPMON uses two databases, the MySQL and the RRD(round-robin database). SQL database Agents store all collected data into the MySQL database and they also check specified parameters and compare them with either the threshold or the previous value. Whenever a limit is exceeded, the agent generates an event and stores it into the database. We decided to avoid any floating point types, therefore we restricted field types to CHAR (text of fixed length) and INT (integer value). We choose appropriate parameter units: o timestamps - all timestamps have resolution 1 s and are expressed by an unsigned integer value - number of UTC seconds since 0:00:00 1.1.1970 , o time offset, delay, dispersion - expressed in microseconds by a signed integer value, o frequency offset - expressed in ppb – parts per billion (i.e., 10-9 or ns/s) by a signed integer value. RRD database NTPMON displays several types of parameters in graphs - all such parameters are stored in the RRD, as it implements two useful features: graphs plotting and old data aggregation that corresponds to interval displayed by daily, weekly and monthly graphs. The database contains individual values and average, minimum and maximum for every 10 minutes, 1 hour and 6 hours. Each monitored site has its own RRD database which is split into two parts in order to avoid an interaction of agents: time offset, frequency offset and dispersion is collected and stored by the ntpq agent, measured time offset and measured delay is collected and stored by the clie agent. 4. Events Agents check in real-time values of collected parameters and generate static events (i.e., the value exceeds a threshold) or dynamic events (i.e., the values changes too rapidly). A set of events and thresholds have been selected according to our long time experience with NTP, therefore we assume to update continuously the heuristic algorithm which generates events. Currently, we recognize following 11 types of events that belong to 3 groups: availability o no system response – the observed system did not replayed in one minute, o no NTP service – the observed system did not answered by valid NTP message, o system restart, qualitative parameters change o OS version - OS has been changed recently, o NTP version - NTP has been changed recently, o stratum – Stratum level has been changed, o REFID – ID of reference NTP server server has been changed, o PPS signal. threshold exceeded o offset - measured offset exceeded (Startum-1 server) 50 μs or 1 ms (Stratum-2 and more), o delay – measured delay (round-trip time) between monitored site and reference site exceeded 20 ms, o frequency stability. NTPMON implements an aggregation of events in order to reduce the number of past, less important events. Aggregation is done in two steps every week and month. The aggregation includes the deletion of warnings and the assignment of coarser time intervals to events. 5. Graphs NTPMON generates graphs of following parameters for interval of 6 hours, one day, one week and one month: time offset – time offset reported by the NTP process. Predefined range is (-50 μs : +50 μs). frequency offset - correction factor of local clock frequency. Predefined range is (AVR – 1 ppm : AVR + 1 ppm), where AVR is the average frequency offset. root dispersion - maximal difference between local clock and the root (primary) NTP server. Predefined range is (0 ms : +5 ms). measured time offset – time offset measured by the clie agent. Predefined range is (-50 μs : +50 μs). measured delay – round-trip time spent by NTP protocol packets between monitored and reference sites. Predefined range is (0 ms : +5 ms). All graphs can be plotted with two possible ranges of Y-axis: the predefined and the dynamically adjusted. Predefined range is suitable for brief comparison of several graphs but it does not show values exceeded the limit. The dynamic value shows all values in observed time interval. When user clicks any graph, it is displayed detailed, two times larger graph with dynamically adjusted range of Y-axis. 6. Implementation NTPMON front-end has been programmed in PHP v.4 and both agents have been written in C. The application includes also several PHP and bash scripts. NTPMON is split into two computers. The clie agent runs on „reference NTP system‟, a dedicated NTP server which has stable and accurate system clock. The computer is equipped with an oven controlled oscillator and the system clock is synchronized by the 1pps signal from a rubidium clock. All other parts of NTPMON, including the front-end and the database are installed and operated on a standard Linux server. Using NTPMON is simple and intuitive. The user has to select program parameters in several sections: list of sites, type interval of displayed graphs and/or events: last 6 hours, last 24 hours, last 7 days, last 30 days, selected day, selected week or selected month, beginning or end of time interval – valid only when selected day / week / month is chosen., displayed objects status, graphs, events. User finishes selection by clicking to the “ Go “ button and all graphs and tables are immediately displayed. When user clicks to any graph, more detailed, two time bigger graph is plotted. 7. Conclusion NTPMON currently monitors 12 sites running NTP – it includes our NTP servers, all PerfMON sites (i.e., CESNET network monitoring system) and several testing computers. We plan to add several new features in next version, for instance sending alarms by e-mail or SMS when an event occurs, profiles specifying subset of investigated sites, access to archive graphs. NTPMON is available at URL http://ntpmon.cesnet.cz/ntpmon. Appendix A. screen snapshots Figure 1 - Input screen Figure 2 - Detailed graph Figure 3 - Status table Figure 4 – Plotted graphs B. SQL database structure The database consists of four main tables: host - description of monitored sites. Majority of fields are filled by the system administrator, only operating system version and NTP process description are updated by the agent, sample - table stores data collected by the ntpq agent, meas - table stores data collected by the clie agent, event - all agents check specified parameters and compare them with either the threshold or the previous value. Whenever a limit is exceeded, the agent generates an event and stores it into the table. Following list of fields is not complete, it shows and explains only selected items: host id unique host system identification name unique short host name (human readable) url network address descr long host name os operating system type and version ver NTP version sample id_host link to the host table time sample timestamp stratum NTP stratum refid source of synchronization (NTP server, external clock) offset time offset (declared by the system) freq relative frequency offset disper time dispersion (traced to stratum-1 server) reftime last reference time stabil frequency stability status clock status meas id_host link to the host table time sample timestamp stratum NTP stratum refid source of synchronization (NTP server, external clock) mea_offset time offset (measured by the reference system) mea_delay time delay (between local and reference clock) event id_host link to the host table time timestamp of the event id_ev type of event id_var variable (parameter) associated with the event par value of variable References:  Mills, D.L., “Network Time Protocol Specification, Implementation and Analysis“, RFC 1305, March 1992.  Nagios, http://www.nagios.org/.  NTP Time Server Monitor, http://www.meinberg.de/english/sw/time-server-monitor.htm.