LCG Monitoring and Fault Tolerance at CERN
Helge Meinhard / CERN-IT
07 November 2003
At CERN, a prototype of the LCG Tier 0 / Tier 1 computing farm is being built. The size
and complexity of this installation is such that given past experience in HEP and in
industry, advanced tools are required for managing the farm. As part of these tools,
management and fault tolerance systems have been developed and deployed. These are
now referred to as LEMON (Lhc Era Monitoring). In this paper we describe the
architecture of the system, explain the components and their status as of October 2003,
and give an outlook.
To a large extent, monitoring and fault tolerance at CERN are using components
developed within the Work Package 4 (Fabric Management)1 of the European DataGrid
In the interest of node independence and scalability, an approach has been chosen in
which the local node controls itself when metrics are to be sampled on the node, and
subsequently submits them to a central repository for monitoring information. A local
spool also keeps data from the local node, in order to preserve data even if they cannot be
sent to the central repository, and for local consumers.
One of these consumers is a framework for local recovery actions. In case of a local
metric, or any combination thereof, is abnormal, it runs a program supposedly correcting
the malfunction, and logs its results back to the monitoring system.
Sensor Sensor Agent
The monitoring server receives the messages from the MSA, processes them, and stores
the metrics as appropriate. It also provides access to the metrics for global consumers. A
very important example of a global consumer is the alarm screen for the computer centre
3. Status of components
3.1 Monitoring sensor agent (MSA)
The basic MSA functionality has been stable for more than two years; the MSA (as well
as a set of sensors for Linux machines) has been deployed on some 1500 nodes. Recent
releases have added the possibility to send (potentially different subsets of) data to
multiple repositories, data smoothing (don’t send if value is within a certain band around
last sent value), wall-clock sampling, and sampling on demand. Not all these features are
currently used in production.
Currently two production quality sensors are in use. The first one is implemented as a
C++ program using a sensor framework in C++, and is delivering universal metrics
offering a performance-oriented view on Linux systems. The second one, a Perl script
and library, is more oriented towards exception monitoring. Numerous specialisations
exist in order to take into account the different functionalities of nodes at CERN. More
metrics are added constantly as need arises. Nonetheless, the core of this sensor has been
very stable over the past 18 months. These two sensors are reporting together some
80…120 metrics per machine depending on its functionality, with sampling rates between
once per minute and once on start-up.
Recent additions of functionality have covered the machine hardware (processor,
mainboard, memory, disks …), and the BIOS settings.
Work is in progress in order to add metrics specifically targeted at disk servers and tape
servers. Some metrics are being developed from scratch, while a significant number of
other metrics exists already, and only needs to be integrated with the monitoring
Developments have started for sensors that can collect metrics from devices where no
MSA can be run (e.g. network equipment such as network switches). These sensors will
be run one or more dedicated machines with MSAs, and will use suitable network
protocols to connect to the devices (SNMP in the case of network equipment) in order to
report the metrics in its name.
3.3 Local spool
The local spooling functionality has been stable since the beginning of LEMON
3.4 Local consumers – local recovery actions
Local recovery actions have not been configured at large scale yet. Some candidate
actuators for simple malfunctions of the numerous batch systems have been identified
(e.g. those acting on /tmp full and /pool full conditions). The WP4 delivered framework
for Fault Tolerance has been evaluated and has shown some weaknesses that have
reportedly been addressed. We are currently discussing whether to use this complex
framework for the simple local use cases at hand, in particular as its support after the
EDG project has ended appears somewhat unclear.
3.4 Protocol between MSA and Monitoring Repository
The LEMON framework is supporting two proprietary protocols, one being based on
TCP, the other one on UDP. In the production system, the UDP-based protocol is being
used. In order to address potential concerns about data loss, we have verified that the
message loss is at the 10E-5 level or lower.
The TCP-based implementation will become interesting once we want the traffic on the
network to be encrypted, which seems easily possible by adding an SSL layer to the MSA
and the Monitoring repository in the case of the TCP-based transport. In order to address
the scalability issue caused by multiple open connections, a proxy scheme has been
implemented as a middle layer between the monitored nodes and the repository, bundling
many connections from the nodes into a single connection to the repository.
Already in 2002, some research has been done around industry-standard protocols,
notably SNMP. The result was that an SNMP-based solution based on the net-snmp
daemon on the local host could indeed replace the functionality of the sensors, the MSA
and the network link to the repository. However, as the UDP-based proprietary protocol
and the MSA-Sensor combination was used in production already, this possibility was
not followed up any further for the time being.
3.5 Monitoring repository
Within the WP4 development activities, a monitoring repository server has been
developed that has been implemented with various back-ends. To date, versions with a
flat file archive (called FMON), and with an Oracle RDBMS (called OraMonServer)
exist; a version that interfaces with ODBC is almost finished, in view of using free
RDBMS systems such as MySQL. (We note, however, that the ODBC implementation
could be used with Oracle as well, leading to an alternative Oracle implementation.) At
CERN, we are particularly interested in an Oracle-based implementation for long-term
storage of data because of the powerful query and reporting tools that are available. For
about four months, OraMonServer has been used in production at CERN, with about
1500 clients reporting metrics to it (see above). A number of problems have been
observed (partly with the way the Oracle data base back-end had been set up), and have
been resolved in very efficient cycles with the developer. To date and with the current
charge, we consider this server stable.
As a result of another project initially independent of WP4, there is an alternative
implementation available using the SCADA system PVSS II (the control framework
chosen by CERN for the LHC experiments). Although the internals are entirely different
from OraMonServer, the interfaces for sending data to it, and the API for extracting data,
are identical. The PVSS system has been recording and archiving data for more than 12
months in a stable fashion, partly in parallel with the OraMonServer (due to the MSA
feature of being able to send data to multiple destinations), but has recently shown
serious stability problems that have so far not been fully understood and solved.
3.6 Alarm display
As part of the project around PVSS II, in close collaboration with the operators in the
CERN computer centre, a powerful alarm display has been developed, and has been used
by the operators for a number of months in test mode next to the old Sure alarm display.
The PVSS-based display takes into account the special needs of the large installation at
CERN; by using the PVSS-provided interface builders, it is also respecting international
standards about control interfaces.
Within the WP4 project, an alarm display is being developed using Java/Swing
technology. Its main motivation is to honour CERN’s commitments towards WP4; it has
not been targeted at large-scale installations.
3.7 Access API
Early during the WP4 development, a repository access API was defined in the form of a
C .h file. This API has been implemented for the WP4 repository server class (via a
SOAP server), and for the PVSS system (by linking with PVSS API libraries). Over time,
it has become clear that this API is not easy and intuitive to use by programmers, in
particular in view of implementing scripting language interface on top of the C interface.
A simplified API for C has hence been defined and implemented, as have interfaces for
Perl. Apart from potentially supporting more scripting languages, no major development
is expected to be required.
However, we note that the API, while shielding the implementation details of the
repository from the consumer, only allows for retrieval of simple time series of metrics.
More complex queries cannot use the API, and will hence likely be based on whatever
RDBMS is chosen. Care should be taken to decouple as much as possible the client
applications from the repository internals. Using views and stored procedures in Oracle
would be a way in this direction.
3.8 Combined metrics
Work has begun to calculate metrics by combining information stored in the monitoring
repository (typically coming from multiple nodes), and possibly gathered from elsewhere.
It is expected that the resulting metrics will be fed into the monitoring repository as well.
Its sampling may happen regularly, or be triggered by changes of metrics stored in the
repository. The first objective is to provide a proof-of-concept that the system is capable
of delivering the monitoring information the experiment users (some of which are
running their own processes for monitoring on machines in the computer centre now)
need. We are hence starting with combined metrics looking at various aspects of the
lxbatch cluster nodes (load, swap occupation and activity, /tmp and /pool occupation) as a
function of whose users’ or groups’ jobs are running on the node.
We are convinced that a lot of very useful work has been done. The quality and stability
of the services provided to the users has increased significally over the last few years,
which we believe is to some part due to this work. We are also convinced that what has
been done so far forms a solid basis for further development and deployment. However
much remains to be done. In particular, the situation of the repository and the alarm
displays needs to be clarified. (However, because of the standard interfaces,
developments in other areas are not hindered by the repository/display situation.) To date,
offering a functional alarm display for the CERN computer centre, and storing data in
Oracle, requires that two different repository systems be used. This is expected to change:
For the next version of PVSS, support for native data storage in Oracle has been
announced that should be evaluated against our criteria for Oracle storage. If this does not
meet our requirements, we may wish to consider developing further the WP4-repository-
based Java alarm display in order to cope with CERN’s requirements, or develop
something new based on a careful review of requirements and technologies. If the PVSS
display proves superior, we may wish to consider stripping the PVSS system down to the
minimum required for the alarm display, and feeding it via the WP4 repository rather
than by the nodes directly.
Developing and maintaining the sensors is a permanent task. However first steps need to
be made towards monitoring equipment not covered by the system so far. A start has
been made with the network switches. Tape hardware and services, disk servers, software
systems (e.g. Castor, Oracle…) as well as machines not covered yet (Windows, AFS
servers…) will need to be included.
So far, the system is configured with ad-hoc format configuration files both on the local
node and on the repository. Work must be done to integrate the configuration with the
tools provided by WP4, most notably CDB.
Other areas of work have been mentioned above, such as deploying a system for local
recovery actions, implementing encryption to the TCP transport, proceeding with the
Apart from the repository access API, user interface activities have so far focused on the
alarm display functionality for operators. We will need to consider how to also provide
interfaces that are directly useful for service managers, line managers, and service users.