; Lecture 9 - Grid Monitoring
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Lecture 9 - Grid Monitoring


  • pg 1
									   Grid Monitoring
Architecture (GMA)
       The diversity of components and their large number of
        users render them vulnerable to faults, failure and
        excessive loads
       Grid monitoring is a critical facet for providing a
        robust, reliable and efficient environment
       The goal is to measure and publish the state of all
        resources – software, hardware, and networks – at a
        particular point in time
       Monitoring data can be used in
           Fault detection
           Recovery
           Performance forecast

    2                                                Grid Computing
Why do we need monitoring?
       Debugging purposes
       Resource Utilization
       Performance Evaluation
       Security
       Management Decisions
       Accounting

    3                            Grid Computing
The Challenges of Grid Monitoring
 No single point of observation
 No central point of monitoring information
 Diverse Hardware and Software Systems
 Different policies and decision making mechanisms
 Network monitoring is very    important
 Larger monitoring data sets
 Security

 4                                             Grid Computing
Characteristics for Grid Monitoring
 Scalable
 Dynamic
 Robust
 Flexible
 Should be integrated with other Grid Technologies
     and middleware (security infrastructure, resource
     brokers, schedulers, ...)

 5                                             Grid Computing
Grid Monitoring Architecture (GMA)
       The GMA consists of three types of components
           Directory Service: supports information publication and discovery
           Producer: makes performance data available (performance event
           Consumer: receives performance data (performance event sink)

                         Consumer           Lookup Location

        Subscription to event   Data Transfer

                         Producer               Store Location

    6                                                                        Grid Computing
       Consumer is a program that receives monitoring data
        (events) from one or more producers
       Different types of consumer
           The archiving consumer: aggregates and stores monitoring
            data for later retrieval/analysis.
           Real-time consumer: collects monitoring data in real time.
           Overview consumers: collects events from several sources.
            And uses the combined information for decision making.
           Job monitoring consumers: can be used to trigger an action
            based on an event from a job.

    7                                                        Grid Computing
       Consumer steps
        1.     Locate events: Consumers search a schema repository for a new event type. The
               schema repository can be a part of the GMA Directory Service.
        2.     Locate producers: Consumers search the Directory Service to find a suitable
        3.     Initiate a query: Consumers request event(s) from a producer, which are delivered
               as part of the reply.
        4.     Initiate a subscription: Consumers can subscribe to a producer for certain kinds of
               events they are interested in. Consumers request event(s) from a producer
        5.     Initiate an unsubscribe: Consumers terminate a subscription to a producer.
        6.     Register: Consumers can add/remove/update one of more entries in the Directory
               Service that describe events that the consumer will accept from producers.
        7.     Accept query: Consumers can also accept a query request from a producer. The
               “query” will also contain the response.
        8.     Accept subscribe: Consumers accept a subscribe request from a producer. The
               producer will be notified automatically once there are requests from the consumers.
        9.     Accept unsubscribe: Consumers accept an unsubscribe request from a producer. If
               this succeeds, no more events will be accepted for this subscription.
            Consumers that initiate the flow of events should support steps 2-5
            Consumers that allow a producer to initiate the flow of events should support
             steps 6-8

    8                                                                               Grid Computing
Directory Service
       Directory Service provides information about
        producers or consumers.
       When producers and consumers publish their
        existence, they must provide event types they produce
        or consume.
       The publication information allows producers and
        consumers to discovery the types of available events,
        the characteristics of that data, and sources or sinks of
       Directory Service is not responsible for the storage of
        event data; only information about which event
        instances can be provided.
    9                                                    Grid Computing
Directory Service
    Functions supported by the Directory Service
        Authorize a search: Establish the identity of a consumer
         that wants to undertake a search.
        Authorize a modification: Establish the identity of a
         consumer that wishes to modify entries.
        Add: Add a record to the directory.
        Update: Change the state of a record in the directory
        Remove: Remove a record from the directory.
        Search: Perform a search for a producer or consumer of a
         particular type, possibly with fixed values for some of the
         event elements.
    There can be more than one directory services.

    10                                                       Grid Computing
Grid Monitoring Architecture (GMA)
    Extended Grid Monitoring Architecture with multiple Directory Service

              Consumer                Consumer               Consumer

                        Event Directory Service Gateway

              Event Directory        Event Directory       Event Directory
                 Service                Service               Service

          Producer       Producer      Producer        Producer      Producer

                                    Grid Resources

    11                                                                          Grid Computing
    A producer is a software component that sends
     monitoring data (events) to a consumer
    Producers can deliver events in a stream or as a single
     response per request.
    Producers are also used to provide access control to
     the event depending on policies, varying frequencies of
     measurement and ranges of performance detail.

    12                                             Grid Computing
    Producer steps
     1.   Locate event: Search the Event Directory Service for the description of an event.
     2.   Locate consumer: Search the Event Directory Service for a consumer.
     3.   Register: Add/remove/update one of more entries in the Event Directory Service describing
          events that the producer will accept from the consumer.
     4.   Accept query: Accept a query request from a consumer. One or more event(s) are returned in the
     5.   Accept subscribe: Accept a subscribe request from a consumer. Further details about the event
          stream are returned in the reply
     6.   Accept unsubscribe: Accept an unsubscribe request from the consumer. If this succeeds, no
          more events will be sent for this subscription.
     7.   Initiate query: Send a single set of event(s) to a consumer as part of a query “request”.
     8.   Initiate subscribe: Request to send events to consumers, which are delivered in a stream.
          Further details about the event stream are returned in the reply.
     9.   Initiate unsubscribe: Terminate a subscription to a consumer. In this succeeds, no more data will
          be sent for this subscription.
    Producers that wish to handle new event types dynamically should
     support the step 1
    Producers that allows consumers to initiate the flow of events should
     support steps 2-6
    Producers that initiate the flow of events should support steps 7-9

    13                                                                                     Grid Computing
    Optional producer tasks
        Event caching allows consumers to request historical data
         from a particular sensor for prediction algorithm
        Event filtering can be applied to sent only if data value
         crosses a certain threshold
            CPU utilization is > 50%
            1, 10, 60-minute average CPU usage.

    14                                                    Grid Computing
    The compound
     producer/consumer is a
     single component that
     implements both producer
     and consumer interfaces           Producer Interface
    Forward, broadcast, filter,
     or cache the performance      Monitoring Service X
     events                            Consumer Interface
    Lessen the load on
     producers of event data
     that is of interest to many   Producer        Producer

    15                                                 Grid Computing
Monitoring data
    Time-related data
        Time-stamped dynamic data – may be provided by a
         counter related to the sampling rate. Data includes
         performance event and status monitoring.
        Time-stamped asynchronous data – indicate when an event
         happens (alerts and checkpoints)
        Non-time-related data – includes static information such as
         OS type and version, hardware characteristics or the update
         time of monitoring information

    16                                                    Grid Computing
Monitoring data
    Information flow data
            Direct producer-consumer flow does not need a central
             component. Three interactions are described by GMA
             1.   Publish/subscribe
             2.   Query/response
             3.   Notification
            Indirect data distribution via a centralized repository. This
             is useful for static information.
            Following a workflow’s path. The data is tagged so that it
             can be associated with a particular part of workflow.
             Monitoring information is produced and stored locally.

    17                                                           Grid Computing
Monitoring data
    Monitoring categories
        Static monitoring
            system configuration and descriptions
        Dynamic monitoring
            network and system performance
        Workflow monitoring
            Variable amount of data is produced as the processing of a
             job/task take place.
            Processing status information, error reporting, job tracking

    18                                                               Grid Computing
Criteria for Grid monitoring Tools
    Scalable and can tolerate faults
    Cross-API monitoring: can deal with data collection from
     legacy and specialized software.
    Homogeneous data presentation: Data are clear and
     presented in standard ways for clients
    Information searching can be done in a timely manner
    Run-time extensibility: can support rapid transitions when
     resources join and leave during runtime
    Filtering/fusing of data that comes from multiplex stream
    Open and standard protocols
    Support standard security features
    Tools can be installed on demand, independent of other
    19                                                 Grid Computing
An overview of grid monitoring systems :
    Autopilot’s infrastructure is based on the
     GMA and uses the Globus Toolkit to
     perform wide-area communication
     between its components                                                   Classification
    Sensor = GMA producer
    Actuator = GMA producer +
            mechanisms for steering remote

                                                  Decision Procedure
     application and controlling sensors                                         Sensor
    The AM (Autopilot Manager)
     performs GMA registry
    An Autopilot client corresponds to a                              Application      Resource Policy
     GMA Consumer, which locate sens.ors
     and actuators by searching the AM for
     registered keywords.
    APD (Autopilot Performance Daemon)                                         Actuator
     retrieves and records system
     performance information from remote

    20                                                                                Grid Computing
An overview of grid monitoring systems :
CODE (Control and Observation in Distributed Environment)

 21                                              Grid Computing
An overview of grid monitoring systems :
CODE (Control and Observation in Distributed Environment)
    Sensors are installed on monitored hosts and gather monitoring
    The SM (Sensor Manager) receives query requests and
     subscriptions from the Observer
    The Observer encapsulates the SM and sensor mechanisms on a
     monitored host and provides a Producer Interface(PI)
    PI support both query-response and subscription-based requests
    The Controller resides on a monitored host and provides
     mechanisms that allow consumers to execute actions on that
    The Manager (consumer) connects to an observer to query for
     data, to subscribe, and to modify the subscriptions
    The Registry stores the locations of Observers and Controllers

    22                                                   Grid Computing
An overview of grid monitoring systems :
    GridRM has a hierarchical architecture that provide homogeneous view of
     heterogeneous resources
    A Naming Schema (NS) defines the semantics by which resources are defined
    A Driver is a modular plug-in that is used to retrieve select information from
     native monitoring agents
    A Local Layer accesses to real-time/historical information from local resources
    The Global Layer provides inter-grid site or VO interaction between GridRM
    Requests are received in an SQL form and passed to the Local Layer for
    Consumers interact with gateways at the Global Layer                GMA
    The local layer can perform caching                               Directory

                                              GridRM          GridRM             GridRM
                                               Client        Gateway            Gateway
                                                     Local Site                Remote Site

    23                                                                        Grid Computing
An overview of grid monitoring systems :
    Information service for the Globus Toolkit 4, based on
    Scalable, uniform and efficient access to distributed
     information sources to support the discovery, selection
     and optimization of resources in Globus environment
    Components of MDS4 are represented as information
     services, each instance has associated Service Data(SD)
     that reveals resource information
    Resources heterogeneity can be masked through
     standardized reporting of static and dynamic resources

    24                                             Grid Computing
    MDS 4 has a decentralized structure.
    MDS4 Can handle both static and dynamic data
    Use GSI to restrict access
    The Resource Layer consists of one or more service instances that produce SD
    The Collective Layer aggregates information from multiple “Resource Layer”
     services. The Index Service is an example of a Collective Layer service
    Client, e.g. user applications, interact with the IS or resource level services
     directly using subscription and query requests

                                             Resource C
                           Client                 SDE              SDE

                  Client                           MDS4 Index Service

                                    Resource A                  Resource B
                                       SDE                              SDE

    25                                                                              Grid Computing
    Index – a resource/service registry that aggregates information from
     multiple ‘resource layer’ services. Index service :
        Supports accessing, aggregating, generating, and querying SD from
         remote services.
        Provide service lookup mechanisms
        Provide Caching
    Trigger – event-driven data filter. Trigger can perform action on
        Ex. can send email when queue length on a compute resource goes over a
         threshold value
    WebMDS – create a specialized and homogeneous view of Index data
    MDS4 supports query/response and subscription/notification

    26                                                              Grid Computing
Information Provider
    GT4 information providers collect information from
     some systems and make it accessible to typical grid
     monitoring system.
    Examples of information providers
        Ganglia http://ganglia.sourceforge.net
        Nagios http://www.nagios.org
        Netlogger http://www-didc.lbl.gov/NetLogger

    27                                                 Grid Computing
    Ganglia is a distributed monitoring system for high-
     performance computing systems such as clusters and
     the Grid
    Based on a hierarchical design, multicast-based
     listen/announce protocol
    Ganglia uses
        XML for data representation
        XDR for data transport
        RRDtool for data storage and visualization
    PHP Web User Interface provides a view of the
     gathered information via real-time dynamic Web pages

    28                                                Grid Computing
    The Ganglia Monitoring Daemon (gmond) is a multi-threaded
     daemon running on each cluster node to be monitored
        Monitor changes in host state
        Multicast relevant changes
        Listen to the state of all other Ganglia nodes via a multicast channel
        Answer requests for an XML description of the cluster state
    The Ganglia Meta Daemons (gmetad) are used to provide a
     federated view by polling a collection of child data sources.
    Data sources of gmetad may be either gmond or gmetad

                    Client               gmetad

                              gmetad               gmetad

                   gmond         gmond        gmond         gmond
                    Node          Node         Node          Node

    29                                                               Grid Computing
    Monitoring is critical for providing a robust, high-
     performance Grid environment
    A basic monitoring has the following components
        Producers(sensors) that generate monitoring data (events)
        Consumers that consume events
        One or more directory services for registration and discovery of
    A monitoring system should have
        GMA compliance
        Caching capability
        Scalable
        Resources monitored include network resources, host resources and jobs
        Resource performance forecasting
        Resource performance analysis
        Various presentation views for resource monitoring
        Directory service for events subscription and notification

    30                                                                      Grid Computing

To top