Docstoc

Fabric monitoring

Document Sample
Fabric monitoring Powered By Docstoc
					 Fabric monitoring for LCG-1
in the CERN Computer Center



               Jan van Eldik
             CERN-IT/FIO/SM
     7th GridPP Collaboration meeting
                July 1, 2003

                          GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1
                  Outline
•   Fabric monitoring developments at CERN
•   Architectural overview
•   Deployment: status & plans for LCG-1
•   Outlook




                            GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 2
           Fabric Monitoring at CERN
• Improved fabric management is key part of LCG programme
• EDG WP4 develops tools for automated installation, configuration,
  fabric monitoring, fault tolerance
• IT/FIO Supervision & Monitoring section: develop and deploy a
  monitoring solution for LHC-era
• A lot of expertise: EDG WP4 monitoring developments,
  PVSS Scada studies, SNMP studies, operator alarm displays, …
• Architecture based on functional requirements gathered
  by PEM project
• Important objective: fabric monitoring for LCG-1 at Cern



                                       GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 3
       Requirements and architecture

              Monitored nodes                               Measurement Repository



                                                                        Database

 Sensor                Monitoring Sensor
   Sensor                       Agent
     Sensor




  Consumer                  Cache
Local Consumer

                                                                          Consumer
                                                                            Consumer
                                                                          Global Consumer




 • Both for performance and exception monitoring
 • Local and global consumers
 • Scalable, extensible, robust
                                           GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 4
                      EDG WP4 implementation
   Monitoring Sensor Agent
   • Calls plug-in sensors to sample configured           Transport
   metrics                                                • Transport is pluggable.
   • Stores all collected data in a local disk buffer     • Two protocols over UDP and TCP are
   •Sends the collected data to the global                currently supported where only the latter can
   repository                                             guarantee the delivery

                                                               Measurement Repository
Plug-in sensors                                                • The data is stored in a database
• Programs/scripts that implements a simple sensor-            •A memory cache guarantees fast access
agent ASCII text protocol                                      to most recent data, which is normally
•A C++ interface class is provided on top of the text          what is used for fault tolerance
protocol to facilitate implementation of new sensors           correlations

                  Monitored nodes                                             Measurement Repository (MR)



                                                                                          Database

     Sensor                 Monitoring Sensor
       Sensor                  Agent (MSA)              Database
         Sensor
                                                        •Proprietary flat-file database
                                                        •Oracle                                      Repository API
                                                        •Open source interface to be                 •SOAP RPC
                                                        developed                                    •Query history data
                                                                                                     •Subscription to new
      Consumer                                          The local cache                              data
    Local Consumer                Cache
                                                        •Assures data is
                                                        collected also when
                                                                                            Consumer
                                                        node cannot connect to                Consumer
                                                        network                             Global Consumer
                                                        •Allows for node
                                                        autonomy for local
                                                        repairs

                                                                  GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 5
         Deployment status in Cern CC
• MSA with sensors for performance and exception monitoring,
  measuring 100-150 quantities per box
• Deployed on ~1500 RedHat Linux nodes
• 30 clusters, with specific configuration files


           Batch               1000 nodes

           Interactive         70 nodes

           Disk server         200 nodes
           Tape server         80 nodes
           WWW, DB, MISC       200 nodes
                                      GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 6
         Status of exception monitoring
• ~50 possible alarms per monitored node
      HighLoad, DaemonDead, FileSysFull, install / config problems

• Operator alarm displays
  – PVSS-based, developed as part of PVSS-tests
  – WP4 alarm display under active development




                                            GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 7
PVSS operator alarm display




                  GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 8
WP4 operator alarm display




                  GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 9
            Performance monitoring
• WP4 Measurement Repository with Oracle backend
  is currently being deployed in the CERN CC for LCG-1
• Data access
   – C-API to the repository is available,
     Perl and Java implementations to be done
   – Simple CLI is being delivered
   – GUI is being delivered




                                GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 10
Anamon




         GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 11
                      Open issues
•   Current solution is still very node-centric
•   Not much experience with consumers
•   No correlations engines, no corrective actions yet…
•   Integration with configuration system to be done




                                    GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 12
             Summary and Outlook
• Fabric monitoring infrastructure for LCG-1 at Cern
  is being deployed
• Monitoring Sensor Agent has been operating very well
• Measurement Repository will now be challenged
• Consumers can start consuming…
• An interesting 6 months period await us!




                                 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 13

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:10/28/2011
language:English
pages:13
xiaohuicaicai xiaohuicaicai
About