Document Sample
MOLAR Powered By Docstoc
           Principal Investigators                               Collaborators
               David Bernholdt (ORNL)                                Vassil Alexandrov (Reading)
               Christian Engelmann (ORNL)                            Patrick G. Bridges (UNM)
               Chokchai (Box) Leangsuksun (LaTech)                   Barney Maccabe (UNM)
               Frank Mueller (NCSU)                                  Don Mason (Cray)
               P. (Saday) Sadayappan (OSU)                           Cindy Nuss (Cray)
               Stephen L. Scott (ORNL)
               Jeff Vetter (ORNL)

                                           FastOS PI Meeting
                                            June 9-10, 2005
                                             Rockville, MD

                  MOLAR: Modular Linux and Adaptive Runtime Support for HEC OS/R research
June 9-10, 2005   FastOS: Forum to Address Scalable Technology for runtime and Operating Systems forum   1
MOLAR is a multi-institution research effort that concentrates
on adaptive, reliable, and efficient operating and runtime
system solutions for ultra-scale high-end scientific computing
on the next generation of supercomputers.

   Create a modular and configurable Linux system that allows customized
   changes based on the requirements of the applications, runtime systems,
   and cluster management software.

   Build runtime systems that leverage the OS modularity and configurability to
   improve efficiency, reliability, scalability, ease-of-use, and provide support to
   legacy and promising programming models.

   Advance computer reliability, availability and serviceability (RAS)
   management systems to work cooperatively with the OS/R to identify and
   preemptively resolve system issues.

   Explore the use of advanced monitoring and adaptation to improve
   application performance and predictability of system interruptions.

June 9-10, 2005                                                                        2
MOLAR: HEC OS/R Research Map

June 9-10, 2005                3
MOLAR: HEC OS/R Research Map

                  Reliability, Availability, and Serviceability

June 9-10, 2005                                                   4
Availability of HEC Systems

     Today’s supercomputers typically need to reboot to
     recover from a single failure.
     Entire systems go down (regularly and unscheduled)
     for any maintenance or repair.
     Compute nodes sit idle while their head node or one
     of their service nodes is down.
     Availability will get worse in the future as the MTBI
     decreases with growing system size.
     Productive computation is not done during the
     checkpoint/restart process

June 9-10, 2005                                          5
Availability Measured by the Nines

9’s Availability Downtime/Year         Examples
1       90.0%      36 days, 12 hours   Personal Computers
2       99.0%      87 hours, 36 min    Entry Level Business
3       99.9%      8 hours, 45.6 min   ISPs, Mainstream Business
4       99.99%     52 min, 33.6 sec    Data Centers
5       99.999%    5 min, 15.4 sec     Banking, Medical
6       99.9999%   31.5 seconds        Military Defense

     Enterprise-class hardware + Stable Linux kernel          = 5+
     Substandard hardware + Good high availability package    = 2-3
     Today’s supercomputers                                   = 1-2
     My desktop                                               = 1-2

June 9-10, 2005                                                       6
High Availability Methods

Active/Hot-Standby:               Active/Active:
   Single active node / task.        Multiple active nodes / task.
   Backup to shared storage.         Work load distribution.
   Simple checkpoint/restart.        Symmetric replication between
   Rollback to backup.               participating nodes.
   Idle standby node(s).             Continuous service.
   Service interruption for the      Always up-to-date.
   time of the fail-over.            No restore-over necessary.
   Service interruption for the      Virtual synchrony model.
   time of restore-over.             Complex algorithms.

June 9-10, 2005                                                  7
High Availability Technology
Active/Hot-Standby:                    Active/Active:
   HA-OSCAR with active/hot-              HARNESS with symmetric
   standby head node.                     distributed virtual machine.
   Similar projects: HA Linux …           Similar projects: Cactus …
   Cluster system software.               Heterogeneous adaptable
   No support for multiple                distributed middleware.
   active/active head nodes.              No system level support.
   No middleware support.                 Solutions not flexible enough.
   No support for compute nodes.

      System-level data replication and distributed control service needed
      for active/active solution.
      Reconfigurable framework similar to HARNESS needed to adapt to
      system properties and application needs.

June 9-10, 2005                                                            8
HA-OSCAR          Production-quality Open
                  source Linux-cluster project

                  HA and HPC clustering
                  techniques to enable critical
                  HPC infrastructure Self-
                  configuration Multi-head
                  Beowulf system

                  HA-enabled HPC Services:
                  Active/Hot Standby

                  Self-healing with 3-5 sec
                  automatic failover time

                  The first known field-grade
                  open source HA Beowulf
                  cluster release
June 9-10, 2005                                 9
Modular HA Framework on Active/
Active Head Nodes
   Highly Available Head Nodes:               Head       Head        Head
             (service nodes too)              Node       Node        Node

                         To Outside World
                        To Compute Nodes

                   Reliable Services:       Scheduler   MPI Runtime       ...
                  Virtual Synchrony:          Distributed Control Service
              Symmetric Replication:           Data Replication Service
            Reliable Server Groups:         Group Communication Service
          Communication Methods:            TCP/IP   Shared Memory     Etc.

June 9-10, 2005                                                                 10
Modular HA Framework on Active/
Active Head Nodes: Scheduler Example
                                       Head   Head   Head
                  Head Node Fails
                                       Node   Node   Node

                    To Outside World
                   To Compute Nodes

                      Schedule Job A                        No Single Point
                      Schedule Job B                          of Failure
                      Schedule Job C
                        Launch Job A                        No Single Point
                      Schedule Job D                          of Control
                      Schedule Job E
                        Launch Job B
                        Launch Job C

June 9-10, 2005                                                         11

June 9-10, 2005   12
Many HA Framework Use Cases

 Active/Active and Active/Hot-standby process state
 replication for multiple head or service nodes.
       Reliable system services, such as scheduler, MPI-runtime and
       system configuration/management/monitoring.
 Memory page replication for SSI and DSM.
 Meta data replication for parallel/distributed File
 Super-scalable peer-to-peer diskless checkpointing.
 Super-scalable localized FT-MPI recovery.
 Replicated coherent caching: checkpointing, File

June 9-10, 2005                                                  13
HA Framework Architecture

     Research in process groups has been around for a
     while (e.g. Lamport, Tanenbaum, Birman).
     Research in unified models for process group
     communication and behavior is about 10 years old.
     Recent (last 4 years) research defined a unified
     model for process group communication.
     Clear mathematical definitions of process group
     communication service properties exist today.
     The HA framework architecture is based on the
     unified model for process group communication.

June 9-10, 2005                                          14
HA Framework Design

    Interface specs in UML.
    C/C++ headers with interfaces.
    Core library with base classes.
    Component libraries/headers.
    Static and dynamic libraries.
    Pluggable framework library.
    Reuse of recently developed
    Open MPI Comm. drivers?
    Failure detection via monitoring?
    OS interface (daemon, /sys, ..)?
    [OLS Cluster Membership BoF]

June 9-10, 2005                         15
Why Pluggable Component Framework?

     A pluggable component framework allows different
     implementations for the same problem.
     Adaptation to different system properties such as:
          Network technology and topology (IP, Elan4, Myrinet).
          Average MTBF of involved components (strong vs. weak).
     Adaptation to different application needs such as:
          Programming model (active vs. passive replication).
          Programming interface (memory vs. state machine).
          Scalability for large number of compute nodes (SSI).

June 9-10, 2005                                                    16
Why Pluggable Component Framework?

     A pluggable component framework also enables
     competition for efficient implementations.
     Furthers collaboration by allowing other researchers
     to contribute their partial solutions.
     It avoids “reinventing the wheel”, which seems to be
     a common problem in this research area:
          Harness, FT-MPI, Horus, Coyote, Transis, …, all
          implement group communication at different levels.
          These implementations have their advantages and
          disadvantages in different areas.
          They are not interchangeable.

June 9-10, 2005                                                17
Which Pluggable Framework Properties?

     System properties, such as failure rate and network
     load, may change at runtime.
     Application needs may change at runtime for
     collaborative software environments.
     Runtime plug-n-play for adaptation to runtime
     changes of system properties or application needs.
     On demand runtime loading of components
     (HARNESS plug-in technology).
     Runtime exchange (transition) of active components
     is a hard problem to be solved another day.

June 9-10, 2005                                        18
What’s Next

     Framework specification - publication.
          Clearly define individual services and their interfaces.
          Solve implementation issues, like use cases, multi-
          threading, mutual exclusions and daemons.

     Framework implementation.
          Implement basic services that others depend on first.
          Implement one protocol to allow application testing.
          Implement multiple protocols with different features.

     Further exploration of applications and use cases.

June 9-10, 2005                                                      19
What’s Next – more…
     Further investigate IPMI (Intel and AMD based hardware) and Active
     Manager Interface CRAY XD1 to identify managed elements/object.
     Research extensibility, scalability, and suitability of existing monitoring
     Work on 2+1 Active-Active Service heads for HEC systems and
     serviceability features, a reference implementation will be based on a
     Linux cluster.
     RAS event classification and modeling and fault-prediction and
     Investigate reliability-aware and federated-system Meta scheduling
     techniques for the n+1 Active-Active Service heads for HEC systems.
     This work will collaborate with ORNL data replication.
     Look at continuous Reliability modeling and self-healing coverage for
     compute nodes.

June 9-10, 2005                                                                    20
MOLAR: HEC OS/R Research Map

June 9-10, 2005                21
Monitoring Activities

     Apply automated techniques to RAS data sets; root
     cause analysis
     Develop toolkit for adaptively customizing OS
     Participate in activities to expose important
     performance information in the kernel/runtime
     Applications, metrics, and benchmarks

June 9-10, 2005                                          22
Automate Root Cause Analysis

     GOAL: simplify the management and administration
     of large computing systems

     Use multivariate statistics and machine learning to
     digest and gain insight from RAS logs
          both post-mortem and near-real-time

June 9-10, 2005                                            23
Earlier Motivation: Multivariate Statistical
Analysis of Hardware Counter Data                                                                 ④
                                                                                                                                     ⑤              Rule-Based
                                                                                                            Clus tering
                                                                                                                                                         Expe rt System
                                                                       ②                                      F-rat io
                                                                                                                                                         Decision Trees
                                                                                                       Factor Analys is
     Hardware counters produce huge                 ①
     amounts of data on large systems                                     Data
     Multivariate statistical techniques                                                                                            Visualization

     help distill important features
                                                         Task Local
     Clustering, Factor analysis, PCA                       Data

                                                                       00    11    22   3
                                                                                        3    4
                                                                                             4    5
                                                                                                  5    6
                                                                                                       6       7
                                                                                                               7    8
                                                                                                                    8     9
                                                                                                                          9    10
                                                                                                                               10   11
                                                                                                                                    11   12
                                                                                                                                         12   13
                                                                                                                                               13   14
                                                                                                                                                     14    15

                                                                        16   17
                                                                              17   18
                                                                                   18   19   20   21   22     23    24    25   26   27   28   29
                                                                                                                                              29    30
                                                                                                                                                     30    31

                                                                        32   33
                                                                             33    34   35   36   37   38     39    40    41   42   43   44   45    46
                                                                                                                                                    46     47

                                                                       48    49    50   51   52   53   54     55    56    57   58   59   60   61    62     63

                                                                       64    65    66   67   68   69   70     71    72    73   74   75   76   77    78     79

                                               Task mapping            80




                                                                                             84   85

                                                                                             100 101
                                                                                                       86     87

                                                                                                       102 103 104
                                                                                                                    88    89   90   91

                                                                                                                          105 106 107
                                                                                                                                         92   93    94

                                                                                                                                         108 109 110 111

                                                                       96                                                                            111

                                                                      112 113 114 115 116 117
                                                                      112                              118 119 120        121 122 123    124 125 126 127

                                                                      128 129 130 131 132 133
                                                                      128                              134 135 136        137 138 139    140 141 142 143

                                                                      144 145 146 147 148 149
                                                                      144                              150 151 152        153 154 155    156 157 158 159

                                                                      160 161 162 163 164 165
                                                                      160                              166 167 168        169 170 171    172 173 174 175

                                                                      176 177 178 179 180 181
                                                                      176                              182 183 184        185 186 187    188 189 190 191

                                                                      192 193 194 195 196 197
                                                                      192                              198 199 200        201 202 203    204 205 206 207

                                                                      208 209 210 211 212 213
                                                                       208 209                         214 215 216        217 218 219    220 221 222 223
                                                                                                                                                 222 223

                                                                      224 225 226 227 228 229
                                                                       224 225 226                     230 231 232        233 234 235    236 237 238 239
                                                                                                                                             237 238 239

                                                                      240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
                                                                       240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

                  D.H. Ahn and J.S. Vetter, “Scalable Analysis Techniques for Microprocessor
June 9-10, 2005   Performance Counter Metrics,” Proc. SC 2002, 2002.                                                                                             24
Current status

     Gathering system log information for existing
          preferably both normal and abnormal conditions

     Investigating toolkits for statistical analysis and
     machine learning (prototypes)
          R, Matlab

     Investigating where/how we can gather real-time
     system logs
          Ganglia, clumon, etc.

June 9-10, 2005                                            25
Toolkit for Adaptively Customizing OS
    GOAL: optimize application performance by adapting OS policies in
    response to workload
         Use OS, runtime, and app information in decision process

    Earlier work w/ Autopilot was limited to user level parallel file system

    Extend these ideas to use kernel and new runtime performance

June 9-10, 2005                                                            26
Expose Performance Information

     GOAL: expose performance information that
          allows better performance analysis for
          supports adaptation

     Leverage recent work
          PERUSE, POMP
          KTAU (Oregon)

June 9-10, 2005                                    27
Metrics and Benchmarks

     Use DoE applications to motivate our goals
     and measure progress
          Current application benchmarks
                  POP, GYRO, AMBER, BLAST, etc.
          OS benchmarks
                  Lmbench, etc
          IO benchmarks

June 9-10, 2005                                   28
MOLAR: HEC OS/R Research Map

June 9-10, 2005                29
Communications and I/O:
                  Motivation & Goals
     PAPI has been extremely successful in providing a unified
     approach to accessing CPU/memory system hardware counters
       Routinely used by applications developers
       Of increasing interest for system-level monitoring & management

     Need similar capabilities for communications and I/O systems
       PMPI and Peruse are models, but need to support broader
       spectrum of programming models, incl. GAS

     Understand what may need to be done at application, runtime,
     and OS/driver level to support such data collection

June 9-10, 2005                                                      30
Background: PAPI

     Performance Application Programming Interface
     Development led by Jack Dongarra’s group at UT-
     Provides software abstraction of architecture-
     dependent native events (i.e. CPU counters) into a
     collection of preset events
     Also provides timers
     Used by higher-level tools, like TAU, to instrument
     code for hardware/memory performance

June 9-10, 2005                                            31
Our Targets: Communications and I/O

     One of the key elements of PAPI is to provide a more unified
     view of performance across architectures

     To date, some work has been done on communications within
     a given “architecture” (i.e. MPI), but nothing for multiple
     architectures; some NICs/drivers have hardware and software
          Comms are “phase 1” of project
          Emphasis on including Global Address Space approaches (i.e.
          Global Arrays, possibly Co-Array Fortran), one-sided approaches

     I/O seems to have less prior work re. perf. monitoring
          Phase 2 of project
          Working on it – thinking about it…

June 9-10, 2005                                                         32
Benefits of Uniform Performance
    Insight into performance characteristics of different
    programming models and their implementations
          e.g., UPC is more efficient than MPI for App. A because 70% of
          communication for both involve small messages (less than 16
          bytes), and UPC’s average latency is only 80% of MPI’s; but on
          App. B, the MPI version is faster because it uses fewer long
          messages, with better achieved bandwidth as follows ...

    Uniform basis for performance characterization of multi-
    paradigm codes, e.g. OpenMP+MPI, MPI +CAF, MPI+GA

June 9-10, 2005                                                            33
Current Status: Understanding the
     General instrumentation and tracing tools
          TAU, SvPablo, Paradyn, Intel Trace Analyzer, Paraver,
          Dimemas, DynInst

     Comms instrumentation/performance tools
          PMPI, MPI Peruse, POMP

     Networking & communications
          RDMA, Quadrics, Myrinet, InfiniBand

June 9-10, 2005                                                   34
HW Counters in Interconnects?

         Similar to hardware performance counters
         in all of today’s processors, are counters
         available in network NIC’s?
         Gathered information about three high-
         performance interconnect technology:

June 9-10, 2005                                       35
Predictive Performance Modeling
      Goal: Run an application on a few “configurations” and use
      collected performance data to predict performance for other
           Run on 1, 2, 4 procs; predict for 8, 16, … procs.
           Run for problem size 100, 500, 1000; predict for 2000
           Run on ASCI Red and Red Storm; predict for next Sandia mesh-

      Attempt to do “automatically” what Hoisie et. al. achieve by
      manual model development for specific applications

      Some work by Lebarta’s group using linear regression using a few
      machine parameters; our goal is to use more information, e.g.
      message distributions and latencies
           Build on Jeff Vetter’s Photon system

June 9-10, 2005                                                           36
Next Steps

     We will initially focus on…
          Developing an abstract model for communications
          performance, which considers two-sided messaging, one-
          sided messaging, and GAS approaches
          Identifying useful user-level performance-related info not
          currently directly accessible with current tools
                  Idle time due to load imbalance
                  Message latencies
                  Computation/communication overlap

June 9-10, 2005                                                        37

     Collaborating with the PAPI team
          MOLAR team – what should be collected
          PAPI team – generalization

     PAPI interested in generalizing to support broader
     range of events, timers, communication substrates,

     Expand the circle of input from others…
          What communication information would be helpful?
          Talking with “local” tuning experts
          Others in FastOS community – expect contact…

June 9-10, 2005                                              38
HA-OSCAR/Molar for HEC
    C. Leangsuksun, A. Tikotekar, S. Scott, M. Pourzandi and I. Haddad. "Towards Cluster Survivability". Proceedings of 6th LCI International
    Conference on Linux Clusters: The HPC Revolution 2005, Chapel Hill, NC, USA, April 2005.
    C. Leangsuksun and H. Song. "A Light-Weight Model of Solution for Markov Processes". Proceedings of 43rd annual ACM Southeast
    Conference (ACMSE), Kennesaw, GA, USA, March 2005.
    K. Limaye, C. B. Leangsuksun, V. K. Munganuru and Z. Greenwood. "HA-OSACAR: Grid-enabled High Availability Framework".
    Proceedings of 13th Annual Mardi Gras Conference "Frontiers of Grid Applications and Technologies", Baton Rouge, LA, USA, February
    K. Limaye, C. B. Leangsuksun, V. K. Munganuru, Z. Greenwood, S. Scott and K. Chanchio. "Grid-enabled HA-OSACAR". Proceedings of
    OSCAR Symposium (OSCAR), Ontario, Canada, May 2005.
    H. Song and C. B. Leangsuksun. "Availability Specification and Evaluation of HA-OSCAR Cluster Servers - An Object-Oriented
    Approach". Proceedings of 3rd International Conference on Computing, Communications and Control Technologies (CCCT), Austin, TX,
    USA, July 2005.
    H. Song, C. B. Leangsuksun and S. Scott. "UML-Based Beowulf Cluster Availability Modeling". Proceedings of International Conference
    on Software Engineering Research and Practice (SERP), Las Vegas, NV, USA, June 2005.
    H. Song, C. Leangsuksun and R. Nassar. "OOMSE -- An Object Oriented Markov Chain Specification and Evaluation Framework".
    Proceedings of 17th International Conference on Software Engineering and Knowledge Engineering (SEKE), Taipei, Taiwan, July 2005.
    K. Limaye, C. Leangsuksun, Z. Greenwood, S. L. Scott, C. Engelmann, R. Libby and K. Chanchio. "Job-Site Level Fault Tolerance for
    Cluster and Grid environments". Submitted to IEEE Cluster Computing (Cluster), Boston, MA, USA, September 2005.
    J. Wu and C. Leangsuksun. "Customizable Fine-Grained Access Control Framework for Grid Computing - Gridshib Alternative".
    Submitted to International Conference for High Performance Computing, Networking, and Storage (SC), Seattle, WA, USA, November
    Y. Liu and C. B. Leangsuksun. "Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off". Submitted to IEEE Cluster
    Computing (Cluster), Boston, MA, USA, September 2005.
    J. Wu and C. Leangsuksun. "Policy-based Access Control Framework for Grid Computing". Submitted to IEEE Cluster Computing
    (Cluster), Boston, MA, USA, September 2005.

OS-level Data Replication and Distributed Control
     C. Engelmann, S. L. Scott and G. A. Geist. "High Availability through Distributed Control". Proceedings of High Availability and
     Performance Computing Workshop (HAPCW), Santa Fe, NM, USA, October 2004.
     C. Engelmann and S. L. Scott. "High Availability for Ultra-Scale High-End Scientific Computing". Proceedings of 2nd International
     Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters
     (COSET-2), Cambridge, MA, USA, June 2005.

June 9-10, 2005                                                                                                                           39
Project web site



June 9-10, 2005                                      40

Shared By: