thesis.ppt - Shivkumar Kalyanaraman _Shiv IBM RPI Carnatic Music_

Document Sample
thesis.ppt - Shivkumar Kalyanaraman _Shiv IBM RPI Carnatic Music_ Powered By Docstoc
					Meta-Simulation Design and
 Analysis for Large Scale
           David W Bauer Jr

    Department of Computer Science
    Rensselaer Polytechnic Institute
 Motivation
 Contributions
 Meta-simulation
   ROSS.Net
   BGP4-OSPFv2 Investigation
 Simulation
   Kernel Processes
   Seven O’clock Algorithm
 Conclusion
 High-Level Motivation: to gain varying degrees of
                        qualitative and quantitative
                        understanding of the behavior of
                        the system-under-test

“…objective as a quest for general invariant
       relationships between network
 Stability and    and protocol dynamics…”
Meta-Simulation: capabilities to extract and interpret
meaningful performance data from the results of
multiple simulations
         • Individual experiment cost is high
         • Developing useful interpretations
         • Protocol performance modeling

Experiment Design Goal: identify minimum cardinality
set of meta-metrics to maximally model system
 Motivation
 Contributions
 Meta-simulation
   ROSS.Net
   BGP4-OSPFv2 Investigation
 Simulation
   Kernel Processes
   Seven O’clock Algorithm
 Conclusion
Contributions: Meta-Simulation: OSPF
Problem: which meta-metrics are most important in
determining OSPF convergence?
                                                                  Negligible metrics
                       Step 1   Search complete
                                                                identified and isolated
                                model space

                                                 Step 2

Our approach within 7% of Full Factorial using 2
orders of magnitude fewer experiments

  Optimization-based ED:        Full-Factorial ED (FFED):
     750 experiments
                                  16384 experiments

                                                              Step 3
Contributions: Meta-Simulation: OSPF/BGP
Ability: model BGP and OSPF control plane
Problem: which meta-metrics are most important in
minimizing control plane dynamics (i.e., updates)?
 All updates belong to one of four categories:
     – OO: OSPF-caused OSPF (OO) update    – BO: total BO+OB
                                        MinimizeBGP-caused OSPF update
     – BO: BGP-caused OSPF update       15-25% better than otherupdate
                                           – OB: OSPF-caused BGP
Meta-Simulation Perspective: complete view of all domains

                       OB: ~50% of total updates
                       BO: ~0.1% of total updates

                             Global perspective 20-25% better
                             than local perspectives
Contributions: Simulation: Kernel Process
                    Parallel Discrete Event Simulation

Conservative Simulation                    Optimistic Simulation
Wait until it is safe to process next     Allow violations of time-stamp
     event, so that events are                 order to occur, but detect
     processed in time-stamp order             them and recover

   Benefits of Optimistic Simulation:
        i. Not dependant on network topology simulated
        ii. As fast as possible forward execution of events
Contributions: Simulation: Kernel Process
Problem: parallelizing simulation requires 1.5 to 2 times more
memory than sequential, and additional memory requirement
affects performance and scalability

                                                          4 Processors Used
  Decreased scalability as model
  size increases:
      due to increased memory
      required to support model

   Solution: Kernel Processes (KPs)
      new data structure supports
      parallelism, increases
                                             Model Size Increasing
Contributions: Simulation: Seven O’clock
Problem: distributing simulation requires efficient global

 Inefficient solution: barrier synchronization between all nodes while
 performing computation

 Efficient solution: pass messages between nodes, and sycnhronize in
 background to main simulation

 Seven O’clock Algorithm: eliminate message passing  reduce cost
 from O(n) or O(log n) to O(1)
 Motivation
 Contributions
 Meta-simulation
   ROSS.Net
   BGP4-OSPFv2 Investigation
 Simulation
   Kernel Processes
   Seven O’clock Algorithm
 Conclusion
ROSS.Net: Big Picture
 Goal: an integrated simulation and experiment
  design environment

  Modeling                               (simulation &                 Protocol
             Protocol Models:           meta-simulation
             OSPFv2, BGP4,
             TCP Reno, IPv4, etc

                   Measured topology
                   data, traffic and
                   router stats, etc.

ROSS.Net: Big Picture
                                                     Meta-           • Experiment design
                                                   Simulation        • Statistical analysis
                                                                     • Optimization heuristic
                                                                         – Recursive Random
                         Design of Experiments
                              Tool (DOT)                                   Search
                                                                     • Sparse empirical
 Input                                                   Output        modeling
 Parameters              Parallel Discrete Event         Metric(s)
                          Network Simulation

                                                                     • Optimistic parallel
                                                   Simulation            – ROSS
                                                                     • Memory efficient
                                                                       network protocol
ROSS.Net: Meta-Simulation Components
            Design of Experiments Tool (DOT)                             Design of Experiments Tool (DOT)

                      Statistical or                                               Statistical or
                   Regression Analysis                                          Regression Analysis
                          (R, STRESS)                                                (R, STRESS)

                                                                               Optimization Search
                   Experiment Design
                   (Full/Fractional Factorial)

                                                 Metric(s)                                                  Metric(s)
Parameter                                                    Parameter
Vector                                                       Vector
                                        Empirical model
                                                                                 Sparse empirical model

    • Small-scale systems                                    • Large-scale systems
    • Linear parameter                                       • Non-Linear parameter
      interactions                                             interactions
    • Small # of params                                      • Large # of params – curse of
Meta-Simulation: OSPF/BGP Interactions

• Router topology from
  Rocketfuel tracedata
   – took each ISP map as a
     single OSPF area
   – Created BGP domain
     between ISP maps
   – hierarchical mapping of
                                           AT&T’s US Router Network Topology

                       • 8 levels of routers:
                           –   Levels 0 and 1, 155Mb/s, 4ms delay
                           –   Levels 2 and 3, 45Mb/s, 4ms delay
                           –   Levels 4 and 5, 1.5Mb/s, 10ms delay
                           –   Levels 6 and 7, 0.5Mb/s, 10ms delay
Meta-Simulation: OSPF/BGP Interactions
   – Intra-domain, link-state routing
   – Path costs matter                               OSPF domain
• Border Gateway Protocol (BGP)
   – Inter-domain, distance-vector, policy routing
   – Reachability matters
• BGP decision-making steps:
   – Highest LOCAL PREF
   – Lowest AS Path Length
   – Lowest origin type
       ( 0 – iBGP, 1 – eBGP, 2 – Incomplete)
   – Lowest MED                                 eBGP connectivity
   – Lowest IGP cost
   – Lowest router ID

                                               iBGP connectivity
Meta-Simulation: OSPF/BGP Interactions
• Intra-domain routing decisions can
  effect inter-domain behavior, and vice
  versa.                                                            OB Update


• All updates belong to either of four
   –   OSPF-caused OSPF (OO) update
   –   OSPF-caused BGP (OB) update – interaction
   –   BGP-caused OSPF (BO) update – interaction
   –   BGP-caused BGP (BB) update                                    8   10

                                    Link failure or cost increase
                                              (e.g. maintenance)
Meta-Simulation: OSPF/BGP Interactions
Intra-domain routing decisions can effect inter-
   domain behavior, and vice versa.
                                                              BO Update
Identified four categories of updates:
   –   OO:   OSPF-caused OSPF update
   –   BB:   BGP-caused BGP update
   –   OB:   OSPF-caused BGP update – interaction
   –   BO:   BGP-caused OSPF update – interaction

                                          eBGP connectivity
                                          becomes available

  These interactions cause route changes to thousands of
             IP prefixes, i.e. huge traffic shifts!!
Meta-Simulation: OSPF/BGP Interactions
• Three classes of protocol
    – OSPF timers, BGP timers,
      BGP decision
• Maximum search space size
• RRS was allowed 200 trials
  to optimize (minimize)
  response surface:
    – OO, OB, BO, BB,
      OB+BO, ALL updates
• Applied multiple linear
  regression analysis on the
Meta-Simulation: OSPF/BGP Interactions

                                                      ~15% improvement when BGP
  •   Optimized with respect to OB+BO response surface. included in search
  •   BGP timers play the major role, i.e. ~15% improvement in the optimal
       – BGP KeepAlive timer seems to be the dominant parameter.. – in contrast to
         expectation of MRAI!
  • OSPF timers effect little, i.e. at most 5%.
       – low time-scale OSPF updates do not effect BGP.
Meta-Simulation: OSPF/BGP Interactions

                                                              Minimize total BO+OB
                                                              15-25% better than other
                                                                   Important to optimize

  •   Varied response surfaces -- equivalent to a particular management approach.
  •   Importance of parameters differ for each metric.
  •   For minimal total updates:                              OB: ~50% of total updates
       – Local perspectives are 20-25% worse than the global. BO: ~0.1% of total updates
  •   For minimal total interactions:
       –   15-25% worse can happen with other metrics
  •   OB updates are more important than BO updates (i.e. ~0.1% vs. ~50%)

                                Global perspective 20-25% better
                                than local perspectives
  – Number of experiments were reduced by an
    order of magnitude in comparison to Full

  – Experiment design and statistical analysis
    enabled rapid elimination of insignificant

  – Several qualitative statements and system
    characterizations could be obtained with few
 Problem Statement
 Contributions
 Meta-simulation
   ROSS.Net
   BGP4-OSPFv2 Investigation
 Simulation
   Kernel Processes
   Seven O’clock Algorithm
 Conclusion
Simulation: Overview
       Parallel Discrete Event Simulation
           – Logical Process (LPs) for each relatively parallelizable simulation
             model, e.g. a router, a TCP host

           Local Causality Constraint: Events within each LP must be processed
                                       in time-stamp order

           Observation: Adherence to LCC is sufficient to ensure that parallel
                        simulation will produce same result as sequential simulation

       Conservative Simulation                      Optimistic Simulation
 -      Avoid violating the local causality   -     Allow violations of local causality to
        constraint (wait until it’s safe)           occur, but detect them and recover
                                                    using a rollback mechanism

 I.     Null Message (deadlock avoidance)     I.    Time Warp Protocol
             (Chandy/Misra/Byrant)                      (Jefferson, 1985)

 II.    Time-stamp of next event              II.   Limiting amount of opt. execution
ROSS: Rensselaer’s Optimistic Simulation System
 ROSS                      GTW
 tw_event                  PEState GState[NPE]                 Example Accesses
 message                    PEState
 receive_ts                 message
                            event queue              GTW: Top down hierarchy
                                                                    lp_ptr =
 src / dest_lp              cancel queue                  GState[LP[i].Map].lplist[LPNum[i]]
 user data                  lplist[MAX_LP]
 tw_lp                       free event list[ ][ ]   ROSS: Bottom up hierarchy
 pe                         LPState                         lp_ptr = event->src_lp;
 lp number                  message
                            process ptr                                or
 type                       init proc ptr                pe_ptr = event->src_lp->pe;
 proc ev queue head         rev proc ptr
 proc ev queue tail         final proc ptr           Key advantages of bottom up
                            ...                         approach:
 tw_pe                      Event                    • reduces access overheads
 event queue                message
                            lp number                • improves locality and processor
 cancel queue               message                     cache performance
 free event list head   Memory usage only 1% more than
 free event list tail
                        sequential and independent of LP count.
                 “On the Fly” Fossil Collection
OTFFC works by only allocating events from the free list that are less than GVT. As events are
processed they are immediately placed at the end of the free list....

                                                                              Snapshot of PE 0’s
                             LP A       LP B       LP C                       internal state at time
   FreeList[1]                       Processor 0

   FreeList[0]        5.0     5.0     5.0   10.0 10.0 10.0 15.0 15.0 15.0

                             LP A       LP B       LP C
                                                                              Snapshot of PE 0’s
                                                                              internal state after
                                                                              rollback of LP A and
   FreeList[1]                       Processor 0
   FreeList[0]         5.0     5.0    10.0 10.0 15.0 15.0   5.0   10.0 15.0

Key Observation: Rollbacks cause the free list to become UNSORTED in virtual time.
Result: event buffers that could be allocated are not.
                  user must over-allocate the free list
Contributions: Simulation: Kernel Process
               Fossil Collection / Rollback

                              LP     9   5

                              LP     8   7   3   1

                              KP     9

 (Processing Element      Processes
                             LP 6        4   2
  per CPU utilized)

                       (Logical Processes)
ROSS: Kernel Processes

  i. significantly lowers fossil collection overheads
  ii. lowers memory usage by aggregation of LP statistics into KP
  iii. retains ability to process events on an LP by LP basis in the forward


  i. potential for “false rollbacks”
  ii. care must be taken when deciding on how to map LPs to KPs
ROSS: KP Efficiency

                      Small trade-off:
                      longer rollbacks
                      vs faster FC

Not enough work
in system…
ROSS: KP Performance Impact

  # KPs does not negatively impact performance
ROSS: Performance vs GTW

                    ROSS outperforms GTW
                    2:1 at best parallel

outperforms GTW
2:1 in sequential
Simulation: Seven O’clock GVT
Optimistic approach
   – Relies on global virtual time (GVT) algorithm to perform fossil collection at
     regular intervals
   – Events with timestamp less than GVT:
        • Will not be rolled back
        • Can be freed

GVT calculation
   – Synchronous algorithms: LPs stop event processing during GVT calculation
        • Cost of synch. may be higher than positive work done per interval
        • Processes waste time waiting
   – Asynchronous algorithms: LPs continue processing events while GVT
     calculation continues in the background
* Goal: creating a consistent cut among LPs that divides the events
  into past and future the wall-clock time

    Two problems: (i) Transient Message Problem, (ii) Simultaneous Reporting Problem
Simulation: Mattern’s GVT
Construct cut via message-

    Cost: O(log n) if
    tree, O(N) if ring
    ! If large number of
    processors, then free
    pool exhausted waiting
    for GVT to complete
Simulation: Fujimoto’s GVT
Construct cut using shared
  memory flag

     Cost: O(1)

     Sequentially consistent
     memory model ensures
     proper causal order

     ! Limited to shared
     memory architecture
Simulation: Memory Model
Sequentially consistent
  does not mean
Memory events are only
  guaranteed to be
  causally ordered

    Is there a method to achieve
    sequentially consistent
    shared memory in a loosely
    coordinated, distributed
Simulation: Seven O’clock GVT
Key observations:
   – An operation can occur atomically within a network of processors if
     all processors observe that the event occurred at the same time.
   – CPU clock time scale (ns) is significantly smaller than network time-
     scale (ms).
Network Atomic Operations (NAOs):
   – an agreed upon frequency in wall-clock time at which some event
     logically observed to have happened across a distributed system.
   – subset of the possible operations provided by a complete sequentially
     consistent memory model.
            Update    Update    Update    Update    Update    Update    Update
            Tables    Tables    Tables    Tables    Tables    Tables    Tables

                                                                                  wall-clock time
            Compute   Compute   Compute   Compute   Compute   Compute   Compute
              GVT       GVT       GVT       GVT       GVT       GVT       GVT

                                                                                  wall-clock time

                         LVT: 7

                10                9                   GVT: min(5,7)

                                      LVT: min(5,9)

    5                      LVT: 5

A           B        C       D                            E
Simulation: Seven O’clock GVT
•   Itanium-2 Cluster    Linear Performance
•   r-PHOLD
•   1,000,000 LPs
•   10% remote events
•   16 start events
•   4 machines
    – 1-4 CPUs
    – 1.3 GHz
• Round-robin LP to
  PE mapping
Simulation: Seven O’clock GVT
• Netfinity Cluster
• 1,000,000 LPs
• 10, 25% remote
• 16 start events
• 4 machines
    – 2 CPUs, 36 nodes
    – 800 GHz
Simulation: Seven O’clock GVT: TCP
• Itanium-2 Cluster       Linear Performance
• 1,000,000 LPs
   – each modeling a
     TCP host (i.e. one
     end of a TCP
• 2 or 4 machines
   – 1-4 CPUs on each
   – 1.3 GHz
• Poorly mapped
Simulation: Seven O’clock GVT: TCP
• Netfinity Cluster
• 1,000,000 LPs
   – each modeling a
     TCP host (i.e. one
     end of a TCP
• 4-36 machines
   – 1-2 CPUs on each
   – Pentium III
   – 800MHz
Simulation: Seven O’clock GVT: TCP
• Sith Itanium-2
• 1,000,000 LPs
   – each modeling a
     TCP host (i.e. one
     end of a TCP
• 4-36 machines
   – 1-2 CPUs on each
   – 900MHz
Simulation: Seven O’clock GVT
   – Seven O’Clock Algorithm
       • Clock-based algorithm for distributed processors
             – creates a sequentially consistent view of distributed memory
       • Zero-Cost Consistent Cut
             – Highly scalable and independent of event memory limits

                       Fujimoto’s     Seven O’Clock        Mattern’s            Samadi’s
    Cut Calculation       O(1)             O(1)          O(n) or O(log n)     O(n) or O(log n)

    Parallel /              P              P& D               P& D                 P& D

    Global Invariant     Shared           Clock          Message Passing      Message Passing
                       Memory Flag    Synchronization       Interface            Interface

    Independent of          N               Y                   N                    N
    Event Memory
Summary: Contributions
    ROSS.Net: platform for large-scale network simulation,
   experiment design and analysis
       OSPFv2 protocol performance analysis
       BGP4/OSPFv2 protocol interactions

    kernel processes
       memory efficient, large-scale simulation

    Seven O’clock GVT Algorithm
       zero-cost consistent cut
       high performance distributed execution
Summary: Future Work
    ROSS.Net: platform for large-scale network
       incorporate more realistic measurement data, protocol
          CAIDA, Multi-cast, UDP, other TCP variants
       more complex experiment designs  better qualitative

    Seven O’clock GVT Algorithm
       compute FFT and analyze “power” of different models
          attempt to eliminate GVT algorithm by determining max rollback

Shared By: