Docstoc

Reintroducing Consistency in Cloud Settings - Microsoft Research

Document Sample
Reintroducing Consistency in Cloud Settings - Microsoft Research Powered By Docstoc
					Ken Birman, Cornell University
Sept 24, 2009   Cornell Dept of Computer Science Colloquium   2
The “realtime web”

     Simple ways to
     create and share
     collaboration and
       social network
        applications
                                  [Try it! http://liveobjects.cs.cornell.edu]


    Examples: Live Objects, Google “Wave”, Javascript/AJAX,
     Silverlight, Java Fx, Adobe FLEX and AIR, etc….

Sept 24, 2009     Cornell Dept of Computer Science Colloquium                   3
   Cloud computing entails building
    massive distributed systems
     They use replicated data, sharded relational
      databases, parallelism
     Brewer’s “CAP theorem:” Must sacrifice
      Consistency for Availability & Performance
   Cloud providers believe this theorem

                                              Long ago, we knew how to build
   My view:                                   reliable, consistent distributed
                                                           systems.

    We gave up on consistency too easily
   Partly, superstition….




   … albeit backed by some painful experiences
Don’t believe me? Just ask the people who really know…
    As described by Randy Shoup at LADIS 2008

Thou shalt…
      1. Partition Everything
      2. Use Asynchrony Everywhere
      3. Automate Everything
      4. Remember: Everything Fails
      5. Embrace Inconsistency

Sept 24, 2009    Cornell Dept of Computer Science Colloquium   7
    Werner Vogels is CTO at Amazon.com…
    His first act? He banned reliable multicast*!
       Amazon was troubled by platform instability
       Vogels decreed: all communication via SOAP/TCP


    This was slower… but
       Stability matters more than speed

* Amazon was (and remains) a heavy pub-sub user

Sept 24, 2009     Cornell Dept of Computer Science Colloquium   8
    Key to scalability is decoupling,
     loosest possible synchronization
    Any synchronized mechanism is a risk
       His approach: create a committee
       Anyone who wants to deploy a highly consistent
          mechanism needs committee approval


                     …. They don’t meet very often

Sept 24, 2009     Cornell Dept of Computer Science Colloquium   9
   Applications structured as stateless tasks
     Azure decides when and how much to replicate
      them, can pull the plug as often as it likes
     Any consistent state lives in backend servers
      running SQL server… but application design tools
      encourage developers to run locally if possible
                          Consistency technologies
                              just don’t scale!




Sept 11, 2009
Sept 24, 2009    P2P Dept of Seattle, Washington
                Cornell 2009 Computer Science Colloquium   11
    This is the common thread

    All three guys (and Microsoft too)
       Really build massive data centers, that work
       And are opposed to “consistency mechanisms”




Sept 24, 2009    Cornell Dept of Computer Science Colloquium   12
 A consistent distributed system will often have
       many components, but users observe
     behavior indistinguishable from that of a
       single-component reference system



                Reference Model                                        Implementation

Sept 24, 2009            Cornell Dept of Computer Science Colloquium                    13
   They reason this way:
     Systems that make guarantees put those guarantees
      first and struggle to achieve them
     For example, any reliability property forces a system to
      retransmit lost messages, use acks, etc
     But modern computers often become unreliable as a
      symptom of overload… so these consistency
      mechanisms will make things worse, by increasing the
      load just when we want to ease off!
   So consistency (of any kind) is a “root cause” for
    meltdowns, oscillations, thrashing
    Transactions that update replicated data

    Atomic broadcast or other forms of reliable
     multicast protocols

    Distributed 2-phase locking mechanisms



Sept 24, 2009   Cornell Dept of Computer Science Colloquium   15
   Our systems become “eventually” consistent
    but can lag far behind reality

   Thus application developers are urged to not
    assume consistency and to avoid anything
    that will break if inconsistency occurs
 A=3                                     B=7                                   B = B-A                                                  A=A+1
                                                      Non-replicated reference execution

    p                                                                                p
              q                                                                                q

                  r                                                                                r

                           s                                                                                    s

                           t                                                                                t


  Time:   0           10       20   30     40    50    60    70                    Time:   0           10           20   30   40   50    60   70




                       Synchronous execution                                                   Virtually synchronous execution

         Synchronous runs: indistinguishable from non-replicated
          object that saw the same updates (like Paxos)
         Virtually synchronous runs are indistinguishable from
          synchronous runs
Sept 24, 2009                                   Cornell Dept of Computer Science Colloquium                                                        17
   During the 1990’s, Isis was a big success
     French Air Traffic Control System, New York Stock
      Exchange, US Navy AEGIS are some blue-chip
      examples that used (or still use!) Isis
     But there were hundreds of less high-profile users


   However, it was not a huge commercial success
     Focus was on server replication and in those days,
     few companies had big server pools
   Leaving a collection of weaker products that,
    nonetheless, were sometimes highly toxic
   For example, publish-subscribe                   12000
                                                     10000




                                       messages /s
    message bus systems that use                      8000
                                                      6000
                                                      4000

    IPMC are notorious for massive                    2000
                                                         0
                                                             250   400     550 700   850
    disruption of data centers!                                          time (s)


   Among systems with strong consistency
    models, only Paxos is widely used in cloud
    systems (but its role is strictly for locking)
                                                  My rent check bounced?
                                                    That can’t be right!
    Inconsistency causes bugs
       Clients would never be able to
                                                                   Jason Fane Properties         1150.00

          trust servers… a free-for-all                           Sept 2009       Tommy Tenant




    Weak or “best effort” consistency?
       Strong security guarantees demand consistency
       Would you trust a medical electronic-health
          records system or a bank that used “weak
          consistency” for better scalability?
Sept 24, 2009       Cornell Dept of Computer Science Colloquium                                            20
   To reintroduce consistency we need
     A scalable model
      ▪ Should this be the Paxos model? The old Isis one?
     A high-performance implementation
      ▪ Can handle massive replication for individual objects
      ▪ Massive numbers of objects
      ▪ Won’t melt down under stress
      ▪ Not prone to oscillatory instabilities or resource
        exhaustion problems
   I’m reincarnating group communication!
     Basic idea: Imagine the distributed system as a
      world of “live objects” somewhat like files
     They float in the network and hold data when idle
     Programs “import” them as needed at runtime
      ▪ The data is replicated but every local copy is accurate
      ▪ Updates, locking via distributed multicast; reads are
        purely local; failure detection is automatic & trustworthy
   A library… highly asynchronous…
    Group g = new Group(“/amazon/something”);
    g.register(UPDATE, myUpdtHandler);
    g.cast(UPDATE, “John Smith”, new_salary);

    public void myUpdtHandler(string empName,
      double salary)
    { …. }
   Just ask all the members to do “their share” of work:

    Replies = g.query(LOOKUP, “Name=*Smith”);
    g.callback(myReplyHndlr, Replies, typeof(double));

    public void lookup(string who) {
       divide work into viewSize() chunks
       this replica will search chunk # getMyRank();
       reply(myAnswer);
    }

    public void myReplyHndlr(double[] whatTheyFound) { … }
                                                         Group g = new Group(“/amazon/something”);
Replies = g.query(LOOKUP, “Name=*Smith”);                g.register(LOOKUP, myLookup);




                                                     public void myLookup(string who) {
                                                       divide work into viewSize() chunks
                                                       this replica will search chunk # getMyRank();

                                                         …..


                                                          reply(myAnswer);
                                                     }



g.callback(myReplyHndlr, Replies, typeof(double));

public void myReplyHndlr(double[] fnd) {
   foreach(double d in fnd)
       avg += d;
   …
}
   The group is just an object.
   User doesn’t experience sockets…
    marshalling… preprocessors… protocols…
     As much as possible, they just provide arguments
      as if this was a kind of RPC, but no preprocessor
     Sometimes they provide a list of types and Isis
      does a callback

   Groups have replicas… handlers… a “current
    view” in which each member has a “rank”
        Can’t we just use Paxos?
          In recent work (collaboration with MSR SV) we’ve merged the models.
             Our model “subsumes” both…

        This new model is more flexible:
          Paxos is really used only for locking.
          Isis can be used for locking, but can also replicate data at very high
           speeds, with dynamic membership, and support other functionality.
          Isis2 will be much faster than Paxos for most group replication
           purposes (1000x or more)



[Building a Dynamic Reliable Service. Ken Birman, Dahlia Malkhi and Robbert van Renesse. Available as a 2009
technical report, in submission to SOCC 10 and ACM Computing Surveys...]
   Unbreakable TCP connections that terminate in groups
     [Burgess ‘10] describes Robert Burgess’ new r-TCP solution
     Groups use some form of state machine replication scheme


   State transfer and persistence

   Locking, other coordination paradigms

   2PC and transactional 1-copy SR

   Publish-subscribe with topic or content filtering (or both)
   Isis2 has a lot in common with an operating
    system and is internally very complex
     Distributed communication layer manages
      multicast, flow control, reliability, failure sensing
     Agreement protocols track group membership,
      maintain group views, implement virtual synchrony
     Infrastructure services build messages, handle
      callbacks, keep groups healthy
   To scale really well we need to take full
    advantage of the hardware: IPMC

   But IPMC was the root cause of the oscillation
    shown on the prior slide
   Traditional IPMC systems can          Melts down
                                           at ~100
                                           groups
    overload the router, melt down
   Issue is that routers have a small
    “space” for active IPMC addresses
   In [Vigfusson, et al ‘09] we show how to use
    optimization to manage the IPMC space
   In effect, merges similar groups while
    respecting limits on the routers and switches
    Algorithm by Vigfusson, Tock [HotNets 09,
     LADIS 2008, Submission to Eurosys 10]
    Uses a k-means clustering algorithm
       Generalized problem is NP complete
       But heuristic works well in practice




Sept 24, 2009     Cornell Dept of Computer Science Colloquium   32
  o Assign   IPMC and unicast addresses s.t.
         %      receiver filtering (hard)
         (1) Min. network traffic
         M       # IPMC addresses (hard)


  • Prefers sender load over receiver load
  • Intuitive control knobs as part of the policy



Sept 24, 2009   Cornell Dept of Computer Science Colloquium   33
                                                                     Topics in `user-
                                                                      interest’ space
                FGIF BEER GROUP              FREE FOOD


 (1,1,1,1,1,0,1,0,1,0,1,1)
 (0,1,1,1,1,1,1,0,0,1,1,1)
Sept 24, 2009          Cornell Dept of Computer Science Colloquium                  34
                                                                          Topics in `user-
                                    224.1.2.4                              interest’ space
  224.1.2.3                                                   224.1.2.5




Sept 24, 2009   Cornell Dept of Computer Science Colloquium                              35
                                                                      Topics in `user-
                                                                       interest’ space

Sending cost:
                                                                MAX
Filtering cost:


  Sept 24, 2009   Cornell Dept of Computer Science Colloquium                        36
                             Unicast                                  Topics in `user-
                                                                       interest’ space

Sending cost:
                                                                MAX
Filtering cost:


  Sept 24, 2009   Cornell Dept of Computer Science Colloquium                        37
                                                                              Unicast




                           Unicast                                        Topics in `user-
                                         224.1.2.4                         interest’ space


  224.1.2.3                                                   224.1.2.5




Sept 24, 2009   Cornell Dept of Computer Science Colloquium                              38
                                                                    multicast


                                  Heuristic




Procs            L-IPMC                                        Procs     L-IPMC

• Processes use “logical” IPMC addresses
• Dr. Multicast transparently maps these to
  true IPMC addresses or 1:1 UDP sends
 Sept 24, 2009        Cornell Dept of Computer Science Colloquium                 39
    We looked at various group scenarios
    Most of the traffic is
     carried by <20% of groups
    For IBM Websphere,
     Dr. Multicast achieves
     18x reduction in
     physical IPMC addresses
    [Dr. Multicast: Rx for Data Center Communication Scalability. Ymir
     Vigfusson, Hussam Abu-Libdeh, Mahesh Balakrishnan, Ken Birman, and Yoav
     Tock. LADIS 2008. November 2008. Full paper submitted to Eurosys 10.]

Sept 24, 2009        Cornell Dept of Computer Science Colloquium               40
   For small groups, reliable multicast
    protocols directly ack/nack the sender
   For large ones, use QSM technique:
    tokens circulate within a tree of rings
     Acks travel around the rings and aggregate over
      members they visit (efficient token encodes data)
     This scales well even with many groups
     Isis2 uses this mode for |groups| > 25 members, with
      each ring containing ~25 nodes
   [Quicksilver Scalable Multicast (QSM). Krzys Ostrowski, Ken Birman, and
    Danny Dolev. Network Computing and Applications (NCA’08), July 08. Boston.]
    Needed to prevent bursts of multicast from
     overrunning receivers
    AJIL protocol imposes limits on IPMC rate
       AJIL monitors aggregated multicast rate
       Uses optimization to apportion bandwidth
       If limit exceeded, user perceives a “slower” multicast
          channel
    [Ajil: Distributed Rate-limiting for Multicast Networks. Hussam Abu-
     Libdeh, Ymir Vigfusson, Ken Birman, and Mahesh Balakrishnan (Microsoft
     Research, Silicon Valley). Cornell University TR. Dec 08.]


Sept 24, 2009        Cornell Dept of Computer Science Colloquium              42
       AJIL reacts rapidly to load surges, stays close
        to targets (and we’re improving it steadily)
       Makes it possible to eliminate almost all
        IPMC message loss within the datacenter!
Sept 24, 2009    Cornell Dept of Computer Science Colloquium   43
              Challenges                                        Solutions
Distributed computing is hard and our target     Make group communication look as natural
developers have limited skills                   to the developer as building a .NET GUI
Raw performance is critical to success           Consistency at the “speed of light” by using
                                                 lossless IPMC to send updates
IPMC can trigger resource exhaustion and         Optimization-based management of IPMC
loss by entering “promiscuous” mode,             addresses reduces # of IPMC groups 100:1.
overrunning receivers.                           AJIL flow control scheme prevents overload.
User’s will generate massive numbers of          Aggregation, aggregation, aggregation… all
groups, not just high rates of events            automated and transparent to users
Reliable protocols in massive groups result in   For big groups, deploy hierarchical ack/nack
ack implosions                                   rings (idea from Quicksilver)
Many existing group communication                Use replicated group keys to secure
systems are insecure                             membership, sensitive data
What about C++ and Python on Linux?              Port platform to Linux with Mono, then offer
                                                 C++/Python supporting using remoting
   Isis2 is coming soon… initially on .NET
   Developers will think of distributed groups very
    much as they think of objects in C#.
     A friendly, easy to understand model
     And under the surface, theoretically rigorous
     Yet fast and secure too

   All the complexities of distributed computing
    are swept into this library… users have a very
    insulated and easy experience
   .NET supports ~40 languages, all of which can
    call Isis2 directly
   On Linux, we’ll do a Mono port and then build
    an outboard server that offers a remoted
    library interface
   C++ and other Linux languages/applications
    will simply run off this server, unless they are
    comfortable running under Mono of course
   Code extensively leverages
     Reflection capabilities of C#, even when called
      from one of the other .NET languages
     Component architecture of .NET means that users
      will already have the right “mind set”
     Powerful prebuilt data types such as HashSets

   All of this makes Isis2 simpler and more
    robust; roughly a 3x improvement compared
    to older C/C++ version of Isis!
   Building this system (myself) as a sabbatical
    project… code is mostly written

   Goal is to run this system on 500 to 500,000
    node systems, with millions of object groups

   Initial byte-code only version will be released
    under a freeBSD license.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:15
posted:4/28/2013
language:
pages:48