Docstoc

presentation slide - NUS - School of Computing

Document Sample
presentation slide - NUS - School of Computing Powered By Docstoc
					Caching and Data
Consistency in P2P


           Dai Bing Tian
            Zeng Yiming
Caching and Data Consistency
 Why Caching
   Caching helps use bandwidth more
     efficiently
 The data consistency in this topic is
  different from the consistency in
  distributed database
 It refers to the consistency between
  cached copy and data on servers.
Introduction
 Caching is built based on current P2P
  architectures like CAN, BestPeer, Pastry,
  etc.
 Caching layer is between application layer
  and P2P layer.
 Every peer has its cache control unit and
  its local cache, and publish the cache
  contents
Presentation Order
   We will present four papers, they are
    Squirrel
    PeerOLAP
    Caching for Range Queries
       With CAN

       With DAG
Overview
   Paper      Based on    Caching Consistency

  Squirrel     Pastry      Yes        Yes

 PeerOLAP     BestPeer     Yes        No

RQ with CAN     CAN        Yes        Yes

RQ with DAG     Not        Yes        Yes
              Specified
Squirrel
 Enables web browsers on desktop
  machines to share their local caches
 Uses a self-organizing, peer-to-peer
  network Pastry as its object location
  service
 Pastry is fault resilient, so is Squirrel
Web Caching
 Web browser generate HTTP GET requests
 If the object is in the local cache, return it if
  ―fresh‖ enough
 ―freshness‖ can be checked by submitting
  cGET request
 If no such object, issue GET request to the
  server
 For simplicity, we assume objects are
  cacheable
Home Node
 As described in Pastry, every peer (node)
  has its nodeID
 objectID = SHA-1 (obj URL)
 This object is assigned to the node whose
  ID is numerically nearest to the objectID
 The node who owns this object is called
  the home node of this object
Two approaches
 There are two approaches of Squirrel
   Home-store
   Directory
 Home-store stores the object directly in
  the cache of the home node
 Directory stores the pointer to the nodes
  who have this object in its cache, these
  nodes are called delegates
 Home-store
                                                        WAN
                                                                 Origin Server
   Requester                                     LAN
                          Send A over
                                                        Send A over
                              Yes, it is fresh                        Request for A
                                                   Yes, it is fresh
               Request for A                                  Is my copy of A fresh?
                 Is my copy of A fresh?

                                                       Home
Request                                                Node
Routed
Through
Pastry
                                                             Origin Server
Directory                   Send A over

                         Request for A
                                                          Send A over

                                       Yes, it is fresh         Request for A

    Requester                                        Is my copy of A fresh?
                   Send A over
                                                                        WAN
                Request for A
                                    Delegate                  LAN

                  Get it from D              Requester and I are your delegates

                                                 Update Meta-info
            Request for A
                                                 Keep the directory
                                 Get it from Server
 Request                                                           No directory
 Routed                     I’m your delegate
                                                          Home
 Through                                                   Node
 Pastry
    Conclusion
 The home-store approach is less
  complicated, but it does not have any
  collaboration
 The directory approach is more collaborative,
  it has the ability to store more objects in
  those peers with larger cache capacity, by
  setting the pointers to these peers in the
  directory
PeerOLAP
 OnLine Analytical Processing (OLAP)
  query typically involves large amounts of
  data
 Each peer has a cache containing some
  results
 An OLAP query can be answered by
  combining partial results from many peers
 PeerOLAP acts as a large distributed
  cache
Data Warehouse & Chunk
―A data                               Date
                            1Qtr 2Qtr 3Qtr 4Qtr sum
                         TV
 warehouse is           PC                        U.S.A




                                                              Country
                      VCR
 based on a          sum
                                                  Canada
 multidimensional
                                                     Mexico
 data model which
 views data in the                                     sum

 form of a data
 cube.‖
–Han & Kamber            http://www.cs.sfu.ca/~han/dmbook
PeerOLAP network
   LIGLO servers provide global name
    lookup and maintain a list of active peers
   Except for LIGLO                             LIGLO

    servers, the              Data
                            Warehouse              Peer
    network is fully
    distributed without
    any centralized
    administration
    point
Query Processing
 Assumption 1: Only chunks at the same
  aggregation level as the query are
  considered
 Assumption 2: The selecting predicates is
  a subset of grouping-by predicates
Cost Model
   Every chunk is associated with a cost
    value, indicating how long it spends to
    get this chunk
                        CnP  Q       size c 
        N c, Q  P              
                            k        Tr Q  P 

       T c, Q  P   S c, Q   N c, Q  P 
Eager Query Processing (EQP)
 Peer P sends requests for the missing
  chunks to all its neighbors, Q1, Q2, .... Qk
 Each Qi provides the desired chunks as
  many as possible, return to P with a cost
  associated with each chunk
 Qi then propagates the requests to all its
  neighbors recursively
 In order to avoid flooding, hmax is set to limit
  the depth of the search
    EQP (Contd.)
 P collects (chunk, cost) pairs from all its
  neighbors
 Random select one chunk ci, and find the
  peer who can provide it with lowest cost, Qi
 For the subsequent chunks, it evaluates the
  minimum of two cases: the peer with lowest
  cost is not connected yet, or some existing
  peer who can also provide this chunk
 Ask for chunks from these peers and the rest
  missing chunks from the warehouse.
Lazy Query Processing (LQP)

 Instead of propagating the requests from
  each Qi to all its neighbors, each Qi
  selects its most beneficial neighbor, and
  forward the request.
 Given the expected number of neighbors
  a peer has is k, EQP will visit O(k^hmax)
  nodes, LQP only visit O(khmax)
Chunk Replacement
   Least Benefit First (LBF)
                   T c, Q  P   a  H P  Q 
        Bc, P  
                              size c 
 Similar to LRU, every chunk has a
  weight
 Once the chunk is used by P, its weight
  is set back to the original benefit value
 Every time there is a new chunk come
  in, the weight of old chunks will reduce
    Collaboration
 LBF gives local chunk replacement algorithm
 3 variations of global behavior
   Isolated Caching Policy: non-collaborative
   Hit Aware Caching Policy: collaborative
   Voluntary Caching: highly collaborative
Network Reorganization
 Optimization can be done by creating
  virtual neighborhoods of peers with similar
  query patterns
 So that there is a high probability for P to
  get missing chunks directly from neighbors
 Each connection is assigned a benefit
  value and the most beneficial connections
  are selected to be the peer’s neighbors
Conclusion
 PeerOLAP is a distributed caching system
  for OLAP results
 By sharing the contents of individual
  caches, PeerOLAP constructs a large
  virtual cache which can benefit all peers
 PeerOLAP is fully distributed and highly
  scalable
Caching For Range Queries
   Range Query:
      E.g.
         SELECT Student.name
           WHERE 20<Student.age<30
   Why Cache?
      Data source too far away from the requesting node
      Data source overloaded with queries
      Data source is a single point of failure
   What to cache?
      All tuples falling in the range
   Who cache?
      Peers responsible for the range
Problem Definition
   Given a relation R, and a range attribute A, we
    assume that the results of prior range-selection
    queries of the form R.A(LOW, HIGH) are stored
    at the peers. When a query is issued at a peer
    which requires the retrieval of tuples from R in
    the range R.A(low, high), we want to locate a
    peer in the system which already stores tuples
    that can be accessed to compute the answer.
A P2P Framework for Caching
Range Queries
   Based on CAN.
   Map data into 2d virtual space, where d is #
    dimensions of the relation.
   For every dimension/attribute, say its domain is
    [a, b], it is mapped to a square virtual hash
    space whose corner coordinates are (a,a), (b,a),
    (b,b) and (a,b).
   The virtual hash space is further partitioned into
    rectangular areas, each of which is called a
    zone.
Example
             Virtual hash space for an
              attribute whose domain
              is [10,70]
             zone-1:
              <(10,56),(15,70)>
              zone-5:
              <(10,48),(25,56)>
              zone-8:
              <(47,10),(70,54)>
Terminology
   Each zone is assigned to a peer.
   Active Peer
       Owns a zone
   Passive Peer
       Not participate in the partitioning, register itself with an active peer
   Target Point
       A range [low,high] is hashed to a point with coordinates (low,high)
   Target Zone
       Where the target point resides
   Target Node
       The peer that owns the target zone
       ―Stores‖ the tuples falling into the range which is mapped to the its
        zone
            Caches the tuples in the local cache; OR
            Stores a pointer to the peer who caches the tuples
Zone Maintenance
 Initially, only the data source is the active
  node and the entire virtual hash space is
  its zone
 A zone split happens under two conditions:
     Heavy Answering Load
     Heavy Routing Load
Example of Zone Splits
                      If a zone has too
                       many queries to
                       answer
                          It finds the x-median
                           and y-median of the
                           stored results.
                           Determine if a split at
                           x-median or y-median
                           results in even
                           distribution of stored
                           answers and the
                           space.
                      If a zone is
                       overloaded because
                       of routing queries
                          It splits the zone from
                           the midpoint of the
                           longer side.
Answering A Range Query
 If an active node poses the query, the
  query is initiated from the corresponding
  zone; if a passive node poses the query, it
  contacts any active node from where the
  query starts routing.
 2 steps involved
     Query Routing
     Query Forwarding
Query Routing
                      If the target point falls
                       in this zone
         (26,30)

                         Return this zone
                      Else
                         Route the query to the
                         neighbor who is
                         closest to the target
                         point
Query Routing
                      If the target point falls
                       in this zone
                         Return this zone
         (26,30)      Else
                         Route the query to the
                         neighbor who is
                         closest to the target
                         point
Query Routing
                   If the target point falls
                    in this zone
                      Return this zone
                   Else
                      Route the query to the
    (26,30)           neighbor who is
                      closest to the target
                      point
Forwarding
 If the results are stored in the target node,
  then the results are sent back to the
  querying node
 Else, it is still possible that zones lie in the
  upper left area of the target point store the
  results. So we need to forward the query
  to these zones too.
Example
             If no results are found
              in zone-7, the shaded
              region may still
              contains the results.
             Reason: Any prior
              range query q whose
              range subsumes (x,y)
              must be hashed into
              the shaded region.
Forwarding (Cont.)
                   How far should it go?
                     For a range (low,high),
                      we want to restrict to
                      results falling in (low-
                      offset,high+offset),
                      where offset =
                      AcceptableFit x |domain|.
   offset            AcceptabelFit  [0,1]
                     The shaded square
                      defined by the target
                      point and offset is called
                      the Acceptable Region
Forwarding (Cont.)
                  Flood Forwarding
                   A    naïve approach. Forward
                     to the left and top neighbors
                     if they fall in the acceptable
                     region
                  Directed Forwarding
                    Forward   to the neighbor
                     that maximally overlaps
                     with the acceptable region
                    Can bound the number of
                     forwards by specifying a
                     limit d, which is
                     decremented for every
                     forward.
Discussion
   Improvements
     LookupDuring Routing
     Warm up queries
 Peer soft-departure & Failure event
 Update—cache consistency
     Saya tuple t with range attribut a=k is
     updated in the data source, then the target
     zone of point (k,k) and all zones lie in the
     upper left region have to update their cache.
Range Addressable Network: A P2P
Cache Architecture for Data Ranges
   Assumption:
     Tuples stored in the system are labeled
      1,2,…,N according to the range attribute
     A range [a,b] is a contiguous subset of
      {1,2,…,N}, where 1<=a<=b<=N
   Objective:
     Given a query range [a,b], peers
      cooperatively find results falling in the shortest
      superset of [a,b], if they are cached
      somewhere.
Overview
 Based on Range Addressable DAG
  (Directed Acyclic Graph)
 Map every active node in the P2P system
  to a group of nodes in the DAG
 A node is responsible for storing results
  and answering queries falling into a
  specific range
Range Addressable DAG
   The entire universe
    [1,N] is mapped to the
    root.
   Recursively divide
    one node into 3
    overlapping intervals
    of equal length.
 Range Lookup                                                      [7,13]

Input: a query range q=[a,b],
       a node v in DAG
Output: the shortest range in
       DAG that contains q
boolean down=true;                                  [5,12]
search (q, v)
                          Q: [7,10]
{
   if q  i(v)
          search (q, parent(v));
   if q  i(child(v)) & down
          search (q, child(v));
   else
          if some range stored at v is a superset of q
                     return the shortest range containing q that is stored at v
                                or parent(v); (*)
          else
                     down=false;
                     search(q,parent(v));
}
Peer Protocol
 Maps the logical DAG structure to physical
  peers
 Two components
     Peer   Management
         Handles peer joining, leaving, failure
     Range    Management
         Deals with query routing and updates
Peer Management
   It ensures that at any time,
     every   node in the DAG is assigned to some
      peer
     the nodes belonging to one peer, called a
      zone, is a connected component of the DAG
   This is done by handling Join Request,
    Leave Request, Failure Event properly.
Join Request
                  The first peer joining the
                   system takes over the
                   entire DAG
                  A new peer joining the
                   system contacts one of
                   the peers in the system to
                   take over one of its child
                   zones. Default strategy:
                   left child, then mid child,
                   then right child.
Join Request
                  The first peer joining the
                   system takes over the
                   entire DAG
                  A new peer joining the
                   system contacts one of
                   the peers in the system to
                   take over one of its child
                   zones. Default strategy:
                   left child, then mid child,
                   then right child.
Join Request
                  The first peer joining the
                   system takes over the
                   entire DAG
                  A new peer joining the
                   system contacts one of
                   the peers in the system to
                   take over one of its child
                   zones. Default strategy:
                   left child, then mid child,
                   then right child.
Join Request
                  The first peer joining the
                   system takes over the
                   entire DAG
                  A new peer joining the
                   system contacts one of
                   the peers in the system to
                   take over one of its child
                   zones. Default strategy:
                   left child, then mid child,
                   then right child.
Leave Request
                   When a peer wants to
                    leave (soft departure), it
                    hands over its zone to the
                    smallest neighboring
                    zone.
                   Neighboring zones: there
                    is a parent-child
                    relationship among any
                    nodes in the zones
Leave Request
                   When a peer wants to
                    leave (soft departure), it
                    hands over its zone to the
                    smallest neighboring
                    zone.
                   Neighboring zones: there
                    is a parent-child
                    relationship among any
                    nodes in the zones
Failure Event
                   A zone maintains info
                    on all its ancestors.
                    So in case it finds out
                    one of its parents
                    failed, it contacts the
                    nearest alive ancestor
                    for zone takeover.
Range Management
 Range Lookup
 Range Update
     When  a tuple is updated in the data source,
     we locate the peer with the shortest range
     containing that tuple, then update this peer
     and all its ancestors.
Improvement
                 Cross Pointers
                   For  a node v, if it’s the
                    left child of its parent,
                    then it keeps cross
                    pointers to all the left
                    children of nodes that
                    are in its parent’s level.
                   Similarly for mid child.
 Improvement (Cont.)
                            P1
                                    Load Balancing by Peer
                                     Sampling
P2                                    Collapsed DAG: collapse each
                                       peer’s zone to a single node.
                                      The system is balanced if the
                                       collapsed DAG is balanced.
                                      Lookup time is O(h) where h is
                  P3                   the height of the collapsed
                                       DAG. Hence a balanced
                                       system leads to optimal
                                       performance.
Collapsed DAG:    P1                  When a new peer joins, it polls
                                       k peers randomly, and send
                                       join request to the one whose
                                       zone is rooted nearest to the
             P2        P3
                                       root.
 Improvement (Cont.)
                    Load Balancing by Peer
                     Sampling
                      Collapsed DAG: collapse each
                       peer’s zone to a single node.
                      The system is balanced if the
                       collapsed DAG is balanced.
                      Lookup time is O(h) where h is
                       the height of the collapsed
                       DAG. Hence a balanced
                       system leads to optimal
                       performance.
Collapsed DAG:        When a new peer joins, it polls
                       k peers randomly, and send
                       join request to the one whose
                       zone roots nearest to the root.
Conclusion
   Caching Range Queries based on CAN
     Maps  every attribute into a 2D space
     The space is divided into zones
     Peers manage their respective zones
     A range [low,high] is mapped to a point
      (low,high) in the 2D space
     Query Routing & Query Forwarding
Conclusion (Cont.)
   Range Addressable Network
     Model ranges as DAG
     Every peer takes responsibility of a group of
      nodes in DAG
     Querying involves traversal of the DAG

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:67
posted:12/26/2010
language:English
pages:59