GhostShare An Invisible Content Addressable Network

Document Sample
GhostShare An Invisible Content Addressable Network Powered By Docstoc
					 GhostShare: You Name It, You
              Get it!
An Invisible Content Addressable
                            Ghost Group
                   University of Bologna CSD
                             UCLA CSD
                         Internal Use Only
The ideas expressed in this document are for research purposes
ONLY aiming to support the freedom of speech as well a disaster
     proof distributed file system across the public Internet
 CAN/P2P  Overview, Issues and Concerns
 P2P Routing  Pastry/CHORD
 P2P Searching Mike
 Anonymity and Performance
 Where we are and where are we going?
 CAN/P2P  Overview, Issues and Concerns
 P2P Routing  Pastry/CHORD
 P2P Searching Mike
 Anonymity and Performance
 Where we are and where are we going?
                  CAN Overview
   Content Addressable Network:
       We address the network using the content instead of
        the nodes ID as “destination” our goal is to get a
        specific content; i.e Users do not care who has the
        Kennedy top Secret file, just wants the file.
 Large Scale network #Nodes ~ 10E6
 Nodes are identified using a unique node ID that
  is not related to the real node IP and keeps the
  identity anonymous to the others.
 Contents are uniquely identified using an HASH
  code based on the contents. (i.e. the full word
       CAN Overview (cont’d)
 Nodes: are identified using a unique node ID
  that is not related to the real node IP and keeps
  the identity anonymous to the others.
 Contents: are uniquely identified using an HASH
  code based on the actual file information and
  content. In a given CAN multiple copies of the
  same contents might be present.
 Objects: represents a specific copy of a content
  shared by a specific node. They are identified by
  the GUID a Global Unique Identifier defined as
           CAN Overview (cont’d)
   Sample Scenarios:
       We have a set of files belonging to different AIDS
        patients distributed on the single doctor desktop
         • Doctors share the information for the sake of the medical
           research and statistics but they want to keep the patient
           anonymity as well as the doctor anonymity in order to
           protect the patient and the doctor himself.
       A number of citizens want to take really advantage of
        the first amendment end freely express their opinion
        without incurring in political prosecution or low suit.
         • They want to share document about political issues or public
           persons anonymously with other users interested in the
           same topic; for example sharing the secrets and dirt love
           stories of Bruce Springsteen.
        CAN Overview (cont’d)
 Problems     (large scale10E6 Nodes):
     The user knows and is aware of the searching
      keys: What he/she is looking for;
     The user does not know the Content Hash nor
      the node address.
 Goal:
     a user wants to write “African American AIDS”
      and get the list of the files containing medical
      information on all African Americans suffering
      from the AIDS efficiently.
           CAN Overview (cont’d)
   Problem:
       I just got the list of content ID and corresponding
        node ID for a set of searching keys. How I get the
        content given that ANONYMITY is required:
         • The source must not know the IP address of the destination
         • The destination must not know the IP address of the source
   GOAL:
       Find an efficient way to get the content from the
        source without exposing the IP to the source; a path
        of intermediate nodes is used so that “source” and
        “destination” never exchange information directly
        (several solution already out there: Pastry, Chord,
        Freenet, etc.)
              CAN Overview (cont’d)
   Issues:
       ROUTING: Given a Node IDentifier
         •   an efficient route is needed from the destination to the source. Overlay
             Network routing: Pastry, Chord, Freenet, etc. We base the following
             discussion on Pastry. (Searching: EASY to Extend on CHORD?????)
       SEARCHING: Given a set of Keys (string expressing what the user
        wants to get):
         • How to get the list of [nodeID,ContentID] pair corresponding to the files
           containing the information request by the user.
       ANONYMITY and RELIABILITY: Given an not trusted, and unreliable
        network environment (nodes and link failure/ presence of pirates) are
         • A fault tolerant mechanism for the content retrieval (in an p2p network nodes
           appear and disappear on almost random basis)
         • An efficient content transfer that balances the network load and the side
           effects of having intermediate nodes between the source and the
         • Reducing the probability for a third party to get the exchanged content and
           to learn the source IP and the destination IP.
 CAN/P2P  Overview, Issues and Concerns
 P2P Routing  Pastry/CHORD
 P2P Searching Mike
 Anonymity and Performance
 Where we are and where are we going?
                           P2P Routing
   P2P Network features:
       Nodes in a P2P network are uniquely identified using an hash number
       The address space for ranges from [0-max hash key];
       the addressing space size is related to number of bits used for the hash

                     ID: 11-11-11-01              ID: 11-11-00-11

                                                             ID: 11-11-01-00

                   ID: 11-11-01-11
                                                                    ID: 11-11-01-01
               P2P Routing
 Several solutions aiming to solve the P2P
  Routing have been proposed. Among them
  PASTRY and CHORD are the more promising
  resulting on path length in the order O(log N).
 Pastry does not Implement the searching while
  chord supports the searching but is not suitable
  for anonymity. [MIKE FIND WHY AND HOW WE
 We base the further discussion on Pastry even if
  some comparisons with chord are given.
         P2P Routing: Pastry
 Each   node in a pastry network is identified
  by a numerical Node ID.
 A message from node I to node J is
  efficiently routed by pastry to the node
  having the closet node id to J. The
  average path length results in O(log N)
  where N is the number of the network
     P2P Routing: Pastry (cont’d)
   Each pastry node maintains a routing data structure that
    is divided in 3 parts:
       Routing Table (size Log2^b(N)*(2^b-1)):
        - Each entry contains the IP of address of one of potentially
        many nodes whose nodeID have the appropriate prefix.
        - Log2^b(N) rows & 2^b – 1 columns.
        - The 2^b -1 entries at row n share the node’s nodeID in the first
        n digits, but whose n+1th digit has one of the 2^b-1 possible
        values other than the n+1th digit of the node’s nodeID
       Leaf Set (size L): The node ID and the IP address of the nodes
        that are numerically closest to node ID from above and below;
        e.g. L=4 Node ID=20; LeafSet={18,19,21,22}.
       Neighborhood Set (size M): The node ID and the IP address of
        the nodes that are closest (according to the proximity metric) to
        the local node.
     P2P Routing: Pastry (cont’d)
   A message is routed as follows:
       The node checks if the key falls within the range of the nodeIDs
        covered by its Leaf Set. If it does, the message is forwarded
        directly to destination node, namely the node in the Leaf Set
        whose nodeID is closest to the key (possibly the present node).
       If the key is not in the Leaf Set, Routing Table is used and the
        message is forwarded to a node that shares a common prefix
        with the key by at least one more digit.
       If the Routing Table misses an entry or an entry can’t be found
        the message is forwarded to a node that shares a prefix with the
        key at least as long as the local node and is numerically closer to
        the key than the present node’s nodeID.
          P2P Routing: Pastry
   Sample Pastry Node.
                                     NodeId 10233102
               Leaf Set            Smaller        Larger
                 10233033         12033021      10233120      10233122
                 10233001         10233000      10233230      10233232
               Routing Table
                -0-2212102            1         -2-2301203    -3-1203203
                     0            1-1-301233    1-2-230203    1-3-021022
                10-0-31203        10-1-32102        2         10-3-23302
                102-0-0230        102-1-1302    102-2-2302        3
                1023-0-322        1023-1-000    1023-2-121        3
                10233-0-01            1         10233-2-32
                     0                          102331-2-0
               Neighborhood Set
               13021022        10200230        11301233      31301233
               02212102        22301203        31203203      33213321
P2P Routing: Pastry Example
                                                                 NodeId 202-11203
                                                                 NodeId 2-2301203
                                          Routing Table
                                           Routing Table

   Routing From NodeID                    -0-2211302
                                           -0-2212102         -1-1113211
                                               00             2-1-231233
                                                              1-1-301233       2-2-101203
                                                                                    2        2-3-021032
    = 10233102, with key                   20-0-31203
                                            Leaf Set          20-1-32102
                                                                 Smaller            2
                                                                                  Larger     20-3-13021
    = 20231222, b= 2.                       202-0-1102
                                           102-0-0230          202-1-1302
                                                              102-1-1302        202-2-2312
                                                                               102-2-2302    202-3-2230
                                              20231200          20231202        20231300      20231301
                                            2021-1-211        2023-1-233
                                                              1023-1-000       2023-2-230
                                                                                     2           3
                                            20211-0311               1             22
                    ID: 0   ID: 2^128-1       20231212         20231222         20231302      20231303
                                                0                              102331-2-0
                                                                               223012-2-0         3
                                                 0                                 2             3

     ID: 10233102
                                             ID: 20231222

                                                     ID: 202312-33

                                                        ID: 2023-2230

ID: 2–2301203

                                                     ID: 202-11203
P2P Routing Chord
P2P Routing Chord
P2P Routing Chord
P2P Routing Chord
 CAN/P2P  Overview, Issues and Concerns
 P2P Routing  Pastry
 P2P Searching Mike
 Anonymity and Performance
 Where we are and where are we going?
                   P2P: Searching
   Different Spaces:
       Searching KEYS:
         • Name, Age, Symptoms, Level of disease, etc. Basically
           anything is usable according to the CONTENT that is defined
           by what the users are willing to share.
       CONTENTS:
         • the information shared by the users: the secret love story of
           Bruce Springsteen, the description of a specific disease
           related to a specific AIDS case of a real person.
   Problems:
       different files with different contents but same keys.
       Multiple copies of the same file but different name
                      P2P: Searching
   Scenario
       The user wants to get “Streets of Philadelphia” by Bruce
       The network contains several information about Streets of
        Philadelphia not necessary al songs as for example the following
         •   Streets-Of-Philadelphia.mp3
         •   The Boss – Streets of Piladelphia.mp3
         •   American-history-Streets-of-Philadelphia.pdf
         •   Bruce Springsteen – Street of Pilladelphia.mp3
         •   Bruce Springsteen Live – Streets of Philadelphia.mp3
         •   Streets of Philadelphia.jpg
       Goal: The user types the requests in a Human – like style and
        the networks returns the file details and information needed to
        retrieve a specific file.
              P2P: Searching
 The   searching carried out in four phases:
     Phase 0 - Token Set Creation: Carried out by
      the Sharing Node
     Phase 1 – Distributed indexing: Carried out by
      the sharing node and a number of peers.
     Phase 2 – Searching: Carried out by any
     Phase 3 – Index maintenance and Cleaning
      and trust, carried out by any node.
                  P2P: Searching
   Phase 0 - Token Set Creation:
       A certain node wants to share “Streets of Philadelpia”
        by Bruce Springsteen. The file name is “Bruce
        Springsteen – Streets of Philadelpia.mp3”
       The sharing node in order to proceed will create the
        token set according to a regular expression (i.e. all
        words logner than 2 characters and no punctuation);
        for the given example the token set will be: T={Bruce,
        Springsteen, Streets, Philadelphia}; moreover other
        information might be part of the token set such us for
        example: sampling rate, file type, copyrighted,
        version, date, etc. (in this example we consider just
        the file name but the system is open).
                    P2P: Searching
   Phase 1 – Building the Distributed Index:
       Token Set  Node Space
         • For Each Token the Sharing node applies the hashing function (the
           same used for generating the node IDs) and gets KEY that falls in
           the Node ID addressing space.
         • The KEY and the relative record containing at least the t-ple [KEY ,
           Node ID, Content Hash] are added to an index table kept in the
           node with NODE_ID=KEY or in it’s closest neighbor in the
           considered addressing space. Some more information may be
           added to increase searching features (date, author, etc.) as well as
           a trust ranking (see anonymity section).
       Changing the first p bits of the KEY is possible to generate more
        p-1 nodes that can be used as index table host for a given key
        improving the reliability [Possible Problems if a node comes later
        we move the table? How we know?]
                                 P2P: Searching
           Building Index: Example
                   The node # 10233102 wants to share the song “Streets of Philadelphia” by
                    Bruce Springsteen; resulting in the following Token Set T={Bruce,
                    Springsteen, Streets, Philadelphia}, the tokens are normalized (all lower or
                    upper case). Each token is then hash with the same function used for the
                    Node ID:
                     • Bruce2231222; Springsteen2023-2230; Streets202-11203;
                                                    A Record containing:
                                                                            [token,NodeID,ContentID,Title,…]
                                                                             is sent to each node resulting for
                         ID: 0   ID: 2^128-1
                                                                             the hash or it’s immediate
                                                                             neighbor according to the pastry
     ID: 10233102
                                               ID: 20231222
                                                                            Multiple copies of the same
                                                     ID: 202312-33
                                                                             record can be sent to other nodes
                                                                             which ID is a “scramble” of the
                                                         ID: 2023-2230
                                                                             token hash; this improves
ID: 2–2301203
                                                                             reliability and achieves a better
                                                     ID: 202-11203           load balance.
             P2P: Searching
 Phase   2: Searching the Indexes
    • At any given point in time the some nodes in the
      network will contain an index table for a given
      token present in the network.
    • The token are searched using the Pastry routing
      and the outcome a set of tables each one
      containing the records related to a search token
      given by the user.
    • The searching node will proceed with select all the
      nodes present in all the tables operating a (Join)
      on the NODE ID
                   P2P: Searching
   Phase 2: Searching the Index – Example
       • The user is looking for a Streets of Philadelphia by
         Springsteen and types as token Streets and Philadelphia.
       • Two search are issued one for Streets and one for
         Philadelphia the result will be 2 tables as follows:
              one containing all the record for Streets
              one containing on the report for Philadelphia

       • The searching node will perform a Join (equi?) on the NODE
         id part of the GUID and get the list of all the nodes containing
         files with both Streets and Philadelphia tokens (basically all
         records listed in both tables). These are the nodes to be
         connected in order get the files.
                            P2P: Searching

                    ID: 0     ID: 2^128-1

     ID: 10233102
                                            ID: 20231222

                                                  ID: 202312-33

                                                     ID: 2023-2230

ID: 2–2301203

                                                  ID: 202-11203
 CAN/P2P  Overview, Issues and Concerns
 P2P Routing  Pastry
 P2P Searching Mike
 Anonymity and Performance
 Where we are and where are we going?
 What  do we mean by Anonymity
 Basic Principle for achieving Anonymity
 Enhancing Anonymity
     Multiple Sources
     Multiple Paths
 Issues   with Anonymity
     Leafset problem
     Encryption – good or bad ???
What do we mean by Anonymity
   Given a person who shares some information
    (Source), and a person who wants it
    (Destination) an anonymous transfer would be
    one where Source does not know Destination’s
    IP and Destination does not know Source’s IP
    (it is assumed that data is not inserted into the network,
    due to user’s unwillingness to do so).
      This protects the Source from disclosing what

         information it shares on the network.
      The Destination’s identity remains hidden.
      Basic Principle for achieving
 Anonymity  is achieved by having the data
 travel from the Source through the node-to
 node-links until it reaches the Destination.
     This is accomplished by keeping state of each
      content request at each node it passes
      through when routed by Pastry to the Source.
     Data is not stored at each node but rather it is
      just forwarded.
         Data Transfer Example
                           Data Request

                           Data Reply




                    2            Destination
          Enhancing Anonymity
   CASE: Lets assume that there is a BAD guy on
    the network, who has positioned himself next to
    Node A (he can do this easily if he knows A’s
    NodeId), and that Node A has some valuable
    information that only the RIGHT people know
    how to retrieve. Lets also assume that the BAD
    guy is going to copy all the data packets from A,
    passing trough him and thus have a copy of
    what A sends to somebody. Lets also assume
    that the BAD guy is an expert in decrypting and
    can decrypt 128-bit encryption (legal max. in the
    USA). What can we do to avoid this situation?
    Enhancing Anonymity (cont.)
   The Answer is: Multiple Disjoint Paths
       By establishing M.D.P. from Destination to Source,
        we ensure that no single Node on the network (except
        the Destination) will be able to re-assemble the
        requested content fully.
       In addition to that the Source will send the data in a
        round-robin fashion and thus ensure that the BAD guy
        is not even going to be able to copy consecutive parts
        of the content through a single path.
    Enhancing Anonymity (cont.)
   Creating Disjoint Multiple Paths.
       When the Destination requests a path establishment for a given
        file, it sends the request to some node from its RT or LS.
       This node adds its nodeId to the message and forwards it
        deeper into the network.
       When the Source receives the request message it acknowledges
        it with a new message and adds to the latter the nodeIds carried
        in the request message (lets call this: NODE_STRING).
       This Process is done for each path that the Destination
        establishes with the Source.
       After the 1st path has been established, each node during the
        subsequent paths’ creation will check if the node where the msg
        will be forwarded to next is in the NODE_STRING. If it isn’t it
        forwards it, if it is it tries a different node.
            Disjoint Paths - Example
                                  S ={ b, c, f, e}       S ={ b, c}
                                                                  c   S ={ b, c}
                   S ={ b, c, f, e}    e

                                  S                           S

                      S ={ b, c, f}
                                       f                          b   S ={ b}
     Path establishment request
     Path acknowledgement
     Does not happen since
     S at node = f contains c
    Enhancing Anonymity (cont.)
   CASE (cont.): Back to the BAD guy. What if our
    bad guy, knowing A’s nodeId, connects to the
    network using multiple nodes with nodeIds
    closest to A’s nodeId. Then he will be in the LS
    of A. Thus he will be able, with high probability,
    to copy a big chunk of the data that A sends in
    the network. This means that for a given file
    download with multiple paths, the bad guy will
    be able to intercept a significant portion those
    paths and thus copy a significant chunk of the
    file. What shall we do now ?
 Enhancing Anonymity (cont.)
A   possible Solution: Multiple Sources.
    If the Destination retrieves the file from
     multiple sources it will be less likely that the
     BAD guy will be able to copy a significant part
     of the file. This is because with the number of
     sources increasing, the number of nodes that
     the BAD guy will need increases
Possible Issues with Anonymity
   Disjoint Path Creation & Leaf Set Problem.
       Lets assume the following scenario. Node A is in the
        LS of Node B (the BAD guy), i.e. B knows A’s IP.
        Then if Node A requests a file that B hosts, he will
        establish a direct connection with node B and send
        him a message for path establishment with empty
        NODE_STRING. Thus B can figure out that Node A is
        requesting the file (otherwise | NODE_STRING | > 1)
        and B knows A’s IP, thus A’s anonymity has been
   Possible Solution.
       When A requests a file, it must avoid using the LS for
        establishing a path.
Possible Attacks on the Network
 Many   ………………….
      Performance & Scalability
 Performance.
     Major Issues
     Solutions / Alternatives
 LoadBalancing.
 Network Fault Tolerance.
 Major   Issue:
     Since we use a chain of Nodes to retrieve
      files, the probability that we have a slow
      connection in the chain significantly
     If one of the nodes in the chain fails, the
      whole path fails.
     Portion of a node’s bandwidth will be
      occupied with forwarding data.
               Performance (cont.)
   Solutions / Alternatives:
       Using Multiple Paths:
         • Gives a higher chance of avoiding network congestion on
           some of the paths, i.e. it is less likely that the network will be
           congested on all the paths.
         • The length of a path is logN, N-number of users on the
           network. This is due to the fact that we use Pastry’s routing
           to establish the multiple paths. The length of a path increases
           linearly with the number of paths (see figure on next slide).

       Using Multiple Sources allows the user to retrieve
        data in parallel and thus does not limit the data
        transfer rate to a single Source’s bandwidth.
Path Length = f (number of paths)
                    Load Balancing
   What is the problem ?
       When receiving data through multiple paths, a path can be
        slower than the other paths. If we send an equal amount of data
        on each path, we will lose the available bandwidth on the faster
   Possible Solution.
       At the receiver calculate for each path the number of packets
        received per unit time. We can use a weighted average to do so
        (TCP’s way of calculating RTT can be adopted for this purpose).
        Send this average on regular basis to the sender so that it can
        adjust its sending rates among the different paths.
       This method ensures that faster paths will carry more data than
        slower once and thus achieve a reasonable load balance in the
         Network Fault Tolerance
   Multiple Paths:
       Secures data retrieval from node failures. In order to
        break a retrieval, all paths must die. Moreover, we
        can dynamically create new paths and thus ensure
        that the data retrieval is rarely stopped.
   Multiple Sources:
       Secure the data retrieval, by allowing the Destination
        to request different parts of a file from a different
        source when a source is bombed or has just
        peacefully exited.

Shared By: