Peer-to-Peer _P2P_ networks and applications
Document Sample


Peer-to-Peer (P2P) networks
and applications
1
What is P2P?
“the sharing of computer resources and
services by direct exchange of
information”
2
What is P2P?
“P2P is a class of applications that take
advantage of resources – storage, cycles,
content, human presence – available at the
edges of the Internet. Because accessing
these decentralized resources means
operating in an environment of unstable and
unpredictable IP addresses P2P nodes must
operate outside the DNS system and have
significant, or total autonomy from central
servers”
3
What is P2P?
“A distributed network architecture may
be called a P2P network if the participants
share a part of their own resources. These
shared resources are necessary to provide
the service offered by the network. The
participants of such a network are both
resource providers and resource
consumers”
4
What is P2P?
Various definitions seem to agree on
sharing of resources
direct communication between equals (peers)
no centralized control
5
What is a peer?
“…an entity with capabilities similar
to other entities in the system.”
6
Client/Server Architecture
Well known,
powerful, reliable
server is a data Server
source
Clients request data
from server Client Client
Internet
Very successful
model
WWW (HTTP), FTP, Client Client
Web services, etc.
* Figure from http://project-iris.net/talks/dht-toronto-03.ppt 7
Client/Server Limitations
Scalability is hard to achieve
Presents a single point of failure
Requires administration
Unused resources at the network edge
P2P systems try to address these
limitations
8
P2P Architecture
All nodes are both
clients and servers
Provide and consume
data
Node
Any node can initiate a
connection Node
Node
No centralized data
source Internet
“The ultimate form of
democracy on the
Internet” Node Node
“The ultimate threat to
copy-right protection
on the Internet”
* Content from http://project-iris.net/talks/dht-toronto-03.ppt 9
P2P Network Characteristics
Clients are also servers and routers
Nodes contribute content, storage, memory, CPU
Nodes are autonomous (no administrative
authority)
Network is dynamic: nodes enter and leave the
network “frequently”
Nodes collaborate directly with each other (not
through well-known servers)
Nodes have widely varying capabilities
10
P2P Goals and Benefits
Efficient use of resources
Unused bandwidth, storage, processing power at the “edge of the network”
Scalability
No central information, communication and computation bottleneck
Aggregate resources grow naturally with utilization
Reliability
Replicas
Geographic distribution
No single point of failure
Ease of administration
Nodes self-organize
Built-in fault tolerance, replication, and load balancing
Increased autonomy
Anonymity – Privacy
not easy in a centralized system
Dynamism
highly dynamic environment
ad-hoc communication and collaboration
11
What is Peer-to-Peer (P2P)?
12
P2P Applications
File sharing (Napster, Gnutella, Kazaa)
Multiplayer games (Unreal Tournament, DOOM)
Collaborative applications (ICQ, shared whiteboard)
Distributed computation (Seti@home)
Ad-hoc networks
13
P2P System Taxonomy
Historic
Data-centric
Computation-centric
User-centric
Network-centric
Platforms
14
P2P Goals/Benefits
Cost sharing
Resource aggregation
Improved scalability/reliability
Increased autonomy
Anonymity/privacy
Dynamism
Ad-hoc communication
15
P2P Challenges
Decentralization
Scalability and Performance
Anonymity
Fairness
Dynamism
Security
Transparency
Fault Resilience and Robustness
16
Peer-to-Peer Content Sharing
17
Popular file sharing P2P Systems
Napster, Gnutella, Kazaa, Freenet
Large scale sharing of files.
User A makes files (music, video, etc.) on their
computer available to others
User B connects to the network, searches for
files and downloads files directly from user A
Issues of copyright infringement
18
Research Areas
Peer discovery and group management
Data placement and searching
Reliable and efficient file exchange
Security/privacy/anonymity/trust
19
Design Concerns
Group Management
Per-node state
Load balancing
Fault tolerance/resiliency
Search
Bandwidth usage
Time to locate item
Success rate
Fault tolerance/resiliency
20
Approaches
Centralized
Unstructured
Structured (Distributed Hash Tables)
21
Centralized index
Bob
original “Napster” design centralized
1) when peer connects, it directory server
1
informs central server: peers
IP address 1
content
1 3
2) Alice queries for “Hey
2
Jude” 1
3) Alice requests file from
Bob
Alice
22
Centralized model
Bob Alice
file transfer is
decentralized, but
locating content is
highly centralized
Judy Jane
23
Centralized
Benefits:
Bob Alice
Low per-node state
Limited bandwidth
usage
Short location time
High success rate
Fault tolerant
Drawbacks:
Single point of failure
Limited scale Judy Jane
Possibly unbalanced
load
copyright infringement
24
Napster
program for sharing files over the Internet
a “disruptive” application/technology?
history:
5/99: Shawn Fanning (freshman, Northeasten U.) founds
Napster Online music service
12/99: first lawsuit
3/00: 25% UWisc traffic Napster
2000: est. 60M users
2/01: US Circuit Court of
Appeals: Napster knew users
violating copyright laws
7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K, Morpheus: 300K
25
Napster: how does it work
Application-level, client-server protocol over point-to-
point TCP
Four steps:
Connect to Napster server
Upload your list of files (push) to server.
Give server keywords to search the full list with.
Select “best” of correct answers. (pings)
26
Napster
napster.com
1. File list is
uploaded
users
27
Napster
2. User napster.com
requests
search at
Request
server. and
results
user
28
Napster
3. User pings napster.com
hosts that
apparently
have data.
Looks for best pings
pings
transfer rate.
user
29
Napster
napster.com
4. User retrieves
file
Retrieves
file
user
30
Napster
Central Napster server
Can ensure correct results
Fast search
Bottleneck for scalability
Single point of failure
Susceptible to denial of service
• Malicious users
• Lawsuits, legislation
Hybrid P2P system – “all peers are equal but some are more equal
than others”
Search is centralized
File transfer is direct (peer-to-peer)
31
Unstructured
fully distributed overlay network: graph
no central server edge between peer X
used by Gnutella and Y if there’s a TCP
Each peer indexes the connection
files it makes available all active peers and
for sharing (and no edges form overlay net
other files)
edge: virtual (not
physical) link
given peer typically
connected with < 10
overlay neighbors
32
Gnutella: Query flooding
File transfer:
Query message HTTP
sent over existing TCP
connections
Query
peers forward
QueryHit
Query message
QueryHit
sent over
reverse
path
Query
QueryHit
33
Gnutella: Peer joining
1. joining peer Alice must find another peer in
Gnutella network: use list of candidate peers
2. Alice sequentially attempts TCP connections with
candidate peers until connection setup with Bob
3. Flooding: Alice sends Ping message to Bob; Bob
forwards Ping message to his overlay neighbors
(who then forward to their neighbors….)
peers receiving Ping message respond to Alice
with Pong message
4. Alice receives many Pong messages, and can then
setup additional TCP connections
34
Unstructured
Carl
Jane
Scalability:
limited scope
flooding
Bob
Alice
Judy
35
Unstructured
Carl Jane
Gnutella model
Benefits:
Limited per-node state
Fault tolerant
Drawbacks:
High bandwidth usage
Long time to locate item Bob
No guarantee on success rate
Possibly unbalanced load
Alice
Judy
36
Gnutella
Searching by flooding:
If you don’t have the file
you want, query 7 of your
neighbors.
If they don’t have it, they
contact 7 of their
neighbors, for a maximum
hop count of 10.
Requests are flooded, but
there is no tree structure.
No looping but packets may
be received twice.
Reverse path forwarding
* Figure from http://computer.howstuffworks.com/file-sharing.htm 37
Gnutella
fool.* ?
TTL = 2
38
Gnutella
X fool.her
IPX:fool.her
TTL = 1
TTL = 1
TTL = 1
39
Gnutella
fool.me
Y
fool.you
IPY:fool.me
fool.you 40
Gnutella
IPY:fool.me
fool.you 41
Gnutella: strengths and weaknesses
pros:
flexibility in query processing
complete decentralization
simplicity
fault tolerance/self-organization
cons:
severe scalability problems
susceptible to attacks
Pure P2P system
42
Gnutella: initial problems and fixes
2000: avg size of reachable network only 400-800
hosts. Why so small?
modem users: not enough bandwidth to provide search
routing capabilities: routing black holes
Fix: create peer hierarchy based on capabilities
previously: all peers identical, most modem black holes
preferential connection:
• favors routing to well-connected peers
• favors reply to clients that themselves serve large number
of files: prevent freeloading
43
Structured
001 012
FreeNet, Chord, CAN,
Tapestry, Pastry model
212 ?
212 ?
332
212
305
44
Structured
001 012
FreeNet, Chord, CAN,
Tapestry, Pastry model
212 ?
Benefits: 212 ?
Manageable per-node state
332
Manageable bandwidth usage
212
and time to locate item
305
Guaranteed success
Drawbacks:
Possibly unbalanced load
Harder to support fault
tolerance
45
Unstructured vs Structured
P2P
The systems we described do not offer any
guarantees about their performance (or even
correctness)
Structured P2P
Scalable guarantees on numbers of hops to answer
a query
Maintain all other P2P properties (load balance,
self-organization, dynamic nature)
Approach: Distributed Hash Tables (DHT)
46
Distributed Hash Tables (DHT)
Distributed version of a hash table data structure
Stores (key, value) pairs
The key is like a filename
The value can be file contents, or pointer to location
Goal: Efficiently insert/lookup/delete (key, value) pairs
Each peer stores a subset of (key, value) pairs in the system
Core operation: Find node responsible for a key
Map key to node
Efficiently route insert/lookup/delete request to this node
Allow for frequent node arrivals/departures
47
DHT Desirable Properties
Keys should be mapped evenly to all nodes
in the network (load balance)
Each node should maintain information
about only a few other nodes (scalability,
low update cost)
Messages should be routed to a node
efficiently (small number of hops)
Node arrival/departures should only affect
a few nodes
48
DHT Routing Protocols
DHT is a generic interface
There are several implementations of this interface
Chord [MIT]
Pastry [Microsoft Research UK, Rice University]
Tapestry [UC Berkeley]
Content Addressable Network (CAN) [UC Berkeley]
SkipNet [Microsoft Research US, Univ. of Washington]
Kademlia [New York University]
Viceroy [Israel, UC Berkeley]
P-Grid [EPFL Switzerland]
Freenet [Ian Clarke]
49
Basic Approach
In all approaches:
keys are associated with globally unique IDs
integers of size m (for large m)
key ID space (search space) is uniformly populated
- mapping of keys to IDs using (consistent) hashing
a node is responsible for indexing all the keys in a
certain subspace (zone) of the ID space
nodes have only partial knowledge of other node’s
responsibilities
50
Improvements: SuperPeers
KaZaA model
Hybrid centralized and unstructured
Advantages and disadvantages?
51
Hierarchical Overlay
between centralized
index, query flooding
approaches
each peer is either a
super node or assigned to
a super node
TCP connection between
peer and its super node.
TCP connections between
some pairs of super nodes. ordinary peer
Super node tracks content group-leader peer
in its children neighoring relationships
in overlay network
52
Kazaa (Fasttrack network)
Hybrid of centralized Napster and decentralized Gnutella
hybrid P2P system
Super-peers act as local search hubs
Each super-peer is similar to a Napster server for a small
portion of the network
Super-peers are automatically chosen by the system based on
their capacities (storage, bandwidth, etc.) and availability
(connection time)
Users upload their list of files to a super-peer
Super-peers periodically exchange file lists
You send queries to a super-peer for files of interest
53
File Distribution: Server-Client vs P2P
Question : How much time to distribute file
from one server to N peers?
us: server upload
bandwidth
Server
ui: peer i upload
u1 d1 u2 bandwidth
us d2
di: peer i download
File, size F bandwidth
dN
Network (with
uN abundant bandwidth)
54
File distribution time: server-client
Server
server sequentially F u1 d1 u2
sends N copies: us d2
NF/us time dN Network (with
abundant bandwidth)
client i takes F/di uN
time to download
Time to distribute F
to N clients using = dcs = max { NF/us, F/min(di) }
i
client/server approach
increases linearly in N
(for large N) 55
File distribution time: P2P
Server
server must send one
F u1 d1 u2
copy: F/us time us d2
client i takes F/di time
Network (with
to download
dN
abundant bandwidth)
NF bits must be
uN
downloaded (aggregate)
fastest possible upload rate: us + Sui
dP2P = max { F/us, F/min(di) , NF/(us + Sui) }
i
56
Server-client vs. P2P: example
Client upload rate = u, F/u = 1 hour, us = 10u, dmin ≥ us
3.5
P2P
Minimum Distribution Time
3
Client-Server
2.5
2
1.5
1
0.5
0
0 5 10 15 20 25 30 35
N
57
File distribution: BitTorrent
Efficient content distribution system using file
swarming. Usually does not perform all the
functions of a typical p2p system, like searching.
CacheLogic estimated (around 2003 or so) that
BitTorrent Traffic accounts for roughly 35% of all
traffic on the Internet.
Author: Bram Cohen
58
File distribution: BitTorrent
P2P file distribution
tracker: tracks peers torrent: group of
participating in torrent peers exchanging
chunks of a file
obtain list
of peers
trading
chunks
peer
59
BitTorrent
file divided into 256KB chunks.
peer joining torrent:
has no chunks, but will accumulate them over time
registers with tracker to get list of peers,
connects to subset of peers (“neighbors”)
while downloading, peer uploads chunks to other
peers.
peers may come and go
once peer has entire file, it may (selfishly) leave or
(altruistically) remain
60
BT: File sharing
To share a file or group of files, a peer first creates
a .torrent file, a small file that contains:
metadata about the files to be shared, and
Information about the tracker, the computer that
coordinates the file distribution.
Peers first obtain a.torrent file, and then connect to
the specified tracker, which tells them from which
other peers to download the pieces of the file.
61
Overall Architecture
Web Server Tracker
C
A
Peer
Peer [Seed]
B
[Leech]
Downloader Peer
“US” [Leech] 62
Overall Architecture
Web Server Tracker
C
A
Peer
Peer [Seed]
B
[Leech]
Downloader Peer
“US” [Leech] 63
Overall Architecture
Web Server Tracker
C
A
Peer
Peer [Seed]
B
[Leech]
Downloader Peer
“US” [Leech] 64
Overall Architecture
Web Server Tracker
C
A
Peer
Peer [Seed]
B
[Leech]
Downloader Peer
“US” [Leech] 65
Overall Architecture
Web Server Tracker
C
A
Peer
Peer [Seed]
B
[Leech]
Downloader Peer
“US” [Leech] 66
Overall Architecture
Web Server Tracker
C
A
Peer
Peer [Seed]
B
[Leech]
Downloader Peer
“US” [Leech] 67
BT: The .torrent file
The URL of the tracker
Pieces <hash1, hash 2,…, hash n>
Piece length
Name of the file
Length of the file
68
BT: The Tracker
IP address, port, peer id
State information (Completed or Downloading)
Returns a random list of peers
69
BT: Pulling Chunks
at any given time, different peers have different
subsets of file chunks
periodically, a peer (Alice) asks each neighbor for
list of chunks that they have.
Alice sends requests for her missing chunks
70
BT: Pieces and Sub-Pieces
A piece is broken into sub-pieces ...
typically 16KB in size
Until a piece is assembled, only download
the sub-pieces of that piece only
This policy lets pieces assemble quickly
71
BT: Pipelining
When transferring data over TCP, always have several
requests pending at once, to avoid a delay between
pieces being sent. At any point in time, some number,
typically 5, are requested simultaneously.
Every time a piece or a sub-piece arrives, a new
request is sent out.
72
BT: Piece Selection
The order in which pieces are selected by
different peers is critical for good performance
If an inefficient policy is used, then peers may
end up in a situation where each has all identical
set of easily available pieces, and none of the
missing ones.
If the original seed is prematurely taken down,
then the file cannot be completely downloaded!
What are “good policies?”
73
BT: Chunk Selection
Strict Priority
First Priority
Rarest First
General rule
Random First Piece
Special case, at the beginning
Endgame Mode
Special case
74
Random First Piece
Initially, a peer has nothing to trade
Important to get a complete piece ASAP
Select a random piece of the file and
download it
75
Rarest Piece First
Determine the pieces that are most rare
among your peers, and download those first.
This ensures that the most commonly
available pieces are left till the end to
download.
76
Endgame Mode
Near the end, missing pieces are requested from
every peer containing them. When the piece
arrives, the pending requests for that piece are
cancelled.
This ensures that a download is not prevented
from completion due to a single peer with a slow
transfer rate.
Some bandwidth is wasted, but in practice, this is
not too much.
77
BT: Sending Chunks
tit-for-tat
Alice sends chunks to four neighbors currently
sending her chunks at the highest rate
re-evaluate top 4 every 10 secs
every 30 secs: randomly select another peer, starts
sending chunks
newly chosen peer may join top 4
“optimistically unchoke”
78
BT: Choking
Choking is a temporary refusal to upload. It is one
of BitTorrent’s most powerful idea to deal with
free riders (those who only download but never
upload).
Tit-for-tat strategy is based on game-theoretic
concepts.
79
BitTorrent: Tit-for-tat
(1) Alice “optimistically unchokes” Bob
(2) Alice becomes one of Bob’s top-four providers; Bob reciprocates
(3) Bob becomes one of Alice’s top-four providers
With higher upload rate,
can find better trading
partners & get file faster!
80
Optimistic unchoking
A BitTorrent peer has a single “optimistic unchoke”
to which it uploads regardless of the current
download rate from it. This peer rotates every
30s
Reasons:
To discover currently unused connections are better
than the ones being used
To provide minimal service to new peers
81
Upload-Only mode
Once download is complete, a peer has no
download rates to use for comparison nor
has any need to use them. The question is,
which nodes to upload to?
Policy: Upload to those with the best
upload rate. This ensures that pieces get
replicated faster, and new seeders are
created fast
82
Questions about BT
Which features contribute to the efficiency of BitTorrent?
What is the effect of bandwidth constraints?
Is the Rarest First policy really necessary?
Must nodes perform seeding after downloading is complete?
How serious is the Last Piece Problem?
Does the incentive mechanism affect the performance much?
83
Trackerless torrents
BitTorrent also supports "trackerless" torrents,
featuring a DHT implementation that allows the
client to download torrents that have been
created without using a BitTorrent tracker.
84
P2P: searching for information
Index in P2P system: maps information to peer location
(location = IP address & port number)
. Instant messaging
File sharing (eg e-mule)
Index dynamically Index maps user
tracks the locations of names to locations.
files that peers share. When user starts IM
Peers need to tell application, it needs to
index what they have. inform index of its
Peers search index to
location
determine where files Peers search index to
can be found determine IP address
of user.
85
P2P Case study: Skype
Skype clients (SC)
inherently P2P: pairs
of users communicate.
proprietary Skype
application-layer login server Supernode
protocol (inferred via (SN)
reverse engineering)
hierarchical overlay
with SNs
Index maps usernames
to IP addresses;
distributed over SNs
86
Peers as relays
Problem when both
Alice and Bob are
behind “NATs”.
NAT prevents an outside
peer from initiating a call
to insider peer
Solution:
Using Alice’s and Bob’s
SNs, Relay is chosen
Each peer initiates
session with relay.
Peers can now
communicate through
NATs via relay
87
Anonymity
Napster, Gnutella, Kazaa don’t provide
anonymity
Users know who they are downloading from
Others know who sent a query
Freenet
Designed to provide anonymity among other
features
88
P2P Review
Two key functions of P2P systems
Sharing content
Finding content
Sharing content
Direct transfer between peers
• All systems do this
Structured vs. unstructured placement of data
Automatic replication of data
Finding content
Centralized (Napster)
Decentralized (Gnutella)
Probabilistic guarantees (DHTs)
89
Issues with P2P
Free Riding (Free Loading)
Two types of free riding
• Downloading but not sharing any data
• Not sharing any interesting data
On Gnutella
• 15% of users contribute 94% of content
• 63% of users never responded to a query
– Didn’t have “interesting” data
No ranking: what is a trusted source?
90
Get documents about "