debbie Using

Document Sample

```					 Using Structure Indices for
Efficient Approximation of
Network Properties
Matthew J. Rattigan, Marc Maier, and David Jensen
University of Massachusetts Amherst

Data Mining
November 27, 2006
Deborah Stoffer
The Problem
   Recent research works with very large networks
   Millions of nodes
   Calculating network statistics on very large
networks can be difficult
   Shortest paths
   Betweenness centrality
   The proportion of all shortest paths in the network that run
through a given node
   Closeness centrality
   The average distance from the given node to every other node
in the network
The Problem
   The most efficient known algorithms for
calculating betweenness centrality and closeness
centrality are O(ne + n2logn)
   n – number of nodes
   e – number of edges
   Calculations for path finding can have even
higher complexity
The Problem
   Example - Rexa citation graph
   Papers in computer science and related fields
   Largest connected component contains 165,000
nodes (papers) and 321,000 edges (citations)
   Finding a path of length 15 requires the exploration of
65,000 nodes
The Problem
Network Structure Index (NSI)
   Similar to the type of index commonly used to speed
queries in modern database systems
   Can be constructed once for a given graph and then used
to speed the calculations of many measures on the graph
   Two components of a NSI
   Set of annotations on every node in the network that provide
information about relative or absolute location
   For G(V,E) the annotations define A: V → S, where S is an
arbitrarily complex “annotation space”
   A distance function that uses the annotations to define graph
distance between pairs of nodes by mapping pairs of node
annotations to a positive real number
   D: S x S → R
Types of Network Structure Indices
 All Pairs Shortest Path (APSP)
 Degree
 Landmark
 Global Network Positioning (GNP)
 Zone
 Distance to Zone (DTZ)
All Pairs Shortest Path NSI
   Node annotations
   Consist of an n x n matrix (n = |V|) containing the
optimal path distances between all pairs of nodes
   Distance function
   A simple lookup in the matrix
Degree NSI
   Node annotations
   Annotate each node with its undirected degree within
the graph
   Distance function between source node s and
target node t
   DDegree (s, t) = 2n – degree (s) – degree (t)
Landmark NSI
 Randomly designate a small number of nodes in
the network to serve as navigational beacons
 Node annotations
   Annotate nodes in the graph by flooding out from
each landmark and recording the graph distance to
each node in the network
   Gives a vector of graph distances for each node
   Distance function

Landmark NSI
Global Network Positioning NSI
   Node annotation
   Annotation uses a nonlinear optimization algorithm
to create a multidimensional coordinate system that
encodes the location of each node within the network
   Distance function is the Manhattan distance
between node pairs

Zone NSI
   Node annotations
   Each node is annotated with a d-dimensional vector of
zone labels
   Distance function


Zone NSI Algorithm
   For d dimensions
   Randomly select k seed nodes, assign them zone
labels 1 through k, and place them in the labeled set
   Place all other nodes in the unlabeled set
   While the unlabeled set is not empty
   Randomly select a node l from the labeled set
   Randomly select a node u from the unlabeled set that is a
neighbor to l
   Assign u to the same zone as l and move it to the labeled set
Zone NSI
Distance to Zone (DTZ) NSI
 Hybrid between Landmark and Zone NSIs
 Node annotations
   Divide the graph into zones and for each node u and
zone Z calculate the distance from u to the closest
node in Z
   Distance function

Distance to Zone (DTZ) NSI
Complexity of Different NSIs
Search Performance
   Optimality of the lengths of paths found
   Path ratio



   pf is the length of the found paths
   po is the length of the optimal paths
   r is the number of randomly selected pairs of nodes in
the graph
   P = 1.0 indicates an NSI that finds optimal paths
   P >> 1.0 indicates a poor performing NSI
Search Performance
   Performance gain
   Exploration ratio



   ef is the number of nodes explored by best-first search
   eb is the number of nodes that are explored using a
   r is the number of pairs of nodes in the graph
   E values close to zero indicate good search performance
   E values greater than 1.0 indicate poor search
performance
Search Performance
   NSIs evaluated on synthetic graphs
   Random
   Rewired lattices
   Forest Fire
Search Performance
Search Performance
Search Performance
Search Performance
Constant Time Distance Estimation
 Can sometimes use an NSI to directly estimate the
graph distance between any two nodes
 Can use the DTZ annotation distance to estimate
actual graph distances
   Annotate the graph as described for the DTZ NSI
   Randomly sample p pairs of nodes in the graph and
perform breadth-first search to obtain their exact graph
distance
   Use linear regression to obtain an equation for
estimated distance
Constant Time Distance Estimation
Constant Time Distance Estimation
Constant Time Distance Estimation
   Simple distance can be used to produce a wide
variety of attributes on nodes, which can be used
by data mining algorithms that analyze graphs
   Label nodes with their distance to a particular node in a
graph
   How close is each actor to Kevin Bacon?
   Label nodes with the minimum or maximum distance
to one of a set of designated nodes
   How close is each actor to an Academy Award winner?
Closeness Centrality
   Measures the proximity of a given node in a
network to every other node


 Important to social network dynamics
 Accurate estimates of closeness centrality often
impossible to calculate for large data sets
 Using an NSI for path finding can estimate
closeness centrality efficiently
Closeness Centrality
Closeness Centrality
   A measure of centrality can be used to produce
attributes on nodes that may be useful to
knowledge discovery algorithms
   Determine the closeness of every node to a collection
of key nodes
   Closeness to all winners of Academy Awards for best actor in
the past 10 years
   Constrain closeness calculations for members of
clusters
   Closeness rank of an actor within their movie industry
   Weight closeness based on the attributes of the
outlying nodes
   Closeness to winners of Academy Awards weighted by how
recent an award
Betweenness Centrality
   Measures the number of short paths on which a
given node lies


 Important to social network dynamics
 Accurate estimates of betweenness centrality
often impossible to calculate for large data sets
Betweenness Centrality
 Can estimate betweenness using the paths
 Randomly sample pairs of nodes and discover the
shortest path between them
 Count the number of times each node in the graph
appears on one of these paths to obtain a
betweenness ranking
Betweenness Centrality
Betweenness Centrality
   A high betweenness score can indicate a bridge
between two communities
   An actor that has played in movies belonging to
different movie industries
   Betweenness centrality can be used to create
features on nodes that are useful for data mining
   Calculate betweenness centrality for particular groups
of nodes
   Actors that sit between winners of Academy Awards for best
picture and the IMDb’s “Bottom 100”, the worst 100 movies as
voted by users of the Internet Movie Database
Conclusions
 The NSIs Zone and DTZ allow efficient and
accurate estimation of path lengths between
arbitrary nodes in a network
 Efficient calculations of network statistics allow a
better range of potential approaches to knowledge
discovery
 All potential NSIs have not been exhaustively
researched
 NSIs could have other applications
   Finding connection subgraphs
   Approximating neighborhood functions
Questions?

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 7 posted: 1/28/2011 language: English pages: 38