debbie Using

Document Sample
debbie Using Powered By Docstoc
					 Using Structure Indices for
 Efficient Approximation of
     Network Properties
Matthew J. Rattigan, Marc Maier, and David Jensen
      University of Massachusetts Amherst

                  Data Mining
               November 27, 2006
                Deborah Stoffer
The Problem
   Recent research works with very large networks
       Millions of nodes
   Calculating network statistics on very large
    networks can be difficult
       Shortest paths
       Betweenness centrality
            The proportion of all shortest paths in the network that run
             through a given node
       Closeness centrality
            The average distance from the given node to every other node
             in the network
The Problem
   The most efficient known algorithms for
    calculating betweenness centrality and closeness
    centrality are O(ne + n2logn)
       n – number of nodes
       e – number of edges
   Calculations for path finding can have even
    higher complexity
       Require bidirectional breadth-first search
The Problem
   Example - Rexa citation graph
       Papers in computer science and related fields
       Largest connected component contains 165,000
        nodes (papers) and 321,000 edges (citations)
       Finding a path of length 15 requires the exploration of
        65,000 nodes
The Problem
Network Structure Index (NSI)
   Similar to the type of index commonly used to speed
    queries in modern database systems
   Can be constructed once for a given graph and then used
    to speed the calculations of many measures on the graph
   Two components of a NSI
       Set of annotations on every node in the network that provide
        information about relative or absolute location
            For G(V,E) the annotations define A: V → S, where S is an
             arbitrarily complex “annotation space”
       A distance function that uses the annotations to define graph
        distance between pairs of nodes by mapping pairs of node
        annotations to a positive real number
            D: S x S → R
Types of Network Structure Indices
 All Pairs Shortest Path (APSP)
 Degree
 Landmark
 Global Network Positioning (GNP)
 Zone
 Distance to Zone (DTZ)
All Pairs Shortest Path NSI
   Node annotations
       Consist of an n x n matrix (n = |V|) containing the
        optimal path distances between all pairs of nodes
   Distance function
       A simple lookup in the matrix
Degree NSI
   Node annotations
       Annotate each node with its undirected degree within
        the graph
   Distance function between source node s and
    target node t
       DDegree (s, t) = 2n – degree (s) – degree (t)
Landmark NSI
 Randomly designate a small number of nodes in
  the network to serve as navigational beacons
 Node annotations
       Annotate nodes in the graph by flooding out from
        each landmark and recording the graph distance to
        each node in the network
       Gives a vector of graph distances for each node
   Distance function
Landmark NSI
Global Network Positioning NSI
   Node annotation
       Annotation uses a nonlinear optimization algorithm
        to create a multidimensional coordinate system that
        encodes the location of each node within the network
   Distance function is the Manhattan distance
    between node pairs
Zone NSI
   Node annotations
       Each node is annotated with a d-dimensional vector of
        zone labels
   Distance function

Zone NSI Algorithm
   For d dimensions
       Randomly select k seed nodes, assign them zone
        labels 1 through k, and place them in the labeled set
       Place all other nodes in the unlabeled set
       While the unlabeled set is not empty
            Randomly select a node l from the labeled set
            Randomly select a node u from the unlabeled set that is a
             neighbor to l
            Assign u to the same zone as l and move it to the labeled set
Zone NSI
Distance to Zone (DTZ) NSI
 Hybrid between Landmark and Zone NSIs
 Node annotations
       Divide the graph into zones and for each node u and
        zone Z calculate the distance from u to the closest
        node in Z
   Distance function
Distance to Zone (DTZ) NSI
Complexity of Different NSIs
Search Performance
   Optimality of the lengths of paths found
       Path ratio


       pf is the length of the found paths
       po is the length of the optimal paths
       r is the number of randomly selected pairs of nodes in
        the graph
       P = 1.0 indicates an NSI that finds optimal paths
       P >> 1.0 indicates a poor performing NSI
Search Performance
   Performance gain
       Exploration ratio


       ef is the number of nodes explored by best-first search
       eb is the number of nodes that are explored using a
        bidirectional breadth-first search
       r is the number of pairs of nodes in the graph
       E values close to zero indicate good search performance
       E values greater than 1.0 indicate poor search
Search Performance
   NSIs evaluated on synthetic graphs
       Random
       Rewired lattices
       Forest Fire
Search Performance
Search Performance
Search Performance
Search Performance
Constant Time Distance Estimation
 Can sometimes use an NSI to directly estimate the
  graph distance between any two nodes
 Can use the DTZ annotation distance to estimate
  actual graph distances
       Annotate the graph as described for the DTZ NSI
       Randomly sample p pairs of nodes in the graph and
        perform breadth-first search to obtain their exact graph
       Use linear regression to obtain an equation for
        estimated distance
Constant Time Distance Estimation
Constant Time Distance Estimation
Constant Time Distance Estimation
   Simple distance can be used to produce a wide
    variety of attributes on nodes, which can be used
    by data mining algorithms that analyze graphs
       Label nodes with their distance to a particular node in a
            How close is each actor to Kevin Bacon?
       Label nodes with the minimum or maximum distance
        to one of a set of designated nodes
            How close is each actor to an Academy Award winner?
Closeness Centrality
   Measures the proximity of a given node in a
    network to every other node

 Important to social network dynamics
 Accurate estimates of closeness centrality often
  impossible to calculate for large data sets
 Using an NSI for path finding can estimate
  closeness centrality efficiently
Closeness Centrality
Closeness Centrality
   A measure of centrality can be used to produce
    attributes on nodes that may be useful to
    knowledge discovery algorithms
       Determine the closeness of every node to a collection
        of key nodes
            Closeness to all winners of Academy Awards for best actor in
             the past 10 years
       Constrain closeness calculations for members of
            Closeness rank of an actor within their movie industry
       Weight closeness based on the attributes of the
        outlying nodes
            Closeness to winners of Academy Awards weighted by how
             recent an award
Betweenness Centrality
   Measures the number of short paths on which a
    given node lies

 Important to social network dynamics
 Accurate estimates of betweenness centrality
  often impossible to calculate for large data sets
Betweenness Centrality
 Can estimate betweenness using the paths
  identified through NSI navigation
 Randomly sample pairs of nodes and discover the
  shortest path between them
 Count the number of times each node in the graph
  appears on one of these paths to obtain a
  betweenness ranking
Betweenness Centrality
Betweenness Centrality
   A high betweenness score can indicate a bridge
    between two communities
       An actor that has played in movies belonging to
        different movie industries
   Betweenness centrality can be used to create
    features on nodes that are useful for data mining
       Calculate betweenness centrality for particular groups
        of nodes
            Actors that sit between winners of Academy Awards for best
             picture and the IMDb’s “Bottom 100”, the worst 100 movies as
             voted by users of the Internet Movie Database
 The NSIs Zone and DTZ allow efficient and
  accurate estimation of path lengths between
  arbitrary nodes in a network
 Efficient calculations of network statistics allow a
  better range of potential approaches to knowledge
 All potential NSIs have not been exhaustively
 NSIs could have other applications
       Finding connection subgraphs
       Approximating neighborhood functions

Shared By: