lecture2

Document Sample
lecture2 Powered By Docstoc
					    School of Information
   University of Michigan




                 SI 614
Basic network concepts and intro to Pajek



                             Lecture 2
                     Instructor: Lada Adamic
                          Outline

 Basic network metrics
 Bipartite graphs
 Graph theory in math
 Pajek
                 Network elements: edges

 Directed (also called arcs)
    A -> B
       A likes B, A gave a gift to B, A is B‟s child
 Undirected
    A <-> B or A – B
       A and B like each other
       A and B are siblings
       A and B are co-authors


 Edge attributes
      weight (e.g. frequency of communication)
      ranking (best friend, second best friend…)
      type (friend, relative, co-worker)
      properties depending on the structure of the rest of the graph:
       e.g. betweenness
                                  Directed networks
    girls‟ school dormitory dining-table partners (Moreno, The sociometry reader, 1960)
    first and second choices shown


                                  Louise
              Ada                                         Lena
                                                                    Adele
                                Marion
                                                                                           Jane
Cora                                                 Frances
                                Eva                       Maxine                    Mary
                                                                 Anna
                                                                                                          Ruth
                                                                            Edna
                Robin                            Martha                                      Betty


       Jean
                                                      Laura
                                      Alice
                        Helen                                                 Hazel               Hilda
                                                                 Ellen


                                              Ella
                                                                                   Irene
Edge weights can have positive or negative values

                                   One gene
                                    activates/inhibits
                                    another
                                   One person
                                    trusting/distrusting
                                    another
                                      Research challenge:
                                       How does one
                                       „propagate‟ negative
                                       feelings in a social
                                       network? Is my
                                       enemy‟s enemy my
                                       friend?

       Transcription regulatory
       network in baker‟s yeast
                            Adjacency matrices

 Representing edges (who is adjacent to whom) as a
  matrix
                                                                       j
    Aij = 1 if node i has an edge to node j                   i
            = 0 if node i does not have an edge to j

                                                                   i
    Aii = 0 unless the network has self-loops

                                                                       j
                                                               i
    Aij = Aji if the network is undirected,
        or if i and j share a reciprocated edge
Example:
                                           0      0    0   0   0
              2
                                           0      0    1   1   0
                        3         A=
    1                                      0      1    0   1   0
                                           0      0    0   0   1
                    4                      1      1    0   0   0
        5
                       Adjacency lists

 Edge list
      23
      24                                                2
      32
      34                                                        3
                                                 1
      45
      52
      51
                                                             4
                                                     5
 Adjacency list
    is easier to work with if network is
        large
        sparse
    quickly retrieve all neighbors for a node
           1:
           2: 3 4
           3: 2 4
           4: 5
           5: 1 2
                                  Nodes

 Node network properties
    from immediate connections
        indegree
                                                                   indegree=3
         how many directed edges (arcs) are incident on a node


        outdegree
         how many directed edges (arcs) originate at a node      outdegree=2


        degree (in or out)
         number of edges incident on a node                        degree=5


    from the entire graph
        centrality (betweenness, closeness)
          2

1
                     3
                             Node degree from matrix values

                 4
    5                    n                             0   0   0   0   0
     Outdegree =        A
                         j 1
                                ij                     0   0   1   1   0
                                                  A=
                                                       0   1   0   1   0
        example: outdegree for node 3 is 2, which      0   0   0   0   1
        we obtain by summing the number of non-
                                                       1   1   0   0   0
        zero entries in the 3rd row n

                                      A
                                      j 1
                                             3j

                           n                           0   0   0   0   0
     Indegree =         A
                         i 1
                                 ij
                                                  A=
                                                       0   0   1   1   0
                                                       0   1   0   1   0
        example: the indegree for node 3 is 1,
        which we obtain by summing the number of       0   0   0   0   1
        non-zero entries in the 3rd column             1   1   0   0   0
                           n

                         A
                         i 1
                                 i3
                  Other node attributes

 take your pick…
    geographical location
    function
    musical tastes…




 Homophily: tendency of like individuals to associate with one
   another
       Network metrics: degree sequence and degree
                        distribution

 Degree sequence: An ordered list of the (in,out) degree of each node

     In-degree sequence:
         [2, 2, 2, 1, 1, 1, 1, 0]
     Out-degree sequence:
         [2, 2, 2, 2, 1, 1, 1, 0]
     (undirected) degree sequence:
         [3, 3, 3, 2, 2, 1, 1, 1]



 Degree distribution: A frequency count of the occurrence of each degree
                                                        5


         In-degree distribution:
             [(2,3) (1,4) (0,1)]                       4


         Out-degree distribution:

                                            frequency
                                                        3
             [(2,4) (1,3) (0,1)]
         (undirected) distribution:
                                                        2
             [(3,3) (2,2) (1,3)]

                                                        1




                                                        0
                                                            0      1       2
                                                                indegree
     Network metrics: connected components
 Strongly connected components
    Each node within the component can be reached from every other node
      in the component by following directed links
                                                       B
                                                                   F
      Strongly connected components                           C           G
          BCDE
                                                 A
          A
          GH                                                          H
          F                                               D
                                                       E

 Weakly connected components: every node can be reached from every
   other node by following links in either direction


      Weakly connected components                     B
         ABC D E
                                                                   F
                                                               C           G
         GHF                                     A


                                                                       H
                                                           D
 In undirected networks one talks simply about        E
   „connected components‟
                Network metrics: shortest paths
 Shortest path (also called a geodesic path)
    The shortest sequence of links connecting two nodes
    Not always unique                                             B
                                                                         3
                                                                                 C
      A and C are connected by 2 shortest              A      2
        paths
           A–E–B -C                                       1                 3
           A–E–D -C                                                         D
                                                                   E 2


 Diameter: the largest geodesic distance in the graph



       The distance between A and C is the
         maximum for the graph: 3


 Caution: some people use the term „diameter‟ to be the average shortest
   path distance, in this class we will use it only to refer to the maximal distance
          Giant components and the web graph
 if the largest component encompasses a significant fraction of the graph,
   it is called the giant component
               The bowtie model of the web


 The Web is a directed graph:
    webpages link to other
     webpages
 The connected components
    tell us what set of pages can
    be reached from any other just
    by surfing (no „jumping‟ around
    by typing in a URL or using a
    search engine)
   Broder et al. 1999 – crawl of
    over 200 million pages and 1.5
    billion links.
   SCC – 27.5%
   IN and OUT – 21.5%
   Tendrils and tubes – 21.5%
   Disconnected – 8%
                                             image: Mark Levene
            bipartite (two-mode) networks


 edges occur only between two groups of nodes, not
  within those groups
 for example, we may have individuals and events
    directors and boards of directors
    customers and the items they purchase
    metabolites and the reactions they participate in
  going from a bipartite to a one-mode graph
                                        group 1
  Two-mode network




 One mode projection                       group 2
    two nodes from the first
     group are connected if
     they link to the same
     node in the second
     group
    some loss of information
    naturally high
     occurrence of cliques
                  Now in matrix notation

 Bij                                                      i

    = 1 if node i from the first group
         links to node j from the second group
    = 0 otherwise
                                                       j


 B is usually not a square matrix!
    for example: we have n customers and m products

                          1    0   0      0
                          1    0   0      0
                B=
                          1    1   0      0
                          1    1   1      1
                          0    0   0      1
         Collapsing to a one-mode network
                                                                  i       k
 i and k are linked if they both link to j
 Aik= j Bij Bkj

 A= B BT                                             j=1   j=2


 the transpose of a matrix swaps Bxy and Byx
 if B is an nxm matrix, BT is an mxn matrix
         1   0      0   0
                                              1   1   1     1         0
         1   0      0   0
 B=                                   BT =    0   0   1     1         0
         1   1      0   0
                                              0   0   0     1         0
         1   1      1   1
                                              0   0   0     1         1
         0   0      0   1
                       Matrix multiplication

   general formula for matrix multiplication Zij= k Xik Ykj
   let Z = A, X = B, Y = BT

                                                  1           1   1   1   0
      1   0   0    0        1   1   1   1   0
                                                  1           1   1   1   0
      1   0   0    0        0   0   1   1   0
A=                                              = 1           1   2   2   0
      1   1   0    0        0   0   0   1   0
                                                  1           1   2   4   1
      1   1   1    1        0   0   0   1   1
                                                  0           0   0   1   1
      0   0   0    1


1 1 1 1   1                                               1           1
            = 1*1+1*1
                                                                          1
          1   + 1*0 + 1*0
          0 =2                                        1               2
          0
     Collapsing a two-mode network to a one mode-network

  Assume the nodes in group 1 are people and the nodes
   in group 2 are movies
  The diagonal entries of A give the number of movies
   each person has seen
  The off-diagonal elements of A give the number of
   movies that both people have seen
  A is symmetric


       1   1   1   1   0
       1   1   1   1   0                        1    1
A=     1   1   2   2   0                                     1

       1   1   2   4   1
                                            1            2
       0   0   0   1   1
Networks of actors
                   History: Graph theory

 Euler‟s Seven Bridges of Königsberg – one of the first problems in
  graph theory
 Is there a route that crosses each bridge only once and returns to
  the starting point?
                            Eulerian paths

 If starting point and end point are the same:
     only possible if no nodes have an odd degree
         each path must visit and leave each shore
 If don‟t need to return to starting point
     can have 0 or 2 nodes with an odd degree




  Eulerian path: traverse each                 Hamiltonian path: visit
  edge exactly once                            each vertex exactly once
  Bi-cliques (cliques in bipartite graphs)

 Km,n is the complete bipartite graph with m and n vertices of the
  two different types
 K3,3 maps to the utility graph
     Is there a way to connect three utilities, e.g. gas, water, electricity to
      three houses without having any of the pipes cross?




                                                                 Utility graph




        K3,3
                     Planar graphs

 A graph is planar if it can be drawn on a plane without
  any edges crossing
            When graphs are not planar

 Two graphs are homeomorphic if you can make one
  into the other by adding a vertex of degree 2
             Cliques and complete graphs

 Kn is the complete graph (clique) with K vertices
    each vertex is connected to every other vertex
    there are n*(n-1)/2 undirected edges




     K3                       K5                      K8
                   Peterson graph

 Example of using edge contractions to show a graph is
  not planar
                   Edge contractions defined




 A finite graph G is planar if and only if it has no subgraph that is
   homeomorphic or edge-contractible to the complete graph in five vertices
   (K5) or the complete bipartite graph K3, 3. (Kuratowski's Theorem)
                            graph density
 Of the connections that may exist between n nodes
    directed graph
      emax = n*(n-1)
        each of the n nodes can connect to (n-1) other nodes
    undirected graph
      emax = n*(n-1)/2
      since edges are undirected, count each one only once

 What fraction are present?
   density = e/ emax

     For example, out of 12
       possible connections, this graph
       has 7, giving it a density of
       7/12 = 0.583

 But it is more difficult for a larger network
   to achieve the same density

 measure not useful for comparing networks of different densities
#s of planar graphs of different sizes


 1:1

 2:2

 3:4

 4:11




              Every planar graph
              has a straight line
              embedding
              (homework exercise)
                         Trees

 Trees are undirected graphs that contain no cycles
                      examples of trees

 In nature
    trees
    river networks
    arteries (or veins, but not both)
 Man made
    sewer system
 Computer science
    binary search trees
    decision trees (AI)
 Network analysis
    minimum spanning trees
       from one node – how to reach all other nodes most quickly
       may not be unique, because shortest paths are not always unique
       depends on weight of edges
   Using Pajek for exploratory social network analysis

 Pajek – (pronounced in Slovenian as Pah-yek) means „spider‟

 website: vlado.fmf.uni-lj.si/pub/networks/pajek/
    download application (free)
    tutorials
    lectures
    data sets

 Windows only (works on Linux via Wine)

 can be installed via NAL in the student lab (DIAD)

 helpful book: „Exploratory Social Network Analysis with Pajek‟ by
   Wouter de Nooy, Andrej Mrvar and Vladimir Batagelj
     first 2 chapters are required reading and on cTools
                         Pajek interface
                                                               things we‟ll use right away



Drop down list of networks opened or created with pajek. Active is displayed

Drop down list of network partitions by discrete variables, e.g. degree, mode, label

Drop down list of continuous node attributes, e.g. centrality, clustering coefficients




    things we‟ll use later for clustering
                             opening a network file
click on folder icon
to open a file




Save changes to your network, network partitions, etc., if you‟d like to keep them
        Working with network files in Pajek

 The active network, partition, etc is shown on top of the
  drop down list




                                   Draw the network
                              Pajek data format

                                     Louise
                  Ada

                                              number of vertices    vertex x,y,z coordinates (optional)


    Cora                        *Vertices 26
                                   1 "Ada"                 0.1646     0.2144     0.5000
                                   2 "Cora"                0.0481     0.3869     0.5000
                                   3 "Louise"              0.3472     0.1913     0.5000
                                    ..

           directed edges
                                *Arcs
from Ada(1) to Louise(3) as         1 3 2 c Black
choice “2” and color Black          ..
      undirected edges          *Edges
                                    1 2 1 c Black
between Ada(1) to Cora(2) as
choice “1” and color Black          ..
                  Live demo of Pajek


 Opening a network
 Visualization
 Essential measurements
                  Final project guidelines

 Work individually or in groups (up to 4 people)
 Important dates
    Feb. 13th Project proposals due (5%)
       1 page abstract & 5 minute class presentation
    March 20th Project status report due (5%)
       3-6 pages of
            result summaries (including figures and tables)
            plan of remaining work
    April 17th in class student presentations of results (5%)
    April 24th final project reports due (25%)
       6-12 pages of
            related work
            main results
            „future‟ work/extensions
                               Final Project
 Option 1: Analyze a network
    What it should be
        More than just a measurement of the average shortest path, clustering
         coefficient, and degree distribution
             An interpretation of measurement results
             If applicable:
                   discovery of community or other structure
                   assortativity
                   motifs
                   weights, thresholds
                   longitudinal data (how the network changes over time)

        Visualizations of all or part of the network that point out a particular feature
        Qualitative comparison with other networks

    What it should not be
       a literature review

    The data can be artificially generated or a real-world dataset
    If you intend to work on data concerning human subjects, you may need
      to start an IRB application ASAP
                              Final Project

 Option 2: New network model
    What it should be
       Method for generating a network
            e.g. preferential attachment
            optimization wrt. different criteria
        Analysis of resulting network
            comparison with random graphs
            how do attributes change depending on model parameters


    What it should not be
       an already thoroughly explored model
                            Final Project

 Option 3: Novel algorithm
    What it should be
       An algorithm to analyze the network
            e.g. clustering or community detection algorithm
            webpage ranking algorithm
        OR a process that is influenced by the network
           gossip spreading
           games such as the prisoner‟s dilemma


        Analysis of algorithm on several different networks


    What it should not be
       an exact replica of an existing algorithm applied to a network where
        it has already been studied

				
DOCUMENT INFO