Docstoc

Peta-Graph Mining - PowerPoint

Document Sample
Peta-Graph Mining - PowerPoint Powered By Docstoc
					 CMU SCS




                 Large Graph Algorithms
                       Christos Faloutsos
                             CMU
                Akoglu, Leman           McGlohon, Mary
                Chau, Polo              Prakash, Aditya
                Kang, U                 Tong, Hanghang
                                        Tsourakakis, Babis
OpenCirrus'10                   C. Faloutsos (CMU)           #1
CMU SCS




     Graphs - why should we care?



              Internet Map                     Food Web
              [lumeta.com]                     [Martinez ’91]




           Friendship Network                  Protein Interactions
           [Moody ’01]                         [genomebiology.com]
ICDM-LDMTA 2009                 C. Faloutsos                          2
CMU SCS




     Graphs - why should we care?
• IR: bi-partite graphs (doc-terms)
                                      D1                T1
                                           ...    ...
                                      DN                TM
•   web: hyper-text graph
•   Social networking sites (Facebook, twitter)
•   Users posing and answering questions
•   Click-streams (user – page bipartite graph)
•   ... and more – any M:N db relationship
ICDM-LDMTA 2009       C. Faloutsos                3
CMU SCS




                Our goal:
One-stop solution for mining huge graphs:

PEGASUS project (PEta GrAph mining
  System)
• www.cs.cmu.edu/~pegasus
• Open-source code and papers



OpenCirrus'10     C. Faloutsos (CMU)        4
CMU SCS




          Outline – Algorithms & results
                    Centralized           Hadoop/PEG
                                             ASUS
   Degree Distr.            old               old

   Pagerank                 old               old

   Diameter/ANF             old             DONE
   Conn. Comp               old             DONE
   Triangles           DONE
   Visualization     STARTED

OpenCirrus'10        C. Faloutsos (CMU)                5
CMU SCS




      HADI for diameter estimation
• Radius Plots for Mining Tera-byte Scale
  Graphs U Kang, Charalampos Tsourakakis,
  Ana Paula Appel, Christos Faloutsos, Jure
  Leskovec, SDM’10
• Naively: diameter needs O(N**2) space and
  up to O(N**3) time – prohibitive (N~1B)
• Our HADI: linear on E (~10B)
     – Near-linear scalability wrt # machines
     – Several optimizations -> 5x faster
OpenCirrus'10          C. Faloutsos (CMU)       6
   CMU SCS




Count
                                          ??
                          ????


                        19+? [Barabasi+]



                                               Radius
 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
 • Largest publicly available graph ever studied.
   OpenCirrus'10     C. Faloutsos (CMU)             7
  CMU SCS




YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• effective diameter: surprisingly small.
• Multi-modality: probably mixture of cores .
  OpenCirrus'10    C. Faloutsos (CMU)       8
  CMU SCS




YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• effective diameter: surprisingly small.
• Multi-modality: probably mixture of cores .
  OpenCirrus'10    C. Faloutsos (CMU)       9
CMU SCS




      Radius Plot of GCC of YahooWeb.

OpenCirrus'10       C. Faloutsos (CMU)   10
 CMU SCS




    Running time - Kronecker and Erdos-Renyi
    Graphs with billions edges.
OpenCirrus'10        C. Faloutsos (CMU)        #11
CMU SCS




          Outline – Algorithms & results
                    Centralized           Hadoop/PEG
                                             ASUS
   Degree Distr.            old               old

   Pagerank                 old               old

   Diameter/ANF             old             DONE
   Conn. Comp               old             DONE
   Triangles           DONE
   Visualization     STARTED

OpenCirrus'10        C. Faloutsos (CMU)                12
CMU SCS



       Generalized Iterated Matrix
      Vector Multiplication (GIMV)


PEGASUS: A Peta-Scale Graph Mining
System - Implementation and Observations.
U Kang, Charalampos E. Tsourakakis,
and Christos Faloutsos.
(ICDM) 2009, Miami, Florida, USA.
Best Application Paper (runner-up).

OpenCirrus'10   C. Faloutsos (CMU)    13
CMU SCS



       Generalized Iterated Matrix
      Vector Multiplication (GIMV)


• PageRank
• proximity (RWR)                    Matrix – vector
• Diameter                           Multiplication
• Connected components               (iterated)
• (eigenvectors,
• Belief Prop.
• …)
OpenCirrus'10   C. Faloutsos (CMU)             14
CMU SCS




          Example: GIM-V At Work
• Connected Components

Count




                          Size
OpenCirrus'10
15                C. Faloutsos (CMU)
CMU SCS




          Example: GIM-V At Work
• Connected Components

Count
                300-size
                cmpt
                X 500.
                        1100-size cmpt
                Why?
                        X 65.
                        Why?




                           Size
OpenCirrus'10
16                 C. Faloutsos (CMU)
CMU SCS




          Example: GIM-V At Work
• Connected Components

Count

                       suspicious
                       financial-advice sites
                       (not existing now)




                          Size
OpenCirrus'10
17                C. Faloutsos (CMU)
CMU SCS




          Outline – Algorithms & results
                    Centralized           Hadoop/PEG
                                             ASUS
   Degree Distr.            old               old

   Pagerank                 old               old

   Diameter/ANF             old             DONE
   Conn. Comp               old             DONE
   Triangles           DONE
   Visualization     STARTED

OpenCirrus'10        C. Faloutsos (CMU)                18
CMU SCS




                 Triangles


• Real social networks have a lot of triangles




ASONAM 2009         C. Faloutsos             19
CMU SCS




                     Triangles


• Real social networks have a lot of triangles
     – Friends of friends are friends
• Q1: how to compute quickly?
• Q2: Any patterns?




ASONAM 2009              C. Faloutsos        20
 CMU SCS




           Triangles : Computations
                [Tsourakakis ICDM 2008]


Q: Can we do that quickly?

Triangles are expensive to compute
     (3-way join; several approx. algos)



  ASONAM 2009           C. Faloutsos       21
 CMU SCS




           Triangles : Computations
                [Tsourakakis ICDM 2008]


But: triangles are expensive to compute
     (3-way join; several approx. algos)
Q: Can we do that quickly?
A: Yes!
     #triangles = 1/6 Sum ( li3 )
   (and, because of skewness, we only need
    the top few eigenvalues!
  ASONAM 2009           C. Faloutsos         22
CMU SCS




          Triangles : Computations
              [Tsourakakis ICDM 2008]




              1000x+ speed-up, high accuracy
ASONAM 2009             C. Faloutsos           23
CMU SCS




                Triangles
• Easy to implement on hadoop: it only needs
  eigenvalues (working on it, using Lanczos)




OpenCirrus'10     C. Faloutsos (CMU)       24
CMU SCS




                    Triangles

• Real social networks have a lot of triangles
     – Friends of friends are friends
• Q1: how to compute quickly?
• Q2: Any patterns?




ASONAM 2009              C. Faloutsos        25
   CMU SCS



                   Triangle Law: #1
                  [Tsourakakis ICDM 2008]


HEP-TH                                                     ASN




Epinions                             X-axis: # of Triangles
                                       a node participates in
                                     Y-axis: count of such nodes
    ASONAM 2009            C. Faloutsos                   26
   CMU SCS



                   Triangle Law: #2
                  [Tsourakakis ICDM 2008]


Reuters                                                       SN




                                          X-axis: degree
Epinions
                                          Y-axis: mean # triangles
                                          Notice: slope ~ degree
                                            exponent (insets)
    ASONAM 2009            C. Faloutsos                      27
CMU SCS




          Outline – Algorithms & results
                    Centralized           Hadoop/PEG
                                             ASUS
   Degree Distr.            old               old

   Pagerank                 old               old

   Diameter/ANF             old             DONE
   Conn. Comp               old             DONE
   Triangles           DONE
   Visualization     STARTED

OpenCirrus'10        C. Faloutsos (CMU)                28
CMU SCS




                Visualization: ShiftR
• Supporting Ad Hoc Sensemaking:
  Integrating Cognitive, HCI, and Data
  Mining Approaches
  Aniket Kittur, Duen Horng (‘Polo’) Chau,
  Christos Faloutsos, Jason I. Hong
  Sensemaking Workshop at CHI 2009, April
  4-5. Boston, MA, USA.



OpenCirrus'10          C. Faloutsos (CMU)   29
CMU SCS
   CMU SCS




                                 Conclusions
    One-stop shopping for large graph mining:
    • www.cs.cmu.edu/~pegasus



Akoglu, Leman                        Kang, U                             Tsourakakis, Babis
                    Chau, Polo                          McGlohon, Mary



   THANKS: NSF, Yahoo (M45), LLNL
    OpenCirrus'10                  C. Faloutsos (CMU)                             31

				
DOCUMENT INFO