Docstoc

CS345 Data Mining

Document Sample
CS345 Data Mining Powered By Docstoc
					A Generalized Co-HITS Algorithm and
  Its Application to Bipartite Graphs


   Hongbo Deng, Michael R. Lyu and Irwin King

      Department of Computer Science and Engineering
           The Chinese University of Hong Kong

                      July 1st, 2009



                                                       1
Introduction
Many data can be modeled as bipartite graphs
  IR Models                                              Link Analysis
  for Content                                            for Graph
  - VSM                                                   - HITS
  - Language Model                                        - PageRank
  - etc.                                                  - etc.
       Relevance                                        Semantic relations

                         Incorporate Content with Graph
                                - Personalized PageRank (PPR)
                                - Linear Combination
                                - etc.
   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   2
   The Chinese University of Hong Kong
An Illustration
                                           Query       URL
Query suggestion
                              mapquest       q1         d1    www.mapquest.com
for query “map”:
                                    map      q2         d2    www.maps.com           Noisy link data
                            Google map       q3         d3    maps.google.com        Lack of relevance
                                            ...         ...                           constraints
                                  google     qi         dj     www.google.com


         HITS                                        PPR                        More reasonable
        google                                     mapquest                        mapquest
     mapquest                                      map quest                    united states map
    google.com                                  google                            map of florida
     map quest                             united states map                        us map
       weather                               mapquest.com                          world map

    Hongbo Deng, Michael R. Lyu and Irwin King
    Department of Computer Science and Engineering                3
    The Chinese University of Hong Kong
Outline
 Introduction
 Generalized Co-HITS
   Preliminaries
   Iterative Framework
   Regularization Framework
 Experiments
 Conclusion




  Hongbo Deng, Michael R. Lyu and Irwin King
  Department of Computer Science and Engineering   4
  The Chinese University of Hong Kong
Preliminaries

        Content                                             Graph
                                         X             Y   Explicit links:




                                                           Hidden links:




  Hongbo Deng, Michael R. Lyu and Irwin King
  Department of Computer Science and Engineering   5
  The Chinese University of Hong Kong
Generalized Co-HITS
 Basic idea
   Incorporate the bipartite graph with the content
    information from both sides
   Initialize the vertices with the relevance scores x0, y0
   Propagate the scores (mutual reinforcement)




                                                        Initial scores   Score propagation

   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   6
   The Chinese University of Hong Kong
Generalized Co-HITS
 Iterative framework




   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   7
   The Chinese University of Hong Kong
Iterative  Regularization Framework
 Consider the vertices on one side
                       U

                    Wuu




                                                    Smoothness   Fit initial scores




   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   8
   The Chinese University of Hong Kong
Generalized Co-HITS
 Regularization Framework

R1            R3           R2




           W uu                      W vv
Intuition: the highly connected
vertices are most likely to have
similar relevance scores.

     Hongbo Deng, Michael R. Lyu and Irwin King
     Department of Computer Science and Engineering   9
     The Chinese University of Hong Kong
Generalized Co-HITS
 Regularization Framework
The cost function:                                   Optimization problem:


Solution:




    Hongbo Deng, Michael R. Lyu and Irwin King
    Department of Computer Science and Engineering            10
    The Chinese University of Hong Kong
Application to Query-URL Bipartite Graphs
 Bipartite graph construction
    Edge weighted by the click frequency
    Normalize to obtain the transition matrix
 Overall Algorithm




   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   11
   The Chinese University of Hong Kong
Outline
 Introduction
 Preliminaries
 Generalized Co-HITS
   Iterative Framework
   Regularization Framework
 Experiments
 Conclusion




  Hongbo Deng, Michael R. Lyu and Irwin King
  Department of Computer Science and Engineering   12
  The Chinese University of Hong Kong
Experimental Evaluation
 Data collection
    AOL query log data




 Cleaning the data
        Removing the queries that appear less than 2 times
        Combining the near-duplicated queries
        883,913 queries and 967,174 URLs
        4,900,387 edges
        250,127 unique terms

   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   13
   The Chinese University of Hong Kong
Evaluation: ODP Similarity
 A simple measure of similarity among queries
  using ODP categories (query  category)
    Definition:

    Example:          3/5
          Q1: “United States”  “Regional > North America >
           United States”
          Q2: “National Parks”  “Regional > North America >
           United States > Travel and Tourism > National Parks and
           Monuments”

 Precision at rank n (P@n):
 300 distinct queries
   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   14
   The Chinese University of Hong Kong
Experimental Results
 Comparison of Iterative Framework
personalized PageRank                        one-step propagation            general Co-HITS




                                                        Result 1:
                                                       The improvements of OSP and CoIter
                                                       over the baseline (the dashed line) are
                                                       promising when compared to the PPR.
                                                       The initial relevance scores from both
                                                       sides provide valuable information.
     Hongbo Deng, Michael R. Lyu and Irwin King
     Department of Computer Science and Engineering          15
     The Chinese University of Hong Kong
Experimental Results
 Comparison of Regularization Framework
      single-sided regularization                        double-sided regularization




                                                    Result 2:
                                                    SiRegu can improve the performance
                                                    over the baseline. CoRegu performs
                                                    better than SiRegu, which owes to the
                                                    newly developed cost function R3.
                                                    Moreover, CoRegu is relatively robust.
   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering        16
   The Chinese University of Hong Kong
Experimental Results
 Detailed Results




 Result 3:
 The CoRegu-0.5 achieves the best performance. It is very essential
 and promising to consider the double-sided regularization framework
 for the bipartite graph.
   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   17
   The Chinese University of Hong Kong
Conclusions
 Propose the Co-HITS algorithm to incorporate the
  bipartite graph with the content information from
  both sides.
 The Co-HITS algorithm is more general, which
  includes HITS and personalized PageRank as
  special cases.
 The CoRegu is more robust with the newly
  developed cost function, which achieves the best
  performance with consistent and promising
  improvements.


   Hongbo Deng, Michael R. Lyu and Irwin King
   Department of Computer Science and Engineering   18
   The Chinese University of Hong Kong
Q&A




                                          Thanks!




 Hongbo Deng, Michael R. Lyu and Irwin King
 Department of Computer Science and Engineering   19
 The Chinese University of Hong Kong

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:4/22/2012
language:
pages:19