Docstoc

Data Mining on the Web

Document Sample
Data Mining on the Web Powered By Docstoc
					Collaborative Filtering and
Pagerank in a Network

          Qiang Yang
            HKUST
      Thanks: Sonny Chee


                              1
Motivation
n   Question:
     n A user bought some products already

     n what other products to recommend to a user?

n   Collaborative Filtering (CF)
     n Automates “circle of advisors”.




                       +
                                                     2
Collaborative Filtering

“..people collaborate to help one another
   perform filtering by recording their
   reactions...” (Tapestry)
n Finds users whose taste is similar to you and
   uses them to make recommendations.
n Complimentary to IR/IF.

  n   IR/IF finds similar documents – CF finds similar
      users.


                                                         3
Example
n   Which movie would Sammy watch next?
       n   Ratings 1--5




      • If we just use the average of other users who voted on these movies, then we get
            •Matrix= 3; Titanic= 14/4=3.5
            •Recommend Titanic!
      •But, is this reasonable?                                                            4
    Types of Collaborative Filtering Algorithms

n   Collaborative Filters
n   Open Problems
     n   Sparsity, First Rater, Scalability




                                              5
Statistical Collaborative Filters
n   Users annotate items with numeric
    ratings.
n   Users who rate items “similarly” become
    mutual advisors.




n   Recommendation computed by taking a
    weighted aggregate of advisor ratings.
                                         6
Basic Idea
n   Nearest Neighbor Algorithm
n   Given a user a and item i
    n   First, find the the most similar users to a,
         n   Let these be Y
    n   Second, find how these users (Y) ranked i,
    n   Then, calculate a predicted rating of a on i
        based on some average of all these users Y
         n   How to calculate the similarity and average?


                                                            7
Statistical Filters

n   GroupLens [Resnick et al 94, MIT]
    n   Filters UseNet News postings
    n   Similarity: Pearson correlation
    n   Prediction: Weighted deviation from mean




                                               8
Pearson Correlation




                      9
Pearson Correlation

n   Weight between users a and u
    n   Compute similarity matrix between users
         n   Use Pearson Correlation (-1, 0, 1)
         n   Let items be all items that users rated




                                                       10
Prediction Generation
n   Predicts how much user a likes an item i
    (a stands for active user)
    n   Make predictions using weighted deviation
        from the mean

                                     (1)

n       : sum of all weights


                                                11
Error Estimation

n   Mean Absolute Error (MAE) for user   a



n   Standard Deviation of the errors




                                             12
Example
                             Correlation


                    Sammy      Dylan       Mathew
           Sammy       1         1          -0.87
   Users

           Dylan       1         1          0.21
           Mathew    -0.87     0.21           1




                                                    =0.83


                                                            13
Open Problems in CF

n   “Sparsity Problem”
    n   CFs have poor accuracy and coverage in
        comparison to population averages at low
        rating density [GSK+99].
n   “First Rater Problem” (cold start prob)
    n   The first person to rate an item receives no
        benefit. CF depends upon altruism. [AZ97]


                                                 14
Open Problems in CF

n   “Scalability Problem”
    n   CF is computationally expensive. Fastest
        published algorithms (nearest-neighbor)
        are n2.
         n   Any indexing method for speeding up?
    n   Has received relatively little attention.




                                                    15
The PageRank Algorithm
n   Fundamental question to ask
    n   What is the importance level of a page P,
n   Information Retrieval
    n   Cosine + TF IDF à does not give related
        hyperlinks
n   Link based
    n   Important pages (nodes) have many other links
        point to it
    n   Important pages also point to other important
        pages

                                                    16
     The Google Crawler Algorithm
n   “Efficient Crawling Through URL Ordering”,
     n Junghoo Cho, Hector Garcia-Molina, Lawrence Page,

        Stanford
    n   http://www.www8.org
    n   http://www-db.stanford.edu/~cho/crawler-paper/
n   “Modern Information Retrieval”, BY-RN
    n   Pages 380—382
n   Lawrence Page, Sergey Brin. The Anatomy of a Search Engine.
    The Seventh International WWW Conference (WWW 98).
    Brisbane, Australia, April 14-18, 1998.
     n http://www.www7.org




                                                                  17
        Page Rank Metric
                                               C=2
•Let 1-d be probability          T1
that user randomly jump to
page P;
                                              Web Page
•“d” is the damping factor. (1   T2              P
-d) is the likelihood of
arriving at P by random
jumping                          TN
                                      d=0.9
•Let N be the in degree of P

•Let Ci be the number of
out links (out degrees) from
each Ti


                                                         18
How to compute page rank?

n   For a given network of web pages,
    n   Initialize page rank for all pages (to one)
    n   Set parameter (d=0.90)
    n   Iterate through the network, L times




                                                  19
    Example: iteration K=1
              IR(P)=1/3 for all nodes, d=0.9

A


               C         node    IR
                         A       1/3
B
                         B       1/3
                         C       1/3


                                               20
     Example: k=2

A                              l is the in-degree of P


                          C         node     IR
                                    A        0.4
B
                                    B        0.1
    Note: A, B, C’s IR values are    C      0.55
    Updated in order of A, then B, then C
    Use the new value of A when calculating B, etc.
                                                         21
    Example: k=2 (normalize)

A


              C    node   IR
                   A      0.38
B
                   B      0.095
                   C      0.52


                                  22
Crawler Control

n   All crawlers maintain several queues of URL’s
    to pursue next
    n   Google initially maintains 500 queues
    n   Each queue corresponds to a web site pursuing
n   Important considerations:
    n   Limited buffer space
    n   Limited time
    n   Avoid overloading target sites
    n   Avoid overloading network traffic

                                                        23
    Crawler Control

n   Thus, it is important to visit important
    pages first
n   Let G be a lower bound threshold on IR(P)
n   Crawl and Stop
    n   Select only pages with IR>G to crawl,
    n   Stop after crawled K pages



                                                24
Test Result: 179,000 pages
                                    




                                                                         


                            
 Percentage of Stanford Web crawled vs. PST – 
   the percentage of hot pages visited so far
                                                                            25
Google Algorithm            (very simplified)

n   First, compute the page rank of each
    page on WWW
    n   Query independent
n   Then, in response to a query q, return
    pages that contain q and have highest
    page ranks
n   A problem/feature of Google: favors big
    commercial sites

                                            26

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:3/1/2014
language:Unknown
pages:26