# Data Mining on the Web

Document Sample

```					Collaborative Filtering and
Pagerank in a Network

Qiang Yang
HKUST
Thanks: Sonny Chee

1
Motivation
n   Question:
n A user bought some products already

n what other products to recommend to a user?

n   Collaborative Filtering (CF)

+
2
Collaborative Filtering

“..people collaborate to help one another
perform filtering by recording their
reactions...” (Tapestry)
n Finds users whose taste is similar to you and
uses them to make recommendations.
n Complimentary to IR/IF.

n   IR/IF finds similar documents – CF finds similar
users.

3
Example
n   Which movie would Sammy watch next?
n   Ratings 1--5

• If we just use the average of other users who voted on these movies, then we get
•Matrix= 3; Titanic= 14/4=3.5
•Recommend Titanic!
•But, is this reasonable?                                                            4
Types of Collaborative Filtering Algorithms

n   Collaborative Filters
n   Open Problems
n   Sparsity, First Rater, Scalability

5
Statistical Collaborative Filters
n   Users annotate items with numeric
ratings.
n   Users who rate items “similarly” become

n   Recommendation computed by taking a
6
Basic Idea
n   Nearest Neighbor Algorithm
n   Given a user a and item i
n   First, find the the most similar users to a,
n   Let these be Y
n   Second, find how these users (Y) ranked i,
n   Then, calculate a predicted rating of a on i
based on some average of all these users Y
n   How to calculate the similarity and average?

7
Statistical Filters

n   GroupLens [Resnick et al 94, MIT]
n   Filters UseNet News postings
n   Similarity: Pearson correlation
n   Prediction: Weighted deviation from mean

8
Pearson Correlation

9
Pearson Correlation

n   Weight between users a and u
n   Compute similarity matrix between users
n   Use Pearson Correlation (-1, 0, 1)
n   Let items be all items that users rated

10
Prediction Generation
n   Predicts how much user a likes an item i
(a stands for active user)
n   Make predictions using weighted deviation
from the mean

(1)

n       : sum of all weights

11
Error Estimation

n   Mean Absolute Error (MAE) for user   a

n   Standard Deviation of the errors

12
Example
Correlation

Sammy      Dylan       Mathew
Sammy       1         1          -0.87
Users

Dylan       1         1          0.21
Mathew    -0.87     0.21           1

=0.83

13
Open Problems in CF

n   “Sparsity Problem”
n   CFs have poor accuracy and coverage in
comparison to population averages at low
rating density [GSK+99].
n   “First Rater Problem” (cold start prob)
n   The first person to rate an item receives no
benefit. CF depends upon altruism. [AZ97]

14
Open Problems in CF

n   “Scalability Problem”
n   CF is computationally expensive. Fastest
published algorithms (nearest-neighbor)
are n2.
n   Any indexing method for speeding up?
n   Has received relatively little attention.

15
The PageRank Algorithm
n   What is the importance level of a page P,
n   Information Retrieval
n   Cosine + TF IDF à does not give related
n   Important pages (nodes) have many other links
point to it
n   Important pages also point to other important
pages

16
n   “Efficient Crawling Through URL Ordering”,
n Junghoo Cho, Hector Garcia-Molina, Lawrence Page,

Stanford
n   http://www.www8.org
n   http://www-db.stanford.edu/~cho/crawler-paper/
n   “Modern Information Retrieval”, BY-RN
n   Pages 380—382
n   Lawrence Page, Sergey Brin. The Anatomy of a Search Engine.
The Seventh International WWW Conference (WWW 98).
Brisbane, Australia, April 14-18, 1998.
n http://www.www7.org

17
Page Rank Metric
C=2
•Let 1-d be probability          T1
page P;
Web Page
•“d” is the damping factor. (1   T2              P
-d) is the likelihood of
arriving at P by random
jumping                          TN
d=0.9
•Let N be the in degree of P

•Let Ci be the number of
each Ti

18
How to compute page rank?

n   For a given network of web pages,
n   Initialize page rank for all pages (to one)
n   Set parameter (d=0.90)
n   Iterate through the network, L times

19
Example: iteration K=1
IR(P)=1/3 for all nodes, d=0.9

A

C         node    IR
A       1/3
B
B       1/3
C       1/3

20
Example: k=2

A                              l is the in-degree of P

C         node     IR
A        0.4
B
B        0.1
Note: A, B, C’s IR values are    C      0.55
Updated in order of A, then B, then C
Use the new value of A when calculating B, etc.
21
Example: k=2 (normalize)

A

C    node   IR
A      0.38
B
B      0.095
C      0.52

22
Crawler Control

n   All crawlers maintain several queues of URL’s
to pursue next
n   Google initially maintains 500 queues
n   Each queue corresponds to a web site pursuing
n   Important considerations:
n   Limited buffer space
n   Limited time

23
Crawler Control

n   Thus, it is important to visit important
pages first
n   Let G be a lower bound threshold on IR(P)
n   Crawl and Stop
n   Select only pages with IR>G to crawl,
n   Stop after crawled K pages

24
Test Result: 179,000 pages

Percentage of Stanford Web crawled vs. PST –
the percentage of hot pages visited so far
25

n   First, compute the page rank of each
page on WWW
n   Query independent
n   Then, in response to a query q, return
pages that contain q and have highest
page ranks
n   A problem/feature of Google: favors big
commercial sites

26

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 4 posted: 3/1/2014 language: Unknown pages: 26