Document Sample

Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee 1 Motivation n Question: n A user bought some products already n what other products to recommend to a user? n Collaborative Filtering (CF) n Automates “circle of advisors”. + 2 Collaborative Filtering “..people collaborate to help one another perform filtering by recording their reactions...” (Tapestry) n Finds users whose taste is similar to you and uses them to make recommendations. n Complimentary to IR/IF. n IR/IF finds similar documents – CF finds similar users. 3 Example n Which movie would Sammy watch next? n Ratings 1--5 • If we just use the average of other users who voted on these movies, then we get •Matrix= 3; Titanic= 14/4=3.5 •Recommend Titanic! •But, is this reasonable? 4 Types of Collaborative Filtering Algorithms n Collaborative Filters n Open Problems n Sparsity, First Rater, Scalability 5 Statistical Collaborative Filters n Users annotate items with numeric ratings. n Users who rate items “similarly” become mutual advisors. n Recommendation computed by taking a weighted aggregate of advisor ratings. 6 Basic Idea n Nearest Neighbor Algorithm n Given a user a and item i n First, find the the most similar users to a, n Let these be Y n Second, find how these users (Y) ranked i, n Then, calculate a predicted rating of a on i based on some average of all these users Y n How to calculate the similarity and average? 7 Statistical Filters n GroupLens [Resnick et al 94, MIT] n Filters UseNet News postings n Similarity: Pearson correlation n Prediction: Weighted deviation from mean 8 Pearson Correlation 9 Pearson Correlation n Weight between users a and u n Compute similarity matrix between users n Use Pearson Correlation (-1, 0, 1) n Let items be all items that users rated 10 Prediction Generation n Predicts how much user a likes an item i (a stands for active user) n Make predictions using weighted deviation from the mean (1) n : sum of all weights 11 Error Estimation n Mean Absolute Error (MAE) for user a n Standard Deviation of the errors 12 Example Correlation Sammy Dylan Mathew Sammy 1 1 -0.87 Users Dylan 1 1 0.21 Mathew -0.87 0.21 1 =0.83 13 Open Problems in CF n “Sparsity Problem” n CFs have poor accuracy and coverage in comparison to population averages at low rating density [GSK+99]. n “First Rater Problem” (cold start prob) n The first person to rate an item receives no benefit. CF depends upon altruism. [AZ97] 14 Open Problems in CF n “Scalability Problem” n CF is computationally expensive. Fastest published algorithms (nearest-neighbor) are n2. n Any indexing method for speeding up? n Has received relatively little attention. 15 The PageRank Algorithm n Fundamental question to ask n What is the importance level of a page P, n Information Retrieval n Cosine + TF IDF à does not give related hyperlinks n Link based n Important pages (nodes) have many other links point to it n Important pages also point to other important pages 16 The Google Crawler Algorithm n “Efficient Crawling Through URL Ordering”, n Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Stanford n http://www.www8.org n http://www-db.stanford.edu/~cho/crawler-paper/ n “Modern Information Retrieval”, BY-RN n Pages 380—382 n Lawrence Page, Sergey Brin. The Anatomy of a Search Engine. The Seventh International WWW Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. n http://www.www7.org 17 Page Rank Metric C=2 •Let 1-d be probability T1 that user randomly jump to page P; Web Page •“d” is the damping factor. (1 T2 P -d) is the likelihood of arriving at P by random jumping TN d=0.9 •Let N be the in degree of P •Let Ci be the number of out links (out degrees) from each Ti 18 How to compute page rank? n For a given network of web pages, n Initialize page rank for all pages (to one) n Set parameter (d=0.90) n Iterate through the network, L times 19 Example: iteration K=1 IR(P)=1/3 for all nodes, d=0.9 A C node IR A 1/3 B B 1/3 C 1/3 20 Example: k=2 A l is the in-degree of P C node IR A 0.4 B B 0.1 Note: A, B, C’s IR values are C 0.55 Updated in order of A, then B, then C Use the new value of A when calculating B, etc. 21 Example: k=2 (normalize) A C node IR A 0.38 B B 0.095 C 0.52 22 Crawler Control n All crawlers maintain several queues of URL’s to pursue next n Google initially maintains 500 queues n Each queue corresponds to a web site pursuing n Important considerations: n Limited buffer space n Limited time n Avoid overloading target sites n Avoid overloading network traffic 23 Crawler Control n Thus, it is important to visit important pages first n Let G be a lower bound threshold on IR(P) n Crawl and Stop n Select only pages with IR>G to crawl, n Stop after crawled K pages 24 Test Result: 179,000 pages Percentage of Stanford Web crawled vs. PST – the percentage of hot pages visited so far 25 Google Algorithm (very simplified) n First, compute the page rank of each page on WWW n Query independent n Then, in response to a query q, return pages that contain q and have highest page ranks n A problem/feature of Google: favors big commercial sites 26

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 4 |

posted: | 3/1/2014 |

language: | Unknown |

pages: | 26 |

OTHER DOCS BY pengxiang

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.