Learn to Rank

Reviews
Shared by: dalmatian
Categories
Stats
views:
41
rating:
not rated
reviews:
0
posted:
11/30/2008
language:
English
pages:
0
Learn to Rank PengBo Dec 3, 2007 Outline    Scoring and Ranking RankSVM Learning RankBoost Learning Ranking in Web Search    Ranking Is The Key Ideal ranking function: Relevance + Quality Relevance (query dependent)     TF, IDF Title, Body, Anchor, URL Proximity … PageRank PageQuality, Spam …  Quality    SEO – Search Engine Optimization    SEO - Inverse Hacking How to design a website? => PageRank How to write a web page? => Relevance Website Structure  PageRank explanation Page A = 0.15 Page B = 0.2775 Page C = 0.15 Page A = 1 Page B = 1 Page C = 1 Page A = 1.4594 Page B = 0.7702 Page C = 0.7702 Page A = 1.425 Page B = 1 Page C = 0.575 Webpage - Relevance      The position of a query term is important  First is best, second is second best In “Title and Description” Tag  Not rely on this by much In “Keyword” Tag  If a query term appears in this tag,  It must appear somewhere in the body, otherwise penalize  Query term should not appear more than twice in this tag, otherwise penalize Query term density  Something like tf-idf  5%~20% for all keywords and 1~3% for individual keyword, otherwise penalize File Size:  < 40K is preferred, very large file is penalized Manually Tuning?  Ranking function is very complex   Features: hundreds, continuous, discrete, BM25  It is painful and hopeless to manually tune the ranking function Scoring and Ranking Scoring     We wish to return in order the documents most likely to be useful to the searcher How can we rank order the docs in the corpus with respect to a query? Assign a score – say in [0,1] for each doc d on each query q Begin with a perfect world – no spammers   Nobody stuffing keywords into a doc to make it match queries More on “adversarial IR” under web search Linear zone combinations  First generation of scoring methods: use a linear combination of Booleans: E.g., Score = 0.6* + 0.3* + 0.05* + 0.05*  Each expression such as takes on a value in {0,1}.  Then the overall score is in [0,1].  For this example the scores can only take on a finite set of values – what are they? Where do these weights come from?   Machine learned scoring Given    A test corpus A suite of test queries A set of relevance judgments  Learn a set of weights such that relevance judgments matched Simple example   Each doc has two zones, Title and Body For a chosen w[0,1], score for doc d on query q where: sT(d, q){0,1} matches the sB(d, q){0,1} matches the is a Boolean denoting whether q Title and is a Boolean denoting whether q Body Learning w from training examples We are given training examples, each of which is a triple: DocID d, Query q and Judgment Relevant/Non. From these, we will learn the best value of w. How?   For each example t we can compute the score based on We quantify Relevant as 1 and Non-relevant as 0   Would like the choice of w to be such that the computed scores are as close to these 1/0 judgments as possible Denote by r(dt,qt) the judgment for t  Then minimize total squared error Optimizing w   There are 4 kinds of training examples Thus only four possible values for score  And only 8 possible values for error   Let n01r be the number of training examples for which sT(d, q)=0, sB(d, q)=1, judgment = Relevant. Similarly define n00r , n10r , n11r , n00i , n01i , n10i , n11i Judgment=1  Error=w Judgment=0  Error=1–w Error: 1  2 n01r  0  (1  )2 n01i Total error – then calculus  Add up contributions from various cases to get total error Now differentiate with respect to w to get optimal value of w as:  Learning-based Web Search  Given a set of features e1,e2,…,eN, learn a ranking function f(e1,e2,…,eN) that maximizes the loss function L. f  min L  f (e1 , e2 ,..., eN ), GroundTruth  * f F  Some related issues    The functional space F  linear/non-linear? continuous? Derivative? The search strategy The loss function SVM and Ranking Ranking vs. Classification  Classification    Well studied over 30 years Bayesian, Neural network, Decision tree, SVM, Boosting, … Training data: points  Pos: x1, x2, x3, Neg: x4, x5 Less studied: only a few works published in recent years Training data: pairs (partial order)  (x1, x2), (x1, x3), (x1, x4), (x1, x5)  (x2, x3), (x2, x4) …  …  Ranking   x5 x4 0 x3 x2 x1 Problem Formulation  Input:   X ={xi |i=1,…,n} : domain or instance space. : XXR : feedback function (ground truth)  (x0, x1)>0 : x1 should be ranked above x0  Vice versa   Output  F(x): ranking function, consistent with (x0, x1) Linear classifiers: Which Hyperplane?   Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one  [according to some criterion of expected goodness] E.g., perceptron This line represents the decision boundary: ax + by - c = 0  Support Vector Machine (SVM) finds an optimal solution.   Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions Another intuition  If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased Support Vector Machine (SVM)   SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors. Support vectors  Quadratic programming problem Seen by many as most successful current text classification method Maximize margin  Maximum Margin: Formalization   xi: data point i hyperplane: wTxi + b = 0   w: weight vector, perpendicular to hyperplane b: intercept term    yi: class of data point i (+1 or -1) Classifier is:  NB: Not 1/0 f(xi) = sign(wTxi + b) Functional margin of xi is: But note that we can increase this margin simply by scaling w, b…. yi (wTxi + b)  Functional margin of dataset is minimum functional margin for any point Geometric Margin    Distance from example to the separator is Examples closest to the hyperplane are support vectors. wT x  b ry w Margin ρ of the separator is the width of separation between support vectors of classes. x r x′ ρ Linear SVM Mathematically  Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi ,yi)} T w xi + b ≥ 1 if yi = 1 wTxi + b ≤ -1 if yi = -1   For support vectors, the inequality becomes an equality Then, since each example’s distance from the hyperplane is wT x  b ry w  The margin is:  2 w Linear Support Vector Machine (SVM) wTxa + b = 1  Hyperplane wT x + b = 0 ρ wTxb + b = -1  Extra scale constraint: mini=1,…,n |wTxi + b| = 1 This implies: wT(xa–xb) = 2 ρ = ||xa–xb||2 = 2/||w||2  wT x + b = 0 Linear SVMs Mathematically (cont.)  Then we can formulate the quadratic optimization problem: Find w and b such that  2 w is maximized; and for all {(xi , yi)} wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1  A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that Φ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1 Solving the Optimization Problem Find w and b such that Φ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1    This is now optimizing a quadratic function subject to linear constraints Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem: Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi The Optimization Problem Solution  The solution has the form: w =Σαiyixi b= yk- wTxk for any xk such that αk 0   Each non-zero αi indicates that corresponding xi is a support vector. Then the classifying function will have the form: f(x) = ΣαiyixiTx + b   Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later. Also keep in mind that solving the optimization problem involved computing the inner products xiTxj between all pairs of training points. Soft Margin Classification   If the training set is not linearly separable, slack variables ξi can be added to allow misclassification of difficult or noisy examples. Allow some errors  Let some points be moved to where they belong, at a cost ξi ξj  Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin) Soft Margin Classification Mathematically  The old formulation: Find w and b such that Φ(w) =½ wTw is minimized and for all {(xi ,yi)} yi (wTxi + b) ≥ 1  The new formulation incorporating slack variables: Find w and b such that Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)} yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i  Parameter C can be viewed as a way to control overfitting – a regularization term Soft Margin Classification – Solution  The dual problem for soft margin classification: Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi Neither slack variables ξi nor their Lagrange multipliers appear in the dual problem! Again, xi with non-zero αi will be support vectors. Solution to the dual problem is: w =Σαiyixi b= yk(1- ξk) - wTxk where k = argmax αk k    But w not needed explicitly for classification! f(x) = ΣαiyixiTx + b Classification with SVMs  Given a new point (x1,x2), we can score its projection onto the hyperplane normal:    In 2 dims: score = w1x1+w2x2+b. I.e., compute score: wx + b = ΣαiyixiTx + b Set confidence threshold t. Score > t: yes Score < -t: no Else: don’t know 3 5 7 Linear SVMs: Summary     The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi. Both in the dual formulation of the problem and in the solution training points appear only inside inner products: Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi f(x) = ΣαiyixiTx + b Non-linear SVMs  Datasets that are linearly separable (with some noise) work out great: x 0  But what are we going to do if the dataset is just too hard? x  How about … mapping data to a higher-dimensional space: x2 0 0 x Non-linear SVMs: Feature spaces  General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) The “Kernel Trick”    The linear classifier relies on an inner product between vectors xiTxj If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes: φ(xi) Tφ(xj) Trick: Can this dot product (which is just a real number) could be computed simply and efficiently in terms of the original data points? Mercer's theorem      any continuous, symmetric, positive semi-definite kernel function K(x, y) can be expressed as a dot product in a high-dimensional space. there exists a function φ(x) whose range is in an inner product space of possibly high dimension, such that K(x, y) =φ(x) Tφ(y) The kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. Kernel Example    2-dimensional vectors x=[x1 x2] let K(xi,xj)=(1 + xiTxj)2, is K() a kernel? Need to show that exist someφ(xi) brings K(xi,xj)= φ(xi) Tφ(xj): K(xi,xj) =(1 + xiTxj)2 = 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 =[1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = φ(xi) Tφ(xj) where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]       Kernels  Why use kernels?   Make non-separable problem separable. Map data into better representational space Linear Polynomial K(x,z) = (1+xTz)d Radial basis function (infinite dimensional space)  Common kernels    Resources of SVM    SVMlight -- a popular implementation of the SVM algorithm by Thorsten Joachims; it can be used to solve classification, regression and ranking problems. LIBSVM -- A Library for Support Vector Machines, ChihChung Chang and Chih-Jen Lin Weka -- a machine learning toolkit that includes an implementation of an SVM classifier; Weka can be used both interactively though a graphical interface or as a software library. (The SVM implementation is called "SMO". It can be found in the Weka Explorer GUI, under the "functions" category.) Let’s come back to Ranking…    Ranking problem as classification Y: relevant or irrelevant (+1, -1) X: features on (q, d), such as        COS Similarity Proximity PageRank Document Length Document age Zone contribution (e.g. TITLE, H1, H2, ...) … A simple example   We could train an SVM over binary relevance judgments Order documents based on their signed distance from the decision boundary. Discussion  Drawback of simple classification view:   the ranking at the very top of the results list is exceedingly important whereas decisions of relevance of a document to a query may be much less important. Ranking SVM  Training Set  for each query q, we have a ranked list of documents totally ordered by a person for relevance to the query. vector of features for each document/query pair  Features   feature differences for two documents di and dj  Classification   if di is judged more relevant than dj, denoted di ≺ dj then assign the vector Φ(di, dj, q) the class yijq =+1; otherwise −1. RankSVM for Web Search Optimization (Joachims, SIGKDD 2002)  Optimizing search engines using clickthrough data   Absolute constraints can not be extracted from clickthrough data, only partial orders are available. Click does not mean relevant absolutely, because many relevant pages are never clicked because of their low ranks in search engines. Boosting and Ranking Horse Race Prediction How to Make $$$ In Horse Races?   Ask professional. Suppose:   Professional cannot give single highly accurate rule But presented with a set of races, can always generate betterthan-random rules  Can you get rich? Idea     Ask expert for rule-of-thumb Assemble set of cases where rule-to-thumb fails (hard cases) Ask expert again for selected set of hard cases And so on… Combine all rules-of-thumb Expert could be “weak” learning algorithm   Questions  How to choose races on each round?  concentrate on “hardest” races (those most often misclassified by previous rules of thumb) take (weighted) majority vote of rules of thumb  How to combine rules of thumb into single prediction rule?  AdaBoost.M1     Given training set X={(x1,y1),…,(xm,ym)} yi{1,1} correct label of instance xiX Initialize distribution D1(i)=1/m; ( How hard the case is) for t = 1,…,T: • Find weak hypothesis (“rule of thumb”) • ht : X  {1,1} with small error t on Dt: Update distribution Dt on {1,…,m} t  log(1   t /  t ) Weighted voting for the final decision-making Find correct yi * ht(xi) > 0, if the hardest race according to the lastyi * ht(xi) < 0,round hypothesis if wrong  output final hypothesis H ( x)  sign(t 1t ht ( x)) T AdaBoost: Training Error Analysis  Suppose Equivalent  Therefore, training error is:  As: Considering Finally: {i: H(xi)≠yi} is a vector which i-th element is [H(xi) ≠ yi]. |{i: H(xi)≠yi}| is the sum of all the element in the vector 1i {i : H ( xi )  yi }   zt m t 1  DT 1 (i)  1, 1  exp(Tyi f ( xi ))   t Zt m i  T 1 {i : H ( xi )  yi }   zt According to m t 1  AdaBoost: How to choose t Minimize the error bound could be done h ( x ) * t Actually AdaBoostby min minimize t yi t i  arg min Zt  arg greedilyDt (i)e the can just minimizing  each i Byt setting trainingt error. Zt t round. dz/dα=0 , and  considering1∑D(i)=1, we can    Dt (i)   Dt (i) 1 r y h  easily get this hsolution. y , let   D (i), we have    This equation is obvious if we treat ui as a binaryvalued variable. we choose Therefore,  Let ui  yi ht ( xi ),     The right term is minimized when r  D (i)  D (i)  t   t h y  h y  h y t 2 Toy Example Round 1 Round 2 Round 3 Final Hypothesis The Nature of Boosting   Given training set X={(x1,y1),…,(xm,ym)}   yi{1,1} correct label of instance xiX Initialize distribution D1(i)=1/m; ( How hard the case is) for t = 1,…,T: • Find weak hypothesis (“rule of thumb”) • ht : X  {1,1} with small error t on Dt: Update distribution Dt on {1,…,m} • • Choose i Note the distribution update and Dt 1 (i )  Dt (i ) * t / Z t , yi  hweighted combination are t ( xi ) D (i )  Dt (i ) / Z t , yi  ht ( xindependent here i) t t 1 Get hypothesis H t ( x )  sign(i 1it hi ( x )) t Boosting: Updating Weight    As distribution will be normalized, only one kind (classify correctly/wrongly) of examples should be updated. The update strategy should avoid choose the current weak learner again. We expect ht performs worst on Dt+1 , we have: Considering  t 1 (ht )    t 1 (ht )  (t  ht ( xi )  yi D t 1 (i )  0.5 t ht ( xi )  yi t t ht ( xi ) yi t We got: ht ( xi )  yi  D (i)) /(  D (i)   D (i)) This result is just the same as in AdaBoost ht ( xi )  yi  Dt (i)   t t  (1   t ) /  t Boosting : Meta Search Google MSNSearch Super Man Yahoo Baidu Sogou Ht(x) Rank-Boost  RankBoost is another boosting technique based on AdaBoost.   Suitable for ordering something with a certain criteria (ranking). Applied to document ranking in Information Retrieval (IR). Usage of RankBoost Why RankBoost?   Absolute relevance scores are not important. Relative order is important:    Correct order: A-B-C-D OK: {A:-100, B:-10, C: 0, D:1000} OK: {A: 0.1, B: 0.3, C: 0.4, D: 0.41} What is “To Learn Ranking” by Boosting?  Suppose we want to use boosting:     What is a sample point? What is the training weight of each sample? What should a hypothesis return? How can we define a loss function? Key Idea  Training data is a graph:   Sample points are edges. Training weights are the weights of edges. Key Idea  Hypothesis H returns a real value. Key Idea  Loss function is # of wrong edges.  # of pairwise disagreements: Summary of Key Ideas  What is a sample point?  Edge Weight of edge  What is the training weight of each sample?   What should a hypothesis return?  Real value # of wrong edges  How can we define a loss function?  In the IR Scenario Extension to Multi-grade  This is not classification!  You can easily extend this to a multigraded criterion. (e.g. movie rating)  Problem: the difference between two grades may not be consistent:  {awful, bad, so-so, good, superb}   “awful” and “bad”. “so-so” and “good”. RankBoost Algorithm Comparison AdaBoost vs. RankBoost AdaBoost Training data Regression goal Function Loss function Training # of wrong # of wrong pairs lables (weighted) (weighted) Additive and greedy search best weak classifiers/rankers iteratively Data: xi Label: yi{+1,-1} RankBoost Data pair: (xi,xj) Partial order: (xi, x j) Advantages  Ranking function is   Nonlinear and non-continuous ranking function Computational efficient Feature combination vs. kernel tricks (quadratic) Easy to be parallelized – distributed training Possible to leverage large scale training data   Training    The only limit is the number of computers More on Learning Ranking Functions  RankNet   Ranking function learning by Neural network Chris Burges, et. al, “Learning to Rank using Gradient Descent”, ICML 2005 Ordinal regression is more suitable for ranking problem than classification N. Ramesh, "Discriminative models for information retrieval , SIGIR 2004  Logistic Regression/MaxEntroy Framework   It is only very recently that sufficient machine learning knowledge, training document collections, and computational power have come together to make this method practical and exciting. Summary  How to Learning?    Training Set Search Strategy Loss Function Linear Classifier Kernel Function Partial Order Classification AdaBoost  RankSVM     RankBoost  Readings   IIR Ch15.1,2,4 [2] N. Ramesh, "Discriminative models for information retrieval," in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom: ACM, 2004. Resources     LETOR – Benchmark datasets for learning to rank test collection, released by Microsoft Research Asia SoGouLab – Open datasets for web search research, including query log, web test collection, etc, released by SoGou Search. SVMlight -- a popular implementation of the SVM algorithm by Thorsten Joachims; it can be used to solve classification, regression and ranking problems. LIBSVM -- A Library for Support Vector Machines, ChihChung Chang and Chih-Jen Lin References  [1] J. Thorsten, "Optimizing search engines using clickthrough data," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Edmonton, Alberta, Canada: ACM, 2002.  [2] Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence: Morgan Kaufmann  E. S. Robert, "A Brief Introduction to Boosting," in Publishers Inc., 1999. [3] N. Ramesh, "Discriminative models for information retrieval," in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom: ACM, 2004. Thank You! Q&A

Related docs
Rank
Views: 17  |  Downloads: 0
How To Increase Page Rank in Google
Views: 47  |  Downloads: 0
Air Force Rank Chart
Views: 1950  |  Downloads: 10
HOW TO RANK LAW SCHOOLS
Views: 17  |  Downloads: 0
Learn to grow
Views: 52  |  Downloads: 5
Learn Google
Views: 106  |  Downloads: 11
Lesson Plan - 1st Class Scout Rank
Views: 28  |  Downloads: 0
premium docs
Other docs by dalmatian
Stock Certificate Preferred Stock
Views: 629  |  Downloads: 26
Customer Credit Application Denial Letter
Views: 828  |  Downloads: 4
Dirty Joke A Couple Taking Golf Lessons
Views: 941  |  Downloads: 13
Ziddo Factsheet
Views: 517  |  Downloads: 0
CHECK REGISTER
Views: 356  |  Downloads: 20
website rough layout
Views: 388  |  Downloads: 8
Users marcsigal Desktop term papers ECON440F2005
Views: 217  |  Downloads: 0
Ingram Micol Inc Ammendments and Bylaws
Views: 106  |  Downloads: 0
Herman Miller Inc Ammendments and Bylaws
Views: 159  |  Downloads: 0
Noncompete agreement
Views: 477  |  Downloads: 44
Summary of SBA Loan Programs
Views: 415  |  Downloads: 9
Break Through Nutrition Plan
Views: 1048  |  Downloads: 62
2006 Inst W-2 and W-3 (PDF) Instructions
Views: 304  |  Downloads: 7
china paper1
Views: 231  |  Downloads: 2