Learn to Rank
PengBo Dec 3, 2007
Outline
Scoring and Ranking RankSVM Learning RankBoost Learning
Ranking in Web Search
Ranking Is The Key Ideal ranking function: Relevance + Quality Relevance (query dependent)
TF, IDF Title, Body, Anchor, URL Proximity … PageRank PageQuality, Spam …
Quality
SEO – Search Engine Optimization
SEO - Inverse Hacking How to design a website? => PageRank How to write a web page? => Relevance
Website Structure
PageRank explanation
Page A = 0.15 Page B = 0.2775 Page C = 0.15
Page A = 1 Page B = 1 Page C = 1
Page A = 1.4594 Page B = 0.7702 Page C = 0.7702
Page A = 1.425 Page B = 1 Page C = 0.575
Webpage - Relevance
The position of a query term is important First is best, second is second best In “Title and Description” Tag Not rely on this by much In “Keyword” Tag If a query term appears in this tag, It must appear somewhere in the body, otherwise penalize Query term should not appear more than twice in this tag, otherwise penalize Query term density Something like tf-idf 5%~20% for all keywords and 1~3% for individual keyword, otherwise penalize File Size: < 40K is preferred, very large file is penalized
Manually Tuning?
Ranking function is very complex
Features: hundreds, continuous, discrete, BM25
It is painful and hopeless to manually tune the ranking function
Scoring and Ranking
Scoring
We wish to return in order the documents most likely to be useful to the searcher How can we rank order the docs in the corpus with respect to a query? Assign a score – say in [0,1] for each doc d on each query q Begin with a perfect world – no spammers
Nobody stuffing keywords into a doc to make it match queries More on “adversarial IR” under web search
Linear zone combinations
First generation of scoring methods: use a linear combination of Booleans:
E.g., Score = 0.6* + 0.3* + 0.05* + 0.05* Each expression such as takes on a value in {0,1}. Then the overall score is in [0,1].
For this example the scores can only take on a finite set of values – what are they?
Where do these weights come from?
Machine learned scoring Given
A test corpus A suite of test queries A set of relevance judgments
Learn a set of weights such that relevance judgments matched
Simple example
Each doc has two zones, Title and Body For a chosen w[0,1], score for doc d on query q
where: sT(d, q){0,1} matches the sB(d, q){0,1} matches the
is a Boolean denoting whether q Title and is a Boolean denoting whether q Body
Learning w from training examples
We are given training examples, each of which is a triple: DocID d, Query q and Judgment Relevant/Non. From these, we will learn the best value of w.
How?
For each example t we can compute the score based on We quantify Relevant as 1 and Non-relevant as 0
Would like the choice of w to be such that the computed scores are as close to these 1/0 judgments as possible Denote by r(dt,qt) the judgment for t
Then minimize total squared error
Optimizing w
There are 4 kinds of training examples Thus only four possible values for score
And only 8 possible values for error
Let n01r be the number of training examples for which sT(d, q)=0, sB(d, q)=1, judgment = Relevant. Similarly define n00r , n10r , n11r , n00i , n01i , n10i , n11i
Judgment=1 Error=w Judgment=0 Error=1–w
Error: 1 2 n01r 0 (1 )2 n01i
Total error – then calculus
Add up contributions from various cases to get total error Now differentiate with respect to w to get optimal value of w as:
Learning-based Web Search
Given a set of features e1,e2,…,eN, learn a ranking function f(e1,e2,…,eN) that maximizes the loss function L.
f min L f (e1 , e2 ,..., eN ), GroundTruth
* f F
Some related issues
The functional space F linear/non-linear? continuous? Derivative? The search strategy The loss function
SVM and Ranking
Ranking vs. Classification
Classification
Well studied over 30 years Bayesian, Neural network, Decision tree, SVM, Boosting, … Training data: points Pos: x1, x2, x3, Neg: x4, x5 Less studied: only a few works published in recent years Training data: pairs (partial order) (x1, x2), (x1, x3), (x1, x4), (x1, x5) (x2, x3), (x2, x4) … …
Ranking
x5
x4
0
x3 x2
x1
Problem Formulation
Input:
X ={xi |i=1,…,n} : domain or instance space. : XXR : feedback function (ground truth) (x0, x1)>0 : x1 should be ranked above x0 Vice versa
Output
F(x): ranking function, consistent with (x0, x1)
Linear classifiers: Which Hyperplane?
Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one
[according to some criterion of expected goodness]
E.g., perceptron
This line represents the decision boundary: ax + by - c = 0
Support Vector Machine (SVM) finds an optimal solution.
Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions
Another intuition
If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased
Support Vector Machine (SVM)
SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors.
Support vectors
Quadratic programming
problem Seen by many as most successful current text classification method
Maximize margin
Maximum Margin: Formalization
xi: data point i hyperplane: wTxi + b = 0
w: weight vector, perpendicular to hyperplane b: intercept term
yi: class of data point i (+1 or -1)
Classifier is:
NB: Not 1/0
f(xi) = sign(wTxi + b)
Functional margin of xi is:
But note that we can increase this margin simply by scaling w, b….
yi (wTxi + b)
Functional margin of dataset is minimum functional margin for any point
Geometric Margin
Distance from example to the separator is
Examples closest to the hyperplane are support vectors.
wT x b ry w
Margin ρ of the separator is the width of separation between
support vectors of classes.
x r x′
ρ
Linear SVM Mathematically
Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi ,yi)} T
w xi + b ≥ 1
if yi = 1
wTxi + b ≤ -1 if yi = -1
For support vectors, the inequality becomes an equality Then, since each example’s distance from the hyperplane is
wT x b ry w
The margin is:
2 w
Linear Support Vector Machine (SVM)
wTxa + b = 1
Hyperplane wT x + b = 0
ρ
wTxb + b = -1
Extra scale constraint: mini=1,…,n |wTxi + b| = 1
This implies: wT(xa–xb) = 2 ρ = ||xa–xb||2 = 2/||w||2
wT x + b = 0
Linear SVMs Mathematically (cont.)
Then we can formulate the quadratic optimization
problem:
Find w and b such that
2 w
is maximized; and for all {(xi , yi)}
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
Solving the Optimization Problem
Find w and b such that Φ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
This is now optimizing a quadratic function subject to linear constraints Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem:
Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi
The Optimization Problem Solution
The solution has the form: w =Σαiyixi b= yk- wTxk for any xk such that αk 0
Each non-zero αi indicates that corresponding xi is a support vector. Then the classifying function will have the form:
f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later. Also keep in mind that solving the optimization problem involved computing the inner products xiTxj between all pairs of training points.
Soft Margin Classification
If the training set is not linearly separable, slack variables ξi can be added to allow misclassification of difficult or noisy examples. Allow some errors
Let some points be moved to where they belong, at a cost
ξi
ξj
Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin)
Soft Margin Classification Mathematically
The old formulation:
Find w and b such that Φ(w) =½ wTw is minimized and for all {(xi ,yi)} yi (wTxi + b) ≥ 1
The new formulation incorporating slack variables:
Find w and b such that Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)} yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
Parameter C can be viewed as a way to control overfitting – a regularization term
Soft Margin Classification – Solution
The dual problem for soft margin classification: Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi Neither slack variables ξi nor their Lagrange multipliers appear in the dual problem! Again, xi with non-zero αi will be support vectors. Solution to the dual problem is: w =Σαiyixi b= yk(1- ξk) - wTxk where k = argmax αk
k
But w not needed explicitly for classification! f(x) = ΣαiyixiTx + b
Classification with SVMs
Given a new point (x1,x2), we can score its projection onto the hyperplane normal:
In 2 dims: score = w1x1+w2x2+b. I.e., compute score: wx + b = ΣαiyixiTx + b Set confidence threshold t.
Score > t: yes Score < -t: no Else: don’t know
3 5 7
Linear SVMs: Summary
The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi. Both in the dual formulation of the problem and in the solution training points appear only inside inner products:
Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi
f(x) = ΣαiyixiTx + b
Non-linear SVMs
Datasets that are linearly separable (with some noise) work out great:
x
0
But what are we going to do if the dataset is just too hard?
x
How about … mapping data to a higher-dimensional space:
x2
0
0
x
Non-linear SVMs: Feature spaces
General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
The “Kernel Trick”
The linear classifier relies on an inner product between vectors xiTxj If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes: φ(xi) Tφ(xj) Trick: Can this dot product (which is just a real number) could be computed simply and efficiently in terms of the original data points?
Mercer's theorem
any continuous, symmetric, positive semi-definite kernel function K(x, y) can be expressed as a dot product in a high-dimensional space. there exists a function φ(x) whose range is in an inner product space of possibly high dimension, such that K(x, y) =φ(x) Tφ(y) The kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm.
Kernel Example
2-dimensional vectors x=[x1 x2] let K(xi,xj)=(1 + xiTxj)2, is K() a kernel? Need to show that exist someφ(xi) brings K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)
=(1 + xiTxj)2 = 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 =[1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = φ(xi) Tφ(xj) where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Kernels
Why use kernels?
Make non-separable problem separable. Map data into better representational space Linear Polynomial K(x,z) = (1+xTz)d Radial basis function (infinite dimensional space)
Common kernels
Resources of SVM
SVMlight -- a popular implementation of the SVM algorithm by Thorsten Joachims; it can be used to solve classification, regression and ranking problems. LIBSVM -- A Library for Support Vector Machines, ChihChung Chang and Chih-Jen Lin Weka -- a machine learning toolkit that includes an implementation of an SVM classifier; Weka can be used both interactively though a graphical interface or as a software library. (The SVM implementation is called "SMO". It can be found in the Weka Explorer GUI, under the "functions" category.)
Let’s come back to Ranking…
Ranking problem as classification Y: relevant or irrelevant (+1, -1) X: features on (q, d), such as
COS Similarity Proximity PageRank Document Length Document age Zone contribution (e.g. TITLE, H1, H2, ...) …
A simple example
We could train an SVM over binary relevance judgments Order documents based on their signed distance from the decision boundary.
Discussion
Drawback of simple classification view:
the ranking at the very top of the results list is exceedingly important whereas decisions of relevance of a document to a query may be much less important.
Ranking SVM
Training Set
for each query q, we have a ranked list of documents totally ordered by a person for relevance to the query. vector of features for each document/query pair
Features
feature differences for two documents di and dj
Classification
if di is judged more relevant than dj, denoted di ≺ dj then assign the vector Φ(di, dj, q) the class yijq =+1; otherwise −1.
RankSVM for Web Search Optimization (Joachims,
SIGKDD 2002)
Optimizing search engines using clickthrough data
Absolute constraints can not be extracted from clickthrough data, only partial orders are available. Click does not mean relevant absolutely, because many relevant pages are never clicked because of their low ranks in search engines.
Boosting and Ranking
Horse Race Prediction
How to Make $$$ In Horse Races?
Ask professional. Suppose:
Professional cannot give single highly accurate rule But presented with a set of races, can always generate betterthan-random rules
Can you get rich?
Idea
Ask expert for rule-of-thumb Assemble set of cases where rule-to-thumb fails (hard cases) Ask expert again for selected set of hard cases And so on…
Combine all rules-of-thumb Expert could be “weak” learning algorithm
Questions
How to choose races on each round?
concentrate on “hardest” races (those most often misclassified by previous rules of thumb) take (weighted) majority vote of rules of thumb
How to combine rules of thumb into single prediction rule?
AdaBoost.M1
Given training set X={(x1,y1),…,(xm,ym)}
yi{1,1} correct label of instance xiX Initialize distribution D1(i)=1/m; ( How hard the case is) for t = 1,…,T:
•
Find weak hypothesis (“rule of thumb”)
•
ht : X {1,1} with small error t on Dt: Update distribution Dt on {1,…,m} t log(1 t / t )
Weighted voting for the final decision-making
Find correct yi * ht(xi) > 0, if the hardest race according to the lastyi * ht(xi) < 0,round hypothesis if wrong
output final hypothesis
H ( x) sign(t 1t ht ( x))
T
AdaBoost: Training Error Analysis
Suppose
Equivalent
Therefore, training error is:
As: Considering Finally:
{i: H(xi)≠yi} is a vector which i-th element is [H(xi) ≠ yi]. |{i: H(xi)≠yi}| is the sum of all the element in the vector
1i {i : H ( xi ) yi } zt m t 1
DT 1 (i) 1,
1 exp(Tyi f ( xi )) t Zt m i
T 1 {i : H ( xi ) yi } zt According to m t 1
AdaBoost: How to choose t
Minimize the error bound could be done h ( x ) * t Actually AdaBoostby min minimize t yi t i arg min Zt arg greedilyDt (i)e the can just minimizing each i Byt setting trainingt error. Zt t round. dz/dα=0 , and considering1∑D(i)=1, we can Dt (i) Dt (i) 1 r y h easily get this hsolution. y , let D (i), we have
This equation is obvious if we treat ui as a binaryvalued variable. we choose Therefore,
Let ui
yi ht ( xi ),
The right term is minimized when
r D (i) D (i) t t h y h y
h y
t
2
Toy Example
Round 1
Round 2
Round 3
Final Hypothesis
The Nature of Boosting
Given training set X={(x1,y1),…,(xm,ym)}
yi{1,1} correct label of instance xiX Initialize distribution D1(i)=1/m; ( How hard the case is) for t = 1,…,T:
•
Find weak hypothesis (“rule of thumb”)
•
ht : X {1,1} with small error t on Dt: Update distribution Dt on {1,…,m}
• •
Choose
i
Note the distribution update and Dt 1 (i ) Dt (i ) * t / Z t , yi hweighted combination are t ( xi ) D (i ) Dt (i ) / Z t , yi ht ( xindependent here i) t t 1
Get hypothesis
H t ( x ) sign(i 1it hi ( x ))
t
Boosting: Updating Weight
As distribution will be normalized, only one kind (classify correctly/wrongly) of examples should be updated. The update strategy should avoid choose the current weak learner again. We expect ht performs worst on Dt+1 , we have: Considering
t 1 (ht )
t 1 (ht ) (t
ht ( xi ) yi
D
t 1
(i ) 0.5
t ht ( xi ) yi t t ht ( xi ) yi t
We got:
ht ( xi ) yi
D (i)) /( D (i) D (i))
This result is just the same as in AdaBoost
ht ( xi ) yi
Dt (i) t
t (1 t ) / t
Boosting : Meta Search
Google MSNSearch
Super Man
Yahoo Baidu Sogou
Ht(x)
Rank-Boost
RankBoost is another boosting technique based on AdaBoost.
Suitable for ordering something with a certain criteria (ranking). Applied to document ranking in Information Retrieval (IR).
Usage of RankBoost
Why RankBoost?
Absolute relevance scores are not important. Relative order is important:
Correct order: A-B-C-D OK: {A:-100, B:-10, C: 0, D:1000} OK: {A: 0.1, B: 0.3, C: 0.4, D: 0.41}
What is “To Learn Ranking” by Boosting?
Suppose we want to use boosting:
What is a sample point? What is the training weight of each sample? What should a hypothesis return? How can we define a loss function?
Key Idea
Training data is a graph:
Sample points are edges. Training weights are the weights of edges.
Key Idea
Hypothesis H returns a real value.
Key Idea
Loss function is # of wrong edges.
# of pairwise disagreements:
Summary of Key Ideas
What is a sample point?
Edge Weight of edge
What is the training weight of each sample?
What should a hypothesis return?
Real value
# of wrong edges
How can we define a loss function?
In the IR Scenario
Extension to Multi-grade
This is not classification!
You can easily extend this to a multigraded criterion. (e.g. movie rating)
Problem: the difference between two grades may not be consistent:
{awful, bad, so-so, good, superb}
“awful” and “bad”. “so-so” and “good”.
RankBoost Algorithm
Comparison
AdaBoost vs. RankBoost
AdaBoost Training data Regression goal Function Loss function Training
# of wrong # of wrong pairs lables (weighted) (weighted) Additive and greedy search best weak classifiers/rankers iteratively Data: xi Label: yi{+1,-1}
RankBoost
Data pair: (xi,xj) Partial order: (xi, x j)
Advantages
Ranking function is
Nonlinear and non-continuous ranking function Computational efficient Feature combination vs. kernel tricks (quadratic) Easy to be parallelized – distributed training Possible to leverage large scale training data
Training
The only limit is the number of computers
More on Learning Ranking Functions
RankNet
Ranking function learning by Neural network Chris Burges, et. al, “Learning to Rank using Gradient Descent”, ICML 2005 Ordinal regression is more suitable for ranking problem than classification N. Ramesh, "Discriminative models for information retrieval , SIGIR 2004
Logistic Regression/MaxEntroy Framework
It is only very recently that sufficient machine learning knowledge, training document collections, and computational power have come together to make this method practical and exciting.
Summary
How to Learning?
Training Set Search Strategy Loss Function Linear Classifier Kernel Function Partial Order Classification AdaBoost
RankSVM
RankBoost
Readings
IIR Ch15.1,2,4 [2] N. Ramesh, "Discriminative models for information retrieval," in Proceedings of the 27th
annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom: ACM, 2004.
Resources
LETOR – Benchmark datasets for learning to rank test collection, released by Microsoft Research Asia SoGouLab – Open datasets for web search research, including query log, web test collection, etc, released by SoGou Search. SVMlight -- a popular implementation of the SVM algorithm by Thorsten Joachims; it can be used to solve classification, regression and ranking problems. LIBSVM -- A Library for Support Vector Machines, ChihChung Chang and Chih-Jen Lin
References
[1] J. Thorsten, "Optimizing search engines using clickthrough data," in Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining. Edmonton, Alberta, Canada: ACM, 2002.
[2]
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence: Morgan Kaufmann
E. S. Robert, "A Brief Introduction to Boosting," in
Publishers Inc., 1999. [3] N. Ramesh, "Discriminative models for information retrieval," in Proceedings of the 27th annual
international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United
Kingdom: ACM, 2004.
Thank You!
Q&A