Introduction to Gaussian Process Regression
Document Sample


Introduction
Introduction to Gaussian Process Regression
Hanna M. Wallach
hmw26@cam.ac.uk
January 25, 2005
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction
Outline
Regression: weight-space view
Regression: function-space view (Gaussian processes)
Weight-space and function-space correspondence
Making predictions
Model selection: hyperparameters
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction
Supervised Learning: Regression (1)
underlying function and noisy data
1.5
1
0.5
output, f(x)
0
−0.5
−1
training data
−1.5
−1 −0.5 0 0.5 1
input, x
Assume an underlying process which generates “clean” data.
Goal: recover underlying process from noisy observed data.
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction
Supervised Learning: Regression (2)
Training data are D = {x(i) , y (i) | i = 1, . . . , n}.
Each input is a vector x of dimension d.
Each target is a real-valued scalar y = f (x) + noise.
Collect inputs in d × n matrix, X , and targets in vector, y :
D = {X , y}.
Wish to infer f for unseen input x , using P(f |x , D).
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction
Gaussian Process Models: Inference in Function Space
samples from the prior samples from the posterior
3 1.4
1.2
2
output, f(x)
output, f(x)
1
1
0.8
0
0.6
−1 0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
input, x input, x
A Gaussian process defines a distribution over functions.
Inference takes place directly in function space.
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression
Part I
Regression: The Weight-Space View
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression
Bayesian Linear Regression (1)
1.5
1
0.5
output, f(x)
0
−0.5
−1
training data
−1.5
−1 −0.5 0 0.5 1
input, x
Assuming noise ∼ N (0, σ 2 ), the linear regression model is:
f (x|w) = x w, y = f + .
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression
Bayesian Linear Regression (2)
Likelihood of parameters is:
P(y|X , w) = N (X w, σ 2 I ).
Assume a Gaussian prior over parameters:
P(w) = N (0, Σp ).
Apply Bayes’ theorem to obtain posterior:
P(w|y, X ) ∝ P(y|X , w)P(w).
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression
Bayesian Linear Regression (3)
Posterior distribution over w is:
1 −1 1
P(w|y, X ) = N ( A X y, A−1 ) where A = Σ−1 + 2 XX .
p
σ2 σ
Predictive distribution is:
P(f |x , X , y) = f (x |w)P(w|X , y)dw
1
= N( x A−1 X y, x A−1 x ).
σ2
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression
Increasing Expressiveness
Use a set of basis functions Φ(x) to project a d dimensional
input x into m dimensional feature space:
e.g. Φ(x) = (1, x, x 2 , . . . )
P(f |x , X , y) can be expressed in terms of inner products in
feature space:
Can now use the kernel trick.
How many basis functions should we use?
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Part II
Regression: The Function-Space View
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Gaussian Processes: Definition
A Gaussian process is a collection of random variables, any
finite number of which have a joint Gaussian distribution.
Consistency:
If the GP specifies y (1) , y (2) ∼ N (µ, Σ), then it must also
specify y (1) ∼ N (µ1 , Σ11 ):
A GP is completely specified by a mean function and a
positive definite covariance function.
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Gaussian Processes: A Distribution over Functions
e.g. Choose mean function zero, and covariance function:
Kp,q = Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) )
For any set of inputs x(1) , . . . , x(n) we may compute K which
defines a joint distribution over function values:
f (x(1) ), . . . , f (x(n) ) ∼ N (0, K ).
Therefore a GP specifies a distribution over functions.
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Gaussian Processes: Simple Example
Can obtain a GP from the Bayesin linear regression model:
f (x) = x w with w ∼ N (0, Σp ).
Mean function is given by:
E[f (x)] = x E[w] = 0.
Covariance function is given by:
E[f (x)f (x )] = x E[ww ]x = x Σp x .
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Weight-Space and Function Space Correspondence
For any set of m basis functions, Φ(x), the corresponding
covariance function is:
K (x(p) , x(q) ) = Φ(x(p) ) Σp Φ(x(q) ).
Conversely, for every covariance function k, there is a possibly
infinite expansion in terms of basis functions:
∞
K (x(p) , x(q) ) = λi Φi (x(p) )Φi (x(q) ).
i=1
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
The Covariance Function
Specifies the covariance between pairs of random variables.
e.g. Squared exponential covariance function:
1
Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) ) = exp (− |x(p) − x(q) |2 ).
2
K(x(p) = 5, x(q)) as a function of x(q)
1
0.8
0.6
0.4
0.2
0
0 2 4 6 8 10
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Gaussian Process Prior
Given a set of inputs x(1) , . . . , x(n) we may draw samples
f (x(1) ), . . . , f (x(n) ) from the GP prior:
f (x(1) ), . . . , f (x(n) ) ∼ N (0, K ).
Four samples:
samples from the prior
3
2
output, f(x)
1
0
−1
0 0.2 0.4 0.6 0.8 1
input, x
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Posterior: Noise-Free Observations (1)
Given noise-free training data:
D = {x(i) , f (i) | i = 1, . . . , n} = {X , f}.
Want to make predictions f at test points X .
According to GP prior, joint distribution of f and f is:
f K (X , X ) K (X , X )
∼N 0, .
f K (X , X ) K (X , X )
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Posterior: Noise-Free Observations (2)
Condition {X , f } on D = {X , f} obtain the posterior.
Restrict prior to contain only functions which agree with D.
The posterior, P(f |X , X , f), is Gaussian with:
µ = K (X , X )K (X , X )−1 f, and
Σ = K (X , X ) − K (X , X )K (X , X )−1 K (X , X ).
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Posterior: Noise-Free Observations (3)
samples from the posterior
1.4
1.2
output, f(x)
1
0.8
0.6
0.4
0 0.2 0.4 0.6 0.8 1
input, x
Samples all agree with the observations D = {X , f}.
Greatest variance is in regions with few training points.
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Prediction: Noisy Observations
Typically we have noisy observations:
D = {X , y}, where y = f +
Assume additive noise ∼ N (0, σ 2 I ).
Conditioning on D = {X , y} gives a Gaussian with:
µ = K (X , X )[K (X , X ) + σ 2 I ]−1 y, and
Σ = K (X , X ) − K (X , X )[K (X , X ) + σ 2 I ]−1 K (X , X ).
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Model Selection: Hyperparameters
e.g. the ARD covariance function:
1
k(x (p) , x (q) ) = exp (− (x (p) − x (q) )2 ).
2θ2
How best to choose θ?
samples from the posterior, θ = 0.1 samples from the posterior, θ = 0.3 samples from the posterior, θ = 0.5
1.5 1.3 1.3
1.2 1.2
1
1.1 1.1
output, f(x)
output, f(x)
output, f(x)
0.5
1 1
0 0.9 0.9
0.8 0.8
−0.5
0.7 0.7
−1
0.6 0.6
−1.5 0.5 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
input, x input, x input, x
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Model Selection: Optimizing Marginal Likelihood (1)
In absence of a strong prior P(θ), the posterior for
hyperparameter θ is proportional to the marginal likelihood:
P(θ|X , y) ∝ P(y|X , θ)
Choose θ to optimize the marginal log-likelihood:
1
log P(y|X , θ) = − log |K (X , X ) + σ 2 I |−
2
1 n
y (K (X , X ) + σ 2 I )−1 y − log 2π.
2 2
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
Model Selection: Optimizing Marginal Likelihood (2)
θML = 0.3255:
samples from the posterior, θ = 0.3255
1.4
1.3
1.2
output, f(x)
1.1
1
0.9
0.8
0.7
0.6
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
input, x
Using θML is an approximation to the true Bayesian method of
integrating over all θ values weighted by their posterior.
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression
References
1 Carl Edward Rasmussen. Gaussian Processes in Machine
learning. Machine Learning Summer School, T ubingen, 2003.
http://www.kyb.tuebingen.mpg.de/~carl/mlss03/
2 Carl Edward Rasmussen and Chris Williams. Gaussian
Processes for Machine Learning. Forthcoming.
3 Carl Edward Rasmussen. The Gaussian Process Website.
http://www.gatsby.ucl.ac.uk/~edward/gp/
Hanna M. Wallach hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Related docs
Get documents about "