# Introduction to Gaussian Process Regression by Reileyfan

VIEWS: 22 PAGES: 25

• pg 1
```									Introduction

Introduction to Gaussian Process Regression

Hanna M. Wallach
hmw26@cam.ac.uk

January 25, 2005

Hanna M. Wallach                                                  hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction

Outline

Regression: weight-space view
Regression: function-space view (Gaussian processes)
Weight-space and function-space correspondence
Making predictions
Model selection: hyperparameters

Hanna M. Wallach                                                      hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction

Supervised Learning: Regression (1)

underlying function and noisy data
1.5

1

0.5

output, f(x)
0

−0.5

−1
training data
−1.5
−1       −0.5        0         0.5           1
input, x

Assume an underlying process which generates “clean” data.
Goal: recover underlying process from noisy observed data.

Hanna M. Wallach                                                                                            hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction

Supervised Learning: Regression (2)

Training data are D = {x(i) , y (i) | i = 1, . . . , n}.
Each input is a vector x of dimension d.
Each target is a real-valued scalar y = f (x) + noise.
Collect inputs in d × n matrix, X , and targets in vector, y :

D = {X , y}.

Wish to infer f for unseen input x , using P(f |x , D).

Hanna M. Wallach                                                          hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction

Gaussian Process Models: Inference in Function Space

samples from the prior                              samples from the posterior
3                                                   1.4

1.2
2
output, f(x)

output, f(x)
1
1
0.8
0
0.6

−1                                                   0.4
0   0.2    0.4      0.6   0.8   1                     0   0.2    0.4      0.6   0.8    1
input, x                                              input, x

A Gaussian process deﬁnes a distribution over functions.
Inference takes place directly in function space.

Hanna M. Wallach                                                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression

Part I

Regression: The Weight-Space View

Hanna M. Wallach                                           hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression

Bayesian Linear Regression (1)

1.5

1

0.5

output, f(x)
0

−0.5

−1
training data
−1.5
−1   −0.5      0         0.5           1
input, x

Assuming noise                    ∼ N (0, σ 2 ), the linear regression model is:

f (x|w) = x w, y = f + .

Hanna M. Wallach                                                                                      hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression

Bayesian Linear Regression (2)

Likelihood of parameters is:

P(y|X , w) = N (X w, σ 2 I ).

Assume a Gaussian prior over parameters:

P(w) = N (0, Σp ).

Apply Bayes’ theorem to obtain posterior:

P(w|y, X ) ∝ P(y|X , w)P(w).

Hanna M. Wallach                                                          hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression

Bayesian Linear Regression (3)

Posterior distribution over w is:
1 −1                            1
P(w|y, X ) = N (              A X y, A−1 ) where A = Σ−1 + 2 XX .
p
σ2                             σ
Predictive distribution is:

P(f |x , X , y) =    f (x |w)P(w|X , y)dw
1
= N(      x   A−1 X y, x   A−1 x ).
σ2

Hanna M. Wallach                                                                    hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression

Increasing Expressiveness

Use a set of basis functions Φ(x) to project a d dimensional
input x into m dimensional feature space:
e.g. Φ(x) = (1, x, x 2 , . . . )
P(f |x , X , y) can be expressed in terms of inner products in
feature space:
Can now use the kernel trick.
How many basis functions should we use?

Hanna M. Wallach                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Part II

Regression: The Function-Space View

Hanna M. Wallach                                           hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Gaussian Processes: Deﬁnition

A Gaussian process is a collection of random variables, any
ﬁnite number of which have a joint Gaussian distribution.
Consistency:
If the GP speciﬁes y (1) , y (2) ∼ N (µ, Σ), then it must also
specify y (1) ∼ N (µ1 , Σ11 ):
A GP is completely speciﬁed by a mean function and a
positive deﬁnite covariance function.

Hanna M. Wallach                                                                  hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Gaussian Processes: A Distribution over Functions

e.g. Choose mean function zero, and covariance function:

Kp,q = Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) )

For any set of inputs x(1) , . . . , x(n) we may compute K which
deﬁnes a joint distribution over function values:

f (x(1) ), . . . , f (x(n) ) ∼ N (0, K ).

Therefore a GP speciﬁes a distribution over functions.

Hanna M. Wallach                                                                    hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Gaussian Processes: Simple Example

Can obtain a GP from the Bayesin linear regression model:

f (x) = x w with w ∼ N (0, Σp ).

Mean function is given by:

E[f (x)] = x E[w] = 0.

Covariance function is given by:

E[f (x)f (x )] = x E[ww ]x = x Σp x .

Hanna M. Wallach                                                         hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Weight-Space and Function Space Correspondence

For any set of m basis functions, Φ(x), the corresponding
covariance function is:

K (x(p) , x(q) ) = Φ(x(p) ) Σp Φ(x(q) ).

Conversely, for every covariance function k, there is a possibly
inﬁnite expansion in terms of basis functions:
∞
K (x(p) , x(q) ) =         λi Φi (x(p) )Φi (x(q) ).
i=1

Hanna M. Wallach                                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

The Covariance Function

Speciﬁes the covariance between pairs of random variables.
e.g. Squared exponential covariance function:
1
Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) ) = exp (− |x(p) − x(q) |2 ).
2
K(x(p) = 5, x(q)) as a function of x(q)
1

0.8

0.6

0.4

0.2

0
0        2        4        6        8         10

Hanna M. Wallach                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Gaussian Process Prior

Given a set of inputs x(1) , . . . , x(n) we may draw samples
f (x(1) ), . . . , f (x(n) ) from the GP prior:

f (x(1) ), . . . , f (x(n) ) ∼ N (0, K ).

Four samples:
samples from the prior
3

2
output, f(x)

1

0

−1
0   0.2    0.4      0.6   0.8   1
input, x

Hanna M. Wallach                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Posterior: Noise-Free Observations (1)

Given noise-free training data:

D = {x(i) , f (i) | i = 1, . . . , n} = {X , f}.

Want to make predictions f at test points X .
According to GP prior, joint distribution of f and f is:

f                 K (X , X ) K (X , X )
∼N      0,                                   .
f                  K (X , X ) K (X , X )

Hanna M. Wallach                                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Posterior: Noise-Free Observations (2)

Condition {X , f } on D = {X , f} obtain the posterior.
Restrict prior to contain only functions which agree with D.
The posterior, P(f |X , X , f), is Gaussian with:

µ = K (X , X )K (X , X )−1 f, and
Σ = K (X , X ) − K (X , X )K (X , X )−1 K (X , X ).

Hanna M. Wallach                                                              hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Posterior: Noise-Free Observations (3)

samples from the posterior
1.4

1.2

output, f(x)
1

0.8

0.6

0.4
0   0.2    0.4      0.6   0.8    1
input, x

Samples all agree with the observations D = {X , f}.
Greatest variance is in regions with few training points.

Hanna M. Wallach                                                                                 hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Prediction: Noisy Observations

Typically we have noisy observations:

D = {X , y}, where y = f +

Assume additive noise            ∼ N (0, σ 2 I ).
Conditioning on D = {X , y} gives a Gaussian with:

µ = K (X , X )[K (X , X ) + σ 2 I ]−1 y, and
Σ = K (X , X ) − K (X , X )[K (X , X ) + σ 2 I ]−1 K (X , X ).

Hanna M. Wallach                                                            hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Model Selection: Hyperparameters

e.g. the ARD covariance function:
1
k(x (p) , x (q) ) = exp (−                                                                         (x (p) − x (q) )2 ).
2θ2
How best to choose θ?
samples from the posterior, θ = 0.1                                                 samples from the posterior, θ = 0.3                                                 samples from the posterior, θ = 0.5
1.5                                                                                 1.3                                                                                 1.3

1.2                                                                                 1.2
1

1.1                                                                                 1.1
output, f(x)

output, f(x)

output, f(x)
0.5
1                                                                                   1

0                                                                                  0.9                                                                                 0.9

0.8                                                                                 0.8
−0.5

0.7                                                                                 0.7

−1
0.6                                                                                 0.6

−1.5                                                                                 0.5                                                                                 0.5
0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1                        0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1                        0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1

input, x                                                                            input, x                                                                            input, x

Hanna M. Wallach                                                                                                                                                                                                                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Model Selection: Optimizing Marginal Likelihood (1)

In absence of a strong prior P(θ), the posterior for
hyperparameter θ is proportional to the marginal likelihood:

P(θ|X , y) ∝ P(y|X , θ)

Choose θ to optimize the marginal log-likelihood:
1
log P(y|X , θ) = − log |K (X , X ) + σ 2 I |−
2
1                              n
y (K (X , X ) + σ 2 I )−1 y − log 2π.
2                              2

Hanna M. Wallach                                                           hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

Model Selection: Optimizing Marginal Likelihood (2)

θML = 0.3255:
samples from the posterior, θ = 0.3255
1.4

1.3

1.2

output, f(x)
1.1

1

0.9

0.8

0.7

0.6

0.5
0     0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1

input, x

Using θML is an approximation to the true Bayesian method of
integrating over all θ values weighted by their posterior.

Hanna M. Wallach                                                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression

References

1 Carl Edward Rasmussen. Gaussian Processes in Machine
learning. Machine Learning Summer School, T ubingen, 2003.
http://www.kyb.tuebingen.mpg.de/~carl/mlss03/
2 Carl Edward Rasmussen and Chris Williams. Gaussian
Processes for Machine Learning. Forthcoming.
3 Carl Edward Rasmussen. The Gaussian Process Website.
http://www.gatsby.ucl.ac.uk/~edward/gp/

Hanna M. Wallach                                                 hmw26@cam.ac.uk
Introduction to Gaussian Process Regression

```
To top