Introduction to Gaussian Process Regression by Reileyfan

VIEWS: 22 PAGES: 25

									Introduction




               Introduction to Gaussian Process Regression

                                              Hanna M. Wallach
                                                hmw26@cam.ac.uk


                                              January 25, 2005




Hanna M. Wallach                                                  hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction




Outline




               Regression: weight-space view
               Regression: function-space view (Gaussian processes)
               Weight-space and function-space correspondence
               Making predictions
               Model selection: hyperparameters




Hanna M. Wallach                                                      hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction




Supervised Learning: Regression (1)


                                                                underlying function and noisy data
                                                          1.5

                                                           1

                                                          0.5


                                          output, f(x)
                                                           0

                                                         −0.5

                                                          −1
                                                                                        training data
                                                         −1.5
                                                           −1       −0.5        0         0.5           1
                                                                             input, x




               Assume an underlying process which generates “clean” data.
               Goal: recover underlying process from noisy observed data.



Hanna M. Wallach                                                                                            hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction




Supervised Learning: Regression (2)



               Training data are D = {x(i) , y (i) | i = 1, . . . , n}.
               Each input is a vector x of dimension d.
               Each target is a real-valued scalar y = f (x) + noise.
               Collect inputs in d × n matrix, X , and targets in vector, y :

                                              D = {X , y}.

               Wish to infer f for unseen input x , using P(f |x , D).




Hanna M. Wallach                                                          hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Introduction




Gaussian Process Models: Inference in Function Space


                                              samples from the prior                              samples from the posterior
                                       3                                                   1.4

                                                                                           1.2
                                       2
                       output, f(x)




                                                                            output, f(x)
                                                                                            1
                                       1
                                                                                           0.8
                                       0
                                                                                           0.6

                                      −1                                                   0.4
                                        0   0.2    0.4      0.6   0.8   1                     0   0.2    0.4      0.6   0.8    1
                                                     input, x                                              input, x




               A Gaussian process defines a distribution over functions.
               Inference takes place directly in function space.



Hanna M. Wallach                                                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression




                                              Part I

                       Regression: The Weight-Space View




Hanna M. Wallach                                           hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression




Bayesian Linear Regression (1)


                                                          1.5

                                                           1

                                                          0.5



                                          output, f(x)
                                                           0

                                                         −0.5

                                                          −1
                                                                                  training data
                                                         −1.5
                                                           −1   −0.5      0         0.5           1
                                                                       input, x




              Assuming noise                    ∼ N (0, σ 2 ), the linear regression model is:

                                            f (x|w) = x w, y = f + .


Hanna M. Wallach                                                                                      hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression




Bayesian Linear Regression (2)


              Likelihood of parameters is:

                                          P(y|X , w) = N (X w, σ 2 I ).

              Assume a Gaussian prior over parameters:

                                              P(w) = N (0, Σp ).

              Apply Bayes’ theorem to obtain posterior:

                                        P(w|y, X ) ∝ P(y|X , w)P(w).



Hanna M. Wallach                                                          hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression




Bayesian Linear Regression (3)


              Posterior distribution over w is:
                                          1 −1                            1
               P(w|y, X ) = N (              A X y, A−1 ) where A = Σ−1 + 2 XX .
                                                                     p
                                          σ2                             σ
              Predictive distribution is:

                             P(f |x , X , y) =    f (x |w)P(w|X , y)dw
                                                     1
                                              = N(      x   A−1 X y, x   A−1 x ).
                                                     σ2



Hanna M. Wallach                                                                    hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Bayesian Linear Regression




Increasing Expressiveness



              Use a set of basis functions Φ(x) to project a d dimensional
              input x into m dimensional feature space:
                      e.g. Φ(x) = (1, x, x 2 , . . . )
              P(f |x , X , y) can be expressed in terms of inner products in
              feature space:
                      Can now use the kernel trick.
              How many basis functions should we use?




Hanna M. Wallach                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




                                              Part II

                     Regression: The Function-Space View




Hanna M. Wallach                                           hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Gaussian Processes: Definition



              A Gaussian process is a collection of random variables, any
              finite number of which have a joint Gaussian distribution.
              Consistency:
                      If the GP specifies y (1) , y (2) ∼ N (µ, Σ), then it must also
                      specify y (1) ∼ N (µ1 , Σ11 ):
              A GP is completely specified by a mean function and a
              positive definite covariance function.




Hanna M. Wallach                                                                  hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Gaussian Processes: A Distribution over Functions


              e.g. Choose mean function zero, and covariance function:

                              Kp,q = Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) )

              For any set of inputs x(1) , . . . , x(n) we may compute K which
              defines a joint distribution over function values:

                                       f (x(1) ), . . . , f (x(n) ) ∼ N (0, K ).

              Therefore a GP specifies a distribution over functions.



Hanna M. Wallach                                                                    hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Gaussian Processes: Simple Example


              Can obtain a GP from the Bayesin linear regression model:

                                      f (x) = x w with w ∼ N (0, Σp ).

              Mean function is given by:

                                              E[f (x)] = x E[w] = 0.

              Covariance function is given by:

                                 E[f (x)f (x )] = x E[ww ]x = x Σp x .



Hanna M. Wallach                                                         hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Weight-Space and Function Space Correspondence


              For any set of m basis functions, Φ(x), the corresponding
              covariance function is:

                                    K (x(p) , x(q) ) = Φ(x(p) ) Σp Φ(x(q) ).

              Conversely, for every covariance function k, there is a possibly
              infinite expansion in terms of basis functions:
                                                      ∞
                                 K (x(p) , x(q) ) =         λi Φi (x(p) )Φi (x(q) ).
                                                      i=1




Hanna M. Wallach                                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




The Covariance Function

              Specifies the covariance between pairs of random variables.
              e.g. Squared exponential covariance function:
                                                                   1
              Cov(f (x(p) ), f (x(q) )) = K (x(p) , x(q) ) = exp (− |x(p) − x(q) |2 ).
                                                                   2
                                                    K(x(p) = 5, x(q)) as a function of x(q)
                                               1

                                              0.8

                                              0.6

                                              0.4

                                              0.2

                                               0
                                                0        2        4        6        8         10




Hanna M. Wallach                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Gaussian Process Prior

              Given a set of inputs x(1) , . . . , x(n) we may draw samples
              f (x(1) ), . . . , f (x(n) ) from the GP prior:

                                       f (x(1) ), . . . , f (x(n) ) ∼ N (0, K ).

              Four samples:
                                                                     samples from the prior
                                                              3


                                                              2
                                              output, f(x)




                                                              1


                                                              0


                                                             −1
                                                               0   0.2    0.4      0.6   0.8   1
                                                                            input, x




Hanna M. Wallach                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Posterior: Noise-Free Observations (1)


              Given noise-free training data:

                                D = {x(i) , f (i) | i = 1, . . . , n} = {X , f}.

              Want to make predictions f at test points X .
              According to GP prior, joint distribution of f and f is:

                                 f                 K (X , X ) K (X , X )
                                      ∼N      0,                                   .
                                f                  K (X , X ) K (X , X )




Hanna M. Wallach                                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Posterior: Noise-Free Observations (2)



              Condition {X , f } on D = {X , f} obtain the posterior.
              Restrict prior to contain only functions which agree with D.
              The posterior, P(f |X , X , f), is Gaussian with:

                        µ = K (X , X )K (X , X )−1 f, and
                        Σ = K (X , X ) − K (X , X )K (X , X )−1 K (X , X ).




Hanna M. Wallach                                                              hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Posterior: Noise-Free Observations (3)


                                                                samples from the posterior
                                                         1.4

                                                         1.2




                                          output, f(x)
                                                          1

                                                         0.8

                                                         0.6

                                                         0.4
                                                            0   0.2    0.4      0.6   0.8    1
                                                                         input, x




              Samples all agree with the observations D = {X , f}.
              Greatest variance is in regions with few training points.



Hanna M. Wallach                                                                                 hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Prediction: Noisy Observations


              Typically we have noisy observations:

                                        D = {X , y}, where y = f +

              Assume additive noise            ∼ N (0, σ 2 I ).
              Conditioning on D = {X , y} gives a Gaussian with:

                   µ = K (X , X )[K (X , X ) + σ 2 I ]−1 y, and
                   Σ = K (X , X ) − K (X , X )[K (X , X ) + σ 2 I ]−1 K (X , X ).




Hanna M. Wallach                                                            hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Model Selection: Hyperparameters


              e.g. the ARD covariance function:
                                                                                                                                                                        1
                                                                        k(x (p) , x (q) ) = exp (−                                                                         (x (p) − x (q) )2 ).
                                                                                                                                                                       2θ2
              How best to choose θ?
                                        samples from the posterior, θ = 0.1                                                 samples from the posterior, θ = 0.3                                                 samples from the posterior, θ = 0.5
                              1.5                                                                                 1.3                                                                                 1.3



                                                                                                                  1.2                                                                                 1.2
                               1

                                                                                                                  1.1                                                                                 1.1
              output, f(x)




                                                                                                   output, f(x)




                                                                                                                                                                                       output, f(x)
                              0.5
                                                                                                                   1                                                                                   1



                               0                                                                                  0.9                                                                                 0.9



                                                                                                                  0.8                                                                                 0.8
                             −0.5

                                                                                                                  0.7                                                                                 0.7

                              −1
                                                                                                                  0.6                                                                                 0.6



                             −1.5                                                                                 0.5                                                                                 0.5
                                    0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1                        0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1                        0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1

                                                             input, x                                                                            input, x                                                                            input, x




Hanna M. Wallach                                                                                                                                                                                                                                                       hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Model Selection: Optimizing Marginal Likelihood (1)


              In absence of a strong prior P(θ), the posterior for
              hyperparameter θ is proportional to the marginal likelihood:

                                              P(θ|X , y) ∝ P(y|X , θ)

              Choose θ to optimize the marginal log-likelihood:
                                         1
                      log P(y|X , θ) = − log |K (X , X ) + σ 2 I |−
                                         2
                                      1                              n
                                        y (K (X , X ) + σ 2 I )−1 y − log 2π.
                                      2                              2



Hanna M. Wallach                                                           hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




Model Selection: Optimizing Marginal Likelihood (2)


              θML = 0.3255:
                                                                       samples from the posterior, θ = 0.3255
                                                             1.4


                                                             1.3


                                                             1.2




                                              output, f(x)
                                                             1.1


                                                              1


                                                             0.9


                                                             0.8


                                                             0.7


                                                             0.6


                                                             0.5
                                                                   0     0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1

                                                                                             input, x


              Using θML is an approximation to the true Bayesian method of
              integrating over all θ values weighted by their posterior.



Hanna M. Wallach                                                                                                                   hmw26@cam.ac.uk
Introduction to Gaussian Process Regression
Gaussian Process Regression




References



          1 Carl Edward Rasmussen. Gaussian Processes in Machine
            learning. Machine Learning Summer School, T ubingen, 2003.
            http://www.kyb.tuebingen.mpg.de/~carl/mlss03/
          2 Carl Edward Rasmussen and Chris Williams. Gaussian
            Processes for Machine Learning. Forthcoming.
          3 Carl Edward Rasmussen. The Gaussian Process Website.
            http://www.gatsby.ucl.ac.uk/~edward/gp/




Hanna M. Wallach                                                 hmw26@cam.ac.uk
Introduction to Gaussian Process Regression

								
To top