Docstoc

Pattern Analysis

Document Sample
Pattern Analysis Powered By Docstoc
					 Overview of Kernel Methods


             Steve Vincent

Adapted from John Shawe-Taylor and Nello
Christianini, Kernel Methods for Pattern Analysis
    Coordinate Transformation
                                                  Plot of x vs y
         Planetary position in a two
          dimensional orthogonal
                                                            1.0000
                                                            0.8000


          coordinate system
                                                            0.6000
                                                            0.4000
                                                            0.2000


   X          Y                         -1.0000   -0.5000
                                                             0.0000
                                                                  0.0000
                                                            -0.2000        0.5000   1.0000

                                                            -0.4000
 0.8415     0.5403                                          -0.6000
                                                            -0.8000
 0.9093    -0.4161                                          -1.0000


 0.1411    -0.9900
-0.7568    -0.6536
-0.9589     0.2837
-0.2794     0.9602
 0.6570     0.7539
 0.9894    -0.1455
 0.4121    -0.9111
-0.5440    -0.8391
                                                                                             2
    Coordinate Transformation
      Planetary position in a two                   Plot of x vs y
       dimensional orthogonal                                  1.0000
                                                               0.8000

       coordinate system                                       0.6000
                                                               0.4000

                       2        2                              0.2000
   X         Y        X        Y                                0.0000
                                      -1.0000        -0.5000         0.0000
                                                               -0.2000                 0.5000       1.0000
 0.8415    0.5403   0.7081   0.2919                            -0.4000
                                                               -0.6000
 0.9093   -0.4161   0.8268   0.1731                            -0.8000
                                                               -1.0000
 0.1411   -0.9900   0.0199   0.9801
-0.7568   -0.6536   0.5727   0.4272
                                                    Plot of x2 vs y2
-0.9589    0.2837   0.9195   0.0805
-0.2794    0.9602   0.0781   0.9220   1.0000
                                      0.9000
 0.6570    0.7539   0.4316   0.5684   0.8000
                                      0.7000

 0.9894   -0.1455   0.9789   0.0212   0.6000
                                      0.5000
                                      0.4000
 0.4121   -0.9111   0.1698   0.8301   0.3000
                                      0.2000

-0.5440   -0.8391   0.2959   0.7041   0.1000
                                      0.0000
                                           0.0000    0.2000     0.4000        0.6000       0.8000   1.0000
                                                                                                             3
Non-linear Kernel
Classification


                           (x )




If the data is not separable by a hyperplane…

                 … transform it to a feature space where it is!

                                                             4
Pattern Analysis Algorithm
   Computational efficiency
       Need to have all algorithms to be computationally
        efficient and that the degree of any polynomial
        involved should render the algorithm practical for
        large data sets
   Robustness
       Able to handle noisy data and identify
        approximate patterns
   Statistical Stability
       Output not be sensitive to a particular dataset,
        just to the underlying source of the data
                                                           5
    Kernel Method
   Mapping into embedding or feature space
    defined by kernel
   Learning algorithm for discovering linear
    patterns in that space
   Learning algorithm must work in dual space
       Primal solution: computes the weight vector
        explicitly
       Dual solution: gives the solution as a linear
        combination of the training examples

                                                        6
  Kernel Methods : the mapping
                         f



                     f




                 f
Original Space               Feature (Vector) Space
                                                      7
Linear Regression
Given training data:
     S    x1 , y1  ,  x 2 , y2  ,   ,  x i , yi  ,   ,x , y   
          points x i  R n and labels yi  R

   Construct linear function:
                            n
                  g (x)  w, x  w ' x   wi xi
                                                    i 1
   Creates pattern function:
                    f (x, y)  y  g (x)  y  w, x  0

                                                                            8
    1-d Regression


                     w, x
y




              x             9
Least Squares Approximation
   Want g ( x)  y

   Define error f (x, y)  y  g (x)  

   Minimize loss                                             2

                L( g , s )  L ( w, S )   ( yi  g ( xi ))
                                            i 1
                                       
                            i2   l (( xi , yi ), g )
                            i 1       i 1
                                                               10
 Optimal Solution
    Want: y  Xw
    Mathematical Model:
min w L(w, S )  y  Xw         y  Xw  '  y  Xw 
                          2



    Optimality Condition:
          L(w, S )
                     2X ' y  2X ' Xw  0
            w

    Solution satisfies: X ' Xw  X ' y
            Solving nn equation is 0(n3)
                                                          11
Ridge Regression
   Inverse typically does not exist.
   Use least norm solution for fixed   0.
   Regularized problem
           min w L (w , S )   w  y  Xw
                                     2           2



   Optimality Condition:
L ( w, S )
              2w  2 X ' y  2 X ' Xw  0
   w

 X ' X  In  w  X ' y      Requires 0(n3) operations
                                                           12
Ridge Regression (cont)
    Inverse always exists for any                        0.
             w   X ' X  I  X ' y
                                   1



    Alternative representation:
     X ' X   I  w  X ' y  w   1  X ' y  X ' Xw 
     w   1 X '  y  Xw   X ' α
    α   1  y  Xw 
      α   y  Xw    y  XX ' α 
     XX ' α   α  y             Solving ll equation is 0(l3)
     α  G  I    
                         1
                              y where G  XX '                     13
Dual Ridge Regression
   To predict new point:

                         x , x        y ' G  I  z
                                                    1
     g ( x)  w , x           i   i
                        i 1

     where z  xi , x

   Note need only compute G, the Gram
    Matrix G  XX ' Gij  xi , x j

    Ridge Regression requires only
    inner products between data points
                                                           14
Efficiency
   To compute
      w in primal ridge regression is 0(n3)
       in primal ridge regression is 0(l3)
   To predict new point x
                                     n

      primal g (x)  w, x   w  x 
                                    i 1
                                           i      i         0(n)
      dual g (x)  w, x   x , x      x   x  
                                                             n

                                    i i               i
                                                            0(nl)  i j   j
                             i 1              i 1        j 1             

    Dual is better if n>>l
                                                                                 15
Notes on Ridge Regression
   “Regularization” is key to address
    stability and regularization.
   Regularization lets method work when
    n>>l.
   Dual more efficient when n>>l.
   Dual only requires inner products of
    data.

                                           16
Linear Regression in Feature
Space
Key Idea:
 Map data to higher dimensional space
 (feature space) and perform linear
 regression in embedded space.
                           Alternative Form

Embedding Map:           w    i xi
                             i

f : x  R  F  R N  n
         n       N




                                              17
     Nonlinear Regression in
     Feature Space
In primal representation:

 x   a, b 
  x, w  w1a  w2b
 
  ( x)   a 2 , b 2 , 2ab 
                           
 g ( x)   ( x), w F
         w1a 2  w2b 2 w1  w3   2ab

                                        18
    Nonlinear Regression in
    Feature Space
In dual representation:

         g ( x)  f ( x), w         F


                  
                   i 1
                          i     f ( x), f ( x i )

                  
                   i 1
                          i   K ( x, x i )




                                                    19
Kernel Methods : intuitive idea
   Find a mapping f such that, in the new
    space, problem solving is easier (e.g. linear)
   The kernel represents the similarity between
    two objects (documents, terms, …), defined
    as the dot-product in this new vector space
   But the mapping is left implicit
   Easy generalization of a lot of dot-product (or
    distance) based pattern recognition
    algorithms

                                                20
 Derivation of Kernel
f (u), f ( v )
   (u1 , u2 ,
      2    2                  2    2
                   2u1u2 ), (v1 , v2 ,   2v1v2 )
   u1 v1  u2 v2  2u1u2 v1v2
     2 2     2 2


     u1v1  u2 v2 
                        2



    u, v
             2




                 K (u, v )  u, v
                                    2
   Thus:


                                                   21
Kernel Function
   A kernel is a function K such that
     K x, u  f (x), f (u)   F

     where f is a mapping from input space
     to feature space F .

   There are many possible kernels.
     Simplest is linear kernel.
          K x, u  x, u

                                             22
Kernel : more formal definition
   A kernel k(x,y)
      is a similarity measure
      defined by an implicit mapping f,

      from the original space to a vector space (feature
       space)
      such that: k(x,y)=fx)•fy)

   This similarity measure and the mapping include:
       Invariance or other a priori knowledge
       Simpler structure (linear representation of the data)
       The class of functions the solution is taken from
       Possibly infinite dimension (hypothesis space for learning)
       … but still computational efficiency when computing k(x,y)
                                                                 23
     Kernelization
    Replace x  y by
    K (x, y ), w here K : X  X  R such that
    (i) K (x, y )  K (y , x) (symmetric
                                       ).
    (ii) For any square - integrable ( L2 (X)) function f ,

     K (x, y ) f (x) f (y )d xdy     (positive - definiteness).

   Such K is called a Mercer kernel.
   Kernels were introduced in mathematics to solve
    integral equations.
   Kernels measure similarity of inputs.

                                                                     24
    Brief Comments on Hilbert
    Spaces
   A Hilbert space is a generalization of finite
    dimensional vector spaces with inner product to
    a possibly infinite dimension.
       Most of interesting infinite dimensional vector spaces
        are function spaces.
       Hilbert spaces are the simplest among such spaces.
       Prime example: L^2 (the square integrable functions)
       Any continuous linear functional on a Hilbert space is
        given by an inner product with a vector. (Riesz
        Representation Theorem.)
       A representation of a vector w.r.t. a fixed basis is
        called Fourier expansion.
                                                            25
Making Kernels
    The mapping function must be symmetric,
     K ( x  z)  f ( x)  f ( z)
                 f ( z)  f ( x)  K ( z  x)
    And satisfy the inequalities that follow from the
     Cauchy-Schwarz inequality.

    K ( x, z )  f ( x )  f ( z )  f ( x ) f ( z )
             2                      2            2       2


                f ( x)  f ( z) f ( x)  f ( z)
                K ( x, x ) K ( z , z )                  26
The Kernel Gram Matrix
   With KM-based learning, the sole information
    used from the training data set is the Kernel
    Gram Matrix
              k (x1 , x1 ) k (x1 , x 2 ) ... k (x1 , x m ) 
              k (x , x ) k (x , x ) ... k (x , x ) 
K training       2    1        2    2             2    m 

              ...               ...       ...      ... 
                                                            
             k (x m , x1 ) k (x m , x 2 ) ... k (x m , x m )
   If the kernel is valid, K is symmetric definite-
    positive (all eigenvalues are all non-negative)
                                                          27
         Mercer’s Theorem
   Suppose X is compact. (Always true for finite examples.)
   Suppose K is a Mercer Kernel.
   Then it can be expanded, using eigenvalues and
    eigenfunctions of K, as             
                            K (x, y )   i i (x) i (y ).
                                            i 1

   Now, using eigenfunctions and their span, find
    A Hilbert space H , and a map   X  H
    such thatthe inner productin H is given by K ,
    that is,  (x),  (y )  K (x, y ).
       H is called a Reproducing Kernel Hilbert Space (RKHS).


                                                                 28
Characterization of Kernels
     Prove: (kernel function)
      K is symmetric         K  VV'
      V is an orthogonal matrix.
   λ1 0 0... 0 
  0 λ 0... 0         V  v1v 2 v 3 ... v t 
Λ     2       
             
                       Kvt  t vt
  0 0 0... t          vt  (vti ) in1
      Let
  f : x i  (  v )   , i  1,...,n
                         n
                    t ti t 1
                                      n

                                                  29
Characterization of Kernels
   Then for any xi, xj    n
     f ( xi )  f ( x j )  t vti vtj  ( VV' ) ij  K ij  K ( xi , x j )
                          t 1


   (positive semi-definite) v s
     Let there exists s  0 with
                      n
     The point z   vsif ( xi )   V' vs
                    i 1
     Then
     z  z  vs V  V' vs  vs VV' vs  vs Kvs  s  0
              '               '            '




                                                                                30
Reproducing Kernel Hilbert
Spaces
   Reproducing Kernel Hilbert Spaces (RKHS)
   [1] The Hilbert space L2 is too “big” for our
    purpose, containing too many non-smooth
    functions. One approach to obtaining restricted,
    smooth spaces is the RKHS approach.
   A RKHS is “smaller” then a general Hilbert space.
   Define the reproducing kernel map
     (to each x we associate a function k(.,x))



                                                        31
Characterization of Kernels
   We now define an inner product.
   Now construct a vector space containing all linear
    combination of the function k(.,x):
                       (This will be our RKHS.)

   Let                     and define



   Prove it as an inner product in RKHS.

                                                         32
Characterization of Kernels
   Symmetry is obvious, and linearity is easy to show.
   Prove<f.f>=0=>f=0. m
     k (, x), f    i k ( xi , x)  f ( x)
    say k is the representer of evaluation.[2]
                      i 1

   By above,

  (reproduction property)
So
              f , f  k (, x), f   k (, x)
          2                              2           2        2
         f                                                f
        k (, x), k (, x)  f , f  k ( x, x)  f , f 
    (From Cauchy-Schwartz).
   If <f,f>=0=>f=0, and This is our RKHS.

                                                                  33
Characterization of Kernels
   Formal definition:
    For a compact subset X of Rd and Hilbert
    space H of functions f :X →R, we say that H
    is a reproducing kernel Hilbert space if k : X2
    → R, such that:
       1. k has the reproducing property.(<k(·, x), f >=
        f(x))
       2. k spans H. (span{k(·, x) : x is belong to X} =
        H)


                                                       34
    Popular Kernels based on
    vectors
By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)

            (u), (v)  K (u, v)
for certain  and K, e.g.

              (u )                   K (u, v)
     Degree d polynomial            ( u, v  1) d
                                       || u  v ||2 
Radial Basis Function Machine     exp              
                                                   
  Two Layer Neural Network      sigmoid ( u, v  c)

                                                         35
Examples of Kernels
                   f


           Polynomial
           kernel (n=2)




             RBF kernel
               (n=2)

                          36
     How to build new kernels
   Kernel combinations, preserving validity:
       K (x,y )  K1 (x,y )  (1   ) K 2 (x,y )   0   1
       K (x,y )  a.K1 (x,y )    a0
       K (x,y )  K1 (x,y ).K 2 (x,y )
       K (x,y )  f ( x). f ( y ) f is real  valued f unction
       K (x,y )  K 3 (φ(x) ,φ(y ))
       K (x,y )  xPy P symmetric def inite positive
                         K1 (x,y )
       K (x,y ) 
                     K1 (x,x) K1 (y,y )
                                                                 37
Important Points
   Kernel method =
    linear method + embedding in feature
    space.
   Kernel functions used to do embedding
    efficiently.
   Feature space is higher dimensional
    space so must regularize.
   Choose kernel appropriate to domain.

                                       38
     Principal Component Analysis
     (PCA)
   Subtract the mean (centers the
    data).
   Compute the covariance matrix, S.
   Compute the eigenvectors of S,
    sort them according to their
    eigenvalues and keep the M first
    vectors.
   Project the data points on those
    vectors.

   Also called the Karhunen-Loeve
    transformation.




                                        39
    Kernel PCA
   Principal Component Analysis (PCA) is one of the
    fundamental technique in a wide range of areas.
   Simply stated, PCA diagonalizes (or, finds singular value
    decomposition (SVD) of) the covariance matrix.
       Equivalently, we may find SVD of the data matrix.
   Instead of PCA in the original input spaces, we may
    perform PCA in the feature space. This is called Kernel
    PCA.
       Find eigenvalues and eigenvectors of the Gram matrix.
       The Gram matrix is an n  n matrix K , whose ij - th entry is K (x i , x j ).
       For many applications, we need to find online algorithms, i.e.,
        algorithms that do not need to store the Gram matrix.


                                                                                   40
    PCA in dot-product form
 Assume we have centered observations column
                                  

  vectors xi (centered means  x  0 )  i
                                 i 1

PCA finds the principal axes by diagonalizing the
  covariance matrix C with singular value
  decomposition
                    C                   (1)
     Eigenvalue                  Eigenvector
                     1 
                  C   x j xT              (2)
                      j 1  j



        Covariance
          matrix
                                                  41
       PCA in dot-product
     Substituting equation 2 into 1, we get
                1
         C   x j xT
                                                   (3)
                        j
 Thus,          
                                              Scalar
       1                     1
   
      
                 x j x   
                      T
                      j            ( x . ) x
                                         j          j    (4)

All solutions v with 0 lie in the span of x1,x2,..,xl,,i.e.
         i xi                            (5)
            i


                                                                42
        Kernel PCA algorithm
           If we do PCA in feature space, covariance matrix
                                 1 
                              C   f ( x j )f ( x j )T                        (6)
                                  j 1
Which can be diagonalized with nonnegative eigenvalues satisfying
                   V  CV                          (7)
And we have shown that V lie in the span of f(xi), so we have
                   1                                      1
     if ( xi )   f ( x j )f ( x j ) .  if ( x j )    if ( x j )f ( x j )T f ( xi )
                                        T

                                                         
               j f ( x) i  f ( xi).f ( x j )                                        (8)
                  j    i
                                                                                                43
Kernel PCA
   Apply kernel trick,we have K(xi,xj)= < f(xi), f(xj)>

        if ( xi )   if ( xi ) K ( xi , x j )   (9)
                          i   j

And we can finally write the expression as the eigenvalue
Problem
    K =                                   (10)




                                                             44
Kernel PCA algorithm outline
1.   Given a set of m-dimensional data{xk}, calculate K,
     for example, Gaussian K(xi,xj)=exp(-||xi-xj||2/d).
2.   Carry out centering in feature space.
3.   Solve eigenvalue problem, K =  .
4.   For a test pattern x, we extract a nonlinear
     component via

                      N
         Vk , x    ik K ( xi , x)
                     i 1                (11)


                                                     45
            Stability of Kernel Algorithms
        Our objective for learning is to improve generalize performance:
        cross-validation, Bayesian methods, generalization bounds,...
          Call ES [ f ( x)]  0 a pattern a sample S.
          Is this pattern also likely to be present in new data: EP [ f ( x)]  0 ?
        We can use concentration inequalities (McDiamid’s theorem)
        to prove that:
        Theorem: Let S  {x1 ,..., x } be a IID sample from P and define
                                           1
        the sample mean of f(x) as: f   f ( xi ) then it follows that:
                                                  i 1

                                 R            1                           R  sup x || f ( x) ||
         P(|| f  EP [ f ] ||       (2  2ln( ))  1  
                                              
(prob. that sample mean and population mean differ less than is more than ,independent of P!
                                                                                         46
         Rademacher Complexity
 Problem: we only checked the generalization performance for a
        single fixed pattern f(x).
        What is we want to search over a function class F?

 Intuition: we need to incorporate the complexity of this function class.
   Rademacher complexity captures the ability of the function class to
   fit random noise. ( i  1 uniform distributed)  i  1
(empirical RC)
                                                                f1
                        2                                                        f2
 R ( F )  E [sup |
               f F
                             
                             i 1
                                       i   f ( xi ) |,| x1 ,..., x ]

                             2
 R ( F )  ES E [sup |
                      f F
                                    
                                    i 1
                                             i   f ( xi ) |]
                                                                       xi   47
           Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1   over random draws of size
every f satisfies:                                        2
                                                         ln( )
           E p [ f ( x)]  Edata [ f ( x)]  R ( F )       
                                                           2
                                                                2
                                                          ln( )
                        Edata [ f ( x)]  R ( F )  3         
                                                            2
  Relevance: The expected pattern E[f]=0 will also be present in a new
             data set, if the last 2 terms are small:
             - Complexity function class F small
             - number of training data large
                                                                       48
       Linear Functions                      (in feature space)

  Consider the
                        FB  { f : x  w, ( x)  , || w || B}
  function class:       with     k ( x, y)  ( x), ( y) 
  and a sample:        S  {x1 ,..., x }

 Then, the empirical                             2B
                                    R ( FB )          tr ( K )
 RC of FB is bounded by:


Relevance: Since: {x    i k ( xi , x) ,  T K  B}  FB it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization).
                                                                       49
         Margin Bound (classification)
    Theorem: Choose c>0 (the margin).
             F : f(x,y)=-yg(x), y=+1,-1
             S: {( x1 , y1 ),..., ( x , y )} IID sample
              : (0,1) : probability of violating bound.
                                                                                2
                                                                         ln( )
                                1
      Pp [ y  sign( g ( x ))]   i 
                                        4
                                          tr ( K )  3                          
                                 c i 1 c                                   2
      (prob. of misclassification)
                                            i  (c  yi g ( xi )) (slack variable)
                                            ( f )  f if f  0 and 0 otherwise

Relevance: We our classification error on new samples. Moreover, we have a
strategy to improve generalization: choose the margin c as large possible such
that all samples are correctly classified:  i  0 (e.g. support vector machines).
                                                                                       50
Next Part
   Constructing Kernels
       Kernels for Text
            Vector space kernels
       Kernels for Structured Data
            Subsequences kernels
            Trie-based kernels
       Kernels from Generative Models
            P-kernels
            Fisher kernels

                                         51

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:3/26/2012
language:English
pages:51