Pattern Analysis

Document Sample

```					 Overview of Kernel Methods

Steve Vincent

Adapted from John Shawe-Taylor and Nello
Christianini, Kernel Methods for Pattern Analysis
Coordinate Transformation
Plot of x vs y
     Planetary position in a two
dimensional orthogonal
1.0000
0.8000

coordinate system
0.6000
0.4000
0.2000

X          Y                         -1.0000   -0.5000
0.0000
0.0000
-0.2000        0.5000   1.0000

-0.4000
0.8415     0.5403                                          -0.6000
-0.8000
0.9093    -0.4161                                          -1.0000

0.1411    -0.9900
-0.7568    -0.6536
-0.9589     0.2837
-0.2794     0.9602
0.6570     0.7539
0.9894    -0.1455
0.4121    -0.9111
-0.5440    -0.8391
2
Coordinate Transformation
   Planetary position in a two                   Plot of x vs y
dimensional orthogonal                                  1.0000
0.8000

coordinate system                                       0.6000
0.4000

2        2                              0.2000
X         Y        X        Y                                0.0000
-1.0000        -0.5000         0.0000
-0.2000                 0.5000       1.0000
0.8415    0.5403   0.7081   0.2919                            -0.4000
-0.6000
0.9093   -0.4161   0.8268   0.1731                            -0.8000
-1.0000
0.1411   -0.9900   0.0199   0.9801
-0.7568   -0.6536   0.5727   0.4272
Plot of x2 vs y2
-0.9589    0.2837   0.9195   0.0805
-0.2794    0.9602   0.0781   0.9220   1.0000
0.9000
0.6570    0.7539   0.4316   0.5684   0.8000
0.7000

0.9894   -0.1455   0.9789   0.0212   0.6000
0.5000
0.4000
0.4121   -0.9111   0.1698   0.8301   0.3000
0.2000

-0.5440   -0.8391   0.2959   0.7041   0.1000
0.0000
0.0000    0.2000     0.4000        0.6000       0.8000   1.0000
3
Non-linear Kernel
Classification

(x )

If the data is not separable by a hyperplane…

… transform it to a feature space where it is!

4
Pattern Analysis Algorithm
   Computational efficiency
   Need to have all algorithms to be computationally
efficient and that the degree of any polynomial
involved should render the algorithm practical for
large data sets
   Robustness
   Able to handle noisy data and identify
approximate patterns
   Statistical Stability
   Output not be sensitive to a particular dataset,
just to the underlying source of the data
5
Kernel Method
   Mapping into embedding or feature space
defined by kernel
   Learning algorithm for discovering linear
patterns in that space
   Learning algorithm must work in dual space
   Primal solution: computes the weight vector
explicitly
   Dual solution: gives the solution as a linear
combination of the training examples

6
Kernel Methods : the mapping
f

f

f
Original Space               Feature (Vector) Space
7
Linear Regression
Given training data:
S    x1 , y1  ,  x 2 , y2  ,   ,  x i , yi  ,   ,x , y   
points x i  R n and labels yi  R

   Construct linear function:
n
g (x)  w, x  w ' x   wi xi
i 1
   Creates pattern function:
f (x, y)  y  g (x)  y  w, x  0

8
1-d Regression

w, x
y

x             9
Least Squares Approximation
   Want g ( x)  y

   Define error f (x, y)  y  g (x)  

   Minimize loss                                             2

L( g , s )  L ( w, S )   ( yi  g ( xi ))
i 1
          
   i2   l (( xi , yi ), g )
i 1       i 1
10
Optimal Solution
   Want: y  Xw
   Mathematical Model:
min w L(w, S )  y  Xw         y  Xw  '  y  Xw 
2

   Optimality Condition:
L(w, S )
 2X ' y  2X ' Xw  0
w

   Solution satisfies: X ' Xw  X ' y
Solving nn equation is 0(n3)
11
Ridge Regression
   Inverse typically does not exist.
   Use least norm solution for fixed   0.
   Regularized problem
min w L (w , S )   w  y  Xw
2           2

   Optimality Condition:
L ( w, S )
 2w  2 X ' y  2 X ' Xw  0
w

 X ' X  In  w  X ' y      Requires 0(n3) operations
12
Ridge Regression (cont)
    Inverse always exists for any                        0.
w   X ' X  I  X ' y
1

    Alternative representation:
 X ' X   I  w  X ' y  w   1  X ' y  X ' Xw 
 w   1 X '  y  Xw   X ' α
α   1  y  Xw 
  α   y  Xw    y  XX ' α 
 XX ' α   α  y             Solving ll equation is 0(l3)
 α  G  I    
1
y where G  XX '                     13
Dual Ridge Regression
   To predict new point:

 x , x        y ' G  I  z
1
g ( x)  w , x           i   i
i 1

where z  xi , x

   Note need only compute G, the Gram
Matrix G  XX ' Gij  xi , x j

Ridge Regression requires only
inner products between data points
14
Efficiency
   To compute
w in primal ridge regression is 0(n3)
 in primal ridge regression is 0(l3)
   To predict new point x
n

primal g (x)  w, x   w  x 
i 1
i      i         0(n)
dual g (x)  w, x   x , x      x   x  
n

i i               i
0(nl)  i j   j
i 1              i 1        j 1             

Dual is better if n>>l
15
Notes on Ridge Regression
   “Regularization” is key to address
stability and regularization.
   Regularization lets method work when
n>>l.
   Dual more efficient when n>>l.
   Dual only requires inner products of
data.

16
Linear Regression in Feature
Space
Key Idea:
Map data to higher dimensional space
(feature space) and perform linear
regression in embedded space.
Alternative Form

Embedding Map:           w    i xi
i

f : x  R  F  R N  n
n       N

17
Nonlinear Regression in
Feature Space
In primal representation:

x   a, b 
x, w  w1a  w2b

 ( x)   a 2 , b 2 , 2ab 
                 
g ( x)   ( x), w F
 w1a 2  w2b 2 w1  w3   2ab

18
Nonlinear Regression in
Feature Space
In dual representation:

g ( x)  f ( x), w         F

   
i 1
i     f ( x), f ( x i )

   
i 1
i   K ( x, x i )

19
Kernel Methods : intuitive idea
   Find a mapping f such that, in the new
space, problem solving is easier (e.g. linear)
   The kernel represents the similarity between
two objects (documents, terms, …), defined
as the dot-product in this new vector space
   But the mapping is left implicit
   Easy generalization of a lot of dot-product (or
distance) based pattern recognition
algorithms

20
Derivation of Kernel
f (u), f ( v )
 (u1 , u2 ,
2    2                  2    2
2u1u2 ), (v1 , v2 ,   2v1v2 )
 u1 v1  u2 v2  2u1u2 v1v2
2 2     2 2

  u1v1  u2 v2 
2

 u, v
2

K (u, v )  u, v
2
Thus:

21
Kernel Function
   A kernel is a function K such that
K x, u  f (x), f (u)   F

where f is a mapping from input space
to feature space F .

   There are many possible kernels.
Simplest is linear kernel.
K x, u  x, u

22
Kernel : more formal definition
   A kernel k(x,y)
  is a similarity measure
 defined by an implicit mapping f,

 from the original space to a vector space (feature
space)
 such that: k(x,y)=fx)•fy)

   This similarity measure and the mapping include:
   Invariance or other a priori knowledge
   Simpler structure (linear representation of the data)
   The class of functions the solution is taken from
   Possibly infinite dimension (hypothesis space for learning)
   … but still computational efficiency when computing k(x,y)
23
Kernelization
Replace x  y by
K (x, y ), w here K : X  X  R such that
(i) K (x, y )  K (y , x) (symmetric
).
(ii) For any square - integrable ( L2 (X)) function f ,

 K (x, y ) f (x) f (y )d xdy     (positive - definiteness).

   Such K is called a Mercer kernel.
   Kernels were introduced in mathematics to solve
integral equations.
   Kernels measure similarity of inputs.

24
Spaces
   A Hilbert space is a generalization of finite
dimensional vector spaces with inner product to
a possibly infinite dimension.
   Most of interesting infinite dimensional vector spaces
are function spaces.
   Hilbert spaces are the simplest among such spaces.
   Prime example: L^2 (the square integrable functions)
   Any continuous linear functional on a Hilbert space is
given by an inner product with a vector. (Riesz
Representation Theorem.)
   A representation of a vector w.r.t. a fixed basis is
called Fourier expansion.
25
Making Kernels
    The mapping function must be symmetric,
K ( x  z)  f ( x)  f ( z)
 f ( z)  f ( x)  K ( z  x)
    And satisfy the inequalities that follow from the
Cauchy-Schwarz inequality.

K ( x, z )  f ( x )  f ( z )  f ( x ) f ( z )
2                      2            2       2

 f ( x)  f ( z) f ( x)  f ( z)
 K ( x, x ) K ( z , z )                  26
The Kernel Gram Matrix
   With KM-based learning, the sole information
used from the training data set is the Kernel
Gram Matrix
 k (x1 , x1 ) k (x1 , x 2 ) ... k (x1 , x m ) 
 k (x , x ) k (x , x ) ... k (x , x ) 
K training       2    1        2    2             2    m 

 ...               ...       ...      ... 
                                               
k (x m , x1 ) k (x m , x 2 ) ... k (x m , x m )
   If the kernel is valid, K is symmetric definite-
positive (all eigenvalues are all non-negative)
27
Mercer’s Theorem
   Suppose X is compact. (Always true for finite examples.)
   Suppose K is a Mercer Kernel.
   Then it can be expanded, using eigenvalues and
eigenfunctions of K, as             
K (x, y )   i i (x) i (y ).
i 1

   Now, using eigenfunctions and their span, find
A Hilbert space H , and a map   X  H
such thatthe inner productin H is given by K ,
that is,  (x),  (y )  K (x, y ).
   H is called a Reproducing Kernel Hilbert Space (RKHS).

28
Characterization of Kernels
   Prove: (kernel function)
K is symmetric         K  VV'
V is an orthogonal matrix.
 λ1 0 0... 0 
0 λ 0... 0         V  v1v 2 v 3 ... v t 
Λ     2       
           
                     Kvt  t vt
0 0 0... t          vt  (vti ) in1
Let
f : x i  (  v )   , i  1,...,n
n
t ti t 1
n

29
Characterization of Kernels
   Then for any xi, xj    n
f ( xi )  f ( x j )  t vti vtj  ( VV' ) ij  K ij  K ( xi , x j )
t 1

   (positive semi-definite) v s
Let there exists s  0 with
n
The point z   vsif ( xi )   V' vs
i 1
Then
z  z  vs V  V' vs  vs VV' vs  vs Kvs  s  0
'               '            '

30
Reproducing Kernel Hilbert
Spaces
   Reproducing Kernel Hilbert Spaces (RKHS)
   [1] The Hilbert space L2 is too “big” for our
purpose, containing too many non-smooth
functions. One approach to obtaining restricted,
smooth spaces is the RKHS approach.
   A RKHS is “smaller” then a general Hilbert space.
   Define the reproducing kernel map
(to each x we associate a function k(.,x))

31
Characterization of Kernels
   We now define an inner product.
   Now construct a vector space containing all linear
combination of the function k(.,x):
(This will be our RKHS.)

   Let                     and define

   Prove it as an inner product in RKHS.

32
Characterization of Kernels
   Symmetry is obvious, and linearity is easy to show.
   Prove<f.f>=0=>f=0. m
     k (, x), f    i k ( xi , x)  f ( x)
say k is the representer of evaluation.[2]
i 1

   By above,

(reproduction property)
So
 f , f  k (, x), f   k (, x)
2                              2           2        2
f                                                f
 k (, x), k (, x)  f , f  k ( x, x)  f , f 
(From Cauchy-Schwartz).
   If <f,f>=0=>f=0, and This is our RKHS.

33
Characterization of Kernels
   Formal definition:
For a compact subset X of Rd and Hilbert
space H of functions f :X →R, we say that H
is a reproducing kernel Hilbert space if k : X2
→ R, such that:
   1. k has the reproducing property.(<k(·, x), f >=
f(x))
   2. k spans H. (span{k(·, x) : x is belong to X} =
H)

34
Popular Kernels based on
vectors
By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)

 (u), (v)  K (u, v)
for certain  and K, e.g.

 (u )                   K (u, v)
Degree d polynomial            ( u, v  1) d
 || u  v ||2 
Radial Basis Function Machine     exp              
             
Two Layer Neural Network      sigmoid ( u, v  c)

35
Examples of Kernels
f

Polynomial
kernel (n=2)

RBF kernel
(n=2)

36
How to build new kernels
   Kernel combinations, preserving validity:
K (x,y )  K1 (x,y )  (1   ) K 2 (x,y )   0   1
K (x,y )  a.K1 (x,y )    a0
K (x,y )  K1 (x,y ).K 2 (x,y )
K (x,y )  f ( x). f ( y ) f is real  valued f unction
K (x,y )  K 3 (φ(x) ,φ(y ))
K (x,y )  xPy P symmetric def inite positive
K1 (x,y )
K (x,y ) 
K1 (x,x) K1 (y,y )
37
Important Points
   Kernel method =
linear method + embedding in feature
space.
   Kernel functions used to do embedding
efficiently.
   Feature space is higher dimensional
space so must regularize.
   Choose kernel appropriate to domain.

38
Principal Component Analysis
(PCA)
   Subtract the mean (centers the
data).
   Compute the covariance matrix, S.
   Compute the eigenvectors of S,
sort them according to their
eigenvalues and keep the M first
vectors.
   Project the data points on those
vectors.

   Also called the Karhunen-Loeve
transformation.

39
Kernel PCA
   Principal Component Analysis (PCA) is one of the
fundamental technique in a wide range of areas.
   Simply stated, PCA diagonalizes (or, finds singular value
decomposition (SVD) of) the covariance matrix.
   Equivalently, we may find SVD of the data matrix.
   Instead of PCA in the original input spaces, we may
perform PCA in the feature space. This is called Kernel
PCA.
   Find eigenvalues and eigenvectors of the Gram matrix.
   The Gram matrix is an n  n matrix K , whose ij - th entry is K (x i , x j ).
   For many applications, we need to find online algorithms, i.e.,
algorithms that do not need to store the Gram matrix.

40
PCA in dot-product form
 Assume we have centered observations column


vectors xi (centered means  x  0 )  i
i 1

PCA finds the principal axes by diagonalizing the
covariance matrix C with singular value
decomposition
  C                   (1)
Eigenvalue                  Eigenvector
1 
C   x j xT              (2)
 j 1  j

Covariance
matrix
41
PCA in dot-product
Substituting equation 2 into 1, we get
1
  C   x j xT
(3)
j
Thus,          
Scalar
1                     1


 x j x   
T
j            ( x . ) x
j          j    (4)

All solutions v with 0 lie in the span of x1,x2,..,xl,,i.e.
    i xi                            (5)
i

42
Kernel PCA algorithm
   If we do PCA in feature space, covariance matrix
1 
C   f ( x j )f ( x j )T                        (6)
 j 1
Which can be diagonalized with nonnegative eigenvalues satisfying
V  CV                          (7)
And we have shown that V lie in the span of f(xi), so we have
1                                      1
   if ( xi )   f ( x j )f ( x j ) .  if ( x j )    if ( x j )f ( x j )T f ( xi )
T

                                      
   j f ( x) i  f ( xi).f ( x j )                                        (8)
j    i
43
Kernel PCA
   Apply kernel trick,we have K(xi,xj)= < f(xi), f(xj)>

  if ( xi )   if ( xi ) K ( xi , x j )   (9)
i   j

And we can finally write the expression as the eigenvalue
Problem
K =                                   (10)

44
Kernel PCA algorithm outline
1.   Given a set of m-dimensional data{xk}, calculate K,
for example, Gaussian K(xi,xj)=exp(-||xi-xj||2/d).
2.   Carry out centering in feature space.
3.   Solve eigenvalue problem, K =  .
4.   For a test pattern x, we extract a nonlinear
component via

N
 Vk , x    ik K ( xi , x)
i 1                (11)

45
Stability of Kernel Algorithms
Our objective for learning is to improve generalize performance:
cross-validation, Bayesian methods, generalization bounds,...
Call ES [ f ( x)]  0 a pattern a sample S.
Is this pattern also likely to be present in new data: EP [ f ( x)]  0 ?
We can use concentration inequalities (McDiamid’s theorem)
to prove that:
Theorem: Let S  {x1 ,..., x } be a IID sample from P and define
1
the sample mean of f(x) as: f   f ( xi ) then it follows that:
i 1

R            1                           R  sup x || f ( x) ||
P(|| f  EP [ f ] ||       (2  2ln( ))  1  

(prob. that sample mean and population mean differ less than is more than ,independent of P!
46
Problem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?

Intuition: we need to incorporate the complexity of this function class.
Rademacher complexity captures the ability of the function class to
fit random noise. ( i  1 uniform distributed)  i  1
(empirical RC)
f1
2                                                        f2
R ( F )  E [sup |
f F

i 1
i   f ( xi ) |,| x1 ,..., x ]

2
R ( F )  ES E [sup |
f F

i 1
i   f ( xi ) |]
xi   47
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1   over random draws of size
every f satisfies:                                        2
ln( )
E p [ f ( x)]  Edata [ f ( x)]  R ( F )       
2
2
ln( )
 Edata [ f ( x)]  R ( F )  3         
2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
- number of training data large
48
Linear Functions                      (in feature space)

Consider the
FB  { f : x  w, ( x)  , || w || B}
function class:       with     k ( x, y)  ( x), ( y) 
and a sample:        S  {x1 ,..., x }

Then, the empirical                             2B
R ( FB )          tr ( K )
RC of FB is bounded by:

Relevance: Since: {x    i k ( xi , x) ,  T K  B}  FB it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization).
49
Margin Bound (classification)
Theorem: Choose c>0 (the margin).
F : f(x,y)=-yg(x), y=+1,-1
S: {( x1 , y1 ),..., ( x , y )} IID sample
 : (0,1) : probability of violating bound.
2
ln( )
1
Pp [ y  sign( g ( x ))]   i 
4
tr ( K )  3                          
c i 1 c                                   2
(prob. of misclassification)
i  (c  yi g ( xi )) (slack variable)
( f )  f if f  0 and 0 otherwise

Relevance: We our classification error on new samples. Moreover, we have a
strategy to improve generalization: choose the margin c as large possible such
that all samples are correctly classified:  i  0 (e.g. support vector machines).
50
Next Part
   Constructing Kernels
   Kernels for Text
   Vector space kernels
   Kernels for Structured Data
   Subsequences kernels
   Trie-based kernels
   Kernels from Generative Models
   P-kernels
   Fisher kernels

51

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 3/26/2012 language: English pages: 51