Document Sample

Overview of Kernel Methods Steve Vincent Adapted from John Shawe-Taylor and Nello Christianini, Kernel Methods for Pattern Analysis Coordinate Transformation Plot of x vs y Planetary position in a two dimensional orthogonal 1.0000 0.8000 coordinate system 0.6000 0.4000 0.2000 X Y -1.0000 -0.5000 0.0000 0.0000 -0.2000 0.5000 1.0000 -0.4000 0.8415 0.5403 -0.6000 -0.8000 0.9093 -0.4161 -1.0000 0.1411 -0.9900 -0.7568 -0.6536 -0.9589 0.2837 -0.2794 0.9602 0.6570 0.7539 0.9894 -0.1455 0.4121 -0.9111 -0.5440 -0.8391 2 Coordinate Transformation Planetary position in a two Plot of x vs y dimensional orthogonal 1.0000 0.8000 coordinate system 0.6000 0.4000 2 2 0.2000 X Y X Y 0.0000 -1.0000 -0.5000 0.0000 -0.2000 0.5000 1.0000 0.8415 0.5403 0.7081 0.2919 -0.4000 -0.6000 0.9093 -0.4161 0.8268 0.1731 -0.8000 -1.0000 0.1411 -0.9900 0.0199 0.9801 -0.7568 -0.6536 0.5727 0.4272 Plot of x2 vs y2 -0.9589 0.2837 0.9195 0.0805 -0.2794 0.9602 0.0781 0.9220 1.0000 0.9000 0.6570 0.7539 0.4316 0.5684 0.8000 0.7000 0.9894 -0.1455 0.9789 0.0212 0.6000 0.5000 0.4000 0.4121 -0.9111 0.1698 0.8301 0.3000 0.2000 -0.5440 -0.8391 0.2959 0.7041 0.1000 0.0000 0.0000 0.2000 0.4000 0.6000 0.8000 1.0000 3 Non-linear Kernel Classification (x ) If the data is not separable by a hyperplane… … transform it to a feature space where it is! 4 Pattern Analysis Algorithm Computational efficiency Need to have all algorithms to be computationally efficient and that the degree of any polynomial involved should render the algorithm practical for large data sets Robustness Able to handle noisy data and identify approximate patterns Statistical Stability Output not be sensitive to a particular dataset, just to the underlying source of the data 5 Kernel Method Mapping into embedding or feature space defined by kernel Learning algorithm for discovering linear patterns in that space Learning algorithm must work in dual space Primal solution: computes the weight vector explicitly Dual solution: gives the solution as a linear combination of the training examples 6 Kernel Methods : the mapping f f f Original Space Feature (Vector) Space 7 Linear Regression Given training data: S x1 , y1 , x 2 , y2 , , x i , yi , ,x , y points x i R n and labels yi R Construct linear function: n g (x) w, x w ' x wi xi i 1 Creates pattern function: f (x, y) y g (x) y w, x 0 8 1-d Regression w, x y x 9 Least Squares Approximation Want g ( x) y Define error f (x, y) y g (x) Minimize loss 2 L( g , s ) L ( w, S ) ( yi g ( xi )) i 1 i2 l (( xi , yi ), g ) i 1 i 1 10 Optimal Solution Want: y Xw Mathematical Model: min w L(w, S ) y Xw y Xw ' y Xw 2 Optimality Condition: L(w, S ) 2X ' y 2X ' Xw 0 w Solution satisfies: X ' Xw X ' y Solving nn equation is 0(n3) 11 Ridge Regression Inverse typically does not exist. Use least norm solution for fixed 0. Regularized problem min w L (w , S ) w y Xw 2 2 Optimality Condition: L ( w, S ) 2w 2 X ' y 2 X ' Xw 0 w X ' X In w X ' y Requires 0(n3) operations 12 Ridge Regression (cont) Inverse always exists for any 0. w X ' X I X ' y 1 Alternative representation: X ' X I w X ' y w 1 X ' y X ' Xw w 1 X ' y Xw X ' α α 1 y Xw α y Xw y XX ' α XX ' α α y Solving ll equation is 0(l3) α G I 1 y where G XX ' 13 Dual Ridge Regression To predict new point: x , x y ' G I z 1 g ( x) w , x i i i 1 where z xi , x Note need only compute G, the Gram Matrix G XX ' Gij xi , x j Ridge Regression requires only inner products between data points 14 Efficiency To compute w in primal ridge regression is 0(n3) in primal ridge regression is 0(l3) To predict new point x n primal g (x) w, x w x i 1 i i 0(n) dual g (x) w, x x , x x x n i i i 0(nl) i j j i 1 i 1 j 1 Dual is better if n>>l 15 Notes on Ridge Regression “Regularization” is key to address stability and regularization. Regularization lets method work when n>>l. Dual more efficient when n>>l. Dual only requires inner products of data. 16 Linear Regression in Feature Space Key Idea: Map data to higher dimensional space (feature space) and perform linear regression in embedded space. Alternative Form Embedding Map: w i xi i f : x R F R N n n N 17 Nonlinear Regression in Feature Space In primal representation: x a, b x, w w1a w2b ( x) a 2 , b 2 , 2ab g ( x) ( x), w F w1a 2 w2b 2 w1 w3 2ab 18 Nonlinear Regression in Feature Space In dual representation: g ( x) f ( x), w F i 1 i f ( x), f ( x i ) i 1 i K ( x, x i ) 19 Kernel Methods : intuitive idea Find a mapping f such that, in the new space, problem solving is easier (e.g. linear) The kernel represents the similarity between two objects (documents, terms, …), defined as the dot-product in this new vector space But the mapping is left implicit Easy generalization of a lot of dot-product (or distance) based pattern recognition algorithms 20 Derivation of Kernel f (u), f ( v ) (u1 , u2 , 2 2 2 2 2u1u2 ), (v1 , v2 , 2v1v2 ) u1 v1 u2 v2 2u1u2 v1v2 2 2 2 2 u1v1 u2 v2 2 u, v 2 K (u, v ) u, v 2 Thus: 21 Kernel Function A kernel is a function K such that K x, u f (x), f (u) F where f is a mapping from input space to feature space F . There are many possible kernels. Simplest is linear kernel. K x, u x, u 22 Kernel : more formal definition A kernel k(x,y) is a similarity measure defined by an implicit mapping f, from the original space to a vector space (feature space) such that: k(x,y)=fx)•fy) This similarity measure and the mapping include: Invariance or other a priori knowledge Simpler structure (linear representation of the data) The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) … but still computational efficiency when computing k(x,y) 23 Kernelization Replace x y by K (x, y ), w here K : X X R such that (i) K (x, y ) K (y , x) (symmetric ). (ii) For any square - integrable ( L2 (X)) function f , K (x, y ) f (x) f (y )d xdy (positive - definiteness). Such K is called a Mercer kernel. Kernels were introduced in mathematics to solve integral equations. Kernels measure similarity of inputs. 24 Brief Comments on Hilbert Spaces A Hilbert space is a generalization of finite dimensional vector spaces with inner product to a possibly infinite dimension. Most of interesting infinite dimensional vector spaces are function spaces. Hilbert spaces are the simplest among such spaces. Prime example: L^2 (the square integrable functions) Any continuous linear functional on a Hilbert space is given by an inner product with a vector. (Riesz Representation Theorem.) A representation of a vector w.r.t. a fixed basis is called Fourier expansion. 25 Making Kernels The mapping function must be symmetric, K ( x z) f ( x) f ( z) f ( z) f ( x) K ( z x) And satisfy the inequalities that follow from the Cauchy-Schwarz inequality. K ( x, z ) f ( x ) f ( z ) f ( x ) f ( z ) 2 2 2 2 f ( x) f ( z) f ( x) f ( z) K ( x, x ) K ( z , z ) 26 The Kernel Gram Matrix With KM-based learning, the sole information used from the training data set is the Kernel Gram Matrix k (x1 , x1 ) k (x1 , x 2 ) ... k (x1 , x m ) k (x , x ) k (x , x ) ... k (x , x ) K training 2 1 2 2 2 m ... ... ... ... k (x m , x1 ) k (x m , x 2 ) ... k (x m , x m ) If the kernel is valid, K is symmetric definite- positive (all eigenvalues are all non-negative) 27 Mercer’s Theorem Suppose X is compact. (Always true for finite examples.) Suppose K is a Mercer Kernel. Then it can be expanded, using eigenvalues and eigenfunctions of K, as K (x, y ) i i (x) i (y ). i 1 Now, using eigenfunctions and their span, find A Hilbert space H , and a map X H such thatthe inner productin H is given by K , that is, (x), (y ) K (x, y ). H is called a Reproducing Kernel Hilbert Space (RKHS). 28 Characterization of Kernels Prove: (kernel function) K is symmetric K VV' V is an orthogonal matrix. λ1 0 0... 0 0 λ 0... 0 V v1v 2 v 3 ... v t Λ 2 Kvt t vt 0 0 0... t vt (vti ) in1 Let f : x i ( v ) , i 1,...,n n t ti t 1 n 29 Characterization of Kernels Then for any xi, xj n f ( xi ) f ( x j ) t vti vtj ( VV' ) ij K ij K ( xi , x j ) t 1 (positive semi-definite) v s Let there exists s 0 with n The point z vsif ( xi ) V' vs i 1 Then z z vs V V' vs vs VV' vs vs Kvs s 0 ' ' ' 30 Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces (RKHS) [1] The Hilbert space L2 is too “big” for our purpose, containing too many non-smooth functions. One approach to obtaining restricted, smooth spaces is the RKHS approach. A RKHS is “smaller” then a general Hilbert space. Define the reproducing kernel map (to each x we associate a function k(.,x)) 31 Characterization of Kernels We now define an inner product. Now construct a vector space containing all linear combination of the function k(.,x): (This will be our RKHS.) Let and define Prove it as an inner product in RKHS. 32 Characterization of Kernels Symmetry is obvious, and linearity is easy to show. Prove<f.f>=0=>f=0. m k (, x), f i k ( xi , x) f ( x) say k is the representer of evaluation.[2] i 1 By above, (reproduction property) So f , f k (, x), f k (, x) 2 2 2 2 f f k (, x), k (, x) f , f k ( x, x) f , f (From Cauchy-Schwartz). If <f,f>=0=>f=0, and This is our RKHS. 33 Characterization of Kernels Formal definition: For a compact subset X of Rd and Hilbert space H of functions f :X →R, we say that H is a reproducing kernel Hilbert space if k : X2 → R, such that: 1. k has the reproducing property.(<k(·, x), f >= f(x)) 2. k spans H. (span{k(·, x) : x is belong to X} = H) 34 Popular Kernels based on vectors By Hilbert-Schmidt Kernels (Courant and Hilbert 1953) (u), (v) K (u, v) for certain and K, e.g. (u ) K (u, v) Degree d polynomial ( u, v 1) d || u v ||2 Radial Basis Function Machine exp Two Layer Neural Network sigmoid ( u, v c) 35 Examples of Kernels f Polynomial kernel (n=2) RBF kernel (n=2) 36 How to build new kernels Kernel combinations, preserving validity: K (x,y ) K1 (x,y ) (1 ) K 2 (x,y ) 0 1 K (x,y ) a.K1 (x,y ) a0 K (x,y ) K1 (x,y ).K 2 (x,y ) K (x,y ) f ( x). f ( y ) f is real valued f unction K (x,y ) K 3 (φ(x) ,φ(y )) K (x,y ) xPy P symmetric def inite positive K1 (x,y ) K (x,y ) K1 (x,x) K1 (y,y ) 37 Important Points Kernel method = linear method + embedding in feature space. Kernel functions used to do embedding efficiently. Feature space is higher dimensional space so must regularize. Choose kernel appropriate to domain. 38 Principal Component Analysis (PCA) Subtract the mean (centers the data). Compute the covariance matrix, S. Compute the eigenvectors of S, sort them according to their eigenvalues and keep the M first vectors. Project the data points on those vectors. Also called the Karhunen-Loeve transformation. 39 Kernel PCA Principal Component Analysis (PCA) is one of the fundamental technique in a wide range of areas. Simply stated, PCA diagonalizes (or, finds singular value decomposition (SVD) of) the covariance matrix. Equivalently, we may find SVD of the data matrix. Instead of PCA in the original input spaces, we may perform PCA in the feature space. This is called Kernel PCA. Find eigenvalues and eigenvectors of the Gram matrix. The Gram matrix is an n n matrix K , whose ij - th entry is K (x i , x j ). For many applications, we need to find online algorithms, i.e., algorithms that do not need to store the Gram matrix. 40 PCA in dot-product form Assume we have centered observations column vectors xi (centered means x 0 ) i i 1 PCA finds the principal axes by diagonalizing the covariance matrix C with singular value decomposition C (1) Eigenvalue Eigenvector 1 C x j xT (2) j 1 j Covariance matrix 41 PCA in dot-product Substituting equation 2 into 1, we get 1 C x j xT (3) j Thus, Scalar 1 1 x j x T j ( x . ) x j j (4) All solutions v with 0 lie in the span of x1,x2,..,xl,,i.e. i xi (5) i 42 Kernel PCA algorithm If we do PCA in feature space, covariance matrix 1 C f ( x j )f ( x j )T (6) j 1 Which can be diagonalized with nonnegative eigenvalues satisfying V CV (7) And we have shown that V lie in the span of f(xi), so we have 1 1 if ( xi ) f ( x j )f ( x j ) . if ( x j ) if ( x j )f ( x j )T f ( xi ) T j f ( x) i f ( xi).f ( x j ) (8) j i 43 Kernel PCA Apply kernel trick,we have K(xi,xj)= < f(xi), f(xj)> if ( xi ) if ( xi ) K ( xi , x j ) (9) i j And we can finally write the expression as the eigenvalue Problem K = (10) 44 Kernel PCA algorithm outline 1. Given a set of m-dimensional data{xk}, calculate K, for example, Gaussian K(xi,xj)=exp(-||xi-xj||2/d). 2. Carry out centering in feature space. 3. Solve eigenvalue problem, K = . 4. For a test pattern x, we extract a nonlinear component via N Vk , x ik K ( xi , x) i 1 (11) 45 Stability of Kernel Algorithms Our objective for learning is to improve generalize performance: cross-validation, Bayesian methods, generalization bounds,... Call ES [ f ( x)] 0 a pattern a sample S. Is this pattern also likely to be present in new data: EP [ f ( x)] 0 ? We can use concentration inequalities (McDiamid’s theorem) to prove that: Theorem: Let S {x1 ,..., x } be a IID sample from P and define 1 the sample mean of f(x) as: f f ( xi ) then it follows that: i 1 R 1 R sup x || f ( x) || P(|| f EP [ f ] || (2 2ln( )) 1 (prob. that sample mean and population mean differ less than is more than ,independent of P! 46 Rademacher Complexity Problem: we only checked the generalization performance for a single fixed pattern f(x). What is we want to search over a function class F? Intuition: we need to incorporate the complexity of this function class. Rademacher complexity captures the ability of the function class to fit random noise. ( i 1 uniform distributed) i 1 (empirical RC) f1 2 f2 R ( F ) E [sup | f F i 1 i f ( xi ) |,| x1 ,..., x ] 2 R ( F ) ES E [sup | f F i 1 i f ( xi ) |] xi 47 Generalization Bound Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions) Then, with probability at least 1 over random draws of size every f satisfies: 2 ln( ) E p [ f ( x)] Edata [ f ( x)] R ( F ) 2 2 ln( ) Edata [ f ( x)] R ( F ) 3 2 Relevance: The expected pattern E[f]=0 will also be present in a new data set, if the last 2 terms are small: - Complexity function class F small - number of training data large 48 Linear Functions (in feature space) Consider the FB { f : x w, ( x) , || w || B} function class: with k ( x, y) ( x), ( y) and a sample: S {x1 ,..., x } Then, the empirical 2B R ( FB ) tr ( K ) RC of FB is bounded by: Relevance: Since: {x i k ( xi , x) , T K B} FB it follows that if we control the norm i 1 T K || w ||2 in kernel algorithms, we control the complexity of the function class (regularization). 49 Margin Bound (classification) Theorem: Choose c>0 (the margin). F : f(x,y)=-yg(x), y=+1,-1 S: {( x1 , y1 ),..., ( x , y )} IID sample : (0,1) : probability of violating bound. 2 ln( ) 1 Pp [ y sign( g ( x ))] i 4 tr ( K ) 3 c i 1 c 2 (prob. of misclassification) i (c yi g ( xi )) (slack variable) ( f ) f if f 0 and 0 otherwise Relevance: We our classification error on new samples. Moreover, we have a strategy to improve generalization: choose the margin c as large possible such that all samples are correctly classified: i 0 (e.g. support vector machines). 50 Next Part Constructing Kernels Kernels for Text Vector space kernels Kernels for Structured Data Subsequences kernels Trie-based kernels Kernels from Generative Models P-kernels Fisher kernels 51

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 3 |

posted: | 3/26/2012 |

language: | English |

pages: | 51 |

OTHER DOCS BY ewghwehws

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.