VIEWS: 18 PAGES: 15 POSTED ON: 7/21/2011
Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan The Problem: Data lies on or near a manifold. Lower dimensionality than overall space. Locally Euclidean. Example: data on a 2D line in R3, flat area on a sphere. Goal: Learn a kernel that will let us work in the lower- dimensional space. “Unfold” the manifold. First we need to know what it is! Its dimensionality. How it can vary. 2D manifold on a sphere. (Wikipedia) Background Assumptions: Kernel Trick Mercer’s Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel Functions can be represented as dot (inner) products in a high-dimensional space (Wikipedia; implied in paper). So we replace the dot product with a kernel function. Or “Gram Matrix”, Knm = φ(xn)T * φ(xm) = k(xn, xm) Kernel provides mapping into high-dimensional space. Consequence of Cover’s theorem: Nonlinear problem then becomes linear. Example: SVMs: xiT * xj -> φ(xi)T * φ(xj) = k(xi, xj). Linear Dimensionality Reduction Techniques: SVD, derived techniques (PCA, ICA, etc.) remove linear correlations. This reduces the dimensionality. Now combine these! Kernel PCA for nonlinear dimensionality reduction! Map input to a higher dimension using a kernel, then use PCA. The (More Specific) Problem: Data described by a manifold. Using kernel PCA, discover the manifold. There’s only one detail missing: How do we find the appropriate kernel? This forms the basis of the paper’s approach. It is also a motivation for the paper… Motivation: Exploits properties of the data, not just its space. Relates kernel discovery to manifold learning. With the right kernel, kernel PCA will allow us to discover the manifold. So it has implications for both fields. Another paper by the same authors focuses on applicability to manifold learning; this paper focuses on kernel learning. Unlike previous methods, this approach is unsupervised; the kernel is learned automatically. Not specific to PCA; it can learn any kernel. Methodology – Idea: Semidefinite programming (optimization) Look for a locally isometric mapping from the space to the manifold. Preserves distance, angles between points. Rotation and Translation on a neighborhood. Fix the distance and angles between a point and its k nearest neighbors. Intuition: Represent points as a lattice of “steel balls”. Neighborhoods connected by “rigid rods” that fix angles and distance (local isometry constraint). Now pull the balls as far apart as possible (obj. function). The lattice flattens -> Lower dimensionality! The “balls” and “rods” represent the manifold... If the data is well-sampled (Wikipedia). Shouldn’t be a problem in practice. Optimization Constraints: Isometry: For all neighbors xj, xk of point xi. If xj and xk are neighbors of each other or another common point, Let Gram matrices We then have Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji. Positive Semidefiniteness (required for kernel trick). No negative eigenvalues. Centered on the origin ( ). So eigenvalues measure variance of PCs. Dataset can be centered if not already. Objective Function We want to maximize pairwise distances. This is an inversion of SSE/MSE! So we have Which is just Tr(K)! Proof: (Not given in paper) Recall ������������ = Φ �������� ∙ Φ(�������� ) 1 2 2 ���� = Φ �������� − 2Φ �������� ∙ Φ �������� + Φ �������� 2���� �������� 1 1 So ���� = �������� ������������ − 2������������ + ������������ = ���� ���� ������������ + ���� ���� ������������ − �������� 2������������ . 2���� 2���� 1 �������� ������������ �������� ���� = ���� ������������ , so ���� = 2������������ ���� − 2 �������� ������������ = �������� ���� − . 2���� ���� Our constraint specifies �������� ������������ = 0. Thus, we are left with: ���� = Tr ���� . Semidefinite Embedding (SDE) Maximize Tr(K) subject to: K≥0 Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji for all i,j that are neighbors of each other or a common point. This optimization is convex, and thus has a unique solution. Use semidefinite programming to perform the optimization (no SDP details in paper). Once we have the optimal kernel, perform kPCA. This technique (SDE) is this paper’s contribution. Experimental Setup Four kernels: SDE (proposed) Linear Polynomial Gaussian “Swiss Roll” Dataset. 23 dimensions. 3 meaningful (top right). 20 filled with small noise (not shown). 800 inputs. k = 4, p = 4, σ = 1.45 (σ of 4-neighborhoods). “Teapot” Dataset. Same teapot, rotated 0 ≤ i < 360 degrees. 23,028 dimensions (76 x 101 x 3). Only one degree of freedom (angle of rotation). 400 inputs. k = 4, p = 4, σ = 1541. “The handwriting dataset”. No dimensionality or parameters specified (16x16x1 = 256D?) 953 images. No images or kernel matrix shown. Results – Dimensionality Reduction Two measures: Learned Kernels (SDE): “Eigenspectra”: Variance captured by individual eigenvalues. Normalized by trace (sum of eigenvalues). Seems to indicate manifold dimensionality. “Swiss Roll” “Teapot” “Digits” Results – Large Margin Classification Used SDE kernels with SVMs. Results were very poor. Lowering dimensionality can impair separability. Error rates: 90/10 training/test split. Mean of 10 experiments. Decision boundary no longer linearly separable. Strengths and Weaknesses Strengths: Unsupervised convex kernel optimization. Generalizes well in theory. Relates manifold learning and kernel learning. Easy to implement; just solve optimization. Intuitive (stretching a string). Weaknesses: May not generalize well in practice (SVMs). Implicit assumption: lower dimensionality is better. Not always the case (as in SVMs due to separability in higher dimensions). Robustness – what if a neighborhood contains an outlier? Offline algorithm – entire gram matrix required. Only a problem if N is large. Paper doesn’t mention SDP details. No algorithm analysis, complexity, etc. Complexity is “relatively high”. In fact, no proof of convergence (according to the authors’ other 2004 paper). Isomap, LLE, et al. already have such proofs. Possible Improvements Introduce slack variables for robustness. “Rods” not “rigid”, but punished for “bending”. Would introduce a “C” parameter, as in SVMs. Incrementally accept minors of K for large values of N, use incremental kernel PCA. Convolve SDE kernel with others for SVMs? SDE unfolds manifold, other kernel makes the problem linearly separable again. Only makes sense if SDE simplifies the problem. Analyze complexity of SDP. Conclusions Using SDP, SDE can learn kernel matrices to “unfold” data embedded in manifolds. Without requiring parameters. Kernel PCA then reduces dimensionality. Excellent for nonlinear dimensionality reduction / manifold learning. Dramatic results when difference in dimensionalities is high. Poorly suited for SVM classification.
Pages to are hidden for
"michael"Please download to view full document