michael by qingyunliuliu

VIEWS: 18 PAGES: 15

									       Learning a Kernel Matrix for
Nonlinear Dimensionality Reduction
                 By K. Weinberger, F. Sha, and L. Saul
                      Presented by Michael Barnathan
The Problem:
   Data lies on or near a manifold.
       Lower dimensionality than overall space.
       Locally Euclidean.
       Example: data on a 2D line in R3, flat area on a sphere.


   Goal: Learn a kernel that will let us work in the lower-
    dimensional space.
       “Unfold” the manifold.
       First we need to know what it is!
           Its dimensionality.
           How it can vary.
                                                   2D manifold on a sphere.
                                                        (Wikipedia)
Background Assumptions:
   Kernel Trick
       Mercer’s Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel
        Functions can be represented as dot (inner) products in a high-dimensional
        space (Wikipedia; implied in paper).
       So we replace the dot product with a kernel function.
           Or “Gram Matrix”, Knm = φ(xn)T * φ(xm) = k(xn, xm)
           Kernel provides mapping into high-dimensional space.
           Consequence of Cover’s theorem: Nonlinear problem then becomes linear.
       Example: SVMs: xiT * xj -> φ(xi)T * φ(xj) = k(xi, xj).
   Linear Dimensionality Reduction Techniques:
       SVD, derived techniques (PCA, ICA, etc.) remove linear correlations.
       This reduces the dimensionality.
   Now combine these!
       Kernel PCA for nonlinear dimensionality reduction!
       Map input to a higher dimension using a kernel, then use PCA.
The (More Specific) Problem:
   Data described by a manifold.
   Using kernel PCA, discover the manifold.

   There’s only one detail missing:
   How do we find the appropriate kernel?

   This forms the basis of the paper’s approach.
   It is also a motivation for the paper…
Motivation:
   Exploits properties of the data, not just its space.
   Relates kernel discovery to manifold learning.
       With the right kernel, kernel PCA will allow us to discover the
        manifold.
       So it has implications for both fields.
           Another paper by the same authors focuses on applicability to
            manifold learning; this paper focuses on kernel learning.
   Unlike previous methods, this approach is unsupervised;
    the kernel is learned automatically.
   Not specific to PCA; it can learn any kernel.
Methodology – Idea:
   Semidefinite programming (optimization)
       Look for a locally isometric mapping from the space to the manifold.
           Preserves distance, angles between points.
           Rotation and Translation on a neighborhood.
       Fix the distance and angles between a point and its k nearest neighbors.
   Intuition:
       Represent points as a lattice of “steel balls”.
       Neighborhoods connected by “rigid rods” that fix angles and distance (local
        isometry constraint).
       Now pull the balls as far apart as possible (obj. function).
       The lattice flattens -> Lower dimensionality!
   The “balls” and “rods” represent the manifold...
       If the data is well-sampled (Wikipedia).
       Shouldn’t be a problem in practice.
Optimization Constraints:
    Isometry:
        For all neighbors xj, xk of point xi.

        If xj and xk are neighbors of each other or another common point,



        Let Gram matrices
        We then have Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji.
    Positive Semidefiniteness (required for kernel trick).
        No negative eigenvalues.

    Centered on the origin (                   ).
        So eigenvalues measure variance of PCs.
        Dataset can be centered if not already.
Objective Function
   We want to maximize pairwise distances.
   This is an inversion of SSE/MSE!
   So we have
   Which is just Tr(K)!
   Proof:                                                                                                                                (Not given in paper)
    Recall ������������ = Φ �������� ∙ Φ(�������� )
            1                            2                                              2
    ���� =                     Φ ��������          − 2Φ �������� ∙ Φ �������� + Φ ��������
           2����
                  ��������
                 1                                                1
    So ���� =                ��������   ������������ − 2������������ + ������������ =          ����   ���� ������������   + ����   ����   ������������ −   ��������   2������������ .
                 2����                                             2����
                                                 1                                                                    ��������   ������������
    �������� ���� =            ���� ������������ ,   so ���� =           2������������ ���� − 2       ��������   ������������ = �������� ���� −                               .
                                                 2����                                                                     ����

    Our constraint specifies                      ��������   ������������ = 0. Thus, we are left with:

                                                               ���� = Tr ����               .
Semidefinite Embedding (SDE)
   Maximize Tr(K) subject to:
       K≥0
    
       Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji for all i,j that are neighbors
        of each other or a common point.
   This optimization is convex, and thus has a unique
    solution.
   Use semidefinite programming to perform the
    optimization (no SDP details in paper).
   Once we have the optimal kernel, perform kPCA.
   This technique (SDE) is this paper’s contribution.
Experimental Setup
   Four kernels:
       SDE (proposed)
       Linear
       Polynomial
       Gaussian
   “Swiss Roll” Dataset.
       23 dimensions.
           3 meaningful (top right).
           20 filled with small noise (not shown).
       800 inputs.
       k = 4, p = 4, σ = 1.45 (σ of 4-neighborhoods).
   “Teapot” Dataset.
       Same teapot, rotated 0 ≤ i < 360 degrees.
       23,028 dimensions (76 x 101 x 3).
       Only one degree of freedom (angle of rotation).
       400 inputs.
       k = 4, p = 4, σ = 1541.
   “The handwriting dataset”.
       No dimensionality or parameters specified (16x16x1 = 256D?)
       953 images. No images or kernel matrix shown.
Results – Dimensionality Reduction
   Two measures:
       Learned Kernels (SDE):




       “Eigenspectra”:
           Variance captured by individual eigenvalues.
           Normalized by trace (sum of eigenvalues).
           Seems to indicate manifold dimensionality.




    “Swiss Roll”                    “Teapot”               “Digits”
Results – Large Margin Classification
   Used SDE kernels with SVMs.
   Results were very poor.
       Lowering dimensionality can impair separability.



Error rates:

90/10 training/test split.
Mean of 10 experiments.


Decision boundary no
longer linearly separable.
Strengths and Weaknesses

   Strengths:
       Unsupervised convex kernel optimization.
       Generalizes well in theory.
       Relates manifold learning and kernel learning.
       Easy to implement; just solve optimization.
       Intuitive (stretching a string).
   Weaknesses:
       May not generalize well in practice (SVMs).
           Implicit assumption: lower dimensionality is better.
           Not always the case (as in SVMs due to separability in higher dimensions).
       Robustness – what if a neighborhood contains an outlier?
       Offline algorithm – entire gram matrix required.
           Only a problem if N is large.
       Paper doesn’t mention SDP details.
           No algorithm analysis, complexity, etc. Complexity is “relatively high”.
           In fact, no proof of convergence (according to the authors’ other 2004 paper).
               Isomap, LLE, et al. already have such proofs.
Possible Improvements
   Introduce slack variables for robustness.
       “Rods” not “rigid”, but punished for “bending”.
       Would introduce a “C” parameter, as in SVMs.
   Incrementally accept minors of K for large values of N,
    use incremental kernel PCA.
   Convolve SDE kernel with others for SVMs?
       SDE unfolds manifold, other kernel makes the problem linearly
        separable again.
       Only makes sense if SDE simplifies the problem.
   Analyze complexity of SDP.
Conclusions
   Using SDP, SDE can learn kernel matrices to “unfold” data
    embedded in manifolds.
       Without requiring parameters.
   Kernel PCA then reduces dimensionality.
   Excellent for nonlinear dimensionality reduction /
    manifold learning.
       Dramatic results when difference in dimensionalities is high.
   Poorly suited for SVM classification.

								
To top