Nearest Neighbor Editing and Condensing Techniques

Document Sample
Nearest Neighbor Editing and Condensing Techniques Powered By Docstoc
					         Nearest Neighbor Editing and
           Condensing Techniques

             Organization
1. Nearest Neighbor Revisited
2. Condensing Techniques
3. Proximity Graphs and Decision Boundaries
4. Editing Techniques




                                              Last updated: Oct. 7, 2005
    Nearest Neighbour Rule

                               Non-parametric pattern
                               classification.
                               Consider a two class problem
                               where each sample consists of
                               two measurements (x,y).


For a given query point q,   k=1
assign the class of the
nearest neighbour.


Compute the k nearest        k=3
neighbours and assign the
class by majority vote.
          Example: Digit Recognition


•   Yann LeCunn – MNIST Digit                                            Test Error Rate (%)
    Recognition                        Linear classifier (1-layer NN)                  12.0
                                       K-nearest-neighbors, Euclidean                   5.0
     – Handwritten digits
                                       K-nearest-neighbors, Euclidean, deskewed         2.4
     – 28x28 pixel images: d = 784     K-NN, Tangent Distance, 16x16                    1.1
     – 60,000 training samples         K-NN, shape context matching                    0.67
     – 10,000 test samples             1000 RBF + linear classifier                     3.6
                                       SVM deg 4 polynomial                             1.1
•   Nearest neighbour is competitive
                                       2-layer NN, 300 hidden units                     4.7
                                       2-layer NN, 300 HU, [deskewing]                  1.6
                                       LeNet-5, [distortions]                           0.8
                                       Boosted LeNet-4, [distortions]                   0.7
              Nearest Neighbour Issues
•   Expensive
     – To determine the nearest neighbour of a query point q, must compute
        the distance to all N training examples
          + Pre-sort training examples into fast data structures (kd-trees)
          + Compute only an approximate distance (LSH)
          + Remove redundant data (condensing)
•   Storage Requirements
     – Must store all training data P
          + Remove redundant data (condensing)
          - Pre-sorting often increases the storage requirements
•   High Dimensional Data
     – “Curse of Dimensionality”
          • Required amount of training data increases exponentially with
            dimension
          • Computational cost also increases dramatically
          • Partitioning techniques degrade to linear search in high dimension
             Exact Nearest Neighbour

• Asymptotic error (infinite sample size) is less than twice the Bayes
  classification error
   – Requires a lot of training data

• Expensive for high dimensional data (d>20?)

• O(Nd) complexity for both storage and query time
   – N is the number of training examples, d is the dimension of each
     sample
   – This can be reduced through dataset editing/condensing
                       Decision Regions
                                                    Each cell contains one
                                                    sample, and every
                                                    location within the cell is
                                                    closer to that sample than
                                                    to any other sample.

                                                    A Voronoi diagram divides
                                                    the space into such cells.


Every query point will be assigned the classification of the sample within that
cell. The decision boundary separates the class regions based on the 1-NN
decision rule.
Knowledge of this boundary is sufficient to classify new points.
The boundary itself is rarely computed; many algorithms seek to retain only
those points necessary to generate an identical boundary.
                             Condensing
•   Aim is to reduce the number of training samples
•   Retain only the samples that are needed to define the decision boundary
•   This is reminiscent of a Support Vector Machine
•   Decision Boundary Consistent – a subset whose nearest neighbour
    decision boundary is identical to the boundary of the entire training set
•   Consistent Set --- – the smallest subset of the training data that correctly
    classifies all of the original training data

•   Minimum Consistent Set – smallest consistent set




     Original data                Condensed data             Minimum Consistent Set
                               Condensing
•   Condensed Nearest Neighbour (CNN)     1.   Initialize subset with a single
    Hart 1968                                  training example
     – Incremental                        2.   Classify all remaining
                                               samples using the subset,
     – Order dependent                         and transfer any incorrectly
     – Neither minimal nor decision            classified samples to the
        boundary consistent                    subset
                                          3.   Return to 2 until no transfers
     – O(n3) for brute-force method            occurred or the subset is full
     – Can follow up with reduced NN
        [Gates72]
         • Remove a sample if doing so
           does not cause any incorrect          Produces consistent set
           classifications
                               Condensing
•   Condensed Nearest Neighbour (CNN)
    Hart 1968                             1.   Initialize subset with a single
     – Incremental                             training example
                                          2.   Classify all remaining
     – Order dependent                         samples using the subset,
     – Neither minimal nor decision            and transfer any incorrectly
        boundary consistent                    classified samples to the
                                               subset
     – O(n3) for brute-force method       3.   Return to 2 until no transfers
     – Can follow up with reduced NN           occurred or the subset is full
        [Gates72]
         • Remove a sample if doing so
           does not cause any incorrect
           classifications
                               Condensing
•   Condensed Nearest Neighbour (CNN)
    Hart 1968                             1.   Initialize subset with a single
     – Incremental                             training example
                                          2.   Classify all remaining
     – Order dependent                         samples using the subset,
     – Neither minimal nor decision            and transfer any incorrectly
        boundary consistent                    classified samples to the
                                               subset
     – O(n3) for brute-force method       3.   Return to 2 until no transfers
     – Can follow up with reduced NN           occurred or the subset is full
        [Gates72]
         • Remove a sample if doing so
           does not cause any incorrect
           classifications
                               Condensing
•   Condensed Nearest Neighbour (CNN)
    Hart 1968                             1.   Initialize subset with a single
     – Incremental                             training example
                                          2.   Classify all remaining
     – Order dependent                         samples using the subset,
     – Neither minimal nor decision            and transfer any incorrectly
        boundary consistent                    classified samples to the
                                               subset
     – O(n3) for brute-force method       3.   Return to 2 until no transfers
     – Can follow up with reduced NN           occurred or the subset is full
        [Gates72]
         • Remove a sample if doing so
           does not cause any incorrect
           classifications
                               Condensing
•   Condensed Nearest Neighbour (CNN)
    Hart 1968                             1.   Initialize subset with a single
     – Incremental                             training example
                                          2.   Classify all remaining
     – Order dependent                         samples using the subset,
     – Neither minimal nor decision            and transfer any incorrectly
        boundary consistent                    classified samples to the
                                               subset
     – O(n3) for brute-force method       3.   Return to 2 until no transfers
     – Can follow up with reduced NN           occurred or the subset is full
        [Gates72]
         • Remove a sample if doing so
           does not cause any incorrect
           classifications
                               Condensing
•   Condensed Nearest Neighbour (CNN)
    Hart 1968                             1.   Initialize subset with a single
     – Incremental                             training example
                                          2.   Classify all remaining
     – Order dependent                         samples using the subset,
     – Neither minimal nor decision            and transfer any incorrectly
        boundary consistent                    classified samples to the
                                               subset
     – O(n3) for brute-force method       3.   Return to 2 until no transfers
     – Can follow up with reduced NN           occurred or the subset is full
        [Gates72]
         • Remove a sample if doing so
           does not cause any incorrect
           classifications
                               Condensing
•   Condensed Nearest Neighbour (CNN)
    Hart 1968                             1.   Initialize subset with a single
     – Incremental                             training example
                                          2.   Classify all remaining
     – Order dependent                         samples using the subset,
     – Neither minimal nor decision            and transfer any incorrectly
        boundary consistent                    classified samples to the
                                               subset
     – O(n3) for brute-force method       3.   Return to 2 until no transfers
     – Can follow up with reduced NN           occurred or the subset is full
        [Gates72]
         • Remove a sample if doing so
           does not cause any incorrect
           classifications
             Proximity Graphs
• Condensing aims to retain points along the
  decision boundary
• How to identify such points?
   – Neighbouring points of different classes

• Proximity graphs provide various definitions of
  “neighbour”
     NNG  MST  RNG  GG  DT
      NNG = Nearest Neighbour Graph
      MST = Minimum Spanning Tree
      RNG = Relative Neighbourhood Graph
      GG = Gabriel Graph
      DT = Delaunay Triangulation (neighbours of a 1NN-classifier)
              Proximity Graphs: Delaunay

•   The Delaunay Triangulation is the dual of the
    Voronoi diagram
•   Three points are each others neighbours if their
    tangent sphere contains no other points
•   Voronoi condensing: retain those points whose
    neighbours (as defined by the Delaunay
    Triangulation) are of the opposite class
•   The decision boundary is identical
•   Conservative subset
•   Retains extra points
•   Expensive to compute in high
    dimensions
               Proximity Graphs: Gabriel
•   The Gabriel graph is a subset of the
    Delaunay Triangulation (some decision
    boundary might be missed)
•   Points are neighbours only if their
    (diametral) sphere of influence is
    empty
•   Does not preserve the identical
    decision boundary, but most changes
    occur outside the convex hull of the
    data points
•   Can be computed more efficiently




     Green lines denote
     “Tomek links”
Not a Gabriel Edge
                  Proximity Graphs: RNG
•   The Relative Neighbourhood Graph (RNG)
    is a subset of the Gabriel graph
•   Two points are neighbours if the “lune”
    defined by the intersection of their radial
    spheres is empty
•   Further reduces the number of neighbours
•   Decision boundary changes are often
    drastic, and not guaranteed to be training
    set consistent




            Gabriel edited                        RNG edited – not consistent
            Dataset Reduction: Editing




•   Training data may contain noise, overlapping classes
     – starting to make assumptions about the underlying distributions

•   Editing seeks to remove noisy points and produce smooth decision
    boundaries – often by retaining points far from the decision boundaries

•   Results in homogenous clusters of points
                              Wilson Editing
•   Wilson 1972
•   Remove points that do not agree with the majority of their k nearest neighbours

               Earlier example                              Overlapping classes




              Original data                                    Original data




          Wilson editing with k=7                          Wilson editing with k=7
                                 Multi-edit
                                                    1.   Diffusion: divide data into N ≥
•   Multi-edit [Devijer & Kittler ’79]                   3 random subsets
     – Repeatedly apply Wilson editing              2.   Classification: Classify Si
        to random partitions                             using 1-NN with S(i+1)Mod N as
                                                         the training set (i = 1..N)
     – Classify with the 1-NN rule                  3.   Editing: Discard all samples
•   Approximates the error rate of the                   incorrectly classified in (2)
    Bayes decision rule                             4.   Confusion: Pool all remaining
                                                         samples into a new set
                                                    5.   Termination: If the last I
                                                         iterations produced no editing
                                                         then end; otherwise go to (1)




                       Multi-edit, 8 iterations – last 3 same
Combined Editing/Condensing
 •   First edit the data to remove noise and smooth the boundary
 •   Then condense to obtain a smaller subset
Where are we with respect to NN?
• Simple method, pretty powerful rule
• Very popular in text mining (seems to work well for this
  task)
• Can be made to run fast
• Requires a lot of training data
• Edit to reduce noise, class overlap
• Condense to remove data that are not needed
 Problems when using k-NN in Practice

• What distance measure to use?
   – Often Euclidean distance is used
   – Locally adaptive metrics
   – More complicated with non-numeric data, or when different
     dimensions have different scales
• Choice of k?
   – Cross-validation
   – 1-NN often performs well in practice
   – k-NN needed for overlapping classes
   – Re-label all data according to k-NN, then classify with 1-NN
   – Reduce k-NN problem to 1-NN through dataset editing