; 没有幻灯片标题
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

没有幻灯片标题

VIEWS: 3 PAGES: 19

  • pg 1
									      Support Cluster Machine
                         Paper from ICML2007
                         Read by Haiqin Yang
                              2007-10-18




This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi,
Jianping Fan, Xiangyang Xue, which was published in 2007.
                                                                           1
                         Outline
 Background and Motivation

 Support Cluster Machine - SCM

 Kernel in SCM

 Experiments

 An Interesting Application: Privacy-preserving Data Mining

 Discussions




                                                               2
        Background and Motivation
 Large scale classification problem
    Decomposition methods               Parallel techniques
        Osuna et al., 1997;                Collobert et al., 2001;
        Joachims, 1999;                    Graf et al., 2004;
        Platt, 1999;                    Approximate formula
        Collobert & Bengio, 2001;          Fung & Mangasarian, 2001;
        Keerthi et al., 2001;              Lee & Mangasarian, 2001;
    Incremental algorithms              Choose representatives
        Cauwenberghs & Poggio, 2000;       Active learning - Schohn &
        Fung & Mangasarian, 2002;           Cohn, 2003;
        Laskov et al., 2006;               Cluster Based-SVM -Yu et al.,
                                             2003;
                                            Core Vector Machine (CVM) -
                                             Tsang et al., 2005;
                                            Clustering SVM -Boley, D. &
                                             Cao, 2004;
                                                                             3
  Support Cluster Machine - SCM
 Given training samples:


 Procedure
   




   




                                  4
                SCM Solution

 Dual representation




 Decision function




                               5
                          Kernel
 Probability product kernel


 By Gaussian assumption, i.e.,
 Hence




                                   6
                      Kernel
 Property I


 That is
 Decision function

 Property II




                               7
                        Experiments
 Datasets                         Classification methods
    Toydata                          libSVM
    MNIST – Handwritten digits       SVMTorch
     (‘0’-’9’) classification         SVMlight
    Adult – Privacy-preserving       CVM (Core Vector Machine)
     Dataset                          SCM
 Clustering algorithms            Model selection
    Threshold Order Dependent
                                     
     (TOD)
                                     
    EM algorithm
                                   CPU: 3.0GHz

                                                                   8
                         Toydata
 Samples: 2500 samples/class generated from a mixture of
  Gaussian distribution
 Clustering algorithm: TOD
 Clustering results: 25 positive, 25 negative




                                                            9
                              MNIST
 Data description
    10 classes: Handwritten digits ‘0’-’9’
    Training samples: 60,000, about 6000 for each class
    Testing samples: 10,000
 Construct 45 binary classifiers
 Results
    25 Clusters for EM algorithm




                                                           10
                         MNIST
 Test results for TOD algorithm




                                   11
    Privacy-preserving Data Mining
 Inter-Enterprise data mining
    Problem: Two parties owning confidential databases
    wish to build a decision-tree classifier on the union of
    their databases, without revealing any unnecessary
    information.
 Horizontally partitioned
   Records (users) split across companies
   Example: Credit card fraud detection model
 Vertically partitioned
   Attributes split across companies
   Example: Associations across websites


                                                               12
      Privacy-preserving Data Mining
 Randomization approach

           30 | 70K | ...            50 | 40K | ...
                                                      ...
           Randomizer                Randomizer


           65 | 20K | ...            25 | 60K | ...   ...

            Reconstruct               Reconstruct
            distribution              distribution     ...
               of Age                  of Salary


                            Data Mining
                                                       Model
                            Algorithms

                                                               13
 Classification Example

ga e
el
 S
 a p
ArR
  y e
    at            g
                  e
                  <
                  A
   i?
   s
   i
   Vt
    o
    r             5
                  2
 Kap
   e
25 R
30 et          Y N
               e
               s o

70 e
 Kap
   e
13 Rt                S
                     a
                     l<
                     a
                     ry
              e
              p
              e
              a
              Rt
44 R
 Kap
   e
30 et                0
                     5K

 Kln
   g
65 S
80 ie              s
                   e
                   YNo
20 i
 Kln
   g
37 Se
                 a
                 e
                 p
                 e
                 Rt       i
                          g
                          n
                          Sl
                           e
00 e
 Kap
   e
22 Rt

                           14
      Privacy-preserving Dataset: Adult
 Data description
    Training samples: 30162
    Testing samples: 15060
    Percentage of positive samples: 24.78%
 Procedure
    Horizontally partition data into three subsets (parties)
    Cluster by TOD algorithm
    Obtain three positive and three negative GMMs
    Combine positive and negative GMMs into one positive and one negative
     GMMs with modified priors
    Classify them by SCM


                                                                        15
     Privacy-preserving Dataset: Adult
 Partition results




 Experimental results




                                         16
                           Discussions
 Solved problems
    Large scale problems: downsample by clustering + classifier
    Privacy-preserving problems: hide individual information
 Differences to other methods
    Training units are generative model, testing units are vectors
    Training units contain complete statistical information
    Only one parameter for model selection
    Easy implementation
    Generalization ability is not clear, while the RBF kernel in SVM has the
     property of larger width leads to lower VC dimension.




                                                                           17
                 Discussions

 Advantages of using priors and covariances




                                               18
Thank   you!



               19

								
To top