Self-taught Learning

Document Sample
scope of work template
							              Self-taught Learning
    Transfer Learning from Unlabeled Data

                          Rajat Raina

                     Honglak Lee, Roger Grosse
Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer,
                       Narut Sereewattanawoot
                           Andrew Y. Ng

                       Stanford University
The “one learning algorithm” hypothesis
    There is some evidence that the human brain uses essentially the
         same algorithm to understand many different input modalities.
          –     Example: Ferret experiments, in which the “input” for vision was plugged
                into auditory part of brain, and the auditory cortex learns to “see.” [Roe
                et al., 1992]




                                        (Roe et al., 1992. Hawkins & Blakeslee, 2004)


Self-taught Learning
The “one learning algorithm” hypothesis
    There is some evidence that the human brain uses essentially the
         same algorithm to understand many different input modalities.
          –     Example: Ferret experiments, in which the “input” for vision was plugged
                into auditory part of brain, and the auditory cortex learns to “see.” [Roe
                et al., 1992]


                       If we could find this one learning algorithm,
                               we would be done. (Finally!)



                                        (Roe et al., 1992. Hawkins & Blakeslee, 2004)


Self-taught Learning
   Finding a deep learning algorithm

    If the brain really is one learning algorithm, it would suffice
       to just:
         Find a learning algorithm for a single layer, and,
         Show that it can build a small number of layers.
    We evaluate our algorithms:
         Against biology.               e.g., Sparse RBMs for V2:
         On applications.               Poster yesterday (Lee et al.)


                                 This talk

Self-taught Learning
   Supervised learning
                       Train                                     Test




         Cars                  Motorcycles



Supervised learning algorithms may not work well with limited labeled data.



Self-taught Learning
   Learning in humans
    Your brain has 1014 synapses (connections).
    You will live for 109 seconds.
    If each synapse requires 1 bit to parameterize, you need to
       “learn” 1014 bits in 109 seconds.

    Or, 105 bits per second.

                       Human learning is largely unsupervised,
                       and uses readily available unlabeled data.

                                        (Geoffrey Hinton, personal communication)


Self-taught Learning
   Supervised learning
                       Train                 Test




         Cars                  Motorcycles




Self-taught Learning
   “Brain-like” Learning
                        Train                     Test




         Cars                     Motorcycles




                 Unlabeled images
        (randomly downloaded from the Internet)

Self-taught Learning
   “Brain-like” Learning

                          +                                  ?
         Labeled Digits       Unlabeled English characters




    Labeled Webpages
                          +   Unlabeled newspaper articles
                                                             ?


                          +                                  ?
Labeled Russian Speech         Unlabeled English speech


Self-taught Learning
   “Self-taught Learning”

                          +                                  ?
         Labeled Digits       Unlabeled English characters




    Labeled Webpages
                          +   Unlabeled newspaper articles
                                                             ?


                          +                                  ?
Labeled Russian Speech         Unlabeled English speech


Self-taught Learning
Recent history of machine learning
• 20 years ago: Supervised learning
                                                               Cars             Motorcycles

• 10 years ago: Semi-supervised learning.

                         Cars          Motorcycles
• 10 years ago: Transfer learning.

Bus     Tractor                 Aircraft       Helicopter                Cars          Motorcycles
• Next: Self-taught learning?



                                                Car         Motorcycle
                  Natural scenes
   Self-taught Learning
Labeled examples:
                       {(xl(i ) , y (i ) )}im 1
                                                 xl( i )  R n , y ( i )  {1,  , T }

Unlabeled examples:
                          { xui ) }ik1
                             (
                                                  xui )  R n , k  m
                                                   (



The unlabeled and labeled data:
• Need not share labels y.
• Need not share a generative distribution.

Advantage: Such unlabeled data is often easy to obtain.


Self-taught Learning
   A self-taught learning algorithm
Overview: Represent each labeled or unlabeled input x
                                                          s
  as a sparse linear combination of “basis vectors” {b j } j 1.
                                   x   a jbj           b j  Rn , a j  R
                                        j

                       = 0.8 *         + 0.3 *              + 0.5 *
        x              = 0.8 *   b87    + 0.3 *   b376      + 0.5 *    b411




Self-taught Learning
   A self-taught learning algorithm
                                   x   a jbj
                                        j


                       = 0.8 *         + 0.3 *                  + 0.5 *
        x              = 0.8 *   b87    + 0.3 *   b376          + 0.5 *   b411

Key steps:
                                                         (i )
1. Learn good bases b j using unlabeled data xu .
2. Use these learnt bases to construct “higher-level” features for the
   labeled data.
3. Apply a standard supervised learning algorithm on these features.




Self-taught Learning
    Learning the bases: Sparse coding
                            (i
 Given only unlabeled data xu ), we find good bases b using sparse
 coding:

                min b,a  || xui )   a (ji )b j ||2    || a (i ) ||1
                              (
                                                    2
                           i               j                      i

                           Reconstruction error              Sparsity penalty




                                               (Efficient algorithms: Lee et al., NIPS 2006)


[Details: An extra normalization constraint on || b j ||2 is required.]


 Self-taught Learning
    Example bases
Natural images.           Learnt bases: “Edges”




Handwritten characters.   Learnt bases: “Strokes”




 Self-taught Learning
   Constructing features
  Using the learnt bases b, compute features for the
     examples xl from the classification task by solving:
        Features of xl  arg min a || xl   a j b j ||2  || a ||1
                                                       2
                                                  j

                                       Reconstruction error    Sparsity penalty


                       = 0.8 *         + 0.3 *                + 0.5 *
           xl          = 0.8 *   b87   + 0.3 *        b376    + 0.5 *      b411

  Finally, learn a classifer using a standard supervised
     learning algorithm (e.g., SVM) over these features.

Self-taught Learning
   Image classification




                Large image       Feature visualization
              (Platypus from
            Caltech101 dataset)




Self-taught Learning
   Image classification




              Platypus image      Feature visualization
           (Caltech101 dataset)




Self-taught Learning
   Image classification




              Platypus image      Feature visualization
           (Caltech101 dataset)




Self-taught Learning
   Image classification




              Platypus image      Feature visualization
           (Caltech101 dataset)




Self-taught Learning
   Image classification




                                                Other reported results:
               Baseline                  16%
                                                Fei-Fei et al, 2004: 16%
                 PCA                     37%    Berg et al., 2005: 17%
                                                Holub et al., 2005: 40%
           Sparse coding                 47%    Serre et al., 2005: 35%
                                                Berg et al, 2005: 48%
                (15 labeled images per class)
                                                Zhang et al., 2006: 59%
                36.0% error reduction           Lazebnik et al., 2006: 56%


Self-taught Learning
   Character recognition
              Digits                        Handwritten English                   English font




           Raw                    54.8%                         Raw                        17.9%

           PCA                    54.8%                        PCA                         14.5%

    Sparse coding                 58.5%                  Sparse coding                     16.6%
                                                     Sparse coding + Raw                   20.2%
  Handwritten English classification
                                                      English font classification
  (20 labeled images per handwritten character)
                                                      (20 labeled images per font character)
  Bases learnt on digits
                                                      Bases learnt on handwritten English
     8.2% error reduction                                   2.8% error reduction

Self-taught Learning
   Text classification

 Reuters newswire                       Webpages                  UseNet articles



      Raw words                 62.8%              Raw words                61.3%
          PCA                   63.3%                PCA                    60.7%
   Sparse coding                64.3%          Sparse coding                63.8%

  Webpage classification                       UseNet classification
  (2 labeled documents per class)              (2 labeled documents per class)

  Bases learnt on Reuters newswire             Bases learnt on Reuters newswire

     4.0% error reduction                          6.5% error reduction

Self-taught Learning
   Shift-invariant sparse coding
   Sparse features     Basis functions




                                               Reconstruction


                                   (Algorithms: Grosse et al., UAI 2007)


Self-taught Learning
   Audio classification


                                                          Spectrogram                    48.4%
    Spectrogram                     38.5%
                                                             MFCCs                       54.0%
        MFCCs                       43.8%
                                                     Music-specific model                49.3%
   Sparse coding                    48.7%
                                                         Sparse coding                   56.6%

 Speaker identification                              Musical genre classification
 (5 labels, TIMIT corpus, 1 sentence per speaker.)   (5 labels, 18 seconds per genre.)

 Bases learnt on different dialects                  Bases learnt on different genres, songs

        8.7% error reduction                                5.7% error reduction
                                                          (Details: Grosse et al., UAI 2007)
Self-taught Learning
   Sparse deep belief networks
                                 . . .         h: Hidden layer


        Sparse RBM                     W, b, c: Parameters

                                 . . .         v: Visible layer




  New



                       (Details: Lee et al., NIPS 2007. Poster yesterday.)
Self-taught Learning
   Sparse deep belief networks
                                Image classification
                               (Caltech101 dataset)

                       1-layer sparse DBN         44.5%

                       2-layer sparse DBN         46.6%



                              3.2% error reduction




                               (Details: Lee et al., NIPS 2007. Poster yesterday.)
Self-taught Learning
    Summary
 Self-taught learning: Unlabeled data does not share the labels of the
  classification task.


                    Cars          Motorcycles
                                                    Unlabeled images

 Use unlabeled data to discover features.
 Use sparse coding to construct an easy-to-classify, “higher-level”
  representation.

                        = 0.8 *           + 0.3 *    + 0.5 *


 Self-taught Learning
THE END
    Related Work
•    Weston et al, ICML 2006
     • Make stronger assumptions on the unlabeled data.

•    Ando & Zhang, JMLR 2005
     • For natural language tasks and character
        recognition, use heuristics to construct a transfer
        learning task using unlabeled data.




Self-taught Learning

						
Related docs
Other docs by wulinqing
quantum mechanics textbook PPT
Views: 17  |  Downloads: 0
Three Paths to Liberation
Views: 38  |  Downloads: 0
Timers_1_
Views: 8  |  Downloads: 0
Ryan's cube tutorial
Views: 20  |  Downloads: 0
12-NWT-GDL-motorcycles
Views: 17  |  Downloads: 0
Objectively Evaluate Insurance Needs
Views: 11  |  Downloads: 0