Self-taught Learning
Document Sample


Self-taught Learning
Transfer Learning from Unlabeled Data
Rajat Raina
Honglak Lee, Roger Grosse
Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer,
Narut Sereewattanawoot
Andrew Y. Ng
Stanford University
The “one learning algorithm” hypothesis
There is some evidence that the human brain uses essentially the
same algorithm to understand many different input modalities.
– Example: Ferret experiments, in which the “input” for vision was plugged
into auditory part of brain, and the auditory cortex learns to “see.” [Roe
et al., 1992]
(Roe et al., 1992. Hawkins & Blakeslee, 2004)
Self-taught Learning
The “one learning algorithm” hypothesis
There is some evidence that the human brain uses essentially the
same algorithm to understand many different input modalities.
– Example: Ferret experiments, in which the “input” for vision was plugged
into auditory part of brain, and the auditory cortex learns to “see.” [Roe
et al., 1992]
If we could find this one learning algorithm,
we would be done. (Finally!)
(Roe et al., 1992. Hawkins & Blakeslee, 2004)
Self-taught Learning
Finding a deep learning algorithm
If the brain really is one learning algorithm, it would suffice
to just:
Find a learning algorithm for a single layer, and,
Show that it can build a small number of layers.
We evaluate our algorithms:
Against biology. e.g., Sparse RBMs for V2:
On applications. Poster yesterday (Lee et al.)
This talk
Self-taught Learning
Supervised learning
Train Test
Cars Motorcycles
Supervised learning algorithms may not work well with limited labeled data.
Self-taught Learning
Learning in humans
Your brain has 1014 synapses (connections).
You will live for 109 seconds.
If each synapse requires 1 bit to parameterize, you need to
“learn” 1014 bits in 109 seconds.
Or, 105 bits per second.
Human learning is largely unsupervised,
and uses readily available unlabeled data.
(Geoffrey Hinton, personal communication)
Self-taught Learning
Supervised learning
Train Test
Cars Motorcycles
Self-taught Learning
“Brain-like” Learning
Train Test
Cars Motorcycles
Unlabeled images
(randomly downloaded from the Internet)
Self-taught Learning
“Brain-like” Learning
+ ?
Labeled Digits Unlabeled English characters
Labeled Webpages
+ Unlabeled newspaper articles
?
+ ?
Labeled Russian Speech Unlabeled English speech
Self-taught Learning
“Self-taught Learning”
+ ?
Labeled Digits Unlabeled English characters
Labeled Webpages
+ Unlabeled newspaper articles
?
+ ?
Labeled Russian Speech Unlabeled English speech
Self-taught Learning
Recent history of machine learning
• 20 years ago: Supervised learning
Cars Motorcycles
• 10 years ago: Semi-supervised learning.
Cars Motorcycles
• 10 years ago: Transfer learning.
Bus Tractor Aircraft Helicopter Cars Motorcycles
• Next: Self-taught learning?
Car Motorcycle
Natural scenes
Self-taught Learning
Labeled examples:
{(xl(i ) , y (i ) )}im 1
xl( i ) R n , y ( i ) {1, , T }
Unlabeled examples:
{ xui ) }ik1
(
xui ) R n , k m
(
The unlabeled and labeled data:
• Need not share labels y.
• Need not share a generative distribution.
Advantage: Such unlabeled data is often easy to obtain.
Self-taught Learning
A self-taught learning algorithm
Overview: Represent each labeled or unlabeled input x
s
as a sparse linear combination of “basis vectors” {b j } j 1.
x a jbj b j Rn , a j R
j
= 0.8 * + 0.3 * + 0.5 *
x = 0.8 * b87 + 0.3 * b376 + 0.5 * b411
Self-taught Learning
A self-taught learning algorithm
x a jbj
j
= 0.8 * + 0.3 * + 0.5 *
x = 0.8 * b87 + 0.3 * b376 + 0.5 * b411
Key steps:
(i )
1. Learn good bases b j using unlabeled data xu .
2. Use these learnt bases to construct “higher-level” features for the
labeled data.
3. Apply a standard supervised learning algorithm on these features.
Self-taught Learning
Learning the bases: Sparse coding
(i
Given only unlabeled data xu ), we find good bases b using sparse
coding:
min b,a || xui ) a (ji )b j ||2 || a (i ) ||1
(
2
i j i
Reconstruction error Sparsity penalty
(Efficient algorithms: Lee et al., NIPS 2006)
[Details: An extra normalization constraint on || b j ||2 is required.]
Self-taught Learning
Example bases
Natural images. Learnt bases: “Edges”
Handwritten characters. Learnt bases: “Strokes”
Self-taught Learning
Constructing features
Using the learnt bases b, compute features for the
examples xl from the classification task by solving:
Features of xl arg min a || xl a j b j ||2 || a ||1
2
j
Reconstruction error Sparsity penalty
= 0.8 * + 0.3 * + 0.5 *
xl = 0.8 * b87 + 0.3 * b376 + 0.5 * b411
Finally, learn a classifer using a standard supervised
learning algorithm (e.g., SVM) over these features.
Self-taught Learning
Image classification
Large image Feature visualization
(Platypus from
Caltech101 dataset)
Self-taught Learning
Image classification
Platypus image Feature visualization
(Caltech101 dataset)
Self-taught Learning
Image classification
Platypus image Feature visualization
(Caltech101 dataset)
Self-taught Learning
Image classification
Platypus image Feature visualization
(Caltech101 dataset)
Self-taught Learning
Image classification
Other reported results:
Baseline 16%
Fei-Fei et al, 2004: 16%
PCA 37% Berg et al., 2005: 17%
Holub et al., 2005: 40%
Sparse coding 47% Serre et al., 2005: 35%
Berg et al, 2005: 48%
(15 labeled images per class)
Zhang et al., 2006: 59%
36.0% error reduction Lazebnik et al., 2006: 56%
Self-taught Learning
Character recognition
Digits Handwritten English English font
Raw 54.8% Raw 17.9%
PCA 54.8% PCA 14.5%
Sparse coding 58.5% Sparse coding 16.6%
Sparse coding + Raw 20.2%
Handwritten English classification
English font classification
(20 labeled images per handwritten character)
(20 labeled images per font character)
Bases learnt on digits
Bases learnt on handwritten English
8.2% error reduction 2.8% error reduction
Self-taught Learning
Text classification
Reuters newswire Webpages UseNet articles
Raw words 62.8% Raw words 61.3%
PCA 63.3% PCA 60.7%
Sparse coding 64.3% Sparse coding 63.8%
Webpage classification UseNet classification
(2 labeled documents per class) (2 labeled documents per class)
Bases learnt on Reuters newswire Bases learnt on Reuters newswire
4.0% error reduction 6.5% error reduction
Self-taught Learning
Shift-invariant sparse coding
Sparse features Basis functions
Reconstruction
(Algorithms: Grosse et al., UAI 2007)
Self-taught Learning
Audio classification
Spectrogram 48.4%
Spectrogram 38.5%
MFCCs 54.0%
MFCCs 43.8%
Music-specific model 49.3%
Sparse coding 48.7%
Sparse coding 56.6%
Speaker identification Musical genre classification
(5 labels, TIMIT corpus, 1 sentence per speaker.) (5 labels, 18 seconds per genre.)
Bases learnt on different dialects Bases learnt on different genres, songs
8.7% error reduction 5.7% error reduction
(Details: Grosse et al., UAI 2007)
Self-taught Learning
Sparse deep belief networks
. . . h: Hidden layer
Sparse RBM W, b, c: Parameters
. . . v: Visible layer
New
(Details: Lee et al., NIPS 2007. Poster yesterday.)
Self-taught Learning
Sparse deep belief networks
Image classification
(Caltech101 dataset)
1-layer sparse DBN 44.5%
2-layer sparse DBN 46.6%
3.2% error reduction
(Details: Lee et al., NIPS 2007. Poster yesterday.)
Self-taught Learning
Summary
Self-taught learning: Unlabeled data does not share the labels of the
classification task.
Cars Motorcycles
Unlabeled images
Use unlabeled data to discover features.
Use sparse coding to construct an easy-to-classify, “higher-level”
representation.
= 0.8 * + 0.3 * + 0.5 *
Self-taught Learning
THE END
Related Work
• Weston et al, ICML 2006
• Make stronger assumptions on the unlabeled data.
• Ando & Zhang, JMLR 2005
• For natural language tasks and character
recognition, use heuristics to construct a transfer
learning task using unlabeled data.
Self-taught Learning
Get documents about "