# SP09 cs188 lecture 29 -- machine learning-1.pptx by aqq19286

VIEWS: 15 PAGES: 5

warren buffet pdf

• pg 1
```									                                                          Recap: Classification
Unsupervised & Semi-supervised
!! Classification systems:
Learning                               !! Supervised learning
!! Make a prediction given
evidence
!! We’ve seen several
methods for this
John Blitzer, John DeNero, Dan Klein
!! Useful when you have
labeled data

1                                                          2

Clustering                                     Clustering
!! Clustering systems:                         !! Basic idea: group together similar instances
!! Unsupervised learning                    !! Example: 2D point patterns
!! Detect patterns in
unlabeled data
!! E.g. group emails or
search results
!! E.g. find categories of
customers
!! E.g. detect anomalous
program executions
!! Useful when don’t know
what you’re looking for                  !! What could “similar” mean?
!! Requires data, but no                       !! One option: small (squared) Euclidean distance
labels
!! Often get gibberish
3                                                          4

K-Means                              K-Means Example
!! An iterative clustering
algorithm
!! Pick K random points
as cluster centers
(means)
!! Alternate:
!! Assign data instances
to closest mean
!! Assign each mean to
the average of its
assigned points
!! Stop when no points’
assignments change

5                                                          6

1
Example: K-Means                              K-Means as Optimization
!! [web demo]                                      !! Consider the total distance to the means:

!!http://www.cs.washington.edu/research/
imagedatabase/demo/kmcluster/                                                  means
points
assignments

!! Each iteration reduces phi

!! Two stages each iteration:
!! Update assignments: fix means c,
change assignments a
!! Update means: fix assignments a,
change means c

7                                                    8

Phase I: Update Assignments                              Phase II: Update Means
!! For each point, re-assign to                    !! Move each mean to the
average of its assigned
closest mean:                                      points:

!! Can only decrease total                         !! Also can only decrease total
distance phi!                                      distance… (Why?)

!! Fun fact: the point y with
minimum squared Euclidean
distance to a set of points {x}
is their mean
9                                                   10

Initialization                           K-Means Getting Stuck
!! K-means is non-deterministic                    !! A local optimum:
!! Requires initial means
!! It does matter what you pick!

!! What can go wrong?

!! Various schemes for preventing
this kind of thing: variance-
based split / merge, initialization
heuristics                                      Why doesn’t this work out like
the earlier example, with the
purple taking over half the blue?

11                                                  12

2
K-Means Questions                                                        Agglomerative Clustering
!! Will K-means converge?                                                     !! Agglomerative clustering:
!! First merge very similar instances
!! To a global optimum?
!! Incrementally build larger clusters out of
smaller clusters
!! Will it always find the true patterns in the data?
!! If the patterns are very very clear?                                    !! Algorithm:
!! Maintain a set of clusters
!! Initially, each instance in its own cluster
!! Will it find something interesting?                                            !! Repeat:
!! Pick the two closest clusters
!! Merge them into a new cluster
!! Do people ever use it?                                                             !! Stop when there’s only one cluster left

!! Produces not one clustering, but a family
!! How many clusters to pick?                                                    of clusterings represented by a
dendrogram

13                                                                                  14

Clustering Application
Agglomerative Clustering
!! How should we define
“closest” for clusters with
multiple elements?

!! Many options
clustering)
clustering)                                                                                                            Top-level categories:
!! Average of all pairs
!! Ward’s method (min variance,
supervised classification
like k-means)

!! Different choices create                                                                                                   Story groupings:
different clustering behaviors                                                                                             unsupervised clustering
15                                                                                  16

Step 1: Agglomerative Clustering                                                     Step 2: K-means clustering
!! Separate clusterings for each global category                              !! Initialize means to centers from agglomerative step
!! Represent documents as vectors                                             !! Why might this be a good idea?
!! Millions of dimensions (1 for each proper noun)                             !! Guaranteed to decrease squared-distance from cluster means
!! How do we know when to stop?                                                   !! Helps to “clean up” points that may not be assigned appropriately

Warren Buffet                                           Chrysler              Warren Buffet                                                Chrysler
Berkshire Hathaway                                      S&P 500               Berkshire Hathaway                                           S&P 500
S&P 500                                                 General Motors        S&P 500                                                      General Motors
GEICO                                                   Fiat                  GEICO                                                        Fiat
Warren Buffet                                                                  Warren Buffet
Dow Jones                                      Chrysler                        Dow Jones                                            Chrysler
GEICO                                          Barack Obama                    GEICO                                                Barack Obama
Charlie Munger                                 Fiat                            Charlie Munger                                       Fiat
United Auto Workers                                                                  United Auto Workers

3
Sentiment Analysis
Semi-supervised learning
!! For a particular task, labeled data
is always better than unlabeled
!! Get a correction for every mistake

!! But labeled data is usually much
more expensive to obtain
!! Google News: Manually label news story
clusters every 15 minutes?
+
!! Other examples? Exceptions?

!! Combine labeled and unlabeled
data to build better models                                               Other companies: http://www.jodange.com , http://www.brandtology.com , …
20

Sentiment Classification                                             Features for Sentiment Classification
Product Review               Running with Scissors: A Memoir                          horrible. read half of it, suffering from a headache
This book was horrible. IIread half of it, suffering from a headache thethe
Title: Horrible book, horrible.            entire time, and eventually i lit it on fire. One less copy in the world...don't
world...don't
Linear classifier (perceptron)                                                waste your money. I wish i had the time spent reading this book back so i
This book was horrible. I read half
could use it for better purposes. This book wasted my life
of it, suffering from a headache the
entire time, and eventually i lit it on   Recall perceptron classification rule:
Positive                           fire. One less copy in the
Negative

...             book back so i could use it for better    Features: counts of particular words & bigrams
purposes. This book wasted my life
22

Domain Adaptation                                                      Books & Kitchen Appliances
Running with Scissors: A Memoir             Avante Deep Fryer, Chrome &
Avante Deep Fryer, Chrome &
Training data: labeled book reviews
Horrible book, horrible.
Title: Horriblebook, horrible.              Black
Black

...                              This book was horrible. I read half
Title: lid does not work well...
Title: lid does not work well...

suffering from headache the           love the way the Tefal deep fryer
I I love the way the Tefal deep fryer

Test data: unlabeled kitchen appliance reviews                                                             13% ! I am returning
Error increase: cooks,however,26%
entire time, and eventually i lit it on
on
cooks, however, I am returning my
fire. One less copy in the                  my second one due to a defective
second one due to a defective lid

??              ??            ...                    ??           world...don't waste your money. I           lid closure. The lid close initially,
closure. The lid may may close
wish i had the time spent reading this      initially, few uses no uses it no
but after a but after aitfew longer stays
book back so i could use it for better      longer will closed. I will not be
closed. Istays not be purchasing this
Semi-supervised problem: Can we build a good
purposes. This book wasted my life          purchasing
one again. this one again.
classifier for kitchen appliances?

4
Handling Unseen Words                                                      Clustering: Feature Vectors for Words
(1) Unsupervised: Cluster words based on context                                  Approximately 1000 pivots, 1 million feature words
(2) Supervised: Use clusters in place of the words themselves
Feature Words
Clustering intuition: Contexts for the word defective
fascinating         defective             repetitive

Unlabeled kitchen contexts            Unlabeled books contexts                             excellent
•!Do not buy the Shark portable        •!The book is so repetitive                               ...

Pivots
steamer …. Trigger mechanism           that I found myself yelling …. I
•!the very nice lady assured me        •!A disappointment …. Ender                            awful
that I must have a defective set       was talked about for <#>
…. What a disappointment!              pages altogether.                                    terrible

K-Means in Pivot Space                                                    Real-valued Linear Projections
excellent                                                                            excellent
fantastic                                                                            fantastic
works_well                                                                           works_well

the                 defective                                                        the                      defective
blender                                                                              blender
repetitive                                                                                repetitive
novel                                                                                novel

Position along the line gives a real-valued
soft notion of “polarity” for each word
27                                                                                      28

Supervised     Semi-supervised
Some results                                                            Learned Polarity Clusters
90
87.7                                              negative                                              positive
85                                                                                                              books

plot        <#>_pages     predictable    fascinating                    grisham
80
78.9
75                                                                             poorly_designed         awkward_to      espresso                     years_now

74.5                                                             the_plastic              leaking              are_perfect         a_breeze

70                                                                                                              kitchen
Train: Books           Test: Kitchen Appliances
30

5

```
To top