Parametric Mixture Models for Document Clustering and Topic

Shared by: HC120310162828
Categories
Tags
-
Stats
views:
0
posted:
3/10/2012
language:
pages:
54
Document Sample
scope of work template
							Parametric Mixture Models for
Document Clustering and Topic
Extraction



 Sepandar Kamvar and Christopher Manning

 Depts of Computer Science and Linguistics
 http://www.stanford.edu/~manning/
 manning@cs.stanford.edu
Document clustering

An information access tool
Useful for exploratory data analysis

E.g. from biology: assigning biological functions
 to genes (for which one might have other
 genomic or microarray evidence): identify the
 shared roles of genes and the relationships
 between genes and disease

Attributes: high dimensional, sparse count data
Dyadic clustering: words
and documents

Cluster simultaneously documents based on
 words they contain, and words central to those
 documents
This is potentially particularly useful for noticing
 the commonalities in a cluster
Documents                            Words
•Type I diabetes as a chronic …      •immunology
•Humoral autoimmune aspects of …     •diabetes
•Autoantibodies in type 1 diabetes   •autoantibodies
•The “natural” history of type I …   •diabetes mellitus
                                     •insulin-dependent
Techniques for clustering:
(1) Traditional clustering

 Hierarchical divisive or
  agglomerative clustering
  techniques:
   Single link, complete
     link, group average
 Central clustering:
   Vector quantization
     (LGB-VQ), K-means
Model-based: Probabilistic
latent variable models

Hoffman (1999) (& Puzicha): PLSA

            Z         P(w, d )  P( z ) P(w | z ) P(d | z )


       W          D

Factorizes dyadic
data via small set
of hidden variables
Non-negative Matrix
Factorization

Lee and Seung (1999)

Represent a document by a
non-negative matrix factorization
        A≈WH
where W, H are constrained to
be non-negative.
Gives a part-based representation.
Aims

The vastly disparate approaches to clustering
 that have been explored:
  proximity-based, model-based, central
   clustering, hierarchical divisive and
   agglomerative, non-negative matrix
   factorizations
… makes it difficult to compare techniques,
 except empirically
What kind of relationships are there?
Aims

A framework for cluster analysis which is:
  model-based: a subclass of mixture models
  Allows both probabilistic model-based and
    linear algebra views of clustering
  General enough to include common methods
To examine the particular nature of NLP
 clusters, looking at Medline
  Practical effectiveness of different methods
Clustering problems

Fundamental problems:
  Defining a suitable objective function Φ
      Best viewed in terms of probabilistic model
      Gives: Clustering paradigm
  Defining an efficient optimization algorithm
   to extremize Φ
      General optimization problem
        • Hard. Need approximate solutions
      Gives: Clustering algorithm
 Text Document Vectors
D1: Apolipoprotein E: a gene associated with many diseases   T1: apolipoprotein
D2: The role of Apolipoprotein E in Alzheimer’s disease      T2: E
D3: Genetics, Disease and Brain Injury: Apolipoprotein E     T3: gene, genetics
D4: Pain, Alzheimer’s disease, and other diseases            T4: associate, associated
                                                             T5: disease, diseases
D5: Apolipoprotein E and Multiple Sclerosis
                                                             T6: role
Document Vector for            Term-Document                 T7: Alzheimer, Alzheimer’s
D4:                            Matrix:                       T8: brain
 0      0               .447     .447   .408    0      .5    T9: injury
 0      0               .447     .447   .408    0      .5    T10: pain
 0      0               .447     0      .408    0      0
        0
                                                             T11: multiple
 0                      .447     0      0       0      0
 2      .816            .447     .447   .408    .816   0     T12: sclerosis
 0      0               0        .447   0       0      0
 1      .408            0        .447   0      .408    0
 0      0               0        0      .408    0      0
 0      0               0        0      .408    0      0
 1      .408            0        0      0       .408   0
 0      0               0        0      0       0      .5
 0      0               0        0      0       0      .5
Model-based clustering

Data is assumed generated by a mixture model
Objective function Φ is likelihood P(D|Θ)
Here, make use of a limited class of parametric
 mixture models:
 Basis-membership mixture models
   (BMMMs!):
      P(D|Θ) = P(D|W,H)
        • W = representation vectors for each cluster
        • H = membership vectors for each data point
  By design, all models of this class correspond
   to matrix factorization problems D ≈ W H
Probabilistic models and
linear algebra

Hofmann – the PLSI promise: “In contrast to
 [LSA] which stems from linear algebra and
 performs a Singular Value Decomposition of co-
 occurrence tables, the proposed technique uses
 a generative latent class model to perform a
 probabilistic mixture decomposition. This results
 in a more principled approach.”
But: there’s an interesting large class of
 probabilistic mixture models which can also be
 seen as matrix factorizations
Advantages on both sides

Probabilistic mixture model formulation:
  Can choose clustering method for a data set
   in a principled way
  Explains different behavior of different
   models: they have an inherent bias
Matrix factorization formulation:
  Can use the full toolkit of constrained
   optimization methods for fast clustering
  Can examine convergence properties, etc.
Term-Document matrix

 Did = (normalized) weighted count (wi , dd)


                            dd       n


    ww



      m
Matrix Factorizations
        D           =     W         x          H
                n               k                       n


                    =               x
                                        k
m                   m
                        Basis               Representation

hd is representation of dd in terms of basis W
If rank(W) ≥ rank(D) then we can always find H so D =
WH                                             ~

If not, we find a reduced rank approximation D
    Matrix Factorizations: SVD

           D       =      W          x      S       x            VT
                    n            k                                    n


                    =                x              x

m                   m                                   k

                         Basis           Singular           Representation
                                         Values
    Best rank k approximation according to 2-norm
    Restrictions on representation: W, V orthonormal
Minimization Problem

Minimize
            A  WSV   T



Given:
 norm
     for SVD, the 2-norm
  constraints on W, V
     for SVD, W and V are orthonormal
Central clustering/vector
quantization = k-means

Goal: Vector data is to be quantized via set of k
 basis vectors which best represent them
We want basis vectors so as to maximize:
                    
   ( D,W , H )  d S (dd , w z ( d ) )
A subclass of BMMM:
  H = P(z|d) = {0,1}
  W = centroids
  D ≈ W H
Central clustering/vector
quantization: K-means
        D          =     W           x               H
               n                 k                            n
                                                          0
                                                          1
                                                          0
                   =                 x                    0
                                                          0
                                         k
m                  m
                       Basis                      Representation

                       
                                              2
For k-means,           d
                               dd  w z(d )
Constraint: columns of H are unary
Same as mixture of Gaussians as variance → 0
    Central clustering
           D          =     W         x             S                  x            VT
                  n               k                                                                n
                                          .3                                              0
                                               .1                          0 0 0 .5 0 0 0 .5 0 0
                                                    .3
                  =                   x                  .2        x                      0            xn
                                                                                          0
                                                              .1
                                                                                          0

                      m                                                k
m                                         Cluster
                          Basis                                             Stochastic matrix
                                          priors                            P(d | z)
                                          P(z)
    If we normalize D so columns sum to 1, then WT is stochastic:
    W = P(w | z)
Probabilistic latent
variable models

Hoffman (& Puzicha): PLSA
                                        Z


                                 W          D


P(w, d )  z P( z) P(w | z) P(d | z)

Dwd  P( w, d ), Wwz  P( w | z ), H zd  P(d , z )
    “PLSA”
              D       =      W         x             S                  x             VT
                  n               k                                                                    n
                                           .3                                                0
                                                .1                          0 .1 0 .3 0 0 .2 .3 0 .1
                                                     .3
                   =                   x                  .2        x                        0
                                                                                             .1
                                                               .1
                                                                                             .2

                      m                                                 k
m
    Data                  Stochastic        Cluster                          Stochastic matrix
    P(w, d)               Basis             priors                           P(d | z)
                          P(w | z)          P(z)
                                                                                           Sparse!
    We simply relax the constraint on columns of V                                         (empirically)
Non-negative Matrix
Factorization
         D         =     W         x          H
               n               k                        n
                                                   .4
                                                    0
                                                    2
                   =               x                0
                                                   .7
                                       k
m                  m
                       Basis               Representation

The same except we remove the stochastic con-
 straints on D, H: H just has to be non-negative
 – if D is also non-negative, non-negativity of W
 follows, otherwise it’s inappropriate
Likelihood = factorization




       – also, diagonally scaled gradient ascent = EM
Pereira, Tishby and Lee
(1993)

The same form of aspect model:
           p ( v | n )   p ( z | n ) p (v | z )
           ˆ
                        zZ
       p ( v, n )  p ( n )  p ( z | n ) p (v | z )
        ˆ
                              zZ
               p ( z ) p ( n | z ) p (v | z )
               zZ
(Assuming model and observed p(x) are equal.)
Pereira, Tishby and Lee
(1993)

Objective function Φ differs by not being simply
 maximum likelihood, but regularized
Their aim is to minimize a part-based
 representation by keeping cluster membership
 as uniform as possible by a maximum entropy
 criterion:
    Distortion(P(v | n) P(v | z)) H ( z | n) / 
Proximity-based clustering

Here, equivalences are only approximate
But the clustering would be equivalent in the
 limit of dense data generated by the assumed
 probabilistic model
Complete link HAC

At each stage, merge closest clusters, where
 closeness is defined as maximum pairwise
 distance of elements
Minimizes       z
                      maxdi ,d j z di  d j
Approximately equivalent to k-means:
  As variance → 0, very likely that nearest
    points in same cluster nearer than distance
    between clusters
  So maximizes likelihood of same model
Single link HAC

At each stage, merge closest clusters, where
 closeness is defined as maximum pairwise
 distance of elements
Minimizes       z
                      maxdi z min d j di z di  d j
Approximately equivalent to data generated by
 a branching random walk
Principal Direction Divisive
Partitioning (Boley, 1998)

Recursively partitions a data set
At each stage divides the least cohesive cluster
 into two subsets – one with largest scatter value
The principal direction of this cluster is found
The cluster is partition by a plane perpendicular
 to the principal direction, and passing through
 the center of gravity of the cluster
The result is a hierarchical structure of subsets
 arranged into a binary tree
PDDP Example
Principal Direction
Splitting


Ships                    Center by Subtracting
                         Mean From Each Point



         Cats


 The first singular vector (or principal
 direction) of the centered term-
 document matrix indicates the direction
 of greatest variance.
Direction of Greatest
Variance

Covariance matrix
  C=(D-meT).(D-meT)T=A.AT
      A=centered term-document matrix.
      m=mean/centroid=(d1+ d2+ d3+ … + dn)/n
      eT = [1 1 … 1]
Direction of greatest variance in data set:
 eigenvector corresponding to largest eigenvalue
 of covariance matrix = largest singular value of A
C=AAT=(USVT) . (USVT)T=(USVT).(VSTUT)=US2UT
How could this possibly
work?

If at each stage a cluster was a grid of
 Gaussians, then this would find maximum
 likelihood model. But it’s not in general:
Choosing the right model:
random walk data
Random walk data,
spherical gaussian model
Data
Mixture of two Gaussians
K-means learns near
perfectly
Single-link crashes and
burns
Evaluation on text

It’s not straightforward to evaluate the soft
 clustering on text. Hofmann shows that clusters
 ‘make sense’, and that it can be used for
 smoothing in IR like LSA (but better)
           reflect    increase
           indicate   decline
           show       change
           follow     rise
           cite       improvement
But for clustering documents?
How not to think of text
clustering…

              Big distance between cluster means




   Small
   cluster
   diameter
Two Medline document
clusters

Two document clusters (hand-curated):
  1. Documents relating to the role of Human
   Herpesvirus 6 (HHV-6) in the pathogenesis of
   Multiple Sclerosis
  2. Documents relating to use of Beta Inter-
   ferons (IFN-beta) in the treatment of MS.
Two Medline document
clusters

Recently, IFN-beta has been hypothesized to
 decrease the immune response to viral infection
 by HHV-6
So these overlap, because some articles in one
 cluster mention the other (but a minority)
Commonest words greatly overlap:
  patients, MS, multiple sclerosis, disease,
But some differentiate:
  ifn-beta, interferon vs. dna, herpesvirus
Two Medline document
clusters

Ave distance from mean for collection: 0.869
Ave dist. from mean for HHV-6:         0.776
Ave dist. from mean for Ifn-Beta:      0.868
Dist. between Diabetes and MS means: 0.550

All documents are far apart (remember that
   they’re length normalized)
The documents are overlapping, and it doesn’t
   look like there are clusters at all
The two clusters (bad view)
Two Medline document
clusters

Canonical angle between subspaces B and C is
 1.557 (very nearly pi/2 = 1.570) [acos(BT.C)]
  So the document subspaces are almost
    orthogonal
Clusters are characterized by a very small
 number of dimensions in which all documents in
 the cluster are linearly dependent, but in which
 all documents in the cluster are orthogonal to all
 documents in other clusters
Clusters are ovoid, examining singular values
The two clusters (good view)
Medline document clusters

For this sort of small data set PDDP wins…
Reporting MI between class and clusters
   CL HAC         0.02
   PLSI           0.54
   Kmeans         0.70
   PDDP           0.74
(note that MI is a rather harsh measure: PDDP
  has an “accuracy” of 95%)
OHSUMED

Larger study: using hand-judged documents in
 the OHSUMED corpus (Hersh et al. 1994, 1987–
 1991 MedLine, also used in TREC 9)
  1. K-means
  2. PDDP
  3. PLSI
Although the dyadic nature of PLSI seems
 appealing, it hasn’t been effective when
 evaluated on hard clustering – but hopefully
 effective for finer grained topics [TBA]
Results

The hierarchical agglomerative clustering
 methods (SL and CL) again perform poorly – the
 data is very sparse not dense
PDDP works well on this data: because of the
 high dimensionality, clusters are normally
 linearly separable, and other clusters are on
 different dimensions
  cf. the success of SVMs for text classification
  better if use extra clusters? [Boley does this]
      clusters are sometimes split
Results

K-means is simple but good…
  Given the shape of clusters, one would hope
   to get value from not having equal spherical
   covariance matrices. I’ve tried a little with
   ‘ovoid pancakes’ but no success so far (too
   many parameters?)
The End

Basis-Membership mixture models connect recent
 model-based clustering methods, and can provide
 an intuitive understanding of traditional clustering
The close connection with linear algebra methods
 connects to LSA etc., and efficient techniques
The simple parameterization is good for text data
Future:
  Show that one can exploit nature of document
   clusters to improve clustering
  Examine theme-based effectiveness of PLSA
Thank
 you!

						
Related docs
Other docs by HC120310162828
Incorporated place means a city
Views: 1  |  Downloads: 0
Scope and Seq HSIE - Science
Views: 7  |  Downloads: 0
Programa Opci�nDe Vida en la Naturaleza
Views: 25  |  Downloads: 0
1 6601939 asker
Views: 16  |  Downloads: 0
Buckeye Brittany Club Newsletter
Views: 12  |  Downloads: 0
Aprendizaje Basado en Problemas - PowerPoint
Views: 194  |  Downloads: 0
Symbolic Narrative 8
Views: 5  |  Downloads: 0