Parametric Mixture Models for Document Clustering and Topic
Shared by: HC120310162828
-
Stats
- views:
- 0
- posted:
- 3/10/2012
- language:
- pages:
- 54
Document Sample


Parametric Mixture Models for
Document Clustering and Topic
Extraction
Sepandar Kamvar and Christopher Manning
Depts of Computer Science and Linguistics
http://www.stanford.edu/~manning/
manning@cs.stanford.edu
Document clustering
An information access tool
Useful for exploratory data analysis
E.g. from biology: assigning biological functions
to genes (for which one might have other
genomic or microarray evidence): identify the
shared roles of genes and the relationships
between genes and disease
Attributes: high dimensional, sparse count data
Dyadic clustering: words
and documents
Cluster simultaneously documents based on
words they contain, and words central to those
documents
This is potentially particularly useful for noticing
the commonalities in a cluster
Documents Words
•Type I diabetes as a chronic … •immunology
•Humoral autoimmune aspects of … •diabetes
•Autoantibodies in type 1 diabetes •autoantibodies
•The “natural” history of type I … •diabetes mellitus
•insulin-dependent
Techniques for clustering:
(1) Traditional clustering
Hierarchical divisive or
agglomerative clustering
techniques:
Single link, complete
link, group average
Central clustering:
Vector quantization
(LGB-VQ), K-means
Model-based: Probabilistic
latent variable models
Hoffman (1999) (& Puzicha): PLSA
Z P(w, d ) P( z ) P(w | z ) P(d | z )
W D
Factorizes dyadic
data via small set
of hidden variables
Non-negative Matrix
Factorization
Lee and Seung (1999)
Represent a document by a
non-negative matrix factorization
A≈WH
where W, H are constrained to
be non-negative.
Gives a part-based representation.
Aims
The vastly disparate approaches to clustering
that have been explored:
proximity-based, model-based, central
clustering, hierarchical divisive and
agglomerative, non-negative matrix
factorizations
… makes it difficult to compare techniques,
except empirically
What kind of relationships are there?
Aims
A framework for cluster analysis which is:
model-based: a subclass of mixture models
Allows both probabilistic model-based and
linear algebra views of clustering
General enough to include common methods
To examine the particular nature of NLP
clusters, looking at Medline
Practical effectiveness of different methods
Clustering problems
Fundamental problems:
Defining a suitable objective function Φ
Best viewed in terms of probabilistic model
Gives: Clustering paradigm
Defining an efficient optimization algorithm
to extremize Φ
General optimization problem
• Hard. Need approximate solutions
Gives: Clustering algorithm
Text Document Vectors
D1: Apolipoprotein E: a gene associated with many diseases T1: apolipoprotein
D2: The role of Apolipoprotein E in Alzheimer’s disease T2: E
D3: Genetics, Disease and Brain Injury: Apolipoprotein E T3: gene, genetics
D4: Pain, Alzheimer’s disease, and other diseases T4: associate, associated
T5: disease, diseases
D5: Apolipoprotein E and Multiple Sclerosis
T6: role
Document Vector for Term-Document T7: Alzheimer, Alzheimer’s
D4: Matrix: T8: brain
0 0 .447 .447 .408 0 .5 T9: injury
0 0 .447 .447 .408 0 .5 T10: pain
0 0 .447 0 .408 0 0
0
T11: multiple
0 .447 0 0 0 0
2 .816 .447 .447 .408 .816 0 T12: sclerosis
0 0 0 .447 0 0 0
1 .408 0 .447 0 .408 0
0 0 0 0 .408 0 0
0 0 0 0 .408 0 0
1 .408 0 0 0 .408 0
0 0 0 0 0 0 .5
0 0 0 0 0 0 .5
Model-based clustering
Data is assumed generated by a mixture model
Objective function Φ is likelihood P(D|Θ)
Here, make use of a limited class of parametric
mixture models:
Basis-membership mixture models
(BMMMs!):
P(D|Θ) = P(D|W,H)
• W = representation vectors for each cluster
• H = membership vectors for each data point
By design, all models of this class correspond
to matrix factorization problems D ≈ W H
Probabilistic models and
linear algebra
Hofmann – the PLSI promise: “In contrast to
[LSA] which stems from linear algebra and
performs a Singular Value Decomposition of co-
occurrence tables, the proposed technique uses
a generative latent class model to perform a
probabilistic mixture decomposition. This results
in a more principled approach.”
But: there’s an interesting large class of
probabilistic mixture models which can also be
seen as matrix factorizations
Advantages on both sides
Probabilistic mixture model formulation:
Can choose clustering method for a data set
in a principled way
Explains different behavior of different
models: they have an inherent bias
Matrix factorization formulation:
Can use the full toolkit of constrained
optimization methods for fast clustering
Can examine convergence properties, etc.
Term-Document matrix
Did = (normalized) weighted count (wi , dd)
dd n
ww
m
Matrix Factorizations
D = W x H
n k n
= x
k
m m
Basis Representation
hd is representation of dd in terms of basis W
If rank(W) ≥ rank(D) then we can always find H so D =
WH ~
If not, we find a reduced rank approximation D
Matrix Factorizations: SVD
D = W x S x VT
n k n
= x x
m m k
Basis Singular Representation
Values
Best rank k approximation according to 2-norm
Restrictions on representation: W, V orthonormal
Minimization Problem
Minimize
A WSV T
Given:
norm
for SVD, the 2-norm
constraints on W, V
for SVD, W and V are orthonormal
Central clustering/vector
quantization = k-means
Goal: Vector data is to be quantized via set of k
basis vectors which best represent them
We want basis vectors so as to maximize:
( D,W , H ) d S (dd , w z ( d ) )
A subclass of BMMM:
H = P(z|d) = {0,1}
W = centroids
D ≈ W H
Central clustering/vector
quantization: K-means
D = W x H
n k n
0
1
0
= x 0
0
k
m m
Basis Representation
2
For k-means, d
dd w z(d )
Constraint: columns of H are unary
Same as mixture of Gaussians as variance → 0
Central clustering
D = W x S x VT
n k n
.3 0
.1 0 0 0 .5 0 0 0 .5 0 0
.3
= x .2 x 0 xn
0
.1
0
m k
m Cluster
Basis Stochastic matrix
priors P(d | z)
P(z)
If we normalize D so columns sum to 1, then WT is stochastic:
W = P(w | z)
Probabilistic latent
variable models
Hoffman (& Puzicha): PLSA
Z
W D
P(w, d ) z P( z) P(w | z) P(d | z)
Dwd P( w, d ), Wwz P( w | z ), H zd P(d , z )
“PLSA”
D = W x S x VT
n k n
.3 0
.1 0 .1 0 .3 0 0 .2 .3 0 .1
.3
= x .2 x 0
.1
.1
.2
m k
m
Data Stochastic Cluster Stochastic matrix
P(w, d) Basis priors P(d | z)
P(w | z) P(z)
Sparse!
We simply relax the constraint on columns of V (empirically)
Non-negative Matrix
Factorization
D = W x H
n k n
.4
0
2
= x 0
.7
k
m m
Basis Representation
The same except we remove the stochastic con-
straints on D, H: H just has to be non-negative
– if D is also non-negative, non-negativity of W
follows, otherwise it’s inappropriate
Likelihood = factorization
– also, diagonally scaled gradient ascent = EM
Pereira, Tishby and Lee
(1993)
The same form of aspect model:
p ( v | n ) p ( z | n ) p (v | z )
ˆ
zZ
p ( v, n ) p ( n ) p ( z | n ) p (v | z )
ˆ
zZ
p ( z ) p ( n | z ) p (v | z )
zZ
(Assuming model and observed p(x) are equal.)
Pereira, Tishby and Lee
(1993)
Objective function Φ differs by not being simply
maximum likelihood, but regularized
Their aim is to minimize a part-based
representation by keeping cluster membership
as uniform as possible by a maximum entropy
criterion:
Distortion(P(v | n) P(v | z)) H ( z | n) /
Proximity-based clustering
Here, equivalences are only approximate
But the clustering would be equivalent in the
limit of dense data generated by the assumed
probabilistic model
Complete link HAC
At each stage, merge closest clusters, where
closeness is defined as maximum pairwise
distance of elements
Minimizes z
maxdi ,d j z di d j
Approximately equivalent to k-means:
As variance → 0, very likely that nearest
points in same cluster nearer than distance
between clusters
So maximizes likelihood of same model
Single link HAC
At each stage, merge closest clusters, where
closeness is defined as maximum pairwise
distance of elements
Minimizes z
maxdi z min d j di z di d j
Approximately equivalent to data generated by
a branching random walk
Principal Direction Divisive
Partitioning (Boley, 1998)
Recursively partitions a data set
At each stage divides the least cohesive cluster
into two subsets – one with largest scatter value
The principal direction of this cluster is found
The cluster is partition by a plane perpendicular
to the principal direction, and passing through
the center of gravity of the cluster
The result is a hierarchical structure of subsets
arranged into a binary tree
PDDP Example
Principal Direction
Splitting
Ships Center by Subtracting
Mean From Each Point
Cats
The first singular vector (or principal
direction) of the centered term-
document matrix indicates the direction
of greatest variance.
Direction of Greatest
Variance
Covariance matrix
C=(D-meT).(D-meT)T=A.AT
A=centered term-document matrix.
m=mean/centroid=(d1+ d2+ d3+ … + dn)/n
eT = [1 1 … 1]
Direction of greatest variance in data set:
eigenvector corresponding to largest eigenvalue
of covariance matrix = largest singular value of A
C=AAT=(USVT) . (USVT)T=(USVT).(VSTUT)=US2UT
How could this possibly
work?
If at each stage a cluster was a grid of
Gaussians, then this would find maximum
likelihood model. But it’s not in general:
Choosing the right model:
random walk data
Random walk data,
spherical gaussian model
Data
Mixture of two Gaussians
K-means learns near
perfectly
Single-link crashes and
burns
Evaluation on text
It’s not straightforward to evaluate the soft
clustering on text. Hofmann shows that clusters
‘make sense’, and that it can be used for
smoothing in IR like LSA (but better)
reflect increase
indicate decline
show change
follow rise
cite improvement
But for clustering documents?
How not to think of text
clustering…
Big distance between cluster means
Small
cluster
diameter
Two Medline document
clusters
Two document clusters (hand-curated):
1. Documents relating to the role of Human
Herpesvirus 6 (HHV-6) in the pathogenesis of
Multiple Sclerosis
2. Documents relating to use of Beta Inter-
ferons (IFN-beta) in the treatment of MS.
Two Medline document
clusters
Recently, IFN-beta has been hypothesized to
decrease the immune response to viral infection
by HHV-6
So these overlap, because some articles in one
cluster mention the other (but a minority)
Commonest words greatly overlap:
patients, MS, multiple sclerosis, disease,
But some differentiate:
ifn-beta, interferon vs. dna, herpesvirus
Two Medline document
clusters
Ave distance from mean for collection: 0.869
Ave dist. from mean for HHV-6: 0.776
Ave dist. from mean for Ifn-Beta: 0.868
Dist. between Diabetes and MS means: 0.550
All documents are far apart (remember that
they’re length normalized)
The documents are overlapping, and it doesn’t
look like there are clusters at all
The two clusters (bad view)
Two Medline document
clusters
Canonical angle between subspaces B and C is
1.557 (very nearly pi/2 = 1.570) [acos(BT.C)]
So the document subspaces are almost
orthogonal
Clusters are characterized by a very small
number of dimensions in which all documents in
the cluster are linearly dependent, but in which
all documents in the cluster are orthogonal to all
documents in other clusters
Clusters are ovoid, examining singular values
The two clusters (good view)
Medline document clusters
For this sort of small data set PDDP wins…
Reporting MI between class and clusters
CL HAC 0.02
PLSI 0.54
Kmeans 0.70
PDDP 0.74
(note that MI is a rather harsh measure: PDDP
has an “accuracy” of 95%)
OHSUMED
Larger study: using hand-judged documents in
the OHSUMED corpus (Hersh et al. 1994, 1987–
1991 MedLine, also used in TREC 9)
1. K-means
2. PDDP
3. PLSI
Although the dyadic nature of PLSI seems
appealing, it hasn’t been effective when
evaluated on hard clustering – but hopefully
effective for finer grained topics [TBA]
Results
The hierarchical agglomerative clustering
methods (SL and CL) again perform poorly – the
data is very sparse not dense
PDDP works well on this data: because of the
high dimensionality, clusters are normally
linearly separable, and other clusters are on
different dimensions
cf. the success of SVMs for text classification
better if use extra clusters? [Boley does this]
clusters are sometimes split
Results
K-means is simple but good…
Given the shape of clusters, one would hope
to get value from not having equal spherical
covariance matrices. I’ve tried a little with
‘ovoid pancakes’ but no success so far (too
many parameters?)
The End
Basis-Membership mixture models connect recent
model-based clustering methods, and can provide
an intuitive understanding of traditional clustering
The close connection with linear algebra methods
connects to LSA etc., and efficient techniques
The simple parameterization is good for text data
Future:
Show that one can exploit nature of document
clusters to improve clustering
Examine theme-based effectiveness of PLSA
Thank
you!
Get documents about "