Docstoc

Cluster Analysis and Latent Class Analysis

Document Sample
Cluster Analysis and Latent Class Analysis Powered By Docstoc
					Cluster Analysis and Latent Class Analysis
                           Stephen Fisher

         stephen.fisher@sociology.ox.ac.uk


Course website: http://malroy.econ.ox.ac.uk/fisher/iss




           Stephen Fisher, Intermediate Social Statistics Lectures
                                                          Outline

• Cluster analysis

• Mixture models

• Latent Class Analysis

Aims: To provide a basic introduction to different methods of identifying
groups in multivariate data, whether measurement is at the interval or ratio
levels (cluster analysis) or is categorical (latent class analysis).




Stephen Fisher, Intermediate Social Statistics Lectures                    1
                                               Cluster Analysis
Cluster analysis is used to classify a set of items into two or more mutually
exclusive unknown groups based on combinations of interval variables.
The goal of cluster analysis is to organize items into groups in such a way
that the degree of similarity is maximized for the items within a group and
minimized between groups.
Examples include the identification of:

• groups of countries sharing common economic characteristics which we
  could perhaps label varieties of capitalism, or maybe welfare regimes;

• types of product in a market which have similar characteristics;

• types of people in society, such as Essex man and Worcester woman.

Cluster analysis is more often used when you are interested in the character-
istics of the individual items in your data, rather than aiming to ‘replace names
with variables’ or test causal hypotheses.

Stephen Fisher, Intermediate Social Statistics Lectures                         2
Cluster analysis is more often viewed as an exploratory technique for “gener-
ating rather than testing hypotheses” (Everitt, Cluster Analysis: 10).
 - To this extent the aim is to summarize the data as simply, practically and
   effectively as possible, rather than to estimate particular quantities.

One effective method of clustering is to look at scatter plot of the first two
principle components and use the inter-ocular test.




Stephen Fisher, Intermediate Social Statistics Lectures                     3
                                 Hierarchical Cluster Analysis

Probably the most commonly used form of cluster analysis involves creat-
ing links between pairs of items to create groups, and between items and
groups, to build up clusters. This approach is called agglomerative hierarchi-
cal clustering, and techniques in this category generally conform the following
process.

1. Identify a set of N items with interval level measurements on K variables.

2. Define a measure of the distance between two items.

     • The most natural distance is Euclidean: Di j = ∑K (xik − x jk )2 for ob-
                                                              k=1
       jects i and j.
     • Generalized (or Mahalanobis) distance has the advantage of being in-
       variant to rescaling of any of the x variables: Di j = (xi − xj) S−1(xi − xj)
       where S is the sample covariance matrix.
     • There are many other distances including city-block distance and
       weighted Euclidean distance.

Stephen Fisher, Intermediate Social Statistics Lectures                            4
     • Distance is essentially a measure of dissimilarity. It is also possible to
       do cluster analysis with matrices of similarity.

3. Define a linkage (or amalgamation) rule for how distance between two
   objects (either a cluster or an item) can be measured.
     • Single linkage: The distance between two objects is the distance be-
       tween the closest two items in each object.
     • Complete linkage: The distance between two objects is the largest
       distance between any possible pair formed by items from two different
       objects.
     • Centroid linkage: The distance between two objects is the distance
       between the centroids of those objects.
     • Group (or pair-group) Average linkage: The distance between two
       objects is the average of all the distances between all possible pairs
       formed by taking one item from each object. (This is similar to, but not
       precisely the same as, centroid linkage).

4. Objects are linked sequentially to form groups. At each stage of this clus-
   tering process the objects with the shortest distance between them are

Stephen Fisher, Intermediate Social Statistics Lectures                         5
    combined, and the distances between the resulting set of groups recom-
    puted.

The Ward Method does not group objects according to the distance between
objects (with whatever linkage) but according to the amount of information
that would be lost as a result of grouping two objects. Information can be
measured as the sum of the squared deviations from the cluster centroid.

The clustering process can be summarized in a dendrogram which is a tree
diagram which plots the sequential linkage between items (which are spread
evenly on the x-axis in a convenient order) and objects according to the dis-
tance between those objects at the point of linkage, represented by the y-axis.

• Inspection of the dendrogram can be used to determine whether the sam-
  ple is clustered, and if so how many clusters there are and which items are
  in each cluster.

• There are two popular indices of distinctness of the clustering (Duha &
  Hart and Culinski & Harabasz) which are used as the basis of ‘stopping

Stephen Fisher, Intermediate Social Statistics Lectures                       6
    rules’ in Stata. They both essentially use the kind of information that the
    Ward method is based on.




Stephen Fisher, Intermediate Social Statistics Lectures                       7
                 Problems with Hierarchical Cluster Analysis


• Single linkage clustering is prone to chaining.

• Complete, centroid, and group-average linkage clustering tend to create
  spherical clusters.

• In general the mathematical properties, such as robustness, of hierarchi-
  cal clustering techniques are poor. Although single linkage is the most
  robust the problem of chaining is sufficiently great that it is still relatively
  unpopular.

• There are stopping rules and goodness of fit tests, but they are rather ad
  hoc.


All these issues suggest it may be useful to use a variety of different methods
and to scrutinize any apparent solution in as much detail as possible.

Stephen Fisher, Intermediate Social Statistics Lectures                         8
If you have a prior theory about what clusters should exist in your data, then
you may be able to apply tests of construct validity by examining whether
the members of each cluster have theoretically prescribed characteristics ac-
cording to variables that were not used to define the clusters.




Stephen Fisher, Intermediate Social Statistics Lectures                      9
                                     K-means partition method

This is an algorithm to produce exactly K clusters.

1. Start with K randomly chosen points to define the centres of the K clusters.

2. Assign each item to the closest point.

3. Calculate the mean (centroid) of each cluster.

4. Use the K means to define the centres of K new clusters and reassign
   each item to the cluster with the closest centre.

5. Repeat the previous two steps until there is no change in the nature of the
   clusters between steps.




Stephen Fisher, Intermediate Social Statistics Lectures                     10
                                                Mixture models

It may be unreasonable to assume that the items in your data can be perfectly
divided into compact groups.

Fuzzy clustering/partition allows for items to be members of more than one
cluster.

Mixture models assume that the data are generated by random draws from a
set of K separate probability distributions.

If these distributions are multivariate normal then the mixture model as the
form,
                                                          K
                                                f (x) =   ∑ pkN(µk, Σk),   (1)
                                                          k=1


where the mean and covariance matrix of each sub-distribution (µk , Σk ) and
the proportions of the overall population pk that each sub-distribution com-
prises, are parameters that can be estimated with maximum-likelihood.

Stephen Fisher, Intermediate Social Statistics Lectures                     11
If we have a series of binary variables, then we use the multivariate Bernoulli
distribution:
                                                          K   J
                                                                    x     1−x
                                        Pr(x) =       ∑ pk ∏ π jkj (1 − π jk j ).   (2)
                                                      k=1     j=1


This mixture model with a set of sub-populations each following a particular
multivariate Bernoulli distribution is a latent class model.




Stephen Fisher, Intermediate Social Statistics Lectures                              12
                                         Latent Class Analysis

The aim of latent class analysis is to see whether the association between a
set of observed (or manifest) categorical variables A, B,C, . . . can be explained
by an unobserved (or latent) typology X.

Examples include:

• Validating class schema with measures of job characteristics (Evans and
  Mills, ESR 1998)

• Accounting for the under-reporting of extreme party (Sinn Fein) support
  (Breen, BJPS 2000)

• Identifying different types of cultural consumer (Chan and Goldthorpe)

• Dealing with measurement error and survey misclassification in transition
  matrices (Hagenaars).

Stephen Fisher, Intermediate Social Statistics Lectures                         13
Note that LCA can be used in either an exploratory or confirmatory fashion.

The model is similar to Factor Analysis, but with categorical manifest and
latent variables. Both are driven by the assumption that the association be-
tween any pair of manifest variables can be explained by the latent variable,
i.e. the manifest variables are conditionally independent given the latent vari-
able.
  - Whether or not this principle makes as much intuitive sense with categori-
    cal variables as it does in the Factor Analysis context is questionable.

The latent class model with three manifest variables A, B and C has the form:

                                          n
Pr(A = a, B = b,C = c) = ∑ Pr(A = a|X = x)Pr(B = b|X = x)Pr(C = c|X = x)Pr(X = x)
                                        x=1
                                                                              (3)

This can also be expressed in the form of a log-linear model.


                        log(mabc) = µ + λX + λA + λB + λC + λAX + λBX + λCX
                                         x    a    b    c    ax    bx    cx   (4)

Stephen Fisher, Intermediate Social Statistics Lectures                        14
Note that the number of latent classes x = 1, . . . , n of X can be varied by the
researcher. Typically one fits models with different numbers of latent classes
and then chooses the model with the best fit to the data.

Interpretation of the model is normally via the estimated conditional probabil-
ities Pr(A = a|X = x), Pr(B = b|X = x), . . . and the estimated relative sizes of
each latent class Pr(X = x).

It is sometimes interesting to see how the latent classification is related to
other variables, especially if the aim of the model is to confirm a theory. To
do this you first need to allocate items to classes.
 - For any given latent class model, each item has an estimated probability
   of being being a member of each latent class and researchers usually
   allocate items to the latent classes they are most likely to be a member of.




Stephen Fisher, Intermediate Social Statistics Lectures                        15
                         Extensions to Latent Class Analysis

• Placing specific restrictions on the parameters (e.g. setting some of the
  conditional probability to be zero or equal to each other)

• Latent class models for ordered categorical variables

• Ordered latent classes

• Latent trait models, which have observed categorical variables and a latent
  continuum.

• Various scaling methods, structural equation modelling, and generalized
  latent variable modelling




Stephen Fisher, Intermediate Social Statistics Lectures                    16
                                                          Software

Stata, R and SPSS all have routines for cluster analysis.

Latent class analysis is probably easiest in Latent Gold or Lem.

You can fit latent class models in Stata with gllamm but it is complicated, not
least because it requires that you learn something about generalized latent
variable modelling (Skrondal and Rabe-Hesketh, 2004).

Latent class and mixed models can be estimated using the Mixed Mode La-
tent Class Regression (mmlcr) package, and others, in R.




Stephen Fisher, Intermediate Social Statistics Lectures                     17
                                               Further Reading

As usual there are several available text books. A short summary of the main
Cluster Analysis techniques is:

Brian Everitt and Graham Dunn, Applied Multivariate Data Analysis, (Edward
Arnold).

The Stata 8 and 9 manuals are very useful at explaining the theory and prac-
tice of cluster analysis with lots of examples, and pointing you to relevant
literature.

Alan McCutcheon’s Latent Class Analysis (small green Sage book #64) is an
excellent introduction.




Stephen Fisher, Intermediate Social Statistics Lectures                   18

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:531
posted:3/9/2010
language:English
pages:19
Description: Cluster Analysis and Latent Class Analysis