Document Sample

Cluster Analysis and Latent Class Analysis Stephen Fisher stephen.fisher@sociology.ox.ac.uk Course website: http://malroy.econ.ox.ac.uk/fisher/iss Stephen Fisher, Intermediate Social Statistics Lectures Outline • Cluster analysis • Mixture models • Latent Class Analysis Aims: To provide a basic introduction to different methods of identifying groups in multivariate data, whether measurement is at the interval or ratio levels (cluster analysis) or is categorical (latent class analysis). Stephen Fisher, Intermediate Social Statistics Lectures 1 Cluster Analysis Cluster analysis is used to classify a set of items into two or more mutually exclusive unknown groups based on combinations of interval variables. The goal of cluster analysis is to organize items into groups in such a way that the degree of similarity is maximized for the items within a group and minimized between groups. Examples include the identiﬁcation of: • groups of countries sharing common economic characteristics which we could perhaps label varieties of capitalism, or maybe welfare regimes; • types of product in a market which have similar characteristics; • types of people in society, such as Essex man and Worcester woman. Cluster analysis is more often used when you are interested in the character- istics of the individual items in your data, rather than aiming to ‘replace names with variables’ or test causal hypotheses. Stephen Fisher, Intermediate Social Statistics Lectures 2 Cluster analysis is more often viewed as an exploratory technique for “gener- ating rather than testing hypotheses” (Everitt, Cluster Analysis: 10). - To this extent the aim is to summarize the data as simply, practically and effectively as possible, rather than to estimate particular quantities. One effective method of clustering is to look at scatter plot of the ﬁrst two principle components and use the inter-ocular test. Stephen Fisher, Intermediate Social Statistics Lectures 3 Hierarchical Cluster Analysis Probably the most commonly used form of cluster analysis involves creat- ing links between pairs of items to create groups, and between items and groups, to build up clusters. This approach is called agglomerative hierarchi- cal clustering, and techniques in this category generally conform the following process. 1. Identify a set of N items with interval level measurements on K variables. 2. Deﬁne a measure of the distance between two items. • The most natural distance is Euclidean: Di j = ∑K (xik − x jk )2 for ob- k=1 jects i and j. • Generalized (or Mahalanobis) distance has the advantage of being in- variant to rescaling of any of the x variables: Di j = (xi − xj) S−1(xi − xj) where S is the sample covariance matrix. • There are many other distances including city-block distance and weighted Euclidean distance. Stephen Fisher, Intermediate Social Statistics Lectures 4 • Distance is essentially a measure of dissimilarity. It is also possible to do cluster analysis with matrices of similarity. 3. Deﬁne a linkage (or amalgamation) rule for how distance between two objects (either a cluster or an item) can be measured. • Single linkage: The distance between two objects is the distance be- tween the closest two items in each object. • Complete linkage: The distance between two objects is the largest distance between any possible pair formed by items from two different objects. • Centroid linkage: The distance between two objects is the distance between the centroids of those objects. • Group (or pair-group) Average linkage: The distance between two objects is the average of all the distances between all possible pairs formed by taking one item from each object. (This is similar to, but not precisely the same as, centroid linkage). 4. Objects are linked sequentially to form groups. At each stage of this clus- tering process the objects with the shortest distance between them are Stephen Fisher, Intermediate Social Statistics Lectures 5 combined, and the distances between the resulting set of groups recom- puted. The Ward Method does not group objects according to the distance between objects (with whatever linkage) but according to the amount of information that would be lost as a result of grouping two objects. Information can be measured as the sum of the squared deviations from the cluster centroid. The clustering process can be summarized in a dendrogram which is a tree diagram which plots the sequential linkage between items (which are spread evenly on the x-axis in a convenient order) and objects according to the dis- tance between those objects at the point of linkage, represented by the y-axis. • Inspection of the dendrogram can be used to determine whether the sam- ple is clustered, and if so how many clusters there are and which items are in each cluster. • There are two popular indices of distinctness of the clustering (Duha & Hart and Culinski & Harabasz) which are used as the basis of ‘stopping Stephen Fisher, Intermediate Social Statistics Lectures 6 rules’ in Stata. They both essentially use the kind of information that the Ward method is based on. Stephen Fisher, Intermediate Social Statistics Lectures 7 Problems with Hierarchical Cluster Analysis • Single linkage clustering is prone to chaining. • Complete, centroid, and group-average linkage clustering tend to create spherical clusters. • In general the mathematical properties, such as robustness, of hierarchi- cal clustering techniques are poor. Although single linkage is the most robust the problem of chaining is sufﬁciently great that it is still relatively unpopular. • There are stopping rules and goodness of ﬁt tests, but they are rather ad hoc. All these issues suggest it may be useful to use a variety of different methods and to scrutinize any apparent solution in as much detail as possible. Stephen Fisher, Intermediate Social Statistics Lectures 8 If you have a prior theory about what clusters should exist in your data, then you may be able to apply tests of construct validity by examining whether the members of each cluster have theoretically prescribed characteristics ac- cording to variables that were not used to deﬁne the clusters. Stephen Fisher, Intermediate Social Statistics Lectures 9 K-means partition method This is an algorithm to produce exactly K clusters. 1. Start with K randomly chosen points to deﬁne the centres of the K clusters. 2. Assign each item to the closest point. 3. Calculate the mean (centroid) of each cluster. 4. Use the K means to deﬁne the centres of K new clusters and reassign each item to the cluster with the closest centre. 5. Repeat the previous two steps until there is no change in the nature of the clusters between steps. Stephen Fisher, Intermediate Social Statistics Lectures 10 Mixture models It may be unreasonable to assume that the items in your data can be perfectly divided into compact groups. Fuzzy clustering/partition allows for items to be members of more than one cluster. Mixture models assume that the data are generated by random draws from a set of K separate probability distributions. If these distributions are multivariate normal then the mixture model as the form, K f (x) = ∑ pkN(µk, Σk), (1) k=1 where the mean and covariance matrix of each sub-distribution (µk , Σk ) and the proportions of the overall population pk that each sub-distribution com- prises, are parameters that can be estimated with maximum-likelihood. Stephen Fisher, Intermediate Social Statistics Lectures 11 If we have a series of binary variables, then we use the multivariate Bernoulli distribution: K J x 1−x Pr(x) = ∑ pk ∏ π jkj (1 − π jk j ). (2) k=1 j=1 This mixture model with a set of sub-populations each following a particular multivariate Bernoulli distribution is a latent class model. Stephen Fisher, Intermediate Social Statistics Lectures 12 Latent Class Analysis The aim of latent class analysis is to see whether the association between a set of observed (or manifest) categorical variables A, B,C, . . . can be explained by an unobserved (or latent) typology X. Examples include: • Validating class schema with measures of job characteristics (Evans and Mills, ESR 1998) • Accounting for the under-reporting of extreme party (Sinn Fein) support (Breen, BJPS 2000) • Identifying different types of cultural consumer (Chan and Goldthorpe) • Dealing with measurement error and survey misclassiﬁcation in transition matrices (Hagenaars). Stephen Fisher, Intermediate Social Statistics Lectures 13 Note that LCA can be used in either an exploratory or conﬁrmatory fashion. The model is similar to Factor Analysis, but with categorical manifest and latent variables. Both are driven by the assumption that the association be- tween any pair of manifest variables can be explained by the latent variable, i.e. the manifest variables are conditionally independent given the latent vari- able. - Whether or not this principle makes as much intuitive sense with categori- cal variables as it does in the Factor Analysis context is questionable. The latent class model with three manifest variables A, B and C has the form: n Pr(A = a, B = b,C = c) = ∑ Pr(A = a|X = x)Pr(B = b|X = x)Pr(C = c|X = x)Pr(X = x) x=1 (3) This can also be expressed in the form of a log-linear model. log(mabc) = µ + λX + λA + λB + λC + λAX + λBX + λCX x a b c ax bx cx (4) Stephen Fisher, Intermediate Social Statistics Lectures 14 Note that the number of latent classes x = 1, . . . , n of X can be varied by the researcher. Typically one ﬁts models with different numbers of latent classes and then chooses the model with the best ﬁt to the data. Interpretation of the model is normally via the estimated conditional probabil- ities Pr(A = a|X = x), Pr(B = b|X = x), . . . and the estimated relative sizes of each latent class Pr(X = x). It is sometimes interesting to see how the latent classiﬁcation is related to other variables, especially if the aim of the model is to conﬁrm a theory. To do this you ﬁrst need to allocate items to classes. - For any given latent class model, each item has an estimated probability of being being a member of each latent class and researchers usually allocate items to the latent classes they are most likely to be a member of. Stephen Fisher, Intermediate Social Statistics Lectures 15 Extensions to Latent Class Analysis • Placing speciﬁc restrictions on the parameters (e.g. setting some of the conditional probability to be zero or equal to each other) • Latent class models for ordered categorical variables • Ordered latent classes • Latent trait models, which have observed categorical variables and a latent continuum. • Various scaling methods, structural equation modelling, and generalized latent variable modelling Stephen Fisher, Intermediate Social Statistics Lectures 16 Software Stata, R and SPSS all have routines for cluster analysis. Latent class analysis is probably easiest in Latent Gold or Lem. You can ﬁt latent class models in Stata with gllamm but it is complicated, not least because it requires that you learn something about generalized latent variable modelling (Skrondal and Rabe-Hesketh, 2004). Latent class and mixed models can be estimated using the Mixed Mode La- tent Class Regression (mmlcr) package, and others, in R. Stephen Fisher, Intermediate Social Statistics Lectures 17 Further Reading As usual there are several available text books. A short summary of the main Cluster Analysis techniques is: Brian Everitt and Graham Dunn, Applied Multivariate Data Analysis, (Edward Arnold). The Stata 8 and 9 manuals are very useful at explaining the theory and prac- tice of cluster analysis with lots of examples, and pointing you to relevant literature. Alan McCutcheon’s Latent Class Analysis (small green Sage book #64) is an excellent introduction. Stephen Fisher, Intermediate Social Statistics Lectures 18

DOCUMENT INFO

Shared By:

Categories:

Stats:

views: | 531 |

posted: | 3/9/2010 |

language: | English |

pages: | 19 |

Description:
Cluster Analysis and Latent Class Analysis

OTHER DOCS BY etssetcf

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.