VIEWS: 0 PAGES: 11 CATEGORY: Technology POSTED ON: 9/3/2010 Public Domain
Cluster Analysis of gamer segmentation data I. Introduction This memorandum describes the segmentation analysis of the gamer segmentation data set gives responses from people who were asked about attitudes and behaviors with respect to video gaming behavior. These attitudes/behaviors were then characterized by five factors called Escapism, Anti-Social, Communal, Reality and Slacking. Seven segments are found by cluster analysis. The explanations are given in 2.4. In next part, factor analysis is discussed and factor loadings are given in section 1. Section 2 describes the procedure of cluster analysis and segments derived, then the explanation and validation are given. II. Technical Analysis 1. FACTOR ANALYSIS Factor analysis is used to help identify groups of variables that are strongly correlated. So we will use the five factors given by factor analysis: Escapism, Anti- Social, Communal, Reality and Slacking. Table 1. Gaming attitudes loading on Factor 1: Escapism. Gaming Attitude Factor Loading Q.15 (E): Playing games allows me to take out my aggressions .704 Q.15 (F): Playing games allows me to be someone I am not .701 Q.15 (S): Playing games allows me to experience the subject whenever I want .667 Q.15 (C): Playing games allows me to escape my problems .620 Q.15 (K): Playing games is another way to get closer to the things I like .567 Q.15 (I): Playing games allows me to express my creativity .558 Q.15 (O): I play games for a rush or excitement .482 Table 2. Gaming attitudes loading on Factor 2: Anti-Social Gaming Attitude Factor Loading Q.15 (Q): I lie about the amount of time I spend playing games .742 Q.15 (H): Others are jealous of the time I spend playing games .708 Q.15 (R): Experiencing something in a game is better than experiencing the real thing .612 Q.15 (L): I spend too much time playing games .591 1 Table 3. Gaming attitudes loading on Factor 3: Communal. Gaming Attitude Factor Loading Q.15 (G): When playing games I like to compete against other players .756 Q.15 (N): Playing games is a big part of hanging out with my friends .606 Q.15 (B): I hate playing games that are too easy .555 Table 4. Gaming attitudes loading on Factor 4: Reality. Gaming Attitude Factor Loading Q.15 (P): I prefer to play games that deal with subjects I know about .700 Q.15 (M): Realistic graphics are essential to a great game .565 Q.15 (K): Playing games is another way to get closer to the things I like .500 Q.15 (A): Winning is very important whenever I play a game .460 Table 5. Gaming attitudes loading on Factor 5: Slacking. Gaming Attitude Factor Loading Q.15 (J): I play games to kill time .707 Q.15 (D): I play games only when I have nothing else to do .688 Note: Standardization is not required for this data set since all the 5 factors are measured on the same scale (from “1” to “5”). 2. CLUSTER ANALYSIS The cluster analysis includes three main components, which are Similarity/Distance measure, clustering procedure, and number of clusters. The first two are decided beforehand, while the number of clusters is decided post hoc. Before clustering only the range of the number can be specified. 2.1 Clustering procedures: Clustering procedure specifies how clusters are formed and how cluster membership is determined. Several high-level strategies are available for forming clusters, such as Hierarchical, K-means, Optimization, Statistical model-based, and Fuzzy clusters, but no one is perfect. Hierarchical methods require neither an initial number of clusters nor an initial cluster centers. The clusters generated are nested, which allows for a more straightforward evaluation of trade-offs between more or fewer clusters. However, hierarchical methods are sensitive to outliers, and they cannot correct early poor classifications. Nonhierarchical methods are less sensitive to outliers, choice of distance measures and inappropriate variables, so are appropriate when objective is to partition 2 data. However, to use nonhierarchical methods, we must specify the number of clusters beforehand, and the solutions are very sensitive to the choice of cluster seeds. 2.2 Linkage Method: Agglomerative hierarchical cluster procedures require a method for determining which two clusters to join at each step of the algorithm. Several linkage methods are available, such as Single linkage, Complete linkage, Average linkage, Centroid method, Median method, and Ward’s method. Here Ward’s method is used in hierarchical cluster procedure. Ward’s method is a Minimum variance method. At each step, joins the cluster pair that, when merged, has the smallest within cluster squared distance from the mean. Since Ward’s method always tries to minimize within group variance, it tends to combine small clusters (the variance of small clusters is small). Therefore, it is biased towards creating clusters with approximately the same number of objects. An important feature of Ward’s method is its tendency to produce cluster solutions similar to those obtained from k-means clustering. Since our clustering procedure is hierarchical followed by k-means, using Ward’s method in hierarchical procedure will generate solutions more compatible for next step (k-means method). Similarity / Distance measures: Clustering methods require some way to measure the distance between pairs of data objects. There are several distance measures to choose from, including Correlational, Distance, Euclidean distance, City-block, Mahalanobis distance, Association measures. Euclidean distance is frequently applied when variables are on same scale, because it reduces computation time and works better with centroid and Ward’s clustering methods. Since all the 5 factors are measured on the same scale, we can safely use Squared Euclidean distance in hierarchical clustering procedure. 2.3 Number of cases and range of number of clusters: Clustering methods require some way to validate the solutions. We split the 1017 cases into 2 sub-samples, each of which contains 509/508 cases, and use cross-validation method to validate the cluster solution. The cluster analysis discussed below is based on a 509-case sub-sample. Each sub-sample (509 cases) is expected to cluster into 2 to 9segments. 2.3.1 Determine appropriate number of clusters by Ward’s Procedure There are no objective statistical criteria to determine how many clusters are appropriate for cluster analysis. Yet we can still get some hints from the tools provided in hierarchical procedure. For each step in algorithm, agglomeration schedule in SPSS gives distance between the clusters joined at that step. It also indicates how cases and clusters are merged together. Relatively large jumps in the agglomeration schedule indicate that dissimilar groups are clustered – clusters in previous step may be better solution. Table 1lists the last 12 steps in Agglomeration Schedule Table. 3 Table 6. Agglomeration Schedule Difference Stage Cluster Coefficient Difference of Cluster Next Combined s of CDs Differences First Stage of CDs Appears Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2 497 9 12 623.472 23.458 493 459 503 498 2 41 646.930 26.823 3.365 174 491 507 499 8 18 673.753 28.865 2.042 490 485 503 500 20 24 702.618 34.02 5.155 494 486 504 501 28 65 736.638 41.902 7.882 489 477 502 502 7 28 778.540 58.994 17.092 496 501 506 503 8 9 837.534 75.584 16.59 499 497 505 504 1 20 913.118 110.95 35.366 488 500 505 505 1 8 1024.068 140.017 29.067 504 503 507 506 7 32 1164.085 195.251 55.234 502 495 508 507 1 2 1359.336 388.07 192.819 505 498 508 508 1 7 1747.406 507 506 0 From the Table.1, we find that the differences of difference between step 504 and 505 is larger than the previous and following differences; the differences of difference between step 502 and 503 is larger than the previous and following differences. The figure 1. plots the relations of Coefficients(CD index) and Number of clusters. We search for the significant change in values, i.e. the “knee” in the graph. It seems that we need to try to cluster the data to 5 or 7 segments. Figure 1. The Relations of Coefficients and Number of clusters. 4 Cluster Validity 2000 1800 1600 1400 CD index 1200 1000 Series1 800 600 400 200 0 1 3 5 7 9 11 13 15 Clusters Then we graph the RS indices in Figure 2. We compute each RS index for each number of clusters derived by hierarchical Procedure. Figure 2. show the RS indices with number of cluster from 2 to 9. It seems that the ‘knee’ at 4 clusters. Figure 2. The RS indices for each number of clusters derived by Ward’s procedure. 0.6 0.5 0.4 0.3 Series1 0.2 0.1 0 2 3 4 5 6 7 8 9 2.3.2 Determine appropriate number of clusters by K-means Procedure After the hierarchical procedure is finished, all the 509 cases are classified into 2- 9 clusters. The seeds (centroids) of each derived clusters can be obtained by the 5 “Compare mean” in SPSS. These seeds will be used as the initial centers of K-means clustering K-means is a nonhierarchical clustering procedure. The number of clusters and number of iterations are specified prior to clustering. Basic algorithm proceeds are described as follows: First initial cluster seeds (centers) are selected. Then data points are assigned to the nearest cluster center. Centers are then recalculated based on associated data points. Data points are reassigned to new cluster centers. Algorithm terminates according to predefined stop rule. Then we compute each RS index for each number of clusters derived by K-means Procedure. Figure 3. show the RS indices with number of cluster from 2 to 9. The change for 7 clusters is more significant than the change for 4 or 5 clusters because the Change of RS is bigger than the changes for 6 and 8 clusters. So we can decide that we should derive 7 clusters. Table 7. Agglomeration Schedule Number of 2 3 4 5 6 7 8 9 Clusters RS 0.255993 0.379124 0.459984 0.525576 0.559546 0.594184 0.617848 0.631563 Change of RS 0.123131 0.08086 0.065592 0.03397 0.034638 0.023664 0.013714 Figure 3. The RS indices for each number of clusters derived by K-means procedure. cluster validity 0.7 0.6 0.5 0.4 RS Series1 0.3 0.2 0.1 0 2 3 4 5 6 7 8 9 Clusters 2.4. Interpret clusters The final cluster centers listed in Table 2 are the cluster means of 5 clusters derived by K-means followed Ward’s procedure(1017 cases). We consider the values (for the five factors) above mean as “High”, below mean as “Low”, close to mean as 6 “moderate”. Also combined with the tables of crosstabulations for each cluster in Appendix, the derived clusters can be described as below: • Cluster 1: the persons in this group have high scores on Anti-Social; moderate scores on Escapism, Slacking and low scores on Communal, Reality. The persons in this group tend to be younger than average. This group has highest percentage of single persons and low rate of having children. The percentage of students is also high comparing to other groups. The education level, annual income and SES have the polarized problem. The percentages of high education level, annual income and SES are not low, even higher than average level. But the percentages in middle levels are low and percentages in low levels are high. • Cluster 2: the persons in this group have low scores for the fives factors and it’s the smallest group (41 cases). They are mostly adults; only few teenagers. Naturally, they have higher percentage of married persons and persons with children, but don’t have tend to have more than 3 children than other groups. Most of the persons in this group have professional jobs and much fewer of persons in this group are students, comparing to other groups. They are well educated and have high income. Especially between $50,000-100,000 annual income, the percentage of this group is much higher than other groups. Of course, the group has high SEC, too. • Cluster 3: the persons in this group have high scores for the fives factors. Many characters of this group are contrary to the second cluster. The persons of ths group are young. Most of them are between 13~15 and 20~29, the age for person unstable. The percentage of married and persons having children is not high. The percentage of black, the percentage of students and the percentage of unemployed in this group are higher than other groups. The education level and annual income is also lower than other groups. Most persons in this group have income under $50,000. They also have lower SES than other groups. • Cluster 4: the persons in this group have low scores on Escapism, Anti-Social, Reality; moderate low score on Communal and high score on Slacking. The persons in this group are average persons. They have lower income and SES but a little higher education level than average. • Cluster 5: the persons in this group have high scores on Escapism, Communal, Reality; moderate scores on Anti-Social and low score on Slacking. The persons in this groups are average person, too. They tend to be younger and have a little lower education level than average. They have high percentage to be students and average annual income and SES. • Cluster 6: the persons in this group have high scores on Escapism, Reality, Slacking; moderate low score on Communal, and low score on Anti-Social. The persons in this group tend to be older, even older than cluster 2. They are well educated and have good jobs and high annual income. Their SES is higher than average but lower than cluster 2 and cluster 7. • Cluster 7: the persons in this group have moderate low score on Escapism and low scores on Anti-Social, Communal, Reality, Slacking. The persons in this group are in middle age and tend to be single. They have higher education level and income than average. They have highest percentage of very high income level (More than $100,000). They also have a little higher SES than average. But there are high percentages of low income and SES in this group. It seems that there exist the polarized problem in this group. 7 Table 8. Final cluster centers from K-means solutions with initial centers from ward’s method Cluster Number of Case 1 2 3 4 5 6 7 Total Q.B What is your age-group? 4.57 5.00 4.24 4.77 4.33 5.15 4.95 4.68 Q.A What is your marital 1.69 1.74 1.63 1.80 1.69 1.72 1.68 1.71 status? Are you Q.B How many children do 6.82 5.03 6.44 6.60 6.65 6.69 6.95 6.63 you have living Q.D Which of the following 1.56 2.00 1.44 1.51 1.66 1.42 1.39 1.52 best describe Q.E Which of the following 7.84 5.76 8.43 7.42 7.96 7.13 8.09 7.71 best describe Q.F What was the last grade 3.57 3.98 3.19 3.71 3.34 3.83 3.73 3.59 of school yo Q.G What is your 5.07 4.88 4.79 4.82 4.81 4.83 5.17 4.91 approximate annual hous SES 1.67 2.10 1.54 1.71 1.67 1.81 1.75 1.71 Escapism 3.1335 1.5540 4.0360 2.3873 3.7299 3.4181 3.0542 3.1979 Anti-Social 2.9372 1.4573 3.5216 1.4855 2.3985 1.8129 1.8376 2.2947 Communal 3.1759 1.7967 3.9712 3.1734 4.1273 3.1656 2.9951 3.3569 Reality 3.0766 1.8598 4.0378 3.0289 3.9303 3.8374 3.1916 3.4267 Slacking 3.0879 1.6951 4.0324 3.6243 2.5727 3.7055 1.8832 3.1052 Table 9. Number of Cases in each Cluster Cumulative Frequency Percent Valid Percent Percent 1 199 19.6 19.6 19.6 2 41 4.0 4.0 23.6 3 139 13.7 13.7 37.3 4 173 17.0 17.0 54.3 Valid 5 165 16.2 16.2 70.5 6 163 16.0 16.0 86.5 7 137 13.5 13.5 100.0 Total 1017 100.0 100.0 8 Figure 4. Percentage of each cluster in a pie graph. Cluster Number of Case 1 2 3 4 5 6 7 Figure 5. Average factors scores for each cluster. 9 Escapism Anti-Social Communal 4.00 Reality Slacking 3.00 n a e M 2.00 1.00 1 2 3 4 5 6 7 Cluster Number of Case 2.5. Validate solution 10 Validation involves assuring that a cluster solution is applicable to population and not merely a product of the data and/or artifact of the clustering procedure. As we mentioned before, the validation procedure in this report is cross-validation. The whole sample (1017 cases) is randomly divided into two subsets – calibration set (group0) and validation set (group1). First we cluster calibration set (group0) and determine cluster centers (Hierarchical procedure followed by K-means). Then we assigned cases in validation set (group1) to calibration clusters based on the closest calibration cluster center. We call this cluster solution C1. Now we cluster the validation set (group1) with the same number of clusters as the calibration set (group0). We call this cluster solution C2. Table 10. Cluster Number of Case * Cluster Number of Case Crosstabulation Cluster Number of Case Total 1 2 3 4 5 6 7 1 0 95 0 2 1 0 2 100 Clust 2 7 0 3 0 0 2 0 12 er 3 0 0 0 89 5 1 0 95 Numb 4 0 0 3 0 98 0 0 101 er of 5 0 0 69 0 3 0 0 72 Case 6 0 0 1 0 0 68 0 69 7 0 0 0 1 0 0 58 59 Total 7 95 76 92 107 71 60 508 The Calibration and Validation cluster solutions are very similar. This strong agreement indicates the appropriateness of the clusters. 11