Cluster Analysis of gamer segmentation data by yyc62487

VIEWS: 0 PAGES: 11

									               Cluster Analysis of gamer segmentation data

I. Introduction
        This memorandum describes the segmentation analysis of the gamer
segmentation data set gives responses from people who were asked about attitudes and
behaviors with respect to video gaming behavior. These attitudes/behaviors were then
characterized by five factors called Escapism, Anti-Social, Communal, Reality and
Slacking. Seven segments are found by cluster analysis. The explanations are given in
2.4.
In next part, factor analysis is discussed and factor loadings are given in section 1.
Section 2 describes the procedure of cluster analysis and segments derived, then the
explanation and validation are given.


II. Technical Analysis
1. FACTOR ANALYSIS
        Factor analysis is used to help identify groups of variables that are strongly
correlated. So we will use the five factors given by factor analysis: Escapism, Anti-
Social, Communal, Reality and Slacking.

Table 1. Gaming attitudes loading on Factor 1: Escapism.

Gaming Attitude                                                                        Factor Loading
   Q.15 (E): Playing games allows me to take out my aggressions                             .704
   Q.15 (F): Playing games allows me to be someone I am not                                 .701
   Q.15 (S): Playing games allows me to experience the subject whenever I want              .667
   Q.15 (C): Playing games allows me to escape my problems                                  .620
   Q.15 (K): Playing games is another way to get closer to the things I like                .567
   Q.15 (I): Playing games allows me to express my creativity                               .558
   Q.15 (O): I play games for a rush or excitement                                          .482


Table 2. Gaming attitudes loading on Factor 2: Anti-Social

Gaming Attitude                                                                        Factor Loading
   Q.15 (Q):   I lie about the amount of time I spend playing games                         .742
   Q.15 (H):   Others are jealous of the time I spend playing games                         .708
   Q.15 (R):   Experiencing something in a game is better than experiencing the real
   thing                                                                                    .612
   Q.15 (L):   I spend too much time playing games                                          .591




                                                     1
Table 3. Gaming attitudes loading on Factor 3: Communal.

Gaming Attitude                                                                Factor Loading
   Q.15 (G): When playing games I like to compete against other players             .756
   Q.15 (N): Playing games is a big part of hanging out with my friends             .606
   Q.15 (B): I hate playing games that are too easy                                 .555

Table 4. Gaming attitudes loading on Factor 4: Reality.

Gaming Attitude                                                                Factor Loading
   Q.15 (P): I prefer to play games that deal with subjects I know about            .700
   Q.15 (M): Realistic graphics are essential to a great game                       .565
   Q.15 (K): Playing games is another way to get closer to the things I like        .500
   Q.15 (A): Winning is very important whenever I play a game                       .460

Table 5. Gaming attitudes loading on Factor 5: Slacking.

Gaming Attitude                                                                Factor Loading
   Q.15 (J): I play games to kill time                                              .707
   Q.15 (D): I play games only when I have nothing else to do                       .688



Note: Standardization is not required for this data set since all the 5 factors are measured
on the same scale (from “1” to “5”).


2. CLUSTER ANALYSIS
        The cluster analysis includes three main components, which are
Similarity/Distance measure, clustering procedure, and number of clusters. The first two
are decided beforehand, while the number of clusters is decided post hoc. Before
clustering only the range of the number can be specified.

2.1     Clustering procedures: Clustering procedure specifies how clusters are formed
and how cluster membership is determined. Several high-level strategies are available for
forming clusters, such as Hierarchical, K-means, Optimization, Statistical model-based,
and Fuzzy clusters, but no one is perfect.
         Hierarchical methods require neither an initial number of clusters nor an initial
cluster centers. The clusters generated are nested, which allows for a more
straightforward evaluation of trade-offs between more or fewer clusters. However,
hierarchical methods are sensitive to outliers, and they cannot correct early poor
classifications.
        Nonhierarchical methods are less sensitive to outliers, choice of distance
measures and inappropriate variables, so are appropriate when objective is to partition


                                                   2
data. However, to use nonhierarchical methods, we must specify the number of clusters
beforehand, and the solutions are very sensitive to the choice of cluster seeds.

2.2     Linkage Method: Agglomerative hierarchical cluster procedures require a
method for determining which two clusters to join at each step of the algorithm. Several
linkage methods are available, such as Single linkage, Complete linkage, Average
linkage, Centroid method, Median method, and Ward’s method. Here Ward’s method is
used in hierarchical cluster procedure.
        Ward’s method is a Minimum variance method. At each step, joins the cluster
pair that, when merged, has the smallest within cluster squared distance from the mean.
Since Ward’s method always tries to minimize within group variance, it tends to combine
small clusters (the variance of small clusters is small). Therefore, it is biased towards
creating clusters with approximately the same number of objects. An important feature of
Ward’s method is its tendency to produce cluster solutions similar to those obtained from
k-means clustering. Since our clustering procedure is hierarchical followed by k-means,
using Ward’s method in hierarchical procedure will generate solutions more compatible
for next step (k-means method).

        Similarity / Distance measures: Clustering methods require some way to
measure the distance between pairs of data objects. There are several distance measures
to choose from, including Correlational, Distance, Euclidean distance, City-block,
Mahalanobis distance, Association measures.
        Euclidean distance is frequently applied when variables are on same scale,
because it reduces computation time and works better with centroid and Ward’s
clustering methods. Since all the 5 factors are measured on the same scale, we can safely
use Squared Euclidean distance in hierarchical clustering procedure.



2.3     Number of cases and range of number of clusters: Clustering methods require
some way to validate the solutions. We split the 1017 cases into 2 sub-samples, each of
which contains 509/508 cases, and use cross-validation method to validate the cluster
solution. The cluster analysis discussed below is based on a 509-case sub-sample. Each
sub-sample (509 cases) is expected to cluster into 2 to 9segments.

2.3.1 Determine appropriate number of clusters by Ward’s Procedure
         There are no objective statistical criteria to determine how many clusters are
appropriate for cluster analysis. Yet we can still get some hints from the tools provided in
hierarchical procedure.
         For each step in algorithm, agglomeration schedule in SPSS gives distance
between the clusters joined at that step. It also indicates how cases and clusters are
merged together. Relatively large jumps in the agglomeration schedule indicate that
dissimilar groups are clustered – clusters in previous step may be better solution. Table
1lists the last 12 steps in Agglomeration Schedule Table.




                                             3
Table 6. Agglomeration Schedule
                                                        Difference  Stage
           Cluster               Coefficient Difference     of      Cluster                Next
          Combined                    s       of CDs Differences     First                 Stage
                                                         of CDs    Appears
  Stage    Cluster 1   Cluster 2                                   Cluster 1   Cluster 2
   497        9           12       623.472    23.458                 493         459       503
   498        2           41       646.930    26.823      3.365      174         491       507
   499        8           18       673.753    28.865      2.042      490         485       503
   500        20          24       702.618     34.02      5.155      494         486       504
   501        28          65       736.638    41.902      7.882      489         477       502
   502        7           28       778.540    58.994     17.092      496         501       506
   503        8           9        837.534    75.584      16.59      499         497       505
   504        1           20       913.118    110.95     35.366      488         500       505
   505        1           8       1024.068 140.017       29.067      504         503       507
   506        7           32      1164.085 195.251       55.234      502         495       508
   507        1           2       1359.336    388.07     192.819     505         498       508
   508        1           7       1747.406                           507         506        0



         From the Table.1, we find that the differences of difference between step 504 and
505 is larger than the previous and following differences; the differences of difference
between step 502 and 503 is larger than the previous and following differences. The
figure 1. plots the relations of Coefficients(CD index) and Number of clusters. We search
for the significant change in values, i.e. the “knee” in the graph. It seems that we need to
try to cluster the data to 5 or 7 segments.

Figure 1. The Relations of Coefficients and Number of clusters.




                                                4
                             Cluster Validity

              2000
              1800
              1600
              1400
   CD index




              1200
              1000                                                  Series1
               800
               600
               400
               200
                 0
                     1

                         3

                             5

                                 7

                                       9

                                             11

                                                   13

                                                          15
                                 Clusters

      Then we graph the RS indices in Figure 2. We compute each RS index for each
number of clusters derived by hierarchical Procedure. Figure 2. show the RS indices with
number of cluster from 2 to 9. It seems that the ‘knee’ at 4 clusters.

Figure 2. The RS indices for each number of clusters derived by Ward’s procedure.

  0.6

  0.5

  0.4

  0.3                                                           Series1

  0.2

  0.1

        0
                2    3   4   5     6     7     8     9


2.3.2 Determine appropriate number of clusters by K-means Procedure
        After the hierarchical procedure is finished, all the 509 cases are classified into 2-
9 clusters. The seeds (centroids) of each derived clusters can be obtained by the


                                              5
“Compare mean” in SPSS. These seeds will be used as the initial centers of K-means
clustering
        K-means is a nonhierarchical clustering procedure. The number of clusters and
number of iterations are specified prior to clustering. Basic algorithm proceeds are
described as follows:
        First initial cluster seeds (centers) are selected. Then data points are assigned to
the nearest cluster center. Centers are then recalculated based on associated data points.
Data points are reassigned to new cluster centers. Algorithm terminates according to
predefined stop rule.
        Then we compute each RS index for each number of clusters derived by K-means
Procedure. Figure 3. show the RS indices with number of cluster from 2 to 9. The change
for 7 clusters is more significant than the change for 4 or 5 clusters because the Change
of RS is bigger than the changes for 6 and 8 clusters. So we can decide that we should
derive 7 clusters.

Table 7. Agglomeration Schedule
  Number of
                  2        3        4        5        6        7        8        9
   Clusters
     RS       0.255993 0.379124 0.459984 0.525576 0.559546 0.594184 0.617848 0.631563
 Change of RS          0.123131 0.08086 0.065592 0.03397 0.034638 0.023664 0.013714
Figure 3. The RS indices for each number of clusters derived by K-means procedure.

                           cluster validity

        0.7
        0.6
        0.5
        0.4
   RS




                                                                 Series1
        0.3
        0.2
        0.1
         0
               2    3     4     5    6     7       8   9
                              Clusters




2.4. Interpret clusters
        The final cluster centers listed in Table 2 are the cluster means of 5 clusters
derived by K-means followed Ward’s procedure(1017 cases). We consider the values (for
the five factors) above mean as “High”, below mean as “Low”, close to mean as



                                               6
“moderate”. Also combined with the tables of crosstabulations for each cluster in
Appendix, the derived clusters can be described as below:
      • Cluster 1: the persons in this group have high scores on Anti-Social; moderate
          scores on Escapism, Slacking and low scores on Communal, Reality. The
          persons in this group tend to be younger than average. This group has highest
          percentage of single persons and low rate of having children. The percentage
          of students is also high comparing to other groups. The education level,
          annual income and SES have the polarized problem. The percentages of high
          education level, annual income and SES are not low, even higher than average
          level. But the percentages in middle levels are low and percentages in low
          levels are high.
      • Cluster 2: the persons in this group have low scores for the fives factors and
          it’s the smallest group (41 cases). They are mostly adults; only few teenagers.
          Naturally, they have higher percentage of married persons and persons with
          children, but don’t have tend to have more than 3 children than other groups.
          Most of the persons in this group have professional jobs and much fewer of
          persons in this group are students, comparing to other groups. They are well
          educated and have high income. Especially between $50,000-100,000 annual
          income, the percentage of this group is much higher than other groups. Of
          course, the group has high SEC, too.
      • Cluster 3: the persons in this group have high scores for the fives factors.
          Many characters of this group are contrary to the second cluster. The persons
          of ths group are young. Most of them are between 13~15 and 20~29, the age
          for person unstable. The percentage of married and persons having children is
          not high. The percentage of black, the percentage of students and the
          percentage of unemployed in this group are higher than other groups. The
          education level and annual income is also lower than other groups. Most
          persons in this group have income under $50,000. They also have lower SES
          than other groups.
      • Cluster 4: the persons in this group have low scores on Escapism, Anti-Social,
          Reality; moderate low score on Communal and high score on Slacking. The
          persons in this group are average persons. They have lower income and SES
          but a little higher education level than average.
      • Cluster 5: the persons in this group have high scores on Escapism, Communal,
          Reality; moderate scores on Anti-Social and low score on Slacking. The
          persons in this groups are average person, too. They tend to be younger and
          have a little lower education level than average. They have high percentage to
          be students and average annual income and SES.
      • Cluster 6: the persons in this group have high scores on Escapism, Reality,
          Slacking; moderate low score on Communal, and low score on Anti-Social.
          The persons in this group tend to be older, even older than cluster 2. They are
          well educated and have good jobs and high annual income. Their SES is
          higher than average but lower than cluster 2 and cluster 7.
      • Cluster 7: the persons in this group have moderate low score on Escapism and
          low scores on Anti-Social, Communal, Reality, Slacking. The persons in this
          group are in middle age and tend to be single. They have higher education
          level and income than average. They have highest percentage of very high
          income level (More than $100,000). They also have a little higher SES than
          average. But there are high percentages of low income and SES in this group.
          It seems that there exist the polarized problem in this group.



                                           7
Table 8. Final cluster centers from K-means solutions with initial centers from ward’s method
                                                       Cluster Number of Case
                                 1          2        3        4      5        6          7     Total
Q.B What is your age-group?    4.57       5.00     4.24     4.77   4.33     5.15       4.95    4.68
  Q.A What is your marital
                               1.69       1.74     1.63     1.80     1.69     1.72     1.68     1.71
      status? Are you
 Q.B How many children do
                               6.82       5.03     6.44     6.60     6.65     6.69     6.95     6.63
      you have living
 Q.D Which of the following
                               1.56       2.00     1.44     1.51     1.66     1.42     1.39     1.52
       best describe
 Q.E Which of the following
                               7.84       5.76     8.43     7.42     7.96     7.13     8.09     7.71
       best describe
Q.F What was the last grade
                               3.57       3.98     3.19     3.71     3.34     3.83     3.73     3.59
        of school yo
     Q.G What is your
                               5.07       4.88     4.79     4.82     4.81     4.83     5.17     4.91
 approximate annual hous
            SES                1.67       2.10     1.54     1.71     1.67     1.81     1.75     1.71
         Escapism             3.1335     1.5540   4.0360   2.3873   3.7299   3.4181   3.0542   3.1979
         Anti-Social          2.9372     1.4573   3.5216   1.4855   2.3985   1.8129   1.8376   2.2947
         Communal             3.1759     1.7967   3.9712   3.1734   4.1273   3.1656   2.9951   3.3569
           Reality            3.0766     1.8598   4.0378   3.0289   3.9303   3.8374   3.1916   3.4267
          Slacking            3.0879     1.6951   4.0324   3.6243   2.5727   3.7055   1.8832   3.1052


Table 9. Number of Cases in each Cluster
                                                             Cumulative
                  Frequency    Percent      Valid Percent
                                                              Percent
           1         199         19.6             19.6         19.6
           2         41           4.0              4.0         23.6
           3         139         13.7             13.7         37.3
           4         173         17.0             17.0         54.3
 Valid
           5         165         16.2             16.2         70.5
           6         163         16.0             16.0         86.5
           7         137         13.5             13.5         100.0
          Total     1017         100.0            100.0




                                                  8
Figure 4. Percentage of each cluster in a pie graph.
                                                       Cluster Number of
                                                               Case
                                                                  1
                                                                  2
                                                                  3
                                                                  4
                                                                  5
                                                                  6
                                                                  7




Figure 5. Average factors scores for each cluster.




                                             9
                                                Escapism
                                                Anti-Social
                                                Communal
    4.00                                        Reality
                                                Slacking




    3.00
   n
   a
   e
   M


    2.00




    1.00


             1     2     3   4    5     6   7
                   Cluster Number of Case




2.5. Validate solution




                                       10
    Validation involves assuring that a cluster solution is applicable to population and not
merely a product of the data and/or artifact of the clustering procedure. As we mentioned
before, the validation procedure in this report is cross-validation.
    The whole sample (1017 cases) is randomly divided into two subsets – calibration set
(group0) and validation set (group1). First we cluster calibration set (group0) and
determine cluster centers (Hierarchical procedure followed by K-means). Then we
assigned cases in validation set (group1) to calibration clusters based on the closest
calibration cluster center. We call this cluster solution C1. Now we cluster the validation
set (group1) with the same number of clusters as the calibration set (group0). We call this
cluster solution C2.

Table 10. Cluster Number of Case * Cluster Number of Case Crosstabulation
                                     Cluster Number of Case
                                                                                    Total
                    1        2         3        4        5         6        7
             1      0       95         0        2        1         0        2       100
 Clust       2      7        0         3        0        0         2        0        12
  er         3      0        0         0       89        5         1        0        95
 Numb        4      0        0         3        0       98         0        0       101
 er of       5      0        0        69        0        3         0        0        72
 Case
             6      0        0         1        0        0        68        0        69
             7      0        0         0        1        0         0       58        59
     Total          7       95        76       92       107       71       60       508


      The Calibration and Validation cluster solutions are very similar. This strong
agreement indicates the appropriateness of the clusters.




                                            11

								
To top