1. Database Theory by fanzhongqing

VIEWS: 0 PAGES: 35

									      COT5230 Data Mining


               Week 4
      Data Mining and Statistics
        Clustering Techniques



         MONASH
AUSTRALIA’S   INTERNATIONAL              UNIVERSITY



                      Data Mining and Statistics, Clustering Techniques   4.1
                  References

 Elder, John F. IV; Pregibon, Daryl; A Statistical
  Perspective on KDD; pp.87-93. Proceedings of
  the First International Conference on Knowledge
  Discovery & Data Mining (Ed. Fayyad, U.M. &
  Uthurusamy, R.), AAAI Press, Menlo Park,
  California 1995.
 Berry& Linoff (1997) Data Mining Techniques: For
  Marketing, Sales, and Customer Support, Wiley.
 Berson A. & Smith S.J. (1997) Data Warehousing,
  Data Mining and OLAP, McGraw-Hill.

                            Data Mining and Statistics, Clustering Techniques   4.2
The Link between Pattern and Approach


 Data mining aims to reveal knowledge about the
  data under consideration
 This knowledge takes the form of patterns within
  the data which embody our understanding of the
  data
   – Patterns are also referred to as structures, models and
     relationships
 The approach chosen is inherently linked to the
  pattern revealed


                                Data Mining and Statistics, Clustering Techniques   4.3
   A Taxonomy of Approaches to Data
             Mining - 1

 It is not expected that all the approaches will work
  equally well with all data sets
 Visualization of data sets can be combined with,
  or used prior to, modeling and assists in selecting
  an approach and indicating what patterns might
  be present




                             Data Mining and Statistics, Clustering Techniques   4.4
  A Taxonomy of Approaches to Data
            Mining - 2

Verification-driven            Discovery-driven
                       Predictive                Informative
                       (Supervised)              (Unsupervised)
Query and reporting                              Clustering
Statistical analysis                             Association
                       Regression
                                                 Deviation
                       Classification
                                                 detection
                                                 (outliers)




                                Data Mining and Statistics, Clustering Techniques   4.5
      Verification-driven Data Mining
               Techniques - 1
 Verification data mining techniques require the
  user to postulate some hypothesis
   – Simple query and reporting, or statistical analysis
     techniques then confirm this hypothesis
 Statistics has been neglected to a degree in data
  mining in comparison to less traditional
  techniques such as
   – neural networks, genetic algorithms and rule-based
     approaches to classification
 Many of these “less traditional” techniques also
  have a statistical interpretation

                                 Data Mining and Statistics, Clustering Techniques   4.6
      Verification-driven Data Mining
               Techniques - 2

 The reasons for this are various
 Statistical techniques are most useful for well-
  structured problems
 Many data mining problems are not well
  structured:
   – the statistical techniques breakdown or require large
     amounts of time and effort to be effective




                                Data Mining and Statistics, Clustering Techniques   4.7
Problems with Statistical Approaches - 1

 Traditional statistical models often highlight linear
  relationships but not complex non-linear
  relationships
 Exploring all possible higher dimensional
  relationships, often (usually) takes an
  unacceptably long time
   – the non-linear statistical methods require knowledge
     about
      » the type of non-linearity
      » the ways in which the variables interact
   – This knowledge is often not known in complex multi-
     dimensional data mining problems

                                     Data Mining and Statistics, Clustering Techniques   4.8
Problems with Statistical Approaches - 2

 Statisticians have traditionally focussed on model
  estimation, rather than model selection
 For these reasons less traditional, more
  exploratory, techniques are often chosen for
  modern data mining
 The current high level of interest in data mining
  centres on many of the newer techniques, which
  may be termed discovery-driven
 Lessons from statistics should not be forgotten.
  Estimation of uncertainty and checking of
  assumptions is as important as ever!

                            Data Mining and Statistics, Clustering Techniques   4.9
       Discovery-driven Data Mining
               Techniques

 Discovery-driven data mining techniques can also
  be broken down into two broad areas:
   – those techniques which are considered predictive,
     sometimes termed supervised techniques
   – those techniques which are termed informative,
     sometimes termed unsupervised techniques
 Predictive techniques build patterns by making a
  prediction of some unknown attribute given the
  values of other known attributes


                               Data Mining and Statistics, Clustering Techniques   4.10
 Informative techniques do not present a solution
  to a known problem
   – they present interesting patterns for consideration by
     some expert in the domain
   – the patterns may be termed “informative patterns”
 The main predictive and informative patterns are:
   – Regression
   – Classification
   – Clustering
   – Association


                                Data Mining and Statistics, Clustering Techniques   4.11
                   Regression

 Regression is a predictive technique which
  discovers relationships between input and output
  patterns, where the values are continuous or real
  valued
 Many traditional statistical regression models are
  linear
 Neural networks, though biologically inspired, are
  in fact non-linear regression models
 Non-linear relationships occur in many multi-
  dimensional data mining applications

                            Data Mining and Statistics, Clustering Techniques   4.12
 An Example of a Regression Model - 1
 Consider a mortgage provider that is concerned
  with retaining mortgages once taken out
 They may also be interested in how profit on
  individual loans is related to customers paying off
  their loans at an accelerated rate
   – For example, a customer may pay an additional
     amount each month and thus pay off their loan in 15
     years instead of 25 years
 A graph of the relationship between profit and the
  elapsed time between when a loan is actually
  paid off and when it was originally contracted to
  be paid off appears on the next slide
                               Data Mining and Statistics, Clustering Techniques   4.13
   An Example of a Regression Model - 2



                                                                                 Non-linear
Profit
                                                                                 linear

         0


             0   7
                     Years Early Loan Paid Off




                                        Data Mining and Statistics, Clustering Techniques   4.14
 An Example of a Regression Model - 3

 Linear regression on the data does not match the
  real pattern of the data
 The curved line represents what might be
  produced by a non-linear approach (perhaps a
  neural network)
 This curved line fits the data much better. It could
  be used as the basis on which to predict
  profitability
   – Decisions on exit fees and penalties for certain
     behaviors may be based on this kind of analysis.


                               Data Mining and Statistics, Clustering Techniques   4.15
     Exploratory Data Analysis (EDA)
 Classical statistics has a dogma that the data
  may not be viewed prior to modeling [Elde95]
   – aim is to avoid choosing biased hypotheses
 During the 1970s the term Exploratory Data
  Analysis (EDA) was used to express the notion
  that both the choice of model and hints as to
  appropriate approaches could be data-driven
 Elder and Pregiban describes the dichotomy thus:
  “On the one side the argument was that hypotheses and the like must
  not be biased by choosing them on the basis of what the data seemed
  to be indicating. On the other side was the belief that pictures and
  numerical summaries of data are necessary in order to understand
  how rich a model the data can support.”
                                     Data Mining and Statistics, Clustering Techniques   4.16
     EDA and the Domain Expert - 1




 It is a very hard problem to include common
  sense based on some knowledge of the domain
  in automated modeling systems
   – chance discoveries occur when exploring data that
     may not have occurred otherwise
   – these can also change the approach to the subsequent
     modeling



                              Data Mining and Statistics, Clustering Techniques   4.17
       EDA and the Domain Expert - 2


 The obstacles to entirely automating the process
  are:
     – It is hard to quantify a procedure to capture “the
       unexpected” in plots
     – Even if this could be accomplished, one would need to
       describe how this maps into the next analysis step in
       the automated procedure
     What is needed is a way to represent meta-
    knowledge about the problem at hand and the
    procedures commonly used


                                 Data Mining and Statistics, Clustering Techniques   4.18
      An Interactive Approach to DM


 A domain expert is someone who has meta-
  knowledge about the problem
 An interactive exploration and a querying and/or
  visualization system guided by a domain expert
  goes beyond current statistical methods
 Current thinking on statistical theory recognizes
  such an approach as being potentially able to
  provide a more effective way of discovering
  knowledge about a data set


                             Data Mining and Statistics, Clustering Techniques   4.19
        Automatic Cluster Detection

 If the are many competing patterns, a data set
  can appear to contain just noise
 Subdividing a data set into clusters where
  patterns can be more easily discerned can
  overcome this
 When we have no idea how to define the clusters
  automatic cluster detection methods can be
  useful
 Finding clusters is an unsupervised learning task


                            Data Mining and Statistics, Clustering Techniques   4.20
 Automatic Cluster Detection - example


 The Hehrtzsprung-Russell diagram which graphs
  a stars luminosity against temperature reveals
  three clusters
   – It is interesting to note that each of the clusters has a
     different relationship between luminosity and
     temperature.
 In most data mining situations the variables to
  consider and the clusters that may be formed are
  not so easily determined


                                  Data Mining and Statistics, Clustering Techniques   4.21
         The Hehrtzsprung-Russell diagram
                                                        Red Giants


Luminosity
(Sun=1)

             1
                                                                     Main Sequence
                 White Dwarves




                                                        2,500
                       40,000
                                Temperature (Degrees Kelvin)

                                           Data Mining and Statistics, Clustering Techniques   4.22
            The K-Means Technique


 K, the number of clusters that are to be formed,
  must be decided before beginning
   – Step 1
      » Select K data points to act as the seeds (or initial centroids)
   – Step 2
      » Each record is assigned to the centroid which is nearest, thus
        forming a cluster
   – Step 3
      » The centroids of the new clusters are then calculated. Go back
        to Step 2
   – This is continued until the clusters stop changing


                                      Data Mining and Statistics, Clustering Techniques   4.23
Assign Each Record to the Nearest
            Centroid


 X2




                X1

                     Data Mining and Statistics, Clustering Techniques   4.24
     Calculate the New Centroids


X2




                 X1



                      Data Mining and Statistics, Clustering Techniques   4.25
Determine the New Cluster Boundaries


   X2




                 X1



                      Data Mining and Statistics, Clustering Techniques   4.26
   Similarity, Association and Distance

 The method just described assumes that each
  record can be described as a point in a metric-
  space
   – This is not easily done for many data sets (e.g.
     categorical and some numeric variables)
 The records in a cluster should have a natural
  association. A measure of similarity is required.
   – Euclidean distance is often used, but it is not always
     suitable
   – Euclidean distance treats changes in each dimension
     equally, but in databases changes in one field may be
     more important than changes in another

                                 Data Mining and Statistics, Clustering Techniques   4.27
               Types of Variables

 Categories
   – e.g. Food Group: Grain, Dairy, Meat, etc.
 Ranks
   – e.g. Food Quality: Premium, High Grade, Medium, Low
 Intervals
   – e.g. The distance between temperatures
 True Measures
   – The measures have a meaningful zero point so ratios
     have meaning as well as distances


                                Data Mining and Statistics, Clustering Techniques   4.28
           Measures of Similarity



 Euclidean distance
 Angle between two vectors (from origin to data
  point
 The number of features in common
 Mahalanobis distance




                           Data Mining and Statistics, Clustering Techniques   4.29
            Weighting and Scaling


 Weighting allows some variables to assume
  greater importance than others.
   – The domain expert must decide if certain variables
     deserve a greater weighting
   – Statistical weighting techniques also exist
 Scaling attempts to apply a common range to
  variables so that differences are comparable
  between variables
   – This can also be statistically based



                                 Data Mining and Statistics, Clustering Techniques   4.30
  Variations of the K-Means Technique
 There are problems with simple K-means method
   – It does not deal well with overlapping clusters.
   – The clusters can be pulled of centre by outliers.
   – Records are either in or out of the cluster so there is no
     notion of likelihood of being in a particular cluster or not
 A Gaussian Mixture Model varies the approach
  already outlined by attaching a weighting based
  on a probability distribution to records which are
  close to or distant from the centroids initially
  chosen. There is then less chance of outliers
  distorting the situation. Each record contributes to
  some degree to each of the centroids
                                  Data Mining and Statistics, Clustering Techniques   4.31
        Agglomeration Methods - 1


 A true unsupervised technique would not pre-
  determine the number of clusters
 A hierarchical technique would offer a hierarchy
  of clusters from large to small. This can be
  achieved in a number of ways
 An agglomerative technique starts out by
  considering each record as a cluster and
  gradually building larger clusters by merging the
  records which are near each other


                            Data Mining and Statistics, Clustering Techniques   4.32
        Agglomeration Methods - 2
 An example of an agglomerative cluster tree:




                           Data Mining and Statistics, Clustering Techniques   4.33
              Evaluating Clusters


 We desire clusters to have members which are
  close to each other and we also want the clusters
  to be widely spaced
 Variance measures are often used. Ideally, we
  want to minimize within-cluster variance and
  maximize between-cluster variance
 But variance is not the only important factor, for
  example it will favor not merging clusters in an
  hierarchical technique


                             Data Mining and Statistics, Clustering Techniques   4.34
Strengths of Automatic Cluster Detection


 Strengths
   – is an undirected knowledge discovery technique
   – works well with many types of data
   – is relatively simple to carry out
 Weaknesses
   – Can be difficult to choose the distance measures and
     weightings
   – Can be sensitive to initial parameter choices
   – The clusters found can be difficult to interpret


                               Data Mining and Statistics, Clustering Techniques   4.35

								
To top