Data Mining in Market Research by upg41041

VIEWS: 18 PAGES: 55

									Data Mining in Market Research
• What is data mining?
   – Methods for finding interesting structure in large databases
       • E.g. patterns, prediction rules, unusual cases
   – Focus on efficient, scalable algorithms
       • Contrasts with emphasis on correct inference in statistics
   – Related to data warehousing, machine learning
• Why is data mining important?
   – Well marketed; now a large industry; pays well
   – Handles large databases directly
   – Can make data analysis more accessible to end users
       • Semi-automation of analysis
       • Results can be easier to interpret than e.g. regression models
       • Strong focus on decisions and their implementation
CRISP-DM Process Model
          Data Mining Software
• Many providers of data mining software
   – SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner,
     MS SQL Server, Polyanalyst, KnowledgeSTUDIO, …
   – See http://www.kdnuggets.com/software/suites.html for a list
   – Good algorithms important, but also need good facilities for
     handling data and meta-data
• We’ll use:
   – WEKA (Waikato Environment for Knowledge Analysis)
      • Free (GPLed) Java package with GUI
      • Online at www.cs.waikato.ac.nz/ml/weka
      • Witten and Frank, 2000. Data Mining: Practical Machine Learning
        Tools and Techniques with Java Implementations.
   – R packages
      • E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree,
        mlbench, randomForest, subselect
           Data Mining Terms
• Different names for familiar statistical concepts,
  from database and AI communities
   – Observation = case, record, instance
   – Variable = field, attribute
   – Analysis of dependence vs interdependence =
     Supervised vs unsupervised learning
   – Relationship = association, concept
   – Dependent variable = response, output
   – Independent variable = predictor, input
 Common Data Mining Techniques
• Predictive modeling
   – Classification
       • Derive classification rules
       • Decision trees
   – Numeric prediction
       • Regression trees, model trees
• Association rules
• Meta-learning methods
   – Cross-validation, bagging, boosting
• Other data mining methods include:
   – artificial neural networks, genetic algorithms, density estimation,
     clustering, abstraction, discretisation, visualisation, detecting
     changes in data or models
                  Classification
• Methods for predicting a discrete response
  – One kind of supervised learning
  – Note: in biological and other sciences, classification
    has long had a different meaning, referring to cluster
    analysis
• Applications include:
  – Identifying good prospects for specific marketing or
    sales efforts
     • Cross-selling, up-selling – when to offer products
     • Customers likely to be especially profitable
     • Customers likely to defect
  – Identifying poor credit risks
  – Diagnosing customer problems
  Weather/Game-Playing Data
• Small dataset
  – 14 instances
  – 5 attributes
    •   Outlook - nominal
    •   Temperature - numeric
    •   Humidity - numeric
    •   Wind - nominal
    •   Play
         – Whether or not a certain game would be played
         – This is what we want to understand and predict
       ARFF file for the weather data.
% ARFF file for the weather data with some numeric features
%
@relation weather

@attribute   outlook { sunny, overcast, rainy }
@attribute   temperature numeric
@attribute   humidity numeric
@attribute   windy { true, false }
@attribute   play? { yes, no }

@data
%
% 14 instances
%
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
rainy, 70, 96, false, yes
rainy, 68, 80, false, yes
rainy, 65, 70, true, no
overcast, 64, 65, true, yes
sunny, 72, 95, false, no
sunny, 69, 70, false, yes
rainy, 75, 80, false, yes
sunny, 75, 70, true, yes
overcast, 72, 90, true, yes
overcast, 81, 75, false, yes
rainy, 71, 91, true, no
   German Credit Risk Dataset
• 1000 instances (people), 21 attributes
  – “class” attribute describes people as good or bad
    credit risks
  – Other attributes include financial information and
    demographics
     • E.g. checking_status, duration, credit_history, purpose,
       credit_amount, savings_status, employment, Age, housing,
       job, num_dependents, own_telephone, foreign_worker
• Want to predict credit risk
• Data available at UCI machine learning data
  repository
     • http://www.ics.uci.edu/~mlearn/MLRepository.html
  – and on 747 web page
     • http://www.stat.auckland.ac.nz/~reilly/credit-g.arff
       Classification Algorithms
• Many methods available in WEKA
   – 0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM,
     Instance-based learner (IB1, IBk), C4.5 (J48), PART,
     Support vector machine (SMO)
• Usually train on part of the data, test on the rest
• Simple method – Zero-rule, or 0R
   – Predict the most common category
      • Class ZeroR in WEKA
   – Too simple for practical use, but a useful baseline for
     evaluating performance of more complex methods
         1-Rule (1R) Algorithm
• Based on single predictor
   – Predict mode within each value of that predictor
• Look at error rate for each predictor on training
  dataset, and choose best predictor
• Called OneR in WEKA
• Must group numerical predictor values for this
  method
   – Common method is to split at each change in the
     response
   – Collapse buckets until each contains at least 6
     instances
      1R Algorithm (continued)
• Biased towards predictors with more categories
  – These can result in over-fitting to the training data
• But found to perform surprisingly well
  – Study on 16 widely used datasets
     • Holte (1993), Machine Learning 11, 63-91
  – Often error rate only a few percentages points higher
    than more sophisticated methods (e.g. decision trees)
  – Produced rules that were much simpler and more
    easily understood
         Naïve Bayes Method
• Calculates probabilities of each response value,
  assuming independence of attribute effects
                             P  A X P B X P C X P X 
         P  X A, B, C  
                                      P ( A, B, C )
• Response value with highest probability is
  predicted
• Numeric attributes are assumed to follow a
  normal distribution within each response value
  – Contribution to probability calculated from normal
    density function
  – Instead can use kernel density estimate, or simply
    discretise the numerical attributes
       Naïve Bayes Calculations




• Observed counts and probabilities above
   – Temperature and humidity have been discretised
• Consider new day
   –   Outlook=sunny, temperature=cool, humidity=high, windy=true
   –   Probability(play=yes) α 2/9 x 3/9 x 3/9 x 3/9 x 9/14= 0.0053
   –   Probability(play=no) α 3/5 x 1/5 x 4/5 x 3/5 x 5/14= 0.0206
   –   Probability(play=no) = 0.0206/(0.0053+0.0206) = 79.5%
        • “no” four times more likely than “yes”
         Naïve Bayes Method
• If any of the component probabilities are zero,
  the whole probability is zero
  – Effectively a veto on that response value
  – Add one to each cell’s count to get around this
    problem
     • Corresponds to weak positive prior information
• Naïve Bayes effectively assumes that attributes
  are equally important
  – Several highly correlated attributes could drown out
    an important variable that would add new information
• However this method often works well in practice
                Decision Trees
• Classification rules can be expressed in a tree
  structure
   – Move from the top of the tree, down through various
     nodes, to the leaves
   – At each node, a decision is made using a simple test
     based on attribute values
   – The leaf you reach holds the appropriate predicted value
• Decision trees are appealing and easily used
   – However they can be verbose
   – Depending on the tests being used, they may obscure
     rather than reveal the true pattern
• More info online at http://recursive-partitioning.com/
If x=1 and y=1
   then class = a
If z=1 and w=1
   then class = a
Otherwise class = b




                Decision tree with a replicated subtree
Problems with Univariate Splits
   Constructing Decision Trees
• Develop tree recursively
  – Start with all data in one root node
  – Need to choose attribute that defines first split
     • For now, we assume univariate splits are used
  – For accurate predictions, want leaf nodes to be as
    pure as possible
  – Choose the attribute that maximises the average
    purity of the daughter nodes
     • The measure of purity used is the entropy of the node
     • This is the amount of information needed to specify the value
       of an instance in that node, measured in bits
                                            n
               entropy p1 , p2 , , pn     pi log 2 pi
                                           i 1
      Tree stumps for the weather data



(a)                       (b)




(c)                       (d)
            Weather Example
• First node from outlook split is for “sunny”, with
  entropy – 2/5 * log2(2/5) – 3/5 * log2(3/5) = 0.971
• Average entropy of nodes from outlook split is
   5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971= 0.693
• Entropy of root node is 0.940 bits
• Gain of 0.247 bits
• Other splits yield:
   – Gain(temperature)=0.029 bits
   – Gain(humidity)=0.152 bits
   – Gain(windy)=0.048 bits
• So “outlook” is the best attribute to split on
Expanded tree stumps for weather data
 (a)                    (b)




             (c)
Decision tree for the weather data
     Decision Tree Algorithms
• The algorithm described in the preceding slides
  is known as ID3
  – Due to Quinlan (1986)
• Tends to choose attributes with many values
  – Using information gain ratio helps solve this problem
• Several more improvements have been made to
  handle numeric attributes (via univariate splits),
  missing values and noisy data (via pruning)
  – Resulting algorithm known as C4.5
     • Described by Quinlan (1993)
  – Widely used (as is the commercial version C5.0)
  – WEKA has a version called J4.8
             Classification Trees
• Described (along with regression trees) in:
   – L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984.
     Classification and Regression Trees.
• More sophisticated method than ID3
   – However Quinlan’s (1993) C4.5 method caught up with CART in
     most areas
• CART also incorporates methods for pruning, missing
  values and numeric attributes
   – Multivariate splits are possible, as well as univariate
       • Split on linear combination Σcjxj > d
   – CART typically uses Gini measure of node purity to determine
     best splits
       • This is of the form Σp(1-p)
   – But information/entropy measure also available
             Regression Trees
• Trees can also be used to predict numeric
  attributes
  – Predict using average value of the response in the
    appropriate node
     • Implemented in CART and C4.5 frameworks
  – Can use a model at each node instead
     • Implemented in Weka’s M5’ algorithm
     • Harder to interpret than regression trees
• Classification and regression trees are
  implemented in R’s rpart package
  – See Ch 10 in Venables and Ripley, MASS 3rd Ed.
              Problems with Trees
• Can be unnecessarily verbose
• Structure often unstable
    – “Greedy” hierarchical algorithm
        • Small variations can change chosen splits at high level nodes, which then
          changes subtree below
        • Conclusions about attribute importance can be unreliable
• Direct methods tend to overfit training dataset
    – This problem can be reduced by pruning the tree
• Another approach that often works well is to fit the tree, remove all
  training cases that are not correctly predicted, and refit the tree on
  the reduced dataset
    – Typically gives a smaller tree
    – This usually works almost as well on the training data
    – But generalises better, e.g. works better on test data
• Bagging the tree algorithm also gives more stable results
    – Will discuss bagging later
  Classification Tree Example
• Use Weka’s J4.8 algorithm on German
  credit data (with default options)
  – 1000 instances, 21 attributes
• Produces a pruned tree with 140 nodes,
  103 leaves
•   === Run information ===

•   Scheme:      weka.classifiers.j48.J48 -C 0.25 -M 2
•   Relation: german_credit
•   Instances: 1000
•   Attributes: 21

•   Number of Leaves :         103

•   Size of the tree :         140

•   === Stratified cross-validation ===
•   === Summary ===

•   Correctly Classified Instances      739       73.9 %
•   Incorrectly Classified Instances     261      26.1 %
•   Kappa statistic                 0.3153
•   Mean absolute error                0.3241
•   Root mean squared error               0.4604
•   Relative absolute error           77.134 %
•   Root relative squared error        100.4589 %
•   Total Number of Instances          1000

•   === Detailed Accuracy By Class ===

•   TP Rate FP Rate Precision Recall F-Measure Class
•    0.883 0.597   0.775 0.883 0.826 good
•    0.403 0.117   0.596 0.403 0.481 bad

•   === Confusion Matrix ===

•    a b <-- classified as
•   618 82 | a = good
•   179 121 | b = bad
             Cross-Validation
• Due to over-fitting, cannot estimate prediction
  error directly on the training dataset
• Cross-validation is a simple and widely used
  method for estimating prediction error
• Simple approach
  – Set aside a test dataset
  – Train learner on the remainder (the training dataset)
  – Estimate prediction error by using the resulting
    prediction model on the test dataset
• This is only feasible where there is enough data
  to set aside a test dataset and still have enough
  to reliably train the learning algorithm
         k-fold Cross-Validation
• For smaller datasets, use k-fold cross-validation
   – Split dataset into k roughly equal parts
              Tr   Tr   Test   Tr   Tr   Tr     Tr   Tr   Tr


   – For each part, train on the other k-1 parts and use this part as
     the test dataset
   – Do this for each of the k parts, and average the resulting
     prediction errors
• This method measures the prediction error when training
  the learner on a fraction (k-1)/k of the data
• If k is small, this will overestimate the prediction error
   – k=10 is usually enough
    Regression Tree Example
• data(car.test.frame)
• z.auto <- rpart(Mileage ~ Weight,
  car.test.frame)
• post(z.auto,FILE=“”)
• summary(z.auto)
                                  24.58
                                    |
                                  n=60


                    Weight>=2568
                                      Weight< 2568

                   22.47                      30.93
                   n=45                       n=15


    Weight>=3088
                            Weight< 3088

20.41                                     24.43
n=22                                      n=23


                            Weight>=2748
                                              Weight< 2748

                           23.8                        25.63
                           n=15                         n=8
•   Call:
•   rpart(formula = Mileage ~ Weight, data = car.test.frame)
•    n= 60

•              CP nsplit rel error            xerror           xstd
•   1   0.59534912      0 1.0000000         1.0322233     0.17981796
•   2   0.13452819      1 0.4046509         0.6081645     0.11371656
•   3   0.01282843      2 0.2701227         0.4557341     0.09178782
•   4   0.01000000      3 0.2572943         0.4659556     0.09134201

•   Node number 1: 60 observations, complexity param=0.5953491
•    mean=24.58333, MSE=22.57639
•    left son=2 (45 obs) right son=3 (15 obs)
•    Primary splits:
•       Weight < 2567.5 to the right, improve=0.5953491, (0 missing)

•   Node number 2: 45 observations, complexity param=0.1345282
•    mean=22.46667, MSE=8.026667
•    left son=4 (22 obs) right son=5 (23 obs)
•    Primary splits:
•       Weight < 3087.5 to the right, improve=0.5045118, (0 missing)
•   …(continued on next page)…
•   Node number 3: 15 observations
•    mean=30.93333, MSE=12.46222

•   Node number 4: 22 observations
•    mean=20.40909, MSE=2.78719

•   Node number 5: 23 observations, complexity param=0.01282843
•    mean=24.43478, MSE=5.115312
•    left son=10 (15 obs) right son=11 (8 obs)
•    Primary splits:
•       Weight < 2747.5 to the right, improve=0.1476996, (0 missing)

•   Node number 10: 15 observations
•    mean=23.8, MSE=4.026667

•   Node number 11: 8 observations
•    mean=25.625, MSE=4.984375
     Regression Tree Example
           (continued)
• plotcp(z.auto)
• z2.auto <- prune(z.auto,cp=0.1)
• post(z2.auto, file="", cex=1)
Complexity Parameter Plot
                                            size of tree

                               1      2                      3       4




                         1.2
                         1.0
  X-val Relative Error

                         0.8
                         0.6
                         0.4




                               Inf   0.28                  0.042   0.011

                                                cp
                                  24.58
                                    |
                                  n=60


                    Weight>=2568
                                      Weight< 2568

                   22.47                      30.93
                   n=45                       n=15


    Weight>=3088
                            Weight< 3088

20.41                                     24.43
n=22                                      n=23


                            Weight>=2748
                                              Weight< 2748

                           23.8                        25.63
                           n=15                         n=8
Pruned Regression Tree
                     Endpoint = Mileage



                                          24.58
                                            |
                                          n=60




                      Weight>=2568
                                                  Weight< 2568



                   22.47                                     30.93
                   n=45                                      n=15




    Weight>=3088
                           Weight< 3088



 20.41                               24.43
 n=22                                n=23
           Classification Methods
• Project the attribute space into decision regions
    – Decision trees: piecewise constant approximation
    – Logistic regression: linear log-odds approximation
    – Discriminant analysis and neural nets: linear & non-linear separators
• Density estimation coupled with a decision rule
    – E.g. Naïve Bayes
• Define a metric space and decide based on proximity
    – One type of instance-based learning
    – K-nearest neighbour methods
        • IBk algorithm in Weka
    – Would like to drop noisy and unnecessary points
        • Simple algorithm based on success rate confidence intervals available in Weka
             – Compares naïve prediction with predictions using that instance
             – Must choose suitable acceptance and rejection confidence levels
• Many of these approaches can produce probability distributions as well
  as predictions
    – Depending on the application, this information may be useful
        • Such as when results reported to expert (e.g. loan officer) as input to their
          decision
   Numeric Prediction Methods
• Linear regression
• Splines, including smoothing splines and
  multivariate adaptive regression splines (MARS)
• Generalised additive models (GAM)
• Locally weighted regression (lowess, loess)
• Regression and Model Trees
  – CART, C4.5, M5’
• Artificial neural networks (ANNs)
  Artificial Neural Networks (ANNs)
• An ANN is a network of many simple processors (or units), that are
  connected by communication channels that carry numeric data
• ANNs are very flexible, encompassing nonlinear regression models,
  discriminant models, and data reduction models
    – They do require some expertise to set up
    – An appropriate architecture needs to be selected and tuned for each
      application
• They can be useful tools for learning from examples to find patterns
  in data and predict outputs
    – However on their own, they tend to overfit the training data
    – Meta-learning tools are needed to choose the best fit
• Various network architectures in common use
    – Multilayer perceptron (MLR)
    – Radial basis functions (RBF)
    – Self-organising maps (SOM)
• ANNs have been applied to data editing and imputation, but not
  widely
 Meta-Learning Methods - Bagging
• General methods for improving the performance of most learning
  algorithms
• Bootstrap aggregation, bagging for short
    – Select B bootstrap samples from the data
        • Selected with replacement, same # of instances
            – Can use parametric or non-parametric bootstrap
    – Fit the model/learner on each bootstrap sample
    – The bagged estimate is the average prediction from all these B models
• E.g. for a tree learner, the bagged estimate is the average prediction
  from the resulting B trees
• Note that this is not a tree
    – In general, bagging a model or learner does not produce a model or
      learner of the same form
• Bagging reduces the variance of unstable procedures like
  regression trees, and can greatly improve prediction accuracy
    – However it does not always work for poor 0-1 predictors
Meta-Learning Methods - Boosting
• Boosting is a powerful technique for improving
  accuracy
• The “AdaBoost.M1” method (for classifiers):
  – Give each instance an initial weight of 1/n
  – For m=1 to M:
     • Fit model using the current weights, & store resulting model m
     • If prediction error rate “err” is zero or >= 0.5, terminate loop.
     • Otherwise calculate αm=log((1-err)/err)
         – This is the log odds of success
     • Then adjust weights for incorrectly classified cases by
       multiplying them by exp(αm), and repeat
  – Predict using a weighted majority vote: ΣαmGm(x),
    where Gm(x) is the prediction from model m
Meta-Learning Methods - Boosting

• For example, for the German credit dataset:
  – using 100 iterations of AdaBoost.M1 with the
    DecisionStump algorithm,
  – 10-fold cross-validation gives an error rate of
    24.9% (compared to 26.1% for J4.8)
                 Association Rules
• Data on n purchase baskets in form (id, item1, item2, …, itemk)
    – For example, purchases from a supermarket
• Association rules are statements of the form:
    – “When people buy tea, they also often buy coffee.”
• May be useful for product placement decisions or cross-selling
  recommendations
• We say there is an association rule i1 ->i2 if
    – i1 and i2 occur together in at least s% of the n baskets (the support)
    – And at least c% of the baskets containing item i 1 also contain i2 (the
      confidence)
• The confidence criterion ensures that “often” is a large enough
  proportion of the antecedent cases to be interesting
• The support criterion should be large enough that the resulting rules
  have practical importance
    – Also helps to ensure reliability of the conclusions
                 Association rules
• The support/confidence approach is widely used
   – Efficiently implemented in the Apriori algorithm
       • First identify item sets with sufficient support
       • Then turn each item set into sets of rules with sufficient confidence
• This method was originally developed in the database
  community, so there has been a focus on efficient
  methods for large databases
   – “Large” means up to around 100 million instances, and about ten
     thousand binary attributes
• However this approach can find a vast number of rules,
  and it can be difficult to make sense of these
• One useful extension is to identify only the rules with
  high enough lift (or odds ratio)
Classification vs Association Rules
• Classification rules predict the value of a
  pre-specified attribute, e.g.
     • If outlook=sunny and humidity=high then play =no
• Association rules predict the value of an
  arbitrary attribute (or combination of
  attributes)
     • E.g. If temperature=cool then humidity=normal
     • If humidity=normal and play=no then windy=true
     • If temperature=high and humidity=high then
       play=no
     Clustering – EM Algorithm
• Assume that the data is from a mixture of normal
  distributions
   – I.e. one normal component for each cluster
• For simplicity, consider one attribute x and two
  components or clusters
   – Model has five parameters: (p, μ1, σ1, μ2, σ2) = θ
• Log-likelihood:
                l  X    log  p1 xi   1  p 2 xi ,
                            n


                           i 1

       where j xi  is the normal density for the j th component.
• This is hard to maximise directly
   – Use the expectation-maximisation (EM) algorithm instead
      Clustering – EM Algorithm
• Think of data as being augmented by a latent 0/1
  variable di indicating membership of cluster 1
• If the values of this variable were known, the log-
  likelihood would be:

         l  X , D    log d i1  xi   1  d i 2  xi 
                          n


                         i 1
• Starting with initial values for the parameters, calculate
  the expected value of di
• Then substitute this into the above log-likelihood and
  maximise to obtain new parameter values
   – This will have increased the log-likelihood
• Repeat until the log-likelihood converges
     Clustering – EM Algorithm
• Resulting estimates may only be a local
  maximum
   – Run several times with different starting points
     to find global maximum (hopefully)
• With parameter estimates, can calculate
  segment membership probabilities for
  each case
                                   1 xi 
          P Di  1 , xi  
                              1 xi   2  xi 
      Clustering – EM Algorithm
• Extending to more latent classes is easy
   – Information criteria such as AIC and BIC are often used to decide
     how many are appropriate
• Extending to multiple attributes is easy if we assume they
  are independent, at least conditioning on segment
  membership
   – It is possible to introduce associations, but this can rapidly
     increase the number of parameters required
• Nominal attributes can be accommodated by allowing
  different discrete distributions in each latent class, and
  assuming conditional independence between attributes
• Can extend this approach to a handle joint clustering and
  prediction models, as mentioned in the MVA lectures
    Clustering - Scalability Issues
•   k-means algorithm is also widely used
•   However this and the EM-algorithm are slow on large databases
•   So is hierarchical clustering - requires O(n2) time
•   Iterative clustering methods require full DB scan at each iteration
•   Scalable clustering algorithms are an area of active research
•   A few recent algorithms:
     – Distance-based/k-Means
         •   Multi-Resolution kd-Tree for K-Means [PM99]
         •   CLIQUE [AGGR98]
         •   Scalable K-Means [BFR98a]
         •   CLARANS [NH94]
     – Probabilistic/EM
         • Multi-Resolution kd-Tree for EM [Moore99]
         • Scalable EM [BRF98b]
         • CF Kernel Density Estimation [ZRL99]
           Ethics of Data Mining
• Data mining and data warehousing raise ethical and legal
  issues
• Combining information via data warehousing could violate
  Privacy Act
   – Must tell people how their information will be used when the data
     is obtained
• Data mining raises ethical issues mainly during application
  of results
   – E.g. using ethnicity as a factor in loan approval decisions
   – E.g. screening job applications based on age or sex (where not
     directly relevant)
   – E.g. declining insurance coverage based on neighbourhood if this
     is related to race (“red-lining” is illegal in much of the US)
• Whether something is ethical depends on the application
   – E.g. probably ethical to use ethnicity to diagnose and choose
     treatments for a medical problem, but not to decline medical
     insurance

								
To top