Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)

Document Sample
Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Powered By Docstoc
					                    Chapter 4
         An Excel-based Data Mining Tool
                (iData Analyzer)

                                    Jason C. H. Chen, Ph.D.
                                        Professor of MIS
                                School of Business Administration
                                       Gonzaga University
                                       Spokane, WA 99223
                                   chen@jepson.gonzaga.edu

A/W & Dr. Chen, Data Mining
                                Objectives
• This chapter will introduce you the iData Analyzer(iDA)
  and how to use two of learner models contained in your
  iDA software of data mining tools.
• In Section 4.1 overviews the iDA Model for Knowledge
  Discovery.
• In Section 4.2, introduces an exemplar-based data
  mining tool, ESX, capable of both supervised learning
  and unsupervised clustering.
• The way of representing datasets and how to use ESX to
  perform unsupervised clustering and building supervised
  learning models and others will be also introduced in
  this chapter.
                                                     2
A/W & Dr. Chen, Data Mining
                                4.1 The iData Analyzer
       • iDA provides support for the business or technical
         analyst by offering a visual learning environment,
         an integrated tool set, and data mining process
         support.
       • iDA consists of the following components:
               –     Preprocessor
               –     Heuristic agent (for larger Large Dataset)
               –     ESX
               –     Neural Network
               –     Rule Maker                   See p.107 and Appendix A-2 for the
               –     Report Generator             instructions of installation
                                                                                       3
A/W & Dr. Chen, Data Mining
                                Limitations
       • The commercial version of iDA is bounded by the
         size of a single MS Excel spreadsheet, i.e., up to
         65,536 rows and 256 columns
       • The iDA input format uses the first three rows of a
         spreadsheet to house information about individual
         attributes
               – Up to 65,533 data instances in attribute-value format
                 can be mined
               – The student version allows a maximum of 7,000 data
                 instances (i.e., 7003 rows)
  After completing the installation if the security setting is high,
  you should change it to medium and click OK.                           4
A/W & Dr. Chen, Data Mining
                    Interface
                                          Data
                                                           Figure 4.1 – The iDA system
                                                                   architecture

                                      PreProcessor




                                         Large       Yes    Heuristic
                                        Dataset              Agent

                                         No




                                         Mining
                                       Technique



                    Neural
                                                              ESX
                   Networks




                                Yes
                Explaination
                                        Generate     Yes                      Rules
                                                           RuleMaker
                                         Rules
                     No


                                         No




                                        Report
                                       Generator




                                        Excel                                         5
                                        Sheets
A/W & Dr. Chen, Data Mining
                                6
A/W & Dr. Chen, Data Mining
         4.2 ESX: A Multipurpose Tool for
                   Data Mining
       • ESX can help create target data, find irregularities
         in data, perform data mining, and offer insight
         about the practical value of discovered knowledge.
       • Features of ESX learner model are:
               – It supports both supervised learning and unsupervised
                 clustering
               – It supports an automated method for dealing with
                 missing attribute value
               – It does not make statistical assumptions about the
                 nature of data to be processed
               – It can point out inconsistencies and unusual values in
                 data                                                     7
A/W & Dr. Chen, Data Mining
         Figure 4.3 An ESX concept hierarch


    Root Level                                             Root




    Concept Level                       C1                  C2          ...       Cn




    Instance Level              I11 I12 . . . I1j   I21 I22 . . . I2k     In1 In2 . . . Inl

                                                                                         8
A/W & Dr. Chen, Data Mining
         4.3 iDAV Format for Data Mining
        Second Row: C: categorical; R: real-valued
        Third Row (see Table 4.2 below)
Table 4.2 – Values for Attribute Usage
   Character                                                        Usage

            I                    The attribute is used as an input attribute

           U                     The attribute is not used (categorical attribute with several unique
                                 values are of little predictive value)
           D                     The attribute is not used for classification or clustering, but attribute
                                 value summary information is displayed in all output reports
           O                     The attribute is used as an output attribute. For supervised learning with
                                 ESX exactly one categorical attribute is selected as the output attribute.
                                                                                                         9
 A/W & Dr. Chen, Data Mining
 Table 4.1 – Credit Card Promotion Database: iDAV Format

Income Range Magazine Promo Watch Promo Life Ins Promo Credit Card Ins.    Sex      Age
      C            C             C             C              C             C        R
      I             I            I              I              I            I         I
  40-50,000       Yes           No             No             No           Male      45
  30-40,000       Yes           Yes           Yes             No          Female     40
  40-50,000        No           No             No             No           Male      42
  30-40,000       Yes           Yes           Yes            Yes           Male      43
  50-60,000       Yes           No            Yes             No          Female     38
  20-30,000        No           No             No             No          Female     55
  30-40,000       Yes           No            Yes            Yes           Male      35
  20-30,000        No           Yes            No             No           Male      27
  30-40,000       Yes           No             No             No           Male      43
  30-40,000       Yes           Yes           Yes             No          Female     41
  40-50,000        No           Yes           Yes             No          Female     43
  20-30,000        No           Yes           Yes             No           Male      29
  50-60,000       Yes           Yes           Yes             No          Female     39
  40-50,000        No           Yes            No             No           Male      55
  20-30,000        No           No            Yes            Yes          Female     19
                                                                                   10
 A/W & Dr. Chen, Data Mining
                    4.4 A Five-step Approach for
                      Unsupervised Clustering

           Step 1: Enter the Data to be Mined
           Step 2: Perform a Data Mining Session
           Step 3: Read and Interpret Summary Results
           Step 4: Read and Interpret Individual Class Results
           Step 5: Visualize Individual Class Rules



A/W & Dr. Chen, Data Mining
                          Step 1: Enter The Data To Be Mined




                                                               12
A/W & Dr. Chen, Data Mining
              Step 2: Perform A Data Mining Session




                                                      13
A/W & Dr. Chen, Data Mining
            Figure 4.5 – Unsupervised settings for ESX (#4,p.116)




    Value for instance similarity:
    A value closer to 100 encourages the formation of new clusters
    A value closer to 0 favors new instances to enter existing clusters
 The real-valued tolerance setting helps determine the similarity criteria
 for real-valued attributes. A setting of 1.0 is usually appropriate.
                                                                          14
A/W & Dr. Chen, Data Mining
         #6 A message box indicating that eight clusters were formed.




                 This tells us the data has been successfully mine.



                                                                      15
A/W & Dr. Chen, Data Mining
        #6, #7 (p.116)As a general rule, an unsupervised clustering
        of more than five or six clusters is likely to be less than
        optimal.




                                                                      16
A/W & Dr. Chen, Data Mining
    #8 and #9, Repeat steps 1-4. For step 5, set the similarity value to 55




                                                                    17
A/W & Dr. Chen, Data Mining
    Re-rule feature


                                                    Covering set rules:
                                                    RuleMaker will generate a
                                                    set of best-defining rules for
                                                    each class.




  Minimum correctness rule (50-100): if 80, the rules generated must have an
  error rate less than or equal to 20%
  Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover
  10% or more of the instances in each class.
  Attribute significance (start with 80-90): values close to 100 will allow
  RuleMaker to consider only those attribute values most highly predictive of
  class membership for rule generation.                                      18
A/W & Dr. Chen, Data Mining
  #10 (p.117) Set minimum rule coverage at 30




                                                                       30




  Minimum correctness rule (50-100): if 80, the rules generated must have an
  error rate less than or equal to 20%
  Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover
  10% or more of the instances in each class.
  Attribute significance: values close to 100 will allow RuleMaker to consider
  only those attribute values most highly predictive of class membership for
  rule generation.                                                           19
A/W & Dr. Chen, Data Mining
                           A Production Rule for the
                        Credit Card Promotion Database
             IF Sex = Female & 19 <=Age <= 43
             THEN Life Insurance Promotion = Yes
                   Rule Accuracy: 100.00%
                   Rule Coverage: 66.67%
    Question: Can we assume that two-thirds of all females in
       the specified age range will take advantage of the
                           promotion?
  • Rule accuracy is a between-class measure.
  • Rule coverage is a within-class measure.

A/W & Dr. Chen, Data Mining
                               Output Reports:
                            Unsupervised Clustering
       • RES SUM: This sheet contains summary statistics
         about attribute values and offers several heuristics
         to help us determine the quality of a data mining
         session.
       • RES CLS: this sheet has information about the
         clusters formed as a result of an unsupervised
         mining session
       • RUL TYP: Instances are listed by their cluster
         number. The typicality of instance i is the average
         similarity of i to the other members of its cluster.
       • RES RUL: The production rules generated for each
         cluster are contained in this sheet.                 21
A/W & Dr. Chen, Data Mining
        #10 (p.117) Set minimum rule coverage at 30




                                                      22
A/W & Dr. Chen, Data Mining
  Figure 4.7 Rules for the credit card promotion database




                                                            23
A/W & Dr. Chen, Data Mining
                       Step 3: Read and Interpret
                         Summary Results (p.117)
                                (Sheet1 RES SUM)

       • Class Resemblance Scores (RES)
       • Domain Resemblance Score
       • Domain Predictability




                                                    24
A/W & Dr. Chen, Data Mining
      Step 3: Read and Interpret Summary Results (p.119)




                Instances of Class 1
                have a best within-
                class fit

                                                                                        Similarity value
                                                                                        (within the class)




                                                                                                                classes.
In general, the within-class RES scores should be higher than the domain RES. It should be true for most of the 25
  A/W & Dr. Chen, Data Mining
    Figure 4.9 - Step 3: Read and Interpret Summary Results (cont.)




                                                                      26
A/W & Dr. Chen, Data Mining
Figure 4.9 -Statistics for numerical attributes and common categorical attribute values
Step 3: Read and Interpret Summary Results (cont.)




                                                                                27
A/W & Dr. Chen, Data Mining
                      Step 4: Read and Interpret
                     Individual Class Results (p.121)
                                (Sheet1 RES CLS)

     • Typicality
              – is defined as the average similarity of an instance to all
                other members of its cluster or class
     • Class Predictability is a within-class measure.
              – the percent of class instances having a particular value for a
                categorical attribute
     • Class Predictiveness is a between-class measure
              – it is defined as probability an instance resides in a specified class
                given the instance has the value for the chosen attribute

                                                                                 28
A/W & Dr. Chen, Data Mining
Figure 4.10 – Class 3 Summary Results




                                within-class   between-class




                                                               29
A/W & Dr. Chen, Data Mining
Figure 4.11 – Necessary and sufficient attribute values for Class 3




                                                                      30
A/W & Dr. Chen, Data Mining
             Step 5: Visualize Individual Class Rules




                                   IF life ins Promo = Yes
                                   THEN Class = 3
                                          :rule accuracy 77.78%
                                          :rule coverage 100.00%

                                                             31
A/W & Dr. Chen, Data Mining
                    4.5 A Six-Step Approach for
                        Supervised Learning
       • Step 1: Choose an Output Attribute
               – Launch a fresh life insurance promotion
       • Step 2: Perform the Mining Session
       • Step 3: Read and Interpret Summary
         Results
       • Step 4: Read and Interpret Test Set Results
       • Step 5: Read and Interpret Class Results
       • Step 6: Visualize and Interpret Class Rules
                                                           32
A/W & Dr. Chen, Data Mining
                    Step 2: Perform the Mining Session
Filename: CreditCardPromotion-supervised.xls




 O: output; D: Display-Only
                                                         33
A/W & Dr. Chen, Data Mining
Step 2(#4): Select the number of instances for training and a real-valued tolerance setting (p.127)




                                                                                        34
A/W & Dr. Chen, Data Mining
  Step 3 – Read and Interpret Summary Results




        Domain statistics for
        categorical attributes tells
        us that 80% of the training
        instances represent
        individuals without credit
        card insurance.

                                                35
A/W & Dr. Chen, Data Mining
  Step 3 – Read and Interpret Summary Results (cont.)




                                                        36
A/W & Dr. Chen, Data Mining
               Step 4 - Read and Interpret Test Set Results




                                                              37
A/W & Dr. Chen, Data Mining
  Step 5 - Read and Interpret Results for Individual Classes (p.130)




                                                               38
A/W & Dr. Chen, Data Mining
     Sheet1 RUL TYP




     In Class Yes (Life Ins. Promo)
     Instances of Credit Card Ins = Yes is 40% (2/5)

                                                       39
A/W & Dr. Chen, Data Mining
                   Step 6 – Visualize and Interpret Class Rules (p.130)
     Re-rule feature
                                                     Covering set rules:
                                                     RuleMaker will generate a
                                                     set of best-defining rules for
                                                     each class.




  Minimum correctness rule (50-100): if 80, the rules generated must have an
  error rate less than or equal to 20%
  Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover
  10% or more of the instances in each class.
  Attribute significance (start with 80-90): values close to 100 will allow
  RuleMaker to consider only those attribute values most highly predictive of
  class membership for rule generation.                                      40
A/W & Dr. Chen, Data Mining
               4.6 Techniques for Generating
                           Rules

                 1.         Define the scope of the rules.
                 2.         Choose the instances.
                 3.         Set the minimum rule correctness.
                 4.         Define the minimum rule coverage.
                 5.         Choose an attribute significance value.



A/W & Dr. Chen, Data Mining
                                4.7 Instance Typicality
                                  Typicality Scores

          • Identify prototypical and outlier instances.
          • Select a best set of training instances.
          • Used to compute individual instance
                classification confidence scores.


A/W & Dr. Chen, Data Mining
    Figure 4.13 Instance Typicality




                                      43
A/W & Dr. Chen, Data Mining
                4.8 Special Considerations and
                            Features
       • Avoid Mining Delays
       • The Quick Mine Feature
               – Supervised with more than 2000 training set
                 instances, “quick mine” feature will be asked
               – Unsupervised with more than 2000 data
                 instances. ESX is given a random selection of
                 500 instances.
       • Erroneous and Missing Data
                                                                 44
A/W & Dr. Chen, Data Mining
                                Homework
    • Use EXS (and iDA) to perform a supervised data
      mining session using the CardiologyCategorical.xls
      data file.
    • Save output file as
      CardiologyCategorical-supervised.xls
    • Lab#4 (p.141)
    • Turn in
             – 1. Spreadsheet file (CardiologyCategorical-
               supervised.xls) that contains the outcome of data mining
               session
             – 2. Word file that includes (and explains) answers to all
               questions (a. thru n.)
                                                                      45
A/W & Dr. Chen, Data Mining

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:43
posted:3/26/2012
language:English
pages:45