Design and Evaluation of Interactive Visual Data Mining Modeling

Document Sample
Design and Evaluation of Interactive Visual Data Mining Modeling Powered By Docstoc
					                   Interactive Visual Data Mining
                             Modeling
                                 to
                     Enhance Understanding and
                    Effectiveness of the Process

                           Dr. Yan Liu
Department of Biomedical, Industrial and Human Factors Engineering
                    Wright State University
                              Outline
   Background
   Research Objectives and Significance
   Conceptual Model
   Hypotheses
   Design of Interactive Visual Decision Tree Classification and
    Association Rules Modeling
   Experiments
   Conclusions




                                                                    1
                  Data Mining (DM): What and Why
   What Is DM
       A synonym for knowledge discovery in databases (KDD)
       Nontrivial process of identifying valid, novel, potentially useful, and ultimately
        understandable patterns in data (Fayyard et al., 1996)
       Lying at the interface of database management, machine learning, pattern
        recognition, statistics and visualization
   Why Is DM Useful
       Rapid development of information techniques produces vast amounts of data
       Knowledge discovered from data can be use for competitive advantage
                 Classification, prediction, association, clustering, etc.




                                                                                      2
              Data Mining Process


           Business                   Data
         Understanding            Understanding



 Deployment                Data
                                                Data
                                             Preparation


              Evaluation          Modeling



CRISP-DM (CRoss Industry Standard Process for DM)
               (Holsheimer,1999)                           3
                    Data Mining Process (Cont’d)
   Business Understanding
       Understand the objectives
       Define performance criteria
            Objective or subjective
       Assess current situations of the organization
            Background knowledge, data sources, resources, etc.
   Data Understanding
       Collect data
            From scratch or existing databases
       Describe data
            Volume, identities of attributes, format, etc.
       Explore/survey data
            Distributions of attributes, relations among a small number of attributes, results of
             simple aggregations, etc.
            Statistical analysis, data visualization, database queries can be useful tools
       Verify data quality
            Incomplete data, missing values, errors, etc.                                    4
                    Data Mining Process (Cont’d)
   Data Preparation
       “Garbage in, garbage out”
       Select data
            Based on relevance, technical constraints
       Clean data
            remove errors, fill in missing data with default values or estimates by modeling
       Construct data
            Generate new attributes (records), merge tables, transform data, etc.
       Reduce data
            Obtain a dataset much smaller yet retaining enough important information




                                                                                            5
                            Data Mining Process
   Modeling
       Select appropriate modeling techniques
            Supervised Learning (directed DM)
                Explanatory variables and one or more dependent variables
                Learn the relationship between the explanatory variables and dependent
                 variables
                Decision trees, artificial neural networks, Bayesian classification, support
                 vector machines (SVMs), regression, etc.
            Unsupervised learning
                No distinction between explanatory and dependent variables
                Understand relationships within the dataset
                Cluster analysis, association rules, self-organizing map, principal component
                 analysis, etc.
       Generate test design
            Test models’ quality and validity
       Build models
       Assess models
            According to domain knowledge, success criteria and test design
                                                                                          6
                   Data Mining Process (Cont’d)
   Evaluation
       Evaluate results
            With respect to the project objectives
       Review process
            Overlooked important factors or tasks
   Deployment
       Plan deployment
       Plan monitoring and maintenance
       Produce final report




                                                      7
                                     DM Modeling
   Automatic DM Modeling
       Little interaction with the computing machine
       May lead to unproductive process (Meneses & Grinstein,2001)
       Downplay the key roles played by humans (Brachman & Anand,1996)
            Understanding of the outputs
            Guiding the discovery process
   Interactive Visual DM Modeling (IVDMM)
       Involvement of Users in Constructing Models
            Data visualization
            Visualization of intermediate outcomes
       May Better Understandings of Models
       May Improve Effectiveness of Modeling



                                                                          8
                    Objectives and Significance

   Objectives
       Identify the key visualization components in IVDMM
       Study how to display them
   Significance of Research
       Theoretical
            Framework of integration of analytical DM techniques and information
             visualization
            Research topics for future exploration
       Practical
            Innovative visualization methods
            Design guidelines




                                                                                    9
Conceptual Model of IVDMM




                            10
                                  Hypotheses

   Hypotheses I, II, and III
       Compared to automatic model construction, interactive visual model
        construction through integrated visualization of model-building process and
        data visualization
            Improves the effectiveness of modeling (Hypothesis I)
            Enhances understanding of the applied DM algorithm (Hypothesis II)
            Brings users greater satisfaction (Hypothesis III)
   Hypothesis IV
       Compared to nonintegrated visualization of derived models and data
        visualization, integrated visualization of derived models and data visualization
            Promotes better understanding of models (Hypothesis IV)




                                                                                  11
Interactive Visual Decision Tree Classification
Interactive Visual Association Rules Modeling




                                                  12
    Interactive Visual Decision Tree Classification
                        (IVDT)
   Classification
       Distinguish data classes in order to predict
        unknown class labels
   Decision Tree Classifier
       Tree structure consisting of leaf and decision
        nodes
       Process
            Selecting the splits
                Information gain, Gini index, Chi-square
            Determining the size of the tree
                Pruning
                Classification accuracy in validation
                  dataset



                                                            13
Logic Diagram of the Proposed IVDT   14
   Visualization of Trees
       Overall structure
            Numbers of layers and nodes, relationships among nodes
       Content information associated with nodes and splits
            Number of records and class distribution


   Car Dataset
       University of California Irvine (UCI) Machine Learning Repository
       Input attributes
            Buying Price, Maintenance Price, Number of Doors, Number of Persons, Safety
       Class attribute
            Acceptability




                                                                                     15
                            Icicle Plot of a Tree Model


• Familiarity (Barlow & Neville, 2001)
• Ease of identification of tree structures (Barlow & Neville, 2001)
• Effective demonstration of node sizes and class distribution (Cleveland & McGill,
1987)
• Relatively effective in space usage (Liu & Salvendy, 2005)                   16
   Interactive Visual Decision Tree Construction




• Length and positions along a common aligned scale are effective in
visualizing quantitative information (Cleveland & McGill, 1987)
                                                                       17
   Interactive Visual Decision Tree Construction




                                                    18
    Integration of Data and Tree Visualizations




• Drill-through is an effective integration technique (Sprenger et al., 1998;
Thearling et al., 2002)

                                                                                19
         Interactive Visual Association Rules
                   Modeling (IVAR)
   Association Rules Modeling                                           Antecedent => Consequent
       Finds interesting associations or correlation
        relationships among items (binary attributes)                           Thursdays
       In the form of “if-then” statements                                        =>
       Measures
            Support (A=>B) = Pr (A and B)
            Confidence (A=>B) = Pr (B|A)                                         =>        +
       Process
            Generate all combinations of items whose support
             values meet the support threshold (large itemsets)
             From each large itemset, generate all rules that satisfy
             the confidence threshold



                                                                                            20
Logic Diagram of the Proposed IVAR
                                     21
   Visualization of Association Rules
        Mappings of items
             One-to-one, one-to-many, many-to-one, many-to-many
        Interestingness measures of rules
             Support, confidence

   Zoo Dataset
        University of California Irvine (UCI) Machine Learning Repository
        Items
             No Leg, Two Legs, Four Legs, Six Legs, Backbone, Toothed, Egg, Feather




                                                                                       22
            Rule-to-Item Matrix View of Association Rules


• Can display all types of mappings
• Effective in visualizing interestingness measures (Cleveland & McGill, 1987)
• Easily view rules associated with specific items (Wong et al., 1999)         23
   Interactive Visual Association Rules Derivation




                  Visualization of Single Items



                                                      24
   Interactive Visual Association Rules Derivation




            Itemset-to-Item Matrix View of Two Itemsets


• Consistent displays of graphics enhance their dissemination, accuracy, and
ease of comprehension (Hix & Hartson, 1993)                                    25
   Interactive Visual Association Rules Derivation




             Visualization of Rules with Two Items
                                                      26
   Interactive Visual Association Rules Derivation




                                                      27
Experiments




              28
                                Experiment One
   Hypotheses
       Compared to automatic model construction, interactive visual model
        construction through integrated visualization of model-building process and data
        visualization
            Improves the effectiveness of modeling (Hypothesis I)
            Enhances understanding of the applied DM algorithm (Hypothesis II)
            Brings users greater satisfaction (Hypothesis III)
   Participants
       20 Graduate engineering students
            10 males, 10 females
            Age: mean 27.1; std. dev = 3.63
   Apparatus
       Dell UltraSharp 1905FP 19-inch monitor



                                                                                  29
   Procedure
       Tutorial
       Training
       Practice
       Formal Task
            20 minutes
            Datasets
                German Credit dataset (Stalog database)
                Flag dataset (UCI machine learning repository)

       Post-study
   Independent Variable
       Type of DM modeling construction
            Decision Tree Classification
                Compare IVDT with SAS Enterprise Miner (EM)-tree
                Also compare with C4.5, CHAID (Chi-square Automatic Interaction
                 Detector), and CART (Classification and Regression Tree) in hypothesis I
            Association rules modeling
                Compare IVAR with automatic modeling applied in current DM tools
                                                                                       30
   Dependent Variables (Cont.)
       Effectiveness of modeling
             Effectiveness of decision tree classification (EDT) =
                ncorrect    nleaf (Eq. 1) α=1; β=0.5
             Effectiveness of association rules modeling (EAR) =
               nI

               r   n
               i 1
                      i    NI   (Eq. 2) r=1,2,3,4, or 5; α=2

       Understanding of applied algorithms
             Algorithm Understanding Questionnaire (Dunsmore and Roper, 2000)
                 7 questions – 6 multiple-choice questions and 1 ordering question

       Satisfaction with task
             Modeling satisfaction questionnaire (Hackman and Oldham, 1975)
                9 seven-point likert-scale questions

   Statistical Design
        One-way between-subject design
        Yij     i   ij
                                                                                      31
   Results
       Hypothesis I
            Decision trees classification




                               F(1,18)=18.61; P=0.0004
                                                         32
    Results
          Hypothesis I
               Decision trees classification

                      IVDT       EM-tree        C4.5    CHAID   CART    IVDT-C4.5

    # of accurate     231.8       232.1         237      224    229       243.9
       records        (2.31)      (3.63)                                  (2.56)

# of leaf nodes         9.8        17.6          43       9      22        32.8
                      (1.40)      (6.62)                                  (4.61)

        EDT           226.9       223.3         215.5   219.5   218.0     227.5
                      (1.98)      (1.41)                                  (2.76)

% Decrease in                     1.6%          3.3%    5.1%    4.0%      -0.2%
   EDT




                                                                                   33
   Results
       Hypothesis I
            Association rules modeling




                         F(1,18)=397.57; P<0.0001
                                                    34
   Results
       Hypothesis II
            Decision trees classification




                            F(1,18)=24.59; p<0.0001
                                                      35
   Results
       Hypothesis II
            Association rules modeling




                              F(1,18)=29.14; p<0.0001
                                                        36
   Results
       Hypothesis III
            Decision tree classification

    Satisfaction Overall     Trust     Worthwhile Independent Challenge  Gained
     Measures                                       Thought             Knowledge
        IVDT         5.5       5.7         5.4        5.5        5.6        5.9
                   ( 0.53)   (0.82 )     (0.52)     (0.85)     (0.52)     (0.88)
     EM-tree         3.6       4.5         4.6        3.9        4.8        4.8
                   (1.07)    (0.85)      (1.26)     (0.88)     (1.23)     (1.23)
      P-value      <.0001     0.01          0.11    0.0006      0.06     <.0001
    % difference 52.8%       26.7%       17.4%      41.0%      16.7%     51.3%


                                Cronbach alpha =0.89




                                                                                   37
   Results
       Hypothesis III
            Association rules modeling

    Satisfaction Overall    Trust    Worthwhile Independent Challenge  Gained
     Measures                                     Thought             Knowledge
        IVAR        5.6       5.5        5.3        5.7        5.8        5.5
                  ( 0.70)   (0.85)     (0.82)     (0.82)     (1.14)     (0.62)
     Automatic       4.7      4.9        4.6        4.6        4.7        4.1
                   (1.18)   (0.78)     (1.18)     (0.84)     (0.95)     (0.67)
      P-value      0.03     0.11          0.10     0.01       0.05     <.0001
    % difference 19.2%      12.2%      20.5%      23.9%      23.4%     34.2%


                                 Cronbach alpha =0.86




                                                                                 38
                                Experiment Two
   Hypothesis
       Compared to nonintegrated visualization of derived models and data
        visualization, integrated visualization of derived models and data visualization
            Promotes better understanding of models (Hypothesis IV)
   Participants
       10 students who took part in experiment one
            5 males, 5 females
            Age: mean 26.7; std. dev 3.09
   Apparatus
       Same as experiment one




                                                                                   39
   Procedure
       Tutorial
       Training
       Practice
       Formal Task
            5 tasks (at most 5 minutes for each task)
            Datasets (same as Experiment one)
       Post-study
   Independent Variables
       Type of Visualization Interface (integrated vs. nonintegrated)
       Type of task (5 levels)
            Decision tree classification
                Find characteristics of the data corresponding to leaf nodes
                Compare characteristics of training data and validation data
            Association rules modeling
                Find characteristics of the data supporting rules



                                                                                40
   Dependent Variables
       Task completion time
       Correction rates
   Statistical Design
       Two-factorial within-subject design Yijkmn     i   j   ij   k  Pm  Qn   ijkmn
   Results
       Decision tree classification
             Integrated interface significantly reduced task completion time (p<.0001) and
              increased correction rates (p<.0001)
             The advantage of the integrated interface over the nonintegrated one was more
              significant when examining data for nodes at lower levels and when comparing
              characteristics of training and validation data
       Association rules modeling
             Integrated interface significantly reduced task completion time (p<.0001) and
              increased correction rates (p=0.05)
             The advantage of the integrated interface over the nonintegrated one was more
              significant when examining data supporting rules with larger support
                                                                                                 41
                                  Conclusion
   Compared to Automatic Modeling, IVDMM Can
       Improve effectiveness of modeling
       Enhance understanding of applied algorithms and derived models
       Bring users greater satisfaction
   IVDMM is to Complement Automatic Modeling, Instead of
    Replacing It
       Automatic modeling takes much less time
       Effectiveness of modeling process can be greatly affected by users’ perceptual
        and cognitive abilities
       Combining IVDMM and automatic modeling can be advantageous




                                                                                 42

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:4/16/2012
language:
pages:43