; Ontology Translation on the Semantic Web
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Ontology Translation on the Semantic Web


  • pg 1
									     DSC433/533, Fall 2008
  Information Analysis for
   Managerial Decisions
        Instructor: Dejing Dou
Office hours: Wednesdays 3:30pm-5:00pm
         Oct. 28 and Oct. 30, 2008

Chapter 6: Classification and Prediction
   What is classification? What is prediction?
    – Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification based on concepts from
  association rule mining
 Other Classification Methods
 Prediction
 Classification accuracy
          Classification vs. Prediction
   Classification:
    – classifies data (constructs a model) based on the training set
      and the values (class labels) in a classifying attribute and uses
      it in classifying new data
    – predicts categorical class labels (discrete or nominal)
   Prediction:
    – models continuous-valued functions, i.e., predicts unknown or
        missing values (e.g. linear, multiple and nonlinear regressions)
   Typical Applications
    –   credit approval
    –   target marketing
    –   medical diagnosis
    –   treatment effectiveness analysis
    Classification—A Two-Step Process
   Model construction: describing a set of predetermined classes
    – Each tuple/sample is assumed to belong to a predefined class, as determined
      by the class label attribute
    – The set of tuples used for model construction is training set
    – Supervised learning vs. unsupervised learning (clustering)
    – The model is represented as classification rules, decision trees, Bayesian
      network, neural network, or mathematical formulae
   Model usage: for classifying future or unknown objects
    – Estimate accuracy of the model
          The known label of test sample is compared with the classified result
           from the model
         Accuracy rate is the percentage of test set samples that are correctly
           classified by the model
         Test set is independent of training set, otherwise over-fitting will occur

    – If the accuracy is acceptable, use the model to classify (predict) data tuples
      whose class labels are not known
                Model Construction

NAME   RANK           YEARS TENURED         Classifier
Mike   Assistant Prof   3      no           (Model)
Mary   Assistant Prof   7      yes
Bill   Professor        2      yes
Jim    Associate Prof   7      yes    IF rank = ‗professor‘
Dave   Assistant Prof   6      no     OR years > 6
Anne   Associate Prof   3      no
                                      THEN tenured = ‗yes‘
           Use the Model in Prediction


                     Data                            Unseen Data

                                                  (Jeff, Professor, 4)
NAME RANK                     YEARS TENURED
T om       A ssistant P rof     2     no          Tenured?
M erlisa   A ssociate P rof     7     no
G eorge    P rofessor           5     yes
Joseph     A ssistant P rof     7     yes                         6
          Supervised vs. Unsupervised
   Supervised learning (classification)
    – Supervision: The training data (observations, measurements,
      etc.) are accompanied by labels indicating the class of the
    – New data is classified based on the training set

   Unsupervised learning (clustering)
    – The class labels of training data is unknown
    – Given a set of measurements, observations, etc. with the aim of
      establishing the existence of classes or clusters in the data

Issues Regarding Classification and
   Prediction : Data Preparation
   Data cleaning
    – Preprocess data in order to reduce noise and handle missing
   Relevance analysis (feature selection)
    – Remove the irrelevant or redundant attributes
   Data transformation
    – Generalize and/or normalize data

         Evaluating Classification Methods
   Predictive accuracy
   Speed
    – time to construct the model
    – time to use the model
   Robustness
    – handling noise and missing values
   Scalability
    – efficiency in disk-resident databases
   Interpretability:
    – understanding and insight provided by the model
   Other Goodness of rules besides accuracy
    – decision tree size
    – compactness of classification rules
                  Training Dataset

                  age    income student credit_rating   buys_computer
This            <=30    high       no  fair                  no
follows an      <=30
example         >40     medium     no  fair                  yes
                >40     low       yes fair                   yes
from            >40     low       yes excellent              no
Quinlan’s       31…40
                                  yes excellent
                                   no  fair
ID3             <=30    low       yes fair                   yes
                >40     medium    yes fair                   yes
                <=30    medium    yes excellent              yes
Dichotomiser)   31…40   medium     no  excellent             yes
                31…40   high      yes fair                   yes
                >40     medium     no  excellent             no

     Output: A Decision Tree for


        <=30          overcast
                       30..40     >40

     student?           yes        credit rating?

no              yes              excellent    fair

no              yes                 no        yes
    Algorithm for Decision Tree Induction
   Basic algorithm (a greedy algorithm and a version of ID3)
     – Tree is constructed in a top-down recursive divide-and-conquer
     – At start, all the training examples are at the root
     – Attributes are categorical (discrete-valued, if continuous-valued,
       they are discretized in advance)
     – Examples are partitioned recursively based on selected attributes
     – Test attributes are selected on the basis of a heuristic or statistical
       measure (e.g., information gain)
   Conditions for stopping partitioning (any of them)
     – All samples for a given node belong to the same class
     – There are no remaining attributes for further partitioning –
       majority voting is employed for classifying the leaf
     – There are no samples left                                        12
        Attribute Selection Measure:
        Information Gain (ID3/C4.5)
   Select the attribute with the highest information gain
   S contains si tuples of class Ci for i = {1, …, m},
    information measures info required to classify any
    arbitrary tuple to m classes m
                                                          si       si
                        I( s1,s 2,...,s m )               log 2
                                                   i 1   s        s
   entropy of attribute A with values {a1,a2,…,av}, which is
    the required information to partition S into the subset{S1, S2,…,Sv}, then
    let Sj contain sij samples of vclass Ci.
                                    s1 j  ...  smj
                        E(A)                       I ( s1 j ,..., smj )
                               j 1         s
   information gained by branching on attribute A
                     Gain(A) I(s1, s 2 ,...,sm)  E(A)
    How much would be gained by branching on A. The smaller entropy (still
    required), the greater the purity of the partitions.
               Attribute Selection by
          Information Gain Computation
          Class P: buys_computer = ―yes‖                  E ( age) 
                                                                         I (2,3) 
                                                                                      I (4,0)
          Class N: buys_computer = ―no‖                              14           14
          I(p, n) = I(9, 5) =0.940                                    5
                                                                        I (3,2)  0.694
          Compute the entropy for age                                14
          age           pi       ni I(pi, ni)            5
                                                           I (2,3) means ―age <=30‖ has 5
        <=30            2        3 0.971                14
        30…40           4        0 0
                                                             out of 14 samples, with 2
        >40             3        2 0.971                     yes‘es and 3 no‘s.
                                                                     2     2 3     3
                                                          I (2,3)   log 2  log 2  0.97
  age    income student credit_rating   buys_computer
<=30    high       no  fair                  no
                                                                     5     5 5     5
>40     medium     no  fair                  yes
>40     low       yes fair                   yes          Hence
                  yes excellent
                  yes excellent
                                                         Gain(age)  I ( p, n)  E (age)  0.246
                                                            Gain(income)  0.029
<=30    medium     no  fair                  no
<=30    low       yes fair                   yes
>40     medium    yes fair                   yes
                  yes excellent
                   no  excellent
                                                            Gain( student)  0.151
                  yes fair
                   no  excellent
                                             no             Gain(credit _ rating)  0.048
                 Other Attribute Selection
   Gain Ratio (C4.5), Gini index (CART)
    – All attributes can be continuous-valued
    – Assume there exist several possible split values for each attribute
    – May need other tools, such as clustering, to get the possible split
    – Can be modified for categorical attributes

  2,   MDL (Minimum Description Length), etc

     Output: A Decision Tree for


        <=30          overcast
                       30..40     >40

     student?           yes        credit rating?

no              yes              excellent    fair

no              yes                 no        yes
      Attribute Selection by
 Information Gain Computation
      age    income student    credit_rating   buys_computer
     <=30   high      no      fair                  no
     <=30   high      no      excellent             no
     <=30   medium    no      fair                  no
     <=30   low       yes     fair                  yes
     <=30   medium    yes     excellent             yes

                2     2 3     3
     I (2,3)   log 2  log 2  0.97
                5     5 5     5
                     2         3
      E ( student)  I (2,0)  I (3,0)  0
               2     5   2     5 1
E (income)  I (2,0)  I (1,1)  I (1,0)  0.4
               5         5        5
                      3         2
E (credit _ rating)  I (1,2)  I (1,1)  0.799
                      5         5
            Gain(income)  0.57
            Gain( student)  0.97
            Gain(credit _ rating )  0.22                      17
                      Tree Pruning
   Overfitting: An induced tree may overfit the training data
    – Too many branches, some may reflect anomalies due to noise or
    – Poor accuracy for unseen samples
   Two approaches to avoid overfitting
    – Prepruning: Halt tree construction early—do not split a node if
      this would result in the goodness measure (e.g. information gain,
      χ2) falling below a threshold. The node becomes leaf.
         Difficult to choose an appropriate threshold

    – Postpruning: Remove branches from a ―fully grown‖ tree—get a
      sequence of progressively pruned trees
         Use a set of data different from the training data to decide
          which is the ―best pruned tree‖
            Approaches to Determine the
                  Final Tree Size
   Separate training (2/3) and testing (1/3) sets
   Use all the data for training
    – but apply a statistical test (e.g., 2) to estimate whether
      expanding or pruning a node may improve the entire
   Use minimum description length (MDL) principle
    – MDL uses encoding to the ―best‖ decision tree as one that
      requires the fewest number of bits.
    – halting growth of the tree when the encoding is minimized

     Extracting Classification Rules from
   Represent the knowledge in the form of IF-THEN rules
    – One rule is created for each path from the root to a leaf
    – Each attribute-value pair along a path forms a conjunction
    – The leaf node holds the class prediction

   Rules are easier for humans to understand
   Example
    IF age = ―<=30‖ AND student = ―no‖ THEN buys_computer = ―no‖
    IF age = ―<=30‖ AND student = ―yes‖ THEN buys_computer = ―yes‖
    IF age = ―31…40‖                    THEN buys_computer = ―yes‖
    IF age = ―>40‖ AND credit_rating = ―excellent‖ THEN buys_computer = ―no‖
    IF age = ―>40‖ AND credit_rating = ―fair‖ THEN buys_computer = ―yes‖

            Enhancements to basic
            decision tree induction
   Allow for continuous-valued attributes (ID3 -> C4.5)
    – Dynamically define new discrete-valued attributes that
      partition the continuous attribute value into a discrete set of
   Handle missing attribute values
    – Assign the most common value of the attribute
    – Assign probability to each of the possible values
   Attribute construction
    – Create new attributes based on existing ones that are sparsely
      represented – Grouping of categorical attribute values.
    – This reduces fragmentation, repetition, and replication     21
       Classification in Large Databases
   Classification—a classical problem extensively studied by
    statisticians and machine learning researchers
   Scalability: Classifying data sets with millions of examples
    and hundreds of attributes with reasonable speed
   Why decision tree induction in data mining?
    – relatively faster learning speed (than other classification methods)
    – convertible to simple and easy to understand classification rules
    – can use SQL queries for accessing databases
    – comparable classification accuracy with other methods

     Scalable Decision Tree Induction
     Methods in Data Mining Studies
   SLIQ (EDBT‘96 — Mehta et al.)
    – builds an index for each attribute. Only class list and the current
      attribute list reside in memory.
   SPRINT (VLDB‘96 — J. Shafer et al.)
    – constructs an attribute list data structure
   PUBLIC (VLDB‘98 — Rastogi & Shim)
    – integrates tree splitting and tree pruning: stop growing the tree
   RainForest (VLDB‘98 — Gehrke, Ramakrishnan & Ganti)
    – separates the scalability aspects from the criteria that determine the
      quality of the tree
    – builds an AVC-list (attribute, value, class label)
                Data Cube-Based
              Decision-Tree Induction
   Integration of generalization with decision-tree induction
    (RIDE‘97, Kamber et al).
   Attribute-oriented induction uses concept hierarchies
   Classification at different concept levels.
    – Low-level concepts (e.g., precise temperature, humidity, outlook)
      can result in quite large and bushy classification-trees
    – High-level concepts can result in useless decision tree
    – Some intermediate level set by domain expert or threshhold
   Cube-based multi-level classification
    – Relevance analysis at multi-levels.
    – Information-gain analysis with dimension + level.
 Let’s Practice: extracting rules
    from the tree with Weka


        <=30          overcast
                       30..40     >40

     student?           yes        credit rating?

no              yes              excellent    fair

no              yes                 no        yes
             One Path, One Rule
IF age = ―<=30‖ AND student = ―no‖ THEN buys_computer = ―no‖

IF age = ―<=30‖ AND student = ―yes‖ THEN buys_computer = ―yes‖

IF age = ―31…40‖    THEN buys_computer = ―yes‖

IF age = ―>40‖ AND credit_rating = ―excellent‖ THEN
  buys_computer = ―no‖

IF age = ―>40‖ AND credit_rating = ―fair‖ THEN buys_computer =
  ―yes‖                                                26
               Examples for C4.5
    class1.xls in Weka using J48 (Weka‘s
 Try
 implementation of C4.5.

 Part   is the class label

 How  about Charles Club, using Florence as
 class label?
Presentation of Classification Results

Visualization of a Decision Tree in
         SGI/MineSet 3.0

           Decision Tree In Practice
   To explore a large dataset to pick out useful

   To predict future states of important variables in an
    industrial process

   To form directed clusters of customers for a
    recommendation system
            Decision Tree as a Data
               Exploration Tool
   The attributes selection process help pick up the
    variables that are likely to be important for
    predicting targets.

   I.e., the attributes used for classification rules are

                Boston Globe case
   Goal: estimate a town‘s expected home delivery
    circulation level based on various demographic and
    geographic characteristics

   Problem: Find a handful attributes for regression.

                Boston Globe case
   The number of subscribing households in a given
    city or town may not make a good target because
    the size of town or cities

   A better target (class label)?

               Boston Globe case
   Penetration: the proportion of households that
    subscribe to the paper.

   Find factor among hundreds in the town signature,
    separate towns with high penetration (top one
    third good) and low penetration (bottom one
    third  bad)
                Boston Globe case
   The rules looks like:

    If median home value <=$226K,
     | sub to pop child ratio >= 0.61, good (97%)
     | sub to pop child ratio <= 0.61 (50% vs. 50%)
      | % population Age 18-24 < 0.09, good (100%)
      | % population Age 18-24 >0.09, bad (100%)
    If median home value >=$226K, good (99%)
What above mean?                                      35
                 Boston Globe case
   Median home value is the best first split, also the most
    important factor
    – <226K town are poor prospects

   Next: a family of derived variables comparing the
    subscriber base in the town to the town population as a
    whole (e.g., child ratio, % of population age)

   Others: mean years of school completed, percentage of the
    population in blue collar occupations, percentage of high-
    status occupations                                     36
            Applying Decision tree for
   Predicting the future is one of the most important
    applications of data mining.
   Analyzing trends in historical data in order to predict
    future behavior.
    – A major bank studied customer data in order to spot earlier
      warning signs for attrition in its checking accounts. ATM
      withdraws, payroll direct deposits, balance inquires, visits to
      teller, …
    – A manufacturer of diesel engines tried to forecast diesel engine
      sales based on historical truck registration data.
   Major difference from statistical study for cycles: multiple
    attributes and one attribute
               Nestle Coffee Rosters:
                 Process Control
   Roster variables: temperature of air at various exhaust
    points, the speed of various fans, the rate of that gas is
    burned, the amount of water introduced to quench the
    beans, and the positions of various flaps and values.

   A lot of ways for things to go wrong: too light in color, a
    costly and damaging roaster fire.

   Goals of simulator: help operators keep the roaster running
    properly. Data from 60 sensors, every 30 seconds.
                Nestle Coffee Rosters:
                  Process Control
   Using the simulator to try out new recipes, a large number of new
    recipes could be evaluated without interrupting production

   The simulator could be used to train new operators and expose them
    to routine problems and their solutions. Operators could try out
    different approaches to resolving a problem.

   The simulator could track the operation of the actual roaster and
    project in several minutes into the future to avoid problems.

                Nestle Coffee Rosters:
                  Process Control
   The simulation was built using a training set of 34,000 cases and
    evaluated in another 40,000 cases.

   For each case, the simulator generated projected snapshots 60 steps
    in the future.

   The size of the error increase with time. For example, the error rate
    for product temperature tuned out to be 2/3 degree per minute of
    projection, but even 30 minutes into the future the simulator is doing
    considerably better than random guessing.


To top