Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Get this document free

classification.ppt - Student web pages

VIEWS: 3 PAGES: 70

									Lecture 7 Classification

        ITCS 6163
        Chapter 7. Classification and
                 Prediction
•   What is classification? What is prediction?
•   Classification by decision tree induction
•   Bayesian Classification
•   Other Classification Methods (SVM)
•   Classification accuracy
•   Prediction
•   Summary
         Classification problem
• Given:
  – Tuples each assigned a class level.
• Develop a model for each class
   – Example:
   – Good creditor : (age in [25,40]) AND (income > 50K)
                      AND (status = MARRIED)
• Applications:
   – Credit approval (good, bad)
   – Store locations (good, fair, poor)
   – Emergency situations (emergency, non-emergency)
            Classification vs. Prediction
• Classification:
   – predicts categorical class labels
   – classifies data (constructs a model) based on the training set
     and the values (class labels) in a classifying attribute and
     uses it in classifying new data
• Prediction:
   – models continuous-valued functions, i.e., predicts unknown
     or missing values
• Typical Applications
   –   credit approval
   –   target marketing
   –   medical diagnosis
   –   treatment effectiveness analysis
             Classification—A Two-Step
                        Process
• Model construction: describing a set of predetermined classes
   – Each tuple/sample is assumed to belong to a predefined class, as
     determined by the class label attribute
   – The set of tuples used for model construction: training set
   – The model is represented as classification rules, decision trees, or
     mathematical formulae
• Model usage: for classifying future or unknown objects
   – Estimate accuracy of the model
      • The known label of test sample is compared with the classified
         result from the model
      • Accuracy rate is the percentage of test set samples that are
         correctly classified by the model
      • Test set is independent of training set, otherwise over-fitting will
         occur
         Supervised vs. Unsupervised
                  Learning
• Supervised learning (classification)
   – Supervision: The training data (observations,
     measurements, etc.) are accompanied by labels indicating
     the class of the observations
   – New data is classified based on the training set
• Unsupervised learning (clustering)
   – The class labels of training data is unknown
   – Given a set of measurements, observations, etc. with the
     aim of establishing the existence of classes or clusters in
     the data
        Chapter 7. Classification and
                 Prediction
•   What is classification? What is prediction?
•   Classification by decision tree induction
•   Bayesian Classification
•   Other Classification Methods
•   Classification accuracy
•   Prediction
•   Summary
          Classification by Decision Tree
                      Induction
• Decision tree
   –   A flow-chart-like tree structure
   –   Internal node denotes a test on an attribute
   –   Branch represents an outcome of the test
   –   Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
   – Tree construction
      • At start, all the training examples are at the root
      • Partition examples recursively based on selected attributes
   – Tree pruning
      • Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
   – Test the attribute values of the sample against the decision tree
            Training Dataset
              age    income student credit_rating   buys_computer
This        <=30    high       no fair                   no
follows     <=30    high       no excellent              no
            30…40   high       no fair                   yes
an          >40     medium     no fair                   yes
example     >40     low       yes fair                   yes
from        >40     low       yes excellent              no
            31…40   low       yes excellent              yes
Quinlan’s   <=30    medium     no fair                   no
ID3         <=30    low       yes fair                   yes
            >40     medium    yes fair                   yes
            <=30    medium    yes excellent              yes
            31…40   medium     no excellent              yes
            31…40   high      yes fair                   yes
            >40     medium     no excellent              no
        Output: A Decision Tree for
            “buys_computer”


                       age?


        <=30          overcast
                       30..40     >40

     student?           yes        credit rating?


no              yes              excellent    fair

no              yes                 no        yes
       Algorithm for Decision Tree
                Induction
• Basic algorithm (a greedy algorithm)
   – Tree is constructed in a top-down recursive divide-and-conquer manner
   – At start, all the training examples are at the root
   – Attributes are categorical (if continuous-valued, they are discretized in
     advance)
   – Examples are partitioned recursively based on selected attributes
   – Test attributes are selected on the basis of a heuristic or statistical
     measure (e.g., information gain)
• Conditions for stopping partitioning
   – All samples for a given node belong to the same class
   – There are no remaining attributes for further partitioning – majority
     voting is employed for classifying the leaf
   – There are no samples left
             Decision trees
 Training set
Salary Education   Class                       Salary < 20000
10000 HS           R
                                   N              Y
40000 C            A
15000 C            R
75000 G            A       Education = G          A
18000 G            A
                           Y               N


                               A           R
              Decision trees
• Pros:
  – Fast.
  – Rules easy to interpret.
  – High dimensional data
• Cons:
  – No correlations
  – Axis-parallel cuts.
        Decision trees(cont.)
• Machine learning:
  – ID3 (Quinlan86)
  – C4.5 (Quinlan93 )
  – CART (Breiman, Friedman, Olshen, Stone,
    Classification and Regression Trees 1984)
• Database:
  – SLIQ (Metha, Agrawal and Rissanen,
    EDBT96)
  – SPRINT (Shafer, Agrawal, Metha, VLDB96)
  – Rainforest (Gherke, Ramakrishnan, Ghanti
    VLDB98)
               Decision trees
• Finding the best tree is NP-Hard
• We look at non-backtracking algorithms (never
   look back at a previous decision)
• Assume we have a test with n outcomes that
   partitions T into subsets T1, T2,…, Tn
  If the test is to be evaluated without exploring
   subsequent dimensions of the Ti’s, the only
   information available for guidance is the
   distribution of classes in T and its subsets.
      Decision tree algorithms
• Building phase:
  – Recursively split nodes using best splitting
    attribute and value for node
• Pruning phase:
  – Smaller (yet imperfect) tree achieves better
    prediction accuracy.
  – Prune leaf nodes recursively to avoid over-
    fitting.
  Predictor variables (attributes)
• Numerically ordered: values are ordered and they
  can be represented in real line. ( E.g., salary.)
• Categorical: takes values from a finite set not
  having any natural ordering. (E.g., color.)
• Ordinal: takes values from a finite set whose
  values posses a clear ordering, but the distances
  between them are unknown. (E.g., preference
  scale: good, fair, bad.)
               Binary Splits
Recursive (binary) partitioning
  – Univariate split on numerically ordered or
    ordinal X
            X <= c
  – on categorical X X  A
  – Linear combination on numerical
      ai Xi <= c
  c and A are chosen to maximize separation.
            Some probability...
S = cases
freq(Ci,S) = # cases in S that belong to Ci
Gain entropic meassure:
Prob(“this case belongs to Ci”) = freq(Ci,S)/|S|
Information conveyed:
  -log (freq(Ci,S)/|S|)
Entropy = expected information =
-  (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S)
                      Gain
Test X:
infoX (T) =  |Ti|/T info(Ti)

gain(X) = info (T) - infoX(T)
                                    Example
Outlook    Temp        Humidity Windy       Class
sunny             75          70        Y       Play    Info(T) (9 play, 5 don’t)
sunny             80          90        Y       Don't
sunny             85          85        N       Don't   info(T) = -9/14log(9/14)-
sunny             72          95        N       Don't   5/14log(5/14) = 0.94 (bits)
sunny             69          70        N       Play
overcast
overcast
                  72
                  83
                              90
                              78
                                        Y
                                        N
                                                Play
                                                Play
                                                        Test: Windy
                                                         Test outlook
overcast          64          65        Y       Play     infowindy=
                                                        infoOutlook =
overcast          81          75        N       Play
rain              71          80        Y       Don't
rain              65          70        Y       Don't   5/14 (-2/5 log(2/5)-3/5 log(3/5))+
                                                        7/14(-4/7log(4/7)-3/7 log(3/7))
rain              75          80        Y       Play
rain              68          80        N       Play    4/14 (-4/4 log(4/4)) +
rain              70          96        N       Play    +7/14(-5/7log(5/7)-2/7log(2/(7))
gainOutlook = 0.94-0.64= 0.3                            5/14 (-3/5 log(3/5) - 2/5 log(2/5))
                                                        = 0.278
gainWindy = 0.94-0.278= 0.662                           = 0.64 (bits)
                                                              Windy is a better test
              Problem with Gain

Strong bias towards test with many outcomes.
Example: Z = Name
|Ti| = 1 (each name unique)
infoZ (T) =  1/|T| (- 1/N log (1/N))  0


Maximal gain!! (but useless division--- overfitting--)
                              Split

Split-info (X) = -  |Ti|/|T| log (|Ti|/|T|)


gain-ratio(X) = gain(X)/split-info(X)


Gain <= log(k)
Split <= log(n)
ratio small
       Extracting Classification Rules from
                      Trees
•   Represent the knowledge in the form of IF-THEN rules
•   One rule is created for each path from the root to a leaf
•   Each attribute-value pair along a path forms a conjunction
•   The leaf node holds the class prediction
•   Rules are easier for humans to understand
•   Example
    IF age = “<=30” AND student = “no” THEN buys_computer = “no”
    IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
    IF age = “31…40”                          THEN buys_computer = “yes”
    IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
    IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
            OVERFITTING
• Decision trees can grow so long that there is
  a leaf for each training example.
• Extremes:
  – Overfitted: “Whatever I haven’t seen can’t be
    classified”
  – Too General: “If it is green, it is a tree”
               Avoid Overfitting in
                 Classification
• The generated tree may overfit the training data
   – Too many branches, some may reflect anomalies due to
     noise or outliers
   – Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
   – Prepruning: Halt tree construction early—do not split a
     node if this would result in the goodness measure falling
     below a threshold
       • Difficult to choose an appropriate threshold
   – Postpruning: Remove branches from a “fully grown”
     tree—get a sequence of progressively pruned trees
       • Use a set of data different from the training data to
         decide which is the “best pruned tree”
            Approaches to Determine the
                  Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
   – but apply a statistical test (e.g., chi-square) to estimate
     whether expanding or pruning a node may improve the
     entire distribution
• Use minimum description length (MDL) principle:
   – halting growth of the tree when the encoding is minimized
      Enhancements to basic decision
             tree induction
• Allow for continuous-valued attributes
   – Dynamically define new discrete-valued attributes that
     partition the continuous attribute value into a discrete set of
     intervals
• Handle missing attribute values
   – Assign the most common value of the attribute
   – Assign probability to each of the possible values
• Attribute construction
   – Create new attributes based on existing ones that are
     sparsely represented
   – This reduces fragmentation, repetition, and replication
         Classification in Large Databases
• Classification—a classical problem extensively studied by statisticians and
  machine learning researchers
• Scalability: Classifying data sets with millions of examples and hundreds of
  attributes with reasonable speed
• Why decision tree induction in data mining?
    –   relatively faster learning speed (than other classification methods)
    –   convertible to simple and easy to understand classification rules
    –   can use SQL queries for accessing databases
    –   comparable classification accuracy with other methods
   Scalable Decision Tree Induction
   Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)
   – builds an index for each attribute and only class list and the
     current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
   – constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
   – integrates tree splitting and tree pruning: stop growing the tree
     earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
   – separates the scalability aspects from the criteria that
     determine the quality of the tree
   – builds an AVC-list (attribute, value, class label)
                              SPRINT
For large data sets.                       Age < 25

     Age   Car Type    Risk
      23     Family       H
      17     Sports       H      H             Car = Sports
      43     Sports       H
      68     Family       L
      32      Truck       L
      20     Family       H

                                       H        L
  Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index, gini(T)
                              n
  is defined as gini(T ) 1  p 2j
                               j 1
  where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes N1 and
  N2 respectively, the gini index of the split data contains examples
  from n classes, the gini index gini(T) is defined as

         gini split (T )  N 1 gini(T 1)  N 2 gini(T 2)
                           N               N
• The attribute provides the smallest ginisplit(T) is chosen to split the
  node (need to enumerate all possible splitting points for each
  attribute).
                     SPRINT
Partition (S)
       if all points of S are in the same class
               return;
       else
               for each attribute A do
                      evaluate_splits on A;
                      use best split to partition into S1,S2;
                      Partition(S1);
                      Partition(S2);
        SPRINT Data Structures
                                 Age   Car Type       Risk
                                  23     Family          H
      Training                    17     Sports          H
                                  43     Sports          H
      set                         68     Family          L
                                  32      Truck          L
                                  20     Family          H



       Age       Risk    Tuple                    Car Type   Risk   Tuple
        17          H        1                      Family      H       0
Age     20          H        5         Car          Sports      H       1
        23          H        0                      Sports      H       2
        32          L        4                      Family      L       3
        43          H        3                       Truck      L       4
        68          L        2                      Family      H       5



       Attribute lists
                                                      Car Type       Risk      Tuple
   Age
    17
              Risk
                 H
                              Tuple
                                  1
                                       Splits           Family
                                                        Sports
                                                                        H
                                                                        H
                                                                                   0
                                                                                   1
    20           H                5                     Sports          H          2
    23           H                0                     Family          L          3
    32           L                4                      Truck          L          4
    43           H                3      Age < 27.5     Family          H          5
    68           L                2




Group1                                                    Group2
                                                              Age     Risk    Tuple
  Age        Risk          Tuple                               32        L        4
   17           H              1                               43        H        2
   20           H              5                               68        L        3
   23           H              0


  Car Type          Risk       Tuple                      Car Type     Risk    Tuple
    Family             H           0                        Sports        H        2
    Sports             H           1                        Family        L        3
    Family             H           5                         Truck        L        4
               Histograms
For continuous attributes
Associated with node (Cabove, Cbelow)



           to process       already processed
                               Example
     Age     Risk    Tuple
      17        H        1
      20        H        5
      23
      32
                H
                L
                         0
                         4
                                                    H
                                                    H            LL
      43        H        3
      68        L        2                  Cb
                                           Cb               3
                                                            20
                                                            1
                                                            4         00
                                                                      1
                                                                      2
ginisplit0 = 0.444                          Ca
                                           Ca               24
                                                            1
                                                            0
                                                            3         22
                                                                      0
                                                                      1
ginisplit1= 0.156
ginisplit2= 0.333                ginisplit2 = 2/6gini(S1) +3/6 gini(S2)
                                  ginisplit0= 1/6 gini(S1)+1/6 gini(S2)
                                            = 0/6 gini(S1)+4/6 gini(S2)
                                                           + 6/6
                                 ginisplit3 =3/6 gini(S1) +2/6 gini(S2)
                                     split4 =5/6
                                     split5 =6/6          +5/6
                                     split1 =4/6 gini(S1)+0/6 gini(S2)
ginisplit3= 0.222                gini(S1) = 1 --[(2/2) 2 +(2/6)22]]= 0.444
                                 gini(S1) = 1 - [(3/3) 2 ] = 0
                                  gini(S2) = 1 [(1/1) ] = 0
                                                 [(4/6)
                                                [(4/5) +(1/5)
                                                [(4/6) +(2/6)
                                                [(3/4) +(1/4) = 0.375 0.320
ginisplit4= 0.416                gini(S2) = 1 - [(2/4)2 +(2/4)2 ] = 0.5
                                 gini(S2) = 1 - [(1/3)2 +(2/3)2 ] = 0.444
                                                [(3/4) +(2/4)
                                                [(1/1) ] = 0
                                                [(1/2) +(1/2)       0.1875
                                                                    0.5
ginisplit5= 0.222
                             Age <= 18.5
ginisplit6= 0.444
  Splitting categorical attributes
Single scan through the attribute list collecting
  counts on count matrix for each combination of
  class label + attribute value
                                    Example
  Car Type      Risk        Tuple
    Family         H            0
    Sports         H            1                         H           L
    Sports         H            2                Family           2         1
    Family         L            3
     Truck         L            4
                                                 Sports           2         0
    Family         H            5                Truck            0         1

ginisplit(family)= 0.444             ginisplit((sports)= 3/6 gini(S1) + 3/6 gini(S2)
                                     ginisplit(family)=1/6 gini(S1) ++5/6 gini(S2)
                                     ginisplit(truck)= 2/6 gini(S1) 4/6 gini(S2)
                                     gini(S1) = 1 - [(2/3)2]+ = 0 2] = 4/9
ginisplit((sports) )= 0.333          gini(S1) = 1 - [(1/1)2 (1/3)
                                     gini(S1) = 1 - [(2/2)2] = 0
                                     gini(S2) = 1- [(2/3)2 + (1/3)2] = 4/9
                                     gini(S2) = 1- [(2/4)2 + (2/4)2] = 0.5
                                     gini(S2) = 1- [(4/5)2 + (1/5)2] = 0.32
ginisplit(truck) )= 0.266
                                      Car Type = Truck
        Example (2 attributes)
Age   Risk   Tuple                     Car Type    Risk        Tuple
 17      H       1                       Family       H            0
 20      H       5                       Sports       H            1
 23      H       0                       Sports       H            2
 32      L       4                       Family       L            3
 43      H       3                        Truck       L            4
 68      L       2                       Family       H            5

                     The winner is Age <= 18.5
                                           Age     Risk       Tuple
                                            20        H           5
                                            23        H           0
             Y               N              32        L           4
                                            43        H           3
                                            68        L           2

                                        Car Type    Risk        Tuple
             H                            Family       H            0

                                         Sports           H           2
                                         Family           L           3
                                          Truck           L           4
                                         Family           H           5
          Performing the split
• Create 2 child nodes
• Split attribute lists for winning attribute
• For the remaining
   – Insert Tuple Ids in Hash Table (which child)
   – Scan lists of attributes and probe hash table
     (may be too large and need several steps).
               Drawbacks
• Large explosion of space (possibly tripling
  the size of database).
• Costly Hash-Join.
        Chapter 7. Classification and
                 Prediction
•   What is classification? What is prediction?
•   Classification by decision tree induction
•   Bayesian Classification
•   Other methods (SVM)
•   Classification accuracy
•   Prediction
•   Summary
            Bayesian
            Theorem
• Given training data D, posteriori probability of a
  hypothesis h, P(h|D) follows the Bayes theorem
                   P(h | D)  P(D | h)P(h)
                                P(D)


• MAP (maximum posteriori) hypothesis
         h     arg max P(h | D)  arg max P(D | h)P(h).
          MAP     hH                hH



• Practical difficulty: require initial knowledge of
  many probabilities, significant computational cost
     Naïve Bayes Classifier (I)
• A simplified assumption: attributes are
  conditionally independent:
                                n
        P( C j |V )  P( C j ) P( v i | C j )
                               i 1




• Greatly reduces the computation cost, only
  count the class distribution.
                   Example
Outlook    Temp        Humidity Windy       Class
sunny             75          70        Y       Play
sunny             80          90        Y       Don't
sunny             85          85        N       Don't
sunny             72          95        N       Don't
sunny             69          70        N       Play
overcast          72          90        Y       Play
overcast          83          78        N       Play
overcast          64          65        Y       Play
overcast          81          75        N       Play
rain              71          80        Y       Don't
rain              65          70        Y       Don't
rain              75          80        Y       Play
rain              68          80        N       Play
rain              70          96        N       Play
       Naive Bayesian Classifier (II)
• Given a training set, we can compute the probabilities

     O u tlo o k        P      N     H u m id ity P       N
     su n n y          2 /9   3 /5   h ig h       3 /9   4 /5
     o verc ast        4 /9    0     n o rm al    6 /9   1 /5
     rain              3 /9   2 /5
     T em p reatu re                 W in d y
     hot               2 /9   2 /5   tru e        3 /9   3 /5
     m ild             4 /9   2 /5   false        6 /9   2 /5
     cool              3 /9   1 /5
                                        Example
Outlook    Temp        Humidity Windy       Class       E ={outlook = sunny, temp =
sunny             75          70        Y       Play    [64,70], humidity= [65,70],
sunny             80          90        Y       Don't
sunny             85          85        N       Don't   windy = y} =
sunny             72          95        N       Don't
sunny             69          70        N       Play    {E1,E2,E3,E4}
overcast          72          90        Y       Play
overcast          83          78        N       Play    Pr[“Play”/E] = (Pr[E1/Play] x
overcast          64          65        Y       Play
overcast          81          75        N       Play    Pr[E2/Play] x Pr[E3/Play] x
rain              71          80        Y       Don't   Pr[E4/Play] x Pr[Play]) / Pr[E] =
rain              65          70        Y       Don't
rain              75          80        Y       Play
rain              68          80        N       Play
rain              70          96
                                      (2/9x 3/9 x 6/9 x 3/9x 9/14)/Pr[E]
                                        N       Play
                                        = 0.0105/Pr[E]
Pr[“Don’t”/E] = (3/5 x 1/5 x 1/5 x 3/5 x 5/14)/Pr[E] = 0.005/Pr[E]
With E: Pr[“Play”/E] = 67.7 %, Pr[“Don’t”/E] = 32.3 %
       Bayesian Belief Networks (I)
  Family
                 Smoker
  History
                                   (FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

                             LC      0.8      0.5     0.7     0.1
LungCancer      Emphysema   ~LC      0.2      0.5     0.3     0.9

                            The conditional probability table
                                for the variable LungCancer
PositiveXRay      Dyspnea


 Bayesian Belief Networks
     Bayesian Belief Networks (II)
• Bayesian belief network allows a subset of the
  variables conditionally independent
• A graphical model of causal relationships
• Several cases of learning Bayesian belief networks
   – Given both network structure and all the variables: easy
   – Given network structure but only some variables
   – When the network structure is not known in advance
Bayesian Network
   Another Example (Friedman &
           Goldzsmidt)
Variables : Burglary, Earthquake, Alarm, Neighbor call, Radio
announcement.
Burglary and Earthquake are independent (P(BE) = P(B)*P(E))
Burglary and Radio announcement are independent given
Earthquake (P(BR/E) = P(B/E)*P(R/E))
So, P(A,R,E,B)=P(A|R,E,B)*P(R|E,B)*P(E|B)*P(B)
can be reduced to:
       P(A,R,E,B) = P(A|E,B)*P(R|E)*P(E)*P(B)
                          Example (cont.)

                                                        Burglary
                     Earthquake



                                                               Alarm
        Radio announc.

                                                     Neigh. call

Each node is conditionally independent of all nondescendants
given its parents.
                Example (cont.)
Associated with each node is a set of conditional probability
distributions. For example, the "Alarm" node might have the
following probability distribution

                  E     B    P(A/EB)   P(!A/EB)
                  E     B    0.90      0.10
                  E     !B   0.20      0.80
                  !E    B    0.90      0.10
                  !E    !B   0.01      0.99
        Chapter 7. Classification and
                 Prediction
•   What is classification? What is prediction?
•   Issues regarding classification and prediction
•   Classification by decision tree induction
•   Bayesian Classification
•   Other Methods
•   Classification accuracy
•   Prediction
•   Summary
   Extending linear classification

Problem: all the algorithms we covered (plus many other
ones) can only represent linear boundaries between classes

         Age <= 25 <-
                             -> Age > 25
                                               Too
                                               simplistic for
                                               many real
                                               cases
       Nonlinear class boundaries

Support vector machines (SVM)-- a misnomer, since they are
algorithms, not machines--
Idea: use a non-linear mapping and transform the space into a
new space.
Example:


       x = w1a13 + w2 a12 a2 + w3 a1 a22 + w4 a23
                             SVMs
     Based on an algorithm that finds a maximum marginal
     hyperplane (linear model).
                                                           Convex
                             Shortest line                 hull:
                             connecting the                (tightest
                             hulls                         enclosing
                                                           polygon)




Maximum
margin             Support
hyperplane         vectors
                    SVMs (cont.)
• We have assumed that the two classes are linearly separable,
so their convex hulls cannot overlap.
• The maximum margin hyperplane (MMH) is the one that is
as far away as possible from both convex hulls. It is orthogonal
to the shortest line connecting the hulls.
• The instances closest to the MMH (minimum distance to the
line) are called support vectors (SV). (At least one for each
class, often more.)
   – Given the SVs, we can easily construct the MLH.
   – All other training points can be deleted without any
   effect on the MMH
                   SVMs (cont.)
A hyperplane that separates the two classes can be written as:
               x = w0 + w1a1 + w2 a2
for a two-attribute case.
However, the equation that defines the MMH, can be defined in
terms of the SVs. Write the class value y of a training instance
(point) as 1 (yes) or -1 (no). Then the MMH is:
       x = b +  i yi a(i). a
               i  SVs
yi = class value of the point a(i); b and i are numerical values
to be determined; a is a test point.
                   SVMs (cont.)
So, now…
Use the training values to determine b and i for
               x = b +  i yi a(i). a            Dot product
                i  SVs


Standard optimization problem: constrained quadratic optimization
(off-the-shelf software packages to solve this: Fletcher, Practical
Methods of Optimization, 1987)
        Chapter 7. Classification and
                 Prediction
•   What is classification? What is prediction?
•   Issues regarding classification and prediction
•   Classification by decision tree induction
•   Bayesian Classification
•   Other Methods
•   Classification accuracy
•   Prediction
•   Summary
     Classification Accuracy: Estimating
                  Error Rates
• Partition: Training-and-testing
   – use two independent data sets, e.g., training set (2/3), test
     set(1/3)
   – used for data set with large number of samples
• Cross-validation
   – divide the data set into k subsamples
   – use k-1 subsamples as training data and one sub-sample as
     test data --- k-fold cross-validation
   – for data set with moderate size
• Bootstrapping (leave-one-out)
   – for small size data
     Boosting and Bagging

• Boosting increases classification accuracy
  – Applicable to decision trees or Bayesian
    classifier
• Learn a series of classifiers, where each
  classifier in the series pays more attention to
  the examples misclassified by its
  predecessor
• Boosting requires only linear time and
  constant space
        Chapter 7. Classification and
                 Prediction
•   What is classification? What is prediction?
•   Issues regarding classification and prediction
•   Classification by decision tree induction
•   Bayesian Classification
•   Classification accuracy
•   Prediction
•   Summary
               What Is
              Prediction?
• Prediction is similar to classification
   – First, construct a model
   – Second, use model to predict unknown value
      • Major method for prediction is regression
          – Linear and multiple regression
          – Non-linear regression
• Prediction is different from classification
   – Classification refers to predict categorical class label
   – Prediction models continuous-valued functions
              Predictive Modeling in
                    Databases
• Predictive modeling: Predict data values or construct generalized linear
  models based on the database data.
• One can only predict value ranges or category distributions
• Method outline:
   – Minimal generalization
   – Attribute relevance analysis
   – Generalized linear model construction
   – Prediction
• Determine the major factors which influence the prediction
   – Data relevance analysis: uncertainty measurement, entropy analysis,
     expert judgement, etc.
• Multi-level prediction: drill-down and roll-up analysis
     Regress Analysis and Log-Linear
          Models in Prediction
• Linear regression: Y =  +  X
   – Two parameters ,  and  specify the line and are to be
     estimated by using the data at hand.
   – using the least squares criterion to the known values of Y1,
     Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
   – Many nonlinear functions can be transformed into the above.
• Log-linear models:
   – The multi-way table of joint probabilities is approximated by
     a product of lower-order tables.
   – Probability: p(a, b, c, d) = ab acad bcd
        Chapter 7. Classification and
                 Prediction
•   What is classification? What is prediction?
•   Issues regarding classification and prediction
•   Classification by decision tree induction
•   Bayesian Classification
•   Other Classification Methods
•   Classification accuracy
•   Prediction
•   Summary
              Summary
• Classification is an extensively studied problem (mainly in
  statistics, machine learning & neural networks)
• Classification is probably one of the most widely used data
  mining techniques with a lot of extensions
• Scalability is still an important issue for database applications:
  thus combining classification with database techniques should
  be a promising topic
• Research directions: classification of non-relational data, e.g.,
  text, spatial, multimedia, etc..

								
To top