Learning Center
Plans & pricing Sign in
Sign Out



									Chapter 7: Transformations
                       Attribute Selection
• Adding irrelevant attributes confuses learning algorithms---so avoid
  such attributes
• Both divide-and-conquer and separate-and-conquer algorithms
  suffer from this; Naïve Bayes does not suffer
• So first choose the attributes to be considered and then proceed---
  dimensionality reduction
• Scheme independent selection:
    – Just enough attributes to divide up the instance space in a way that
      separates all the training instances: For example, in Table 1, if we were
      to drop outlook, instance 1 and 4 will be inseparable-not good. --- very
      tedious procedure
• Using machine learning algorithms for attribute selection
    – Decision tree: Apply DT on all attributes, and select only those that are
      actually used in the decisions---the selected attributes can then be used
      in another chosen learning algorithm
    – Use linear SVM algorithm that ranks attributes based on weights to
      choose the attributes---recursive feature elimination
    – Using instance-based learning methods
        • Sample instances randomly from the training set
        • Check neighboring records of the same and different classes (near hits and
          near misses)
        • If a near hit has a different value for a certain attribute, that attribute appears
          to be irrelevant---reduce its weight
        • If a near miss, has a different value, the attribute appears to be relevant and
          its weight should be increased
        • After repeating this procedure many times, selection takes place---only
          attributes with +ve weights are chosen.
• Searching the attribute space:
   –   Fig 7.1
   –   Forward selection (start with empty set and keep expanding)
   –   Backward elimination (start with all, and start eliminating one by one)
   –   Bidirectional search---combination of the above two
• Scheme-specific selection
   – Cross-validation is used to measure the effectiveness of a subset of attributes
             Discretizing Numeric Attributes
•   Global discretization: Used in 1R learning scheme: Sort the instances by the
    attribute’s value and assign the value into ranges at the points that class
    value changes---keeping some minimum instance coverage criteria
•   Local discretization: Used in decision trees: When a specific attribute is
    used to split a node, a decision is made on the value at which this break
    could take place
•   Transforming numeric attribute into k binary variables
•   Unsupervised discretization: Not taking the classes of the training set---
    break the value range into some intervals---e,g., equal-interval binning or
    equal-frequency binning---runs the risk of destroying distinctions within an
    interval or bin
•   Supervised discretization---takes classes into account while making
•   Proportional k-interval discretization: #of bins chosen in a data-dependent
    fashion by setting it to the square root of #of instances with equal-frequency
64 Y 65 N 68 Y 69 Y 70 Y 71 N 72 N 72 Y 75 Y 75 Y 80 N 81 Y 83 Y 85 N
Proportional binning
Number of bins = 4
64-68 Bin1 2Y 1N
69-71 Bin2 2Y 1N
72-75 Bin3 3Y 1N
80-85 Bin4 2Y 2N
Equal Frequency binning
Number of bins = 3
64-70 4Y 1N
71-75 3Y 2N
80-85 2Y 2N
   Entropy-based Discretization
• One example: Order the values of the attribute, and for each
  possible break-point determine the information gain (p. 298-299).
  Split at the point where this value is the smallest.
    – For all values, find the smallest (A);
    – Repeat this procedure for each of the parts formed by the breaking at A;
    – Repeat this step recursively until a stopping criteria is met
                Some Useful Transformations
• Examples:
     – Subtracting one date attribute from another to obtain a new age
     – Converting two attributes A and B to A/B, a new attribute representing
       the ratio
     – Reduce several nominal attributes to one by concatenating their vales,
       producing a single k1xk2 value attribute
• Principal component analysis: Use a special coordinate system that
  depends on the given cloud of points as follows: place the first axis
  in the direction of greatest variance of the points to maximize the
  variance along that axis; the 2nd axis in perpendicular to it; in multi-
  dimensional case, choose the 2nd axis that maximizes variance
  along that axis; and so on; finally, choose the ones that contribute to
  the highest variance---the principal components
        Random Projections
• Since PCA is expensive (cubic in the #of
  dimensions), alternative is to a random
  projection of the data into a subspace with
  a predetermined number of dimensions
      Text to attribute vector
• Convert a document to a vector of words
  that occur in the document---it could be
  the frequency of the words or just the
  absence/presence of the word
• In other words, a document is
  characterized by the words that appear
  often in it.
              Time series
• Some times, we may replace the attributes
  by the difference in successive values, etc.
  This is time series.
      Automatic Data Cleansing
• Data mining techniques themselves can sometimes help to solve the
  problem of cleansing the corrupted data
• By discarding misclassified instances from the training set,
  relearning, and then repeating until there are no more misclassified
  instances, decision trees induced from data can be improved
• Robust regression---by removing outliers, linear regression is
     Combining Multiple Models
• Bagging, boosting, and stacking are prominent methods to combine
  multiple models
• Bagging: Models receive equal weight---output of each model is a
  majority value, for example.
• Boosting: Similar to bagging except that it assigns different weights
  to different model outputs
• Option tree (Fig. 7.10) and Fig. 7.11 (-ve means play=yes; + ve
  means play=no;)

To top