Data Mining Strategies classes by hcj


									Data Mining Strategies/classes
                           Data Mining
                                                           (Week 2)
                     Clustering       Learning


    Classification          Estimation            Prediction
    Supervised learning (Predictive)

• Learning to assign objects to classes
  given data examples
• Learner (classifier)

        A typical supervised text learning scenario.
Unsupervised Learning (Descriptive)

The target goal is not pre-defined, i.e.
  gathering items in a database to groups
  where items in the same group are
  similar (clustering).
• Clustering
• Association Rule Discovery
          Formally: What is Data?
• Collection of data objects and their
                                               Tid Refund Marital     Taxable
                                                          Status      Income Cheat

• An attribute is a property or                1    Yes     Single    125K   No
  characteristic of an object                  2    No      Married   100K   No

   – Examples: eye color of a                  3    No      Single    70K    No

       person, temperature, etc.               4    Yes     Married   120K   No

   – Attribute is also known as Objects        5    No      Divorced 95K     Yes
                                               6    No      Married   60K    No
       variable, field, characteristic,
       or feature                              7    Yes     Divorced 220K    No
                                               8    No      Single    85K    Yes
• A collection of attributes describe
                                               9    No      Married   75K    No
  an object
                                               10   No      Single    90K    Yes
   – Object is also known as              10

       record, point, case, sample,
       entity, or instance
 Types of Data sets in Data Mining
• Training data set
   – represents the history of a set transactions or
     objects in an application area, i.e. credit card
     objects, medical diagnoses data, etc.
   – The source of learning the knowledge
• Test data set
   – represents a set of objects similar to the
     training data. However, it’s main usage is to
     evaluate the knowledge (rules) produced from
     the training data set.
                 Attribute Values
• Attribute values are numbers or symbols assigned to an attribute

• Distinction between attributes and attribute values
   – Same attribute can be mapped to different attribute values
        • Example: height can be measured in feet or meters

    – Different attributes can be mapped to the same set of values
       • Example: Attribute values for ID and age are integers

    – The values used to represent an attribute may have properties
      that are not properties of the attribute itself, and vice versa.
               Types of Attributes
•     There are different types of attributes
    –   Integer
       • Examples: ID numbers, zip codes
    –   Real
        •   Examples, height, weight, etc.
    –   Ordinal / Categorical
        •   Examples: rankings (e.g., taste of potato chips on a scale from (1-
            10), grades, height in {tall, medium, short}
        •   Finite set of possible values, e.g. name, job
    –    Interval
        • Examples: calendar dates, temperatures in Celsius or
   Properties of Attribute Values
• The type of an attribute depends on the following properties it
   – Distinctness:              = 
   – Order:             < >
   – Addition:                  + -
   – Multiplication:            */

    – Categorical attribute: distinctness
    – Real and integer attributes: distinctness & order
    – Interval attribute: distinctness & order
          Discrete and Continuous
•   Discrete Attribute
    –   Has only a finite or countably infinite set of values
    –   Examples: zip codes or the set of words in a collection of documents
    –   Often represented as integer variables.
    –   Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
   – Has real numbers as attribute values
   – Examples: temperature, height, or weight.
   – Practically, real values can only be measured and represented using a
       finite number of digits.
   – Continuous attributes are typically represented as floating-point

Typically, categorical and ordinal attributes are discrete, while interval
and ratio attributes are continuous.
           Asymmetric Attributes
• For asymmetric attributes, only presence -- a non-zero attribute
  value -- is regarded as important.
• E.g. Transaction data for association rule discovery
   – “Bread”, “Coke” etc are in fact (asymmetric) attributes and only
      their presence (i.e. value 1 or true) is important.

            TID    Items
            1      Bread, Coke, Milk
            2      Beer, Bread
            3      Beer, Coke, Diaper, Milk
            4      Beer, Bread, Diaper, Milk
            5      Coke, Diaper, Milk
                  Types of data
• Record
  – Data Matrix
  – Document Data
  – Transaction Data

• Graph
  – World Wide Web
  – Molecular Structures

• Ordered
  –   Spatial Data
  –   Temporal Data
  –   Sequential Data
  –   Genetic Sequence Data
                               Record Data
• Data that consists of a collection of records, each of which consists
  of a fixed set of attributes
                    Tid Refund Marital    Taxable
                               Status     Income Cheat

                    1    Yes    Single    125K   No
                    2    No     Married   100K   No
                    3    No     Single    70K    No
                    4    Yes    Married   120K   No
                    5    No     Divorced 95K     Yes
                    6    No     Married   60K    No
                    7    Yes    Divorced 220K    No
                    8    No     Single    85K    Yes
                    9    No     Married   75K    No
                    10   No     Single    90K    Yes
                          Data Matrix
• If data objects have the same fixed set of numeric attributes, then
  the data objects can be thought of as points in a multi-dimensional
  space, where each dimension represents a distinct attribute

• Such data set can be represented by an m  n matrix, where there
  are m rows, one for each object, and n columns, one for each

      Projection    Projection   Distance    Load   Thickness
      of x Load      of y load

     10.23         5.27          15.22      2.7     1.2
     12.65         6.25          16.22      2.2     1.1
                     Document Data
• Each document becomes a ‘term’ (word) vector,
   – each term is a component (attribute) of the vector,
   – the value of each component is the number of times the
     corresponding term occurs in the document.







        Document 1    3       0       5    0       2      6      0    2       0         2

        Document 2    0       7       0    2       1      0      0    3       0         0

        Document 3    0       1       0    0       1      2      2    0       3         0
                  Transaction Data
• A special type of record data, where
   – each record (transaction) involves a set of items.
   – For example, consider a grocery store. The set of products
     purchased by a customer during one shopping trip constitute a
     transaction, while the individual products that were purchased
     are the items.

            TID    Items
            1      Bread, Coke, Milk
            2      Beer, Bread
            3      Beer, Coke, Diaper, Milk
            4      Beer, Bread, Diaper, Milk
            5      Coke, Diaper, Milk
        Data with Relationships among
• Examples: Generic graph and HTML Links
                                           <a href="papers/papers.html#bbbb">
                                           Data Mining </a>
                2                          <a href="papers/papers.html#aaaa">
                                           Graph Partitioning </a>
    5                         1            <a href="papers/papers.html#aaaa">
                                           Parallel Solution of Sparse Linear System of Equations </a>
                  2                        <a href="papers/papers.html#ffff">
                                           N-Body Computation and Dense Linear System Solvers

Web search engines collect and process Web pages to extract their contents.
It is well known, however, that the links to and from each page provide a great deal of
information about the relevance of a Web page to a query, and thus, must also be taken into
Data with Objects That Are Graphs
       E.g. Chemical Data
• Benzene Molecule: C6H6

                      Substructure mining:

                      Which substructures occur
                      frequently in a set of compounds?

                      Ascertain whether the presence of
                      any of these substructures is
                      associated with the presence or
                      absence of certain chemical
                      properties, such as melting point or
                      heat of formation.
        Common Tasks of Data Mining
          finding the description
                                      identifying a finite
          of several predefined
                                    set of categories or
          classes and classify
                                    clusters to describe
          a data item into one
                                                 the data.
          of them.


                  Finding correlations
Association       Between items in a database
                                               finding a
          discovering the           compact description
          most significant          for a subset of data
          changes in the data
Deviation and                                     Summarization
change detection
• Given:
  – Database of tuples, each assigned a class label
• Develop a model/profile for each class
  – Example profile (good credit):
  – (25 <= age <= 40 and income > 40k) then
    (married = YES)

• Sample applications:
  – Credit card approval (good, bad)
  – Bank locations (good, fair, poor)
  – Treatment effectiveness (good, fair, poor)
      Data Mining Classification
• Predictive Modelling :
   – Based on the features present in the class_labeled
     training data, develop a description or model for each
     class. It is used for
      • better understanding of each class, and
      • prediction of certain properties of unseen data
   – If the field being predicted is a numeric (continuous )
     variables then the prediction problem is a regression
   – If the field being predicted is a categorical then the
     prediction problem is a classification problem
   – Predictive Modelling is based on inductive learning
     (supervised learning)
Predictive Modelling (Classification):
                                 * o o     o
                             *         o       o
                               * ** *      o
                                   *   o
                             * * o o


        Linear Classifier:                                Non Linear Classifier:
debt                                               debt
              *                                                 *
          *                                                  *
             * o o       o                                     * o o     o
         *         o          o                            *         o       o
           * ** *        o
                                                             * ** *      o
               *   o                                             *   o
         * * o o                                           * * o o
              o                                                  o
       a*income + b*debt < t => No loan !                                          income
       Predictive Modelling (Classification)
• Task: determine which of a fixed set of classes an example belongs to
• Input: training set of examples annotated with class values.
• Output:induced hypotheses (model/concept description/classifiers)

    Learning : Induce classifiers from training data
             Training       Learning           Classifiers
             Data:          System             (Derived Hypotheses)

    Predication : Using Hypothesis for Prediction: classifying any
    example described in the same manner

  Data to be classified       Classifier          Decision on class
          Classification Algorithms
Basic Principle (Inductive Learning Hypothesis): Any
hypothesis found to approximate the target function well over a
sufficiently large set of training examples will also approximate
the target function well over other unobserved examples.
 Typical Algorithms:
•   Decision trees
•   Rule-based induction
•   Neural networks
•   Memory(Case) based reasoning
•   Genetic algorithms
•   Bayesian networks
                  Decision Tree: Example
Day          Outlook Temperature           Humidity         Wind     Play Tennis
1            Sunny      Hot               High              Weak     No
2            Sunny      Hot               High              Strong   No
3            Overcast   Hot               High              Weak     Yes
4            Rain       Mild              High              Weak     Yes
5            Rain       Cool              Normal            Weak     Yes
6            Rain       Cool              Normal            Strong   No
7            Overcast   Cool              Normal            Strong   Yes
8            Sunny      Mild              High              Weak     No
9            Sunny      Cool              Normal            Weak     Yes
10           Rain       Mild              Normal            Weak     Yes
11           Sunny      Mild              Normal            Strong   Yes
12           Overcast   Mild              High              Strong   Yes
13           Overcast   Hot               Normal            Weak     Yes
14           Rain       Mild              High              Strong   No


                    Sunny      Overcast        Rain

          Humidity                 Yes               Wind

      High          Normal                         Strong   Weak
     No                 Yes                  No              Yes
      Issues in Classification

• Consider error of hypothesis H over
  – training data : error_training (h)
  – entire distribution D of data : error_D (h)
    Hypothesis h overfits training data if there
    is an alternative hypothesis h’ such that
      error_training (h) < error_training (h’)
      error_D (h) > error (h’)
                   Preventing Overfitting

• Problem: We don’t want to these algorithms to fit to
• Reduced-error pruning :
  – breaks the samples into a training set and a test set. The tree is
    induced completely on the training set.
  – Working backwards from the bottom of the tree, the subtree
    starting at each nonterminal node is examined.
     • If the error rate on the test cases improves by pruning it, the subtree is
       removed. The process continues until no improvement can be made by
       pruning a subtree,
     • The error rate of the final tree on the test cases is used as an estimate of
       the true error rate.
           Evaluation of Classification Systems
Training Set: examples with class             Predicted
values for learning.
                                                False Positives
Test Set: examples with class values
for evaluating.
                                                True Positives
Evaluation: Hypotheses are used to
infer classification of examples in the
test set; inferred classification is
                                                 False Negatives
compared to known classification.
Accuracy: percentage of examples in
the test set that are classified correctly.
Mining Association Rules
                What Is Association
•   Association rule mining:
    –   Finding frequent patterns, associations, correlations, or
        causal structures among sets of items or objects in
        transaction databases, relational databases, and other
        information repositories.
•   Applications:
    –   Basket data analysis, cross-marketing, catalog design,
        loss-leader analysis, clustering, classification, etc.
•   Examples.
    –   Rule form: “Body ead [support, confidence]”.
    –   buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
    –   major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%,
               Association Rule: Basic
• Given: (1) database of transactions, (2) each transaction is a list
  of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of items
  with that of another set of items
   – E.g., 98% of people who purchase tires and auto accessories
      also get automotive services done
• Applications
   – *  Maintenance Agreement (What the store should do to
      boost Maintenance Agreement sales)
   – Home Electronics  * (What other products should the
      store stock up?)
   – Attached mailing in direct marketing
   – Detecting “ping-pong”ing of patients, faulty “collisions”
      Rule Measures: Support and
          buys both
                      Customer      • Find all the rules X & Y  Z with
                      buys diaper
                                      minimum confidence and support
                                       – support, s, probability that a transaction
                                         contains {X, Y, Z}
                                       – confidence, c, conditional probability
                                         that a transaction having {X, Y} also
  Customer                               contains Z
  buys beer

Transaction ID Items Bought
    2000       A,B,C        Let minimum support 50%, and
    1000       A,C            minimum confidence 50%,
    4000       A,D            we have
    5000       B,E,F           – A  C (50%, 66.6%)
                                            – C  A (50%, 100%)
Support: A measure of the frequency with
 which an itemset occurs in a DB.
  supp(A) = # records that contain A
If an itemset has support higher than some
   specified threshold we say that the itemset is
   supported or frequent (some authors use the term
Support threshold is normally set reasonably low
   (say) 1%.
Confidence: A measure, expressed as a ratio, of
 the support for an AR compared to the support
 of its antecedent.
         conf(AB) = supp(AB)

We say that we are confident in a rule if
 its confidence exceeds some
 threshold (normally set reasonably
 high, say, 80%).
• Given:
  – Data points and number of desired clusters K
• Group the data points into K clusters
  – Data points within clusters are more similar than across

• Sample applications:
   – Customer segmentation
   – Market basket customer analysis
   – Attached mailing in direct marketing
   – Clustering companies with similar growth
  Popular Algorithms: K-means

• Assign initial means
• Assign each point to the cluster for the closest
• Compute new mean for each cluster
• Iterate until criterion function converges
              Good Reference
More on these topics and other related to
 KDD and Data mining :

To top