Data Mining Query Languages by ocn20264


									Data Mining Query
Kristen LeFevre
April 19, 2004
With Thanks to Zheng Huang and Lei Chen

 Introduce the problem of querying
  data mining models
 Overview of three different solutions
  and their contributions
 Topic for Discussion: What would an
  ideal solution support?
Problem Description

   You guys are armed with two powerful tools
       Database management systems
       Efficient and effective data mining algorithms
        and frameworks
   Generally, this work asks:
       “How can we merge the two?”
       “How can we integrate data mining more
        closely with traditional database systems,
        particularly querying?”
Three Different Answers

 DMQL: A Data Mining Query
  Language for Relational Databases
  (Han et al, Simon Fraser University)
 Integrating Data Mining with SQL
  Databases: OLE DB for Data Mining
  (Netz et al, Microsoft)
 MSQL: A Query Language for
  Database Mining (Imielinski &
  Virmani, Rutgers University)
Some Common Ground

   Create and manipulate data mining models
    through a SQL-based interface (“Command-
    driven” data mining)
   Abstract away the data mining particulars
   Data mining should be performed on data in
    the database (should not need to export to
    a special-purpose environment)
   Approaches differ on what kinds of models
    should be created, and what operations we
    should be able to perform

   Commands specify the following:
       The set of data relevant to the data mining
        task (the training set)
       The kinds of knowledge to be discovered
         •   Generalized relation
         •   Characteristic rules
         •   Discriminant rules
         •   Classification rules
         •   Association rules

   Commands Specify the following:
       Background knowledge
         • Concept hierarchies based on attribute
           relationships, etc.
       Various thresholds
         • Minimum support, confidence, etc.
                            Syntax
                             use database <database_name>
Specify background
                             {use hierarchy <hierarchy_name> for
Specify rules to be
discovered                   <rule_spec>
Relevant attributes or
                             related to <attr_or_agg_list>
Collect the set of           from <relation(s)>
relevant data to mine        [where <conditions>]
                             [order by <order list>]
Specify threshold            {with [<kinds of>] threshold =
parameters                     <threshold_value> [for <attribute(s)>]}
 Syntax <rule_spec>
find classification rules [as <rule_name>]
  [according to <attributes>]

Find association rules [as <rule_name>]

generalize data [into <relation_name>]

 use database Hospital
 find association rules as Heart_Health
 related to Salary, Age, Smoker,
 from Patient_Financial f, Patient_Medical m
 where f.ID = m.ID and m.age >= 18
 with support threshold = .05
 with confidence threshold = .7

 DMQL provides a display in
  command to view resulting rules, but
  no advanced way to query them
 Suggests that a GUI interface might
  aid in the presentation of these results
  in different forms (charts, graphs, etc.)

 Focus on Association Rules
 Seeks to provide a language both to
  selectively generate rules, and
  separately to query the rule base
 Expressive rule generation language,
  and techniques for optimizing some
   Get-Rules and Select-Rules Queries
     Get-Rules operator generates rules over
      elements of argument class C, which satisfy
      conditions described in the “where” clause
    [Project Body, Consequent,
      confidence, support]
    GetRules(C) [as R1]
    [into <rulebase_name>]
    [where <conds>]
    [sql-group-by clause]

                      <conds> may contain a number of
                       conditions, including:
                              restrictions on the attributes in the body or
in, has, and is are rule        • “rule.body HAS {(Job = „Doctor‟}”
   subset, superset,
     and equality               • “rule1.consequent IN rule2.body”
                                • “rule.consequent IS {Age = *}”
                              pruning conditions (restrict by support,
                               confidence, or size)
                              Stratified or correlated subqueries
       where Body has {Age = *}
       and Support > .05 and Confidence > .7
       and not exists ( GetRules(Patients)
                       Support > .05 and
                       Confidence > .7
                       and R2.Body HAS R1.Body)

Retrieve all rules with descriptors of the form “Age = x” in the body,
except when there is a rule with equal or greater support and
confidence with a rule containing a superset of the descriptors in
the body
              GetRules(C) R1
              where <pruning-conds>
              and not exists ( GetRules(C) R2
                                  where <same pruning-conds>
                                  and R2.Body HAS R1.Body)

              GetRules(C) R1
              where <pruning-conds>
              and consequent is {(X=*)}
 stratified   and consequent in (SelectRules(R2)
                                        where consequent is {(X=*)}

   Nested Get-Rules Queries and their
       Stratified (non-corrolated) queries are
        evaluated “bottom-up.” The subquery is
        evaluated first, and replaced with its results
        in the outer query.
       Correlated queries are evaluated either top-
        down or bottom-up (like “loop-unfolding”),
        and there are rules for choosing between the
        two options
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
                Support > .05 and
                Confidence > .7
                and R2.Body HAS R1.Body)

Top-Down Evaluation
where Body has {Age = *}
and Support > .05 and Confidence > .7

For each rule produced by the outer, evaluate the
 not exists ( GetRules(Patients)
            Support > .05 and
            Confidence > .7
            and R2.Body HAS R1.Body)

Bottom-Up Evaluation
not exists ( GetRules(Patients)
           Support > .05 and
           Confidence > .7
           and R2.Body HAS R1.Body)
For each rule produced by the inner, evaluate the
 where Body has {Age = *}
 and Support > .05 and Confidence > .7
                 Choosing between the two
                    In general, evaluate the expression with more
                     restrictive conditions first
                    Heuristic rules
                      • Evaluate the query with higher support threshold first
                      • Next consider confidence threshold
Meant to prevent
                      • A (length = x) expression is in general more restrictive
unconstrained           than (length > x), which is more restrictive than (length
queries from being      < x)
evaluated first       • “Body IS (constant expression)” is more restrictive than
                        “Body HAS”, which is more restrictive than “Body IN”
                      • Next consider “Consequent IN” expressions
                      • Descriptors of for (A = a) are more restrictive than
                        wildcards such as (A = *)
               OLE DB for DM
                  An extension to the OLE DB interface for
                   Microsoft SQL Server
                  Seeks to support the following ideas:
                      Define a model by specifying the set of
                       attributes to be predicted, the attributes used
                       for the prediction, and the algorithm
                      Populate the model using the training data
None of the
                      Predict attributes for new data using the
seemed to
                       populated model
support this          Browse the mining model (not fully
                       addressed because it varies a lot by model
   Defining a Mining Model
       Identify the set of data attributes to be
        predicted, the set of attributes to be used for
        prediction, and the algorithm to be used for
        building the model
   Populating the Model
       Pull the information into a single rowset
        using views, and train the model using the
        data and algorithm specified
       Supports complex objects, so rowset may be
        hierarchical (see paper for more complex

   Using the mining model to predict
       Defines a new operator prediction join.
        A model may be used to make
        predictions on datasets by taking the
        prediction join of the mining model
        and the data set.
     OLE DB for DM
CREATE MINING MODEL [Heart_Health Prediction]
[ID] Int Key,
[Age] Int,
[Smoker] Int,
[Salary] Double discretized,
[HeartAttack] Int PREDICT, %Prediction column
USING [Decision_Trees_101]

Identifies the source columns for the training
data, the column to be predicted, and the data
mining algorithm.
     OLE DB for DM
INSERT INTO [Heart_Health Prediction]
([ID], [Age], [Smoker], [Salary])
SELECT [ID], [Age], [Smoker], [Salary] FROM
  Patient_Medical M, Patient_Financial F

The INSERT represents using a tuple for
training the model (not actually inserting it into
the rowset).
      OLE DB for DM
  [Heart_Health Prediction].[HeartAttack]
FROM [Heart_Health Prediction]
SELECT [ID], [Age], [Smoker], [Salary]
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID) as t
ON [Heart_Health Prediction].Age = t.Age AND
  [Heath_Health Prediction].Smoker = t.Smoker
  AND [Heart_Health Prediction].Salary =

Prediction join connects the model and an actual data
table to make predictions
Key Ideas

 Important to have an API for creating
  and manipulating data mining models
 The data is already in the DBMS, so it
  makes sense to do the data mining
  where the data is
 Applications already use SQL, so a
  SQL extension seems logical
Key Ideas

   Need a method for defining data mining
    models, including algorithm specification,
    specification of various parameters, and
    training set specification (DMQL, MSQL,
   Need a method of querying the models
   Need a way of using the data mining model
    to interact with other data in the database,
    for purposes such as prediction (ODBDM)
Discussion Topic:
What Functionality would
and Ideal Solution

To top