Integrating Data Mining with SQL Databases OLE DB for

Document Sample
Integrating Data Mining with SQL Databases OLE DB for Powered By Docstoc
					           Integrating Data Mining with SQL Databases: OLE DB for Data Mining

                             Amir Netz Surajit Chaudhuri Usama Fayyad1 Jeff Bernhardt
                                          Microsoft Corporation

The integration of data mining with traditional database       in memory. In the last decade, there has been an increased
systems is key to making it convenient, easy to deploy in      focus on scalability but research work in the community
real applications, and to growing its user base. In this       has focused on scaling analysis over data that is stored in
paper we describe the new API for data mining proposed         the file system. This tradition of performing data mining
by Microsoft as extensions to OLE DB standard. We              outside the DBMS has led to a situation where starting a
illustrate the basic notions that motivated the API's design   data mining project requires its own separate
and describe the key components of an OLE DB for Data          environment. Data is dumped or sampled out of the
Mining provider. We also include examples of the usage         database, and then a series of Perl, Awk, and special
and treat the problems of data representation and              purpose programs are used for data preparation. This
integration with the SQL framework. We believe this new        typically results in the familiar large trail of droppings in
API will go a long way in enabling deployment of data          the file system, creating an entire new data management
mining in enterprise data-warehouses. A reference              problem outside the database. In particular, such “export”
implementation of a provider is available with the recent      creates nightmares of data consistency with data in
release of Microsoft SQL Server 2000 database system.          warehouses. This is ironic because database systems were
                                                               created to solve exactly such data management problems.
                                                                   Once the data is adequately prepared, the mining
1. Introduction                                                algorithms are used to generate data mining models.
                                                               However, past work in data mining has barely addressed
    Progress in database technology has provided the           the problem of how to deploy models, e.g., once a model
foundation that made massive data warehouses of                is generated, how to store, maintain, and refresh it as data
business data ubiquitous. Increasingly, such data              in the warehouse is updated, how to programmatically use
warehouses are built using relational database technology.     the model to do predictions on other data sets, and how to
Recently, there has been a tremendous commercial               browse models. Over the life cycle of an enterprise
interest in mining the information in these data               application, such deployment and management of models
warehouses. However, building mining applications on           remains one of the most important tasks. The objective of
data over relational databases is nontrivial and requires      proposing OLE DB DM is to alleviate problems in
significant work on the part of application builders.          deployment of models and to ease the data preparation by
Therefore, what is needed is a strong system support to        working directly on relational data.
ease the burden of developing mining applications against          Another motivation to introduce OLE DB DM is to
relational databases. In this paper, we introduce the          enable enterprise application developers to participate in
“OLE DB for Data Mining” (henceforth abbreviated as            building data mining solutions. Easing the ability to
OLE DB DM in this paper) Application Programming               develop integrated solutions is critical for the growth of
Interface (API). As we will demonstrate in the rest of this    data mining technology in the enterprise space. In
paper, if a data source supports the OLE DB DM (i.e., if       enterprises, data mining needs to be perceived as a value-
the data source is an OLE DB DM “provider”), then the          added component along with traditional decision support
task of building mining applications is simplified because     techniques, e.g., traditional SQL as well as OLAP
of the native support for DM primitives in OLE DB DM.          querying environments [3]. In order to ensure success of
    The case for supporting such an API is best understood     data mining in this regard, it is important to recognize that
by reviewing the current state of the art. Traditionally,      a typical enterprise application developer is hardly an
data mining tasks are performed assuming that the data is      expert in statistics and pattern recognition. Rather, he or

    Author’s current affiliation: digiMine, Inc.
she is comfortable using traditional database APIs/tools       OLE DB API to make sure that support for mining
based on SQL, OLE DB, and other well-known standards           concepts are built in.
and protocols. Therefore, it is extremely important that          Figure 1 shows how Microsoft SQL Server database
the infrastructure for supporting data mining solutions is     management system exposes OLE DB DM. In this
aligned     with    traditional     database    development    specific implementation, the “core” relational engine
environment and with APIs for database access.                 exposes the traditional OLE DB interface. However, the
Consequently, we decided to exploit the well-known OLE         analysis server component exposes an OLE DB DM
DB API to ensure that data mining models and operations        interface (along with OLE DB for OLAP) so that data
gain the status of first-class objects in the mainstream       mining solutions may be built on top of the analysis
database development environment.                              server. Of course, this is only one example of how OLE
    Our proposal builds upon OLE DB [7], an application        DB DM may be exposed. For example, in the future, data
programming interface that has been gaining popular use        mining support may be deeply integrated in the “core”
for universal data access not only from relational systems     relational database system.
but also from any data source that can be viewed as a “set        Our approach in defining OLE DB DM is to maintain
of tables” (rowsets) as well. It is important to point out     the SQL metaphor and to reuse many of the existing
that OLE DB for DM is independent of any particular            notions whenever possible. Due to the widespread use of
provider or software and is meant to establish a uniform       SQL in the developer community, we decided to expose
API. Another feature of our proposal is that it is not         data mining interfaces in a language-based API. This
specialized to any specific mining model but is structured     decision reflects a central design principle: building data
to cater to all well-known mining models. Any party            mining solutions (applications) should be easy. Thus, our
interested in using this interface is encouraged to do so by   design philosophy requires relational database
building its own provider, as with OLE DB (and ODBC).          management systems to invest resources to provide the
However, we should also add that one of the strengths of       “drivers” to support OLE DB DM so that data mining
our proposal is that we have implemented this API as part      solutions may be developed with ease by data mining
of the recent release of Microsoft SQL Server 2000             application developers. Since there are likely to be many
(Analysis Server component). The proposal also                 more application developers than DBMS vendors, such a
incorporates feedback from a multitude of vendors in the       design philosophy inflicts the least pain.
data mining space who participated in preview meetings            The key challenge is to be able to support the basic
and calls for reviews over the past 18 months.                 operations of data mining without introducing much
    Our proposal is not focused on any particular KDD          change in programming model and environment that a
algorithms per se since our intent is not to propose new       database developer is used to. Fundamentally, this
algorithms, but to suggest a system infrastructure that        requires supporting operations on data mining models.
makes it possible to “plug in” any algorithm and enable it     These key operations to support on a data mining model
to participate in the entire life-cycle of data mining. This   are the ability to:
paper is our attempt to summarize the basic philosophy         1.     Define a mining model, i.e., identify the set of
and design decisions leading to the current specification             attributes of data to be predicted, the set of
of OLE DB DM. A complete specification may be                         attributes to be used for prediction, and the
obtained from [9].                                                    algorithm used to build the mining model.
    The rest of the paper is organized as follows. We first    2.     Populate a mining model from training data using
introduce the overall architecture and design philosophy              the algorithm specified in (1).
of OLE DB for Data Mining (OLE DB DM) in Section 2.            3.     Predict attributes for new data using a mining
In Section 3, we discuss, with illustrative examples of               model that has been populated.
their usage, the primitives and basic building blocks in the   4.     Browse a mining model for reporting and
API. Section 4 presents related work and we conclude in               visualization applications.
Section 5.                                                        In developing OLE DB DM, we have decided to
                                                               represent a data mining model (henceforth model or
                                                               DMM) as analogous to a table in SQL.2 Thus, it can be
2. Overview and Design Philosophy                              defined via the CREATE statement (used to create a table
   OLE DB DM provides a set of extensions to OLE DB
to make it “mining-enabled.” OLE DB is an object-              2
                                                                  Although viewed conceptually and programmatically as
oriented specification for a set of data access interfaces
                                                               tables, models have important differences with tables. A data
designed for record-oriented data stores. In particular, by    mining model is populated by consuming a rowset but its own
virtue of supporting command strings, OLE DB can               internal structure can be more abstract, e.g., a decision-tree is a
provide full support to SQL. In this paper, we will            tree-like structure, much more compact than the training data set
primarily introduce extensions that leverage this aspect of    used to create it.
in SQL). It can be populated, possibly repeatedly via the            and visualization purposes. Fundamental operations
INSERT INTO statement (used to add rows in a SQL                     on models include CREATE, INSERT INTO,
table), and it can be emptied (reset) via the DELETE                 PREDICTION JOIN, SELECT, DELETE FROM,
statement. A mining model may be used to make                        and DROP.
predictions on datasets. This is modeled through
prediction join between a mining model and a data set,              It should be noted that OLE DB DM has all the
just as the traditional join operation is used to pull           traditional OLE DB interfaces. In particular, schema
together information from multiple tables. Finally, the          rowsets specify the capabilities of an OLE DB DM
content of a mining model can be browsed via SELECT              provider. This is the standard mechanism in OLE DB
(used to identify a subset of data in SQL). Since the            whereby a provider describes information about itself to
mining model is a native named object at the server, the         potential consumers. This information describes the
mining model can participate in interaction with other           supported capabilities (e.g. prediction, segmentation,
objects using the primitives listed above (see also Section      sequence analysis, etc.), types of data distributions
3). Note that this achieves the desirable goal of avoiding       supported, limitations of the provider (size or time
excessive data movement, extraction, copying and thus            limitations), support for incremental model maintenance,
resulting in better performance and manageability                etc. Other schema rowsets provide metadata on the
compared to the traditional “mining outside the database         columns of a mining model, on its contents, and the
server” framework. Finally, once the DMM is created and          supported services. The details of these schema rowsets
optimized, deployment within the enterprise becomes as           are beyond the scope of this paper.
easy as writing SQL queries. This significantly minimizes
the cost of deployment by eliminating the need for                  The rest of this section is organized as follows. In
developing a proprietary infrastructure and interface to         Section 3.1, we describe how input data is modeled as a
deploy the data mining solution. Model deployment and            caseset. In Section 3.2, we describe how data mining
maintenance costs have received little attention in the          models may be defined (created) in a relational
literature, though they are an expensive and difficult task      framework and the kind of meta-information on casesets
in real applications. In the next section, we review the         that mining needs to be aware of. Section 3.3 describes
details of the language extensions that provide an               operations on mining models, notably insertion
extensible framework for incorporating support for data          (populating the model), prediction, and browsing.
3. Basic Components of OLE DB DM                                 3.1. Data as Cases
                                                                     To ensure that a data mining model is accurate and
   OLE DB DM has only two notions beyond traditional             meaningful, data mining algorithms require that all
OLE DB definition: cases and models.                             information related to the entity to be mined be
• Input data is in the form of a set of cases (caseset). A       consolidated      before     invoking      the     algorithms.
   case captures the traditional view of an “observation”        Traditionally, mining algorithms also view the input as a
   by machine learning algorithms as consisting of all           single table that represents such consolidated information
   information known about a basic entity being                  where each row represents an instance of an entity. Data
   analyzed for mining. Of course, structurally, a set of        mining algorithms are designed so that they consume an
   cases is no different from a table.3 Cases are                entity instance at a time. Having all the information
   explained further in Section 3.1.                             related to a single entity instance together in a rowset, has
• Data mining model (DMM) is treated as if it were a             two important benefits. First, traditional data mining
   special type of “table:” (a) We associate a caseset           algorithms can be leveraged with relative ease. Next, it
   with a DMM and additional meta-information while              increases scalability as it eliminates the need for data
   creating (defining) a DMM. (b) When data (in the              mining algorithms to do considerable bookkeeping.
   form of cases) is inserted into the data mining model,            In a relational database, data is often normalized and
   a mining algorithm processes it and the resulting             hence the information related to an entity scattered in
   abstraction (or DMM) is saved instead of the data             multiple tables. Therefore, in order to use data mining, a
   itself. Once a DMM is populated, it can be used for           key step is to be able to pull the information related to an
   prediction, or its content can be browsed for reporting       entity into a single rowset using views. Furthermore, as is
                                                                 well recognized, data is often hierarchical in nature and so
                                                                 ideally we require the view definition language to be
3                                                                powerful enough to be able to produce a hierarchical
  However, like many other data sets, cases are more naturally
represented as hierarchical data, i.e., nested tables are more   rowset. This collection of data pertaining to a single entity
natural.                                                         is called a case, and the set of all relevant cases is referred
                                                                 to as a caseset. As mentioned above, a caseset may need
Customer     Gender    Hair Color   Age      Age           Product Purchases                     Car_Ownership
ID                                           Prob.      Name     Quantity      Type           Car         Prob.
    1        Male      Black        35       100%       TV           1         Electronic     Truck       100%

                                                     VCR         1          Electronic   Van              50%
                                                     Ham         2          Food
                                                     Beer        6          Beverage
                            Table 1: A nested table representation of the 3-way join result

to be hierarchical. In the rest of this section, we elaborate       Nested table is a familiar notion in the relational
on these two issues through examples and further                 database world, though not adopted universally in
discussions.                                                     commercial systems. In OLE DB DM, we have chosen to
   Consider a schema consisting of 3 tables representing         use the Data Shaping Service5 to generate a hierarchical
customers. The Customers table has some customer                 rowset. It should be emphasized that it is a logical
properties. The Product Purchases table stores for each          definition and does not have any implications for data
customer what products they bought. The Car Ownership            storage. In other words, it is not necessary for the storage
table lists the cars owned by each customer.4 Consider a         subsystem to support nested records. Cases are only
customer, say Customer ID 1, who happens to be a male,           instantiated as nested rowsets prior to training/predicting
has black hair, and believed to be 35 years old with 100%        data mining models. Note that the same physical data may
certainty. Assume this customer has bought a TV, a VCR,          be used to generate different casesets for different
Beer (quantity 6), and Ham (quantity 2). Further assume          analysis purposes. For example, if one chooses to mine
we know this customer owns a truck (100%) and we                 models over all products, each product then becomes a
believe he also has a van (50% certainty). Suppose one           single case and customers would need to be represented
were to issue a join query over the 3 tables to get all the      as columns of the case.
information we know about the set of customers. Readers             Once a hierarchical caseset has been prepared using
familiar with SQL will quickly realize that this join query      the traditional view definition mechanism and the Data
will return a table of 12 rows. Furthermore, these 12 rows       Shaping Service in OLE DB, we can use this caseset to
contain lots of replication. More importantly, since the         populate a data mining model or we can predict values of
information about an entity instance is scattered among          unknown columns using a pre-existing model. In order to
multiple rows, the quality of output from data mining            do that, we need a way to represent data mining models.
algorithms is negatively impacted by such flattened              We illustrate this in the next section.
   Table 1 shows how a nested table can succinctly               3.2. Creating/Defining Data Mining Models
describe the above join result. As illustrated in Table 1, a        In OLE DB DM, a data mining model (DMM) is
customer case can include not only simple columns but            treated as a first class object of the provider, much like a
also multiple tables (nested tables as columns). Each of         table. In this section, our focus is on definition (creation)
these tables inside the case can have a variable number of       of data mining models. In particular, the DMM definition
rows and a different number of columns. Some of the              includes a description of the columns of data over which
columns in the example have a direct one-to-one                  the model will be created, including meta-information on
relationship with the case (such as “Gender” and “Age”),         relationships between columns. It is assumed that a data
while others have a one-to-many relationship with the            set is represented as a nested table as discussed in Section
case and therefore exist in nested tables. The reader can        3.1. In Section 3.3, we will describe how other operations
easily identify the two tables (nested) contained in the         on data mining models may be supported in OLE DB
sample case in Table 1.                                          DM.
   • Product Purchases table containing the columns                 As discussed earlier, a data mining model is trained
        Product Name, Product Quantity, and Product              over a single caseset and is used to predict columns of a
        Type.                                                    caseset. Therefore, defining (creating) a data mining
   • Car Ownership table containing the columns Cars             model requires us to specify:
        and Car Probability.                                     • The name of the model.

4                                                                5
  The columns Age Probability and Car_Ownership.Probability       The Data Shaping Service is part of the Microsoft Data Access
are unusual in that they represent degree of certainty.          Components (MDAC) products.
•    The algorithm (and parameters) used to build the           The above details are discussed in the rest of this
     model.                                                  subsection and we conclude this section with a
•    The algorithm for prediction using the model.           comprehensive example. Finally, note that despite the
•    The columns the caseset must have and the               similarities with CREATE TABLE with respect to
     relationships among these columns.                      programmability, mining models are conceptually
•    Identifying columns to be used as “source” columns      different as they represent only summarized information
     and the columns to be filled by the mining model        for a data set. Also, unlike traditional data tables, they are
     (“prediction columns”).                                 capable of prediction (See Section 3.3).

    The following example illustrates the syntax:            3.2.1 Content Types of Columns – Column Specifiers
                                                                 Each column in a case can represent one of the
    CREATE MINING MODEL [Age Prediction] (                   following content types:
          %Name of Model                                         KEY: the columns that identify a row. For example,
    [Customer ID] LONG KEY,                                  “Customer ID” uniquely identifies customer cases,
    [Gender]       TEXT DISCRETE,                            “Product Name” uniquely identifies a row in “Product
    [Age]           DOUBLE DISCRETIZED PREDICT,              Purchases” table.
          %prediction column                                     ATTRIBUTE: A direct attribute of the case. This type
    [Product Purchases] TABLE(
                                                             of column represents some value for the case. Example
    [Product Name] TEXT KEY,
    [Quantity]     DOUBLE NORMAL CONTINUOUS,                 attributes are: the age, gender, or hair-color of the
    [Product Type] TEXT DISCRETE                             customer or the quantity of a specific product the
    RELATED TO [Product Name] ))                             customer purchased.
    USING [Decision_Trees_101]                                   RELATION: Information used to classify attributes,
          %Mining Algorithm used                             other relations, or key columns. For example, “Product
                                                             Type” classifies “Product Name.” A given relation value
   The annotations with the example identify the source      must always be consistent for all of the instance values of
columns used for prediction, the column to be predicted      the other columns it describes—for example, the product
(age) and the mining algorithm used. Specifically, the       “Ham” must always be shown as “Food” for all cases. In
DMM is named [Age Prediction] (the square brackets [ ]       the CREATE MINING MODEL command syntax,
are name delimiters by convention). The PREDICT              relations are identified in the column definition by using a
keyword indicates that the Age column is to be predicted.    RELATED TO clause to indicate the column being
The [Product Purchases] is a nested table, with [Product     classified.
Name] being its key. The USING clause specifies the              QUALIFIER: A special value associated with an
mining algorithm to use. Parameters to the algorithm         attribute that has a predefined meaning for the provider.
could be included as part of this clause. When we contrast   Take for example the probability that the attribute value is
this example with the CREATE TABLE syntax in SQL, a          correct. These qualifiers are all optional and apply only if
key difference is the use of substantially more meta         the data has uncertainties attached to it or if the output of
information that adorns a column reference above.            previous predictions is being chained as input to a
   The model columns describe all of the information         subsequent DMM training step. In the CREATE MINING
about a specific case. For example, assume that each case    MODEL command syntax, modifiers are identified by
in the DMM represents a customer. The columns of the         using an OF clause to indicate the attribute column they
DMM will include all known and desired information           modify. There are many qualifiers. We give a few
about the customer (e.g. Table 1). Several kinds of          examples here:
information may be specified for columns:                           (a) PROBABILITY: the probability [0,1] of the
   • Information about the role of a column (Section                associated value.
        3.2.1): Is it a key? Does it contain an attribute?          (b) VARIANCE: A number that describes the
        How does it relate to other columns or nested               variance of the value of an attribute.
        tables? Is it a qualifier on other columns?                 (c) SUPPORT: A float that represents a weight
   • For columns that represent attributes, what type of            (case replication factor) to be associated with the
        attribute is it? Attribute types are described in           value.
        Section 3.2.2.                                              (d) PROBABILITY_VARIANCE: The variance
   • Additional information about data distributions                associated with the probability estimator used for
        that can be provided as hints to the DMM (Section           PROBABILITY.
        3.2.3).                                                     (e) ORDER: Specifies the order of a column. (See
                                                                    ORDERED below.)
      (f) TABLE: A nested table in the case consists of a       could be modeled as binary where the significant
      special column with the data type TABLE. For any          information is whether the age is known or not.
      given case row, the value of a TABLE type column
      contains the entire contents of the associated nested     3.2.4 Prediction Columns
      table. The value of a TABLE type column is in                Output columns are columns in a dataset that the
      itself a table containing all of the columns for the      DMM is capable of predicting. Because these columns
      nested table. In the CREATE MINING MODEL                  will be predicted over cases in which their values are
      command syntax, nested tables are described by a          missing or unspecified, additional statistical information
      set of columns, all of which are contained within         may be associated with such predictions. We discuss this
      the definition of a named TABLE type column.              additional information here. Section 3.2.3 presents the
                                                                syntax of such prediction columns.
3.2.2 Content Types of Columns – Attribute Types                   Attribute or Table type columns can be input columns,
   For a column that is specified as an attribute, it is        output columns, or both. The data mining provider builds
possible to further indicate the type of attribute. The         a DMM capable of predicting or explaining output
following keywords specify the attribute type to the DM         column values based on the values of the input columns.
provider.                                                       Predictions may convey not only simple information such
• DISCRETE: The attribute values are discrete                   as “estimated age is 21;” but they may also convey
     (categorical) with no ordering implied by the values       additional statistical information associated with the
     (often called states). “Area Code” is a good example.      prediction such as confidence level and standard
• ORDERED: Columns that define an ordered set of                deviation. Further, the prediction may actually be a
     values with a total ordering but no distance or            collection of predictions, such as “the set of products that
     magnitude semantics are implied. A ranking of skill        the customer is likely to buy.” Each of the predictions in
     level (say one through five) is an ordered set, but a      the collection may also include a set of statistics.
     skill level of five isn't necessarily five times better       A prediction can be expressed as a histogram. A
     than a skill level of one.                                 histogram provides multiple possible prediction values,
• CYCLICAL: A set of values that have cyclical                  each accompanied by a probability and other statistics.
     ordering. Day of the week is a good example, since         When histogram information is required, each prediction
     day number one follows day number seven.                   (which by itself can be part of a collection of predictions)
• CONTINOUS: Attributes with numeric values                     may have a collection of possible values that constitutes a
     (integer of float). Values are naturally ordered and       histogram.
     have implicit distance and magnitude semantics.               Since there is a wealth of information we can associate
     Salary is a typical example.                               with predictions, it is often necessary to extract only a
• DISCRETIZED: The data that will be inserted into              portion of the predicted information. For example, a user
     the model is continuous, but it should be transformed      may want to see only the “best estimate,” “top 3
     into and modeled as a number of ORDERED states             estimates,” or “estimates with probability greater than
     by the provider. Some data mining algorithms cannot        55%.” Neither every provider nor every DMM can
     accept CONTINUOUS attributes as input, or they             support all of the possible requests. Therefore, it is
     may not be able to predict CONTINUOUS values.              necessary for the output column to define whatever
• SEQUENCE_TIME: A time measurement range. The                  information may be extracted out of it.
     format is not restricted, e.g., a period number is            OLE DB DM defines a set of standard transformation
     acceptable. This is typically used to associate a          functions on output columns. These functions are
     sequence time with individual attribute values such as     discussed in detail in the OLE DB DM specification
     purchase time.                                             document [9]. The basic mechanism we used to
                                                                accommodate the flexibility in output is to draw on the
3.2.3 Distribution Hints                                        familiar notion of user-defined functions (UDF) used in
    An attribute, continuous or discrete, may have an           OLAP [10,15]. Each provider ships a set of functions that
associated distribution. These distributions are used as        can be referenced in the prediction query. Some UDF’s
hints to the DMM and can specify prior knowledge about          are scalar-valued, such as probability, or support. Others
the data. For example, a continuous attribute can be            have tables as values, such as histogram and hence return
normal (Gaussian), log normal, or uniform. A discrete can       nested tables when invoked.
be binomial, multinomial, Poisson, mixture. Other hints
could include: Not_Null which indicates there should            3.2.5 Detailed Example of DMM Column Specification
never be a null value, model existence only which means            Now that we have described what the column models
the information of interest is not in the value of an           are, it may be useful to provide an example. Column
attribute, but in the fact that a value is present. Hence Age   descriptions allow the provider to better model the
training data it is given. We now return to our running        between a data mining model and the given (test) data set.
example and indicate how content types of some of the          Prediction join between the cases from the test data test
columns in the example can now be classified as shown          with the DMM and selecting the “Age” column from the
below (content type in italics):                               DMM returns a predicted “Age” for each of the test
         (a) Gender: discrete attribute                        cases.
         (b) Age: continuous attribute                            We need to use “prediction join” since the process of
         (c) Age Probability: probability modifier of age      combining actual cases with all possible model cases is
         (d) Customer Loyalty: ordered attribute               not as simple as the semantics of a normal SQL equi-join.
         (e) Product Name: Key (for Product Purchases          For the instance when the schema of the actual case table
             table)                                            matches the schema of the model, NATURAL
         (f) Product Type: Relation of Product Name            PREDICTION JOIN can be used, obviating the need for
                                                               the ON clause of the join. Columns from the source query
   The example above includes a few additional columns         will be matched to columns from the DMM based on the
to illustrate a few more content types.                        names of the columns.

3.3. Operations on Data Model
                                                                  SELECT t.[Customer ID], [Age Prediction].[Age]
                                                                  FROM [Age Prediction]
Populating a Mining Model: Insert Into                            PREDICTION JOIN (SHAPE {
Once a mining model is defined, the next step is to               SELECT [Customer ID], [Gender], FROM Customers
populate a mining model by consuming a caseset that               ORDER BY [Customer ID]}
satisfies the specification in the Create Mining Model            APPEND ({SELECT [CustID], [Product Name], [Quantity]
statement. In OLE DB DM, we use INSERT to model                   FROM Sales
instantiation of the mining model. As against a traditional       ORDER BY [CustID]}
table, insertion does not result in addition of the tuple to      RELATE [Customer ID] To [CustID])
the rowset. Rather, the insertion corresponds to                  AS [Product Purchases]) as t
                                                                  ON [Age Prediction].Gender = t.Gender and
consuming the observation represented by a case using             [Age Prediction].[Product Purchases].[Product Name]
the data mining model. The following example illustrates          = t.[Product Purchases].[Product Name] and
the syntax. Note the use of the SHAPE statement to create         [Age Prediction].[Product Purchases].[Quantity]
the nested table from the input data:                             = t.[Product Purchases].[Quantity]

  INSERT INTO [Age Prediction]                                     The DMM content can be thought of as a “truth table”
  ([Customer ID], [Gender], [Age],                             containing a row for every possible combination of the
  [Product Purchases]([Product Name], [Quantity], [Product     distinct values for each column in the DMM. In other
  Type])) SHAPE                                                words, it contains every possible case along with every
  {SELECT [Customer ID], [Gender], [Age]                       possible associated set of output columns. With this view
  FROM Customers
  ORDER BY [Customer ID]}
                                                               in mind, a DMM can be used to look up predicted values
  APPEND (                                                     and statistics by using the attribute values of a case as a
  {SELECT [CustID], [Product Name], [Quantity], [Product       key for the join. It is important to point out that this
  Type] FROM Sales ORDER BY [CustID]}                          analogy is not quite accurate as the set of all possible
  RELATE [Customer ID] To [CustID])                            cases (combinations of attribute values) is huge, and
  AS [Product Purchases]                                       possibly infinite when one considers continuous
                                                               attributes. However, this logical view allows us to map
                                                               prediction into a familiar basic operation in the relational
Using Data Model to Predict: Prediction Join                   world.
   We now explain how predictions can be obtained from             When a model is joined with a table, predicted values
models in OLE DB DM that has been populated. The               are obtained for each case that “matches” the model (i.e.
basic operation of obtaining prediction on a dataset D         cases for which the model is capable of making a
using a DMM M is modeled as a “prediction join”                prediction). A SELECT statement associated with the
between D and M. Of course, the caseset needs to match         PREDICTION JOIN specifies which predicted values
the schema of the DMM. Since the model does not                should be extracted and returned in the result set of the
actually contain data details, the semantics of the            query. User-defined functions (UDF) in the projection
Prediction Join are different than those of a equi-join on     columns are used to further enhance the set of predicted
tables. The following example illustrates how prediction       values with additional statistical information. We have
join may be used. In this example, the value for “Age” is      discussed this aspect in Section 3.2.4.
not known and is being predicted using the prediction join
Browsing Model Content: Select                                  tools provide application interfaces and UI to support
   In addition to listing the column structure of a DMM, a      mining and allow access to data in the warehouse.
different type of browsing is to navigate the content of the    However, such tools do not provide the ability to deal
model viewed as a graph. Using a set of input cases, the        with arbitrary mining models and integration of the
content of a DMM is learned by the data mining                  mining application interfaces with SQL as well as
algorithm. The content of a DMM is the set of rules,            relational data access APIs (ODBC, OLE DB).
formulas, classifications, distributions, nodes, or any other      The paper by Meo et al. [8] is an example of
information that was derived from a specific set of data        specialized extensions to SQL to provide support for
using a data mining technique. Content type varies              association rules. Such an approach is narrowly focused
according to the specific data mining technique used to         and does not provide an extensible framework for
create the DMM. The DMM content of a decision tree–             supporting models derived from other mining techniques,
based classification will differ from a segmentation            e.g., clustering and classification. Moreover, no support is
model, which, in turn, differs from a multi-regression          provided for management of mining models and for using
DMM.                                                            such models subsequently in other contexts.
   The most popular way to express DMM content is by               More recently, CRISP-DM [4] (http://www.crisp-
viewing it as a directed graph (that is, a set of nodes with has been proposed as a standard for Data Mining
connecting edges). Note that decision trees fit such a view     process model. This initiative is complementary to OLE
nicely. Each node in the tree may have relationships to         DB DM. CRISP-DM provides a set of methodologies,
other nodes. A node may have one or more parent nodes           best practices, and attempts to define the various activities
and zero or more child nodes. The depth of the graph may        involved in the KDD process. It does not address issues of
vary depending on the specific node.                            integration with the database system.
   Tree navigation we adopted is similar to the already            A related effort, called Predictive Model Markup
defined mechanism in the OLE DB for OLAP                        Language (PMML), provides an open standard for how
specification [10]. On querying (SELECT * FROM                  models should be persisted in XML. The basic idea of
<mining model>.CONTENT), the content of a mining                PMML is to enable portability and model sharing [5].
model          is        exposed          through         the   PMML does not address database integration issue but
MINING_MODEL_CONTENT                    schema        rowset.   specifies a persistence format for models. We are
Unfortunately,        a      detailed      discussion      of   currently working with the PMML group to use PMML
MINING_MODEL_CONTENT schema rowset is beyond                    format as an open persistence format. In fact, in the model
the scope of this paper (see Appendix A of [9]).                browsing methods, briefly discussed in Section 3.3.3, we
                                                                use PMML inspired XML strings in exposing the content
                                                                of data mining model nodes.
4. Related Work
                                                                   As mentioned earlier, OLE DB DM is an extension to
                                                                OLE DB, a specification of data access interfaces. OLE
   The focus of OLE DB DM is not about new KDD
                                                                DB generalized the widely used ODBC connectivity
algorithms. Therefore, we do not discuss here the wide
                                                                standard by enabling access to record-oriented data stores
body of literature on KDD algorithms and scalable
                                                                that are not necessarily SQL data sources.
techniques for such algorithms.
   Compared to the wide body of work on KDD
algorithms as well as the more recent work on scalable          5. Concluding Remarks
algorithms, there has been relatively much less work on
the problem of integration of data mining with relational          We presented the highlights of OLE DB DM, a new
databases.                                                      API for integrating data mining into SQL databases. A
   The work by Sarawagi et al. [13] address the problem         reference OLE DB DM provider is scheduled to ship with
of how to implement derivation of association rules             the next release of a commercial database system:
efficiently using SQL. Although significant from the            Microsoft SQL Server 2000.
performance implications, this direction of work is                The target user base of OLE DB DM is the community
orthogonal to the issue of how to enable, deploy, and           of enterprise developers who are familiar with SQL and
expose data mining models and algorithms as first class         database connectivity APIs such as ODBC and OLE DB.
objects in the database application API. A similar effort       For this reason, a key design goal was to keep the set of
relating to generalized association rules and sequences is      extensions as close to SQL’s look and feel as possible.
also focused on implementing particular algorithms in           OLE DB DM introduces data mining objects and entities
SQL [14].                                                       by drawing straightforward analogies to familiar
   Research         projects      such      as      Quest       relational objects. We have only introduced two new
( [1] and DBMiner           concepts: Data Mining Models (DMM) and Cases.
[6] ( as well as commercial mining
   We hope that OLE DB DM will help steer our                    [10] OLE      DB     for   OLAP,      Microsoft  Corporation,
community to establish a standard in the field. The         
advantages of such a standard include integration with the       [11] R. Ramakrishnan and J. Gehrke. Principles of Database
database so that data mining no longer needs to involve               Management (2nd Edition), 1999.
                                                                 [12] J. Shanmugusundaram, U. M. Fayyad and P. S. Bradley.
taking data outside the database, addressing issues of                Compressed data cubes for OLAP aggregate query
model deployment once models are created, and                         approximation on continuous dimensions. In Proc. 5th intl.
providing an infrastructure for managing data and models              conf. on knowledge discovery and data mining, pp. 223-
while leveraging the existing SQL infrastructure.                     232, 1999.
Establishing standards also reduces the costs, increases         [13] Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal:
the ability to interoperate and partner, and reduces the              Integrating mining with relational database systems:
risks for both providers and consumers. Consumers are                 alternatives and implications. Proc. of SIGMOD
protected since they are no longer dependent on a                     Conference, pp. 343-354, 1998.
proprietary interface so other vendors can step in to            [14] S. Thomas, S. Sarawagi Mining generalized association
                                                                      rules and sequential patterns using SQL queries. Proc. of
provide replacements and enhancements. Providers are                  the 4th Int'l Conference on Knowledge Discovery in
protected from the need to build infrastructure from                  Databases and Data Mining, New York, August 1998.
scratch as partners can step in to build different               [15] E. Thomsen, G, Spofford, D. Chase, Microsoft OLAP
complementary components. To move towards a standard,                 solutions, Wiley, John & Sons, 1999.
an open forum such as standards committee elected by the
community will be needed. We hope that this proposal
will encourage further discussions along this direction.

Acknowledgement: We are grateful to Venkatesh
Ganti for his thoughtful comments on the draft.

6. References

[1] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer,
    R. Srikant: The Quest data mining system, Proc. of the 2nd
    int'l conference on knowledge discovery in databases and
    data mining, August, 1996.
[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I.
    Verkamo: Fast discovery of association rules. In Advances
    in knowledge discovery and data mining, pp. 307-328,
    AAAI Press, Menlo Park, CA, 1996.
[3] S. Chaudhuri and U. Dayal: An overview of data
    warehousing and OLAP technology. SIGMOD Record
    26(1), pp. 65-74, 1997.
[4] P. Chapman, R. Kerber, J. Clinton, T. Khabaza, T.
    Reinartz, R. Wirth: The CRISP-DM process model,
    Technical Report. 1999.
[5] DMG Organization, PMML 1.0-Predictive Model Markup
    1_0.html, 1999.
[6] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski,
    D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, O. R.
    Zaiane: DBMiner: A system for mining Knowledge in
    Large Relational Databases. Proc. of int'l conf. on data
    mining and knowledge discovery (KDD'96), Portland,
    Oregon, pp. 250-255, August 1996.
[7] P. Hipson: OLE DB and ADO developer's guide, McGraw-
    Hill; 1999.
[8] Rosa Meo, Giuseppe Psaila, Stefano Ceri: An extension to
    SQL for mining association rules. 195-224, Data Mining
    and Knowledge Discovery, Vol2, Number 2, 1998.
[9] Introduction to OLE DB for Data Mining, Microsoft