Integrating Data Mining with SQL Databases OLEDB for Data
Document Sample


Integrating Data Mining with SQL Databases: OLE DB for Data Mining
Amir Netz Surajit Chaudhuri Usama Fayyad1 Jeff Bernhardt
Microsoft Corporation
Abstract
The integration of data mining with traditional database in memory. In the last decade, there has been an increased
systems is key to making it convenient, easy to deploy in focus on scalability but research work in the community
real applications, and to growing its user base. In this has focused on scaling analysis over data that is stored in
paper we describe the new API for data mining proposed the file system. This tradition of performing data mining
by Microsoft as extensions to OLE DB standard. We outside the DBMS has led to a situation where starting a
illustrate the basic notions that motivated the API's design data mining project requires its own separate
and describe the key components of an OLE DB for Data environment. Data is dumped or sampled out of the
Mining provider. We also include examples of the usage database, and then a series of Perl, Awk, and special
and treat the problems of data representation and purpose programs are used for data preparation. This
integration with the SQL framework. We believe this new typically results in the familiar large trail of droppings in
API will go a long way in enabling deployment of data the file system, creating an entire new data management
mining in enterprise data-warehouses. A reference problem outside the database. In particular, such “export”
implementation of a provider is available with the recent creates nightmares of data consistency with data in
release of Microsoft SQL Server 2000 database system. warehouses. This is ironic because database systems were
created to solve exactly such data management problems.
Once the data is adequately prepared, the mining
1. Introduction algorithms are used to generate data mining models.
However, past work in data mining has barely addressed
Progress in database technology has provided the the problem of how to deploy models, e.g., once a model
foundation that made massive data warehouses of is generated, how to store, maintain, and refresh it as data
business data ubiquitous. Increasingly, such data in the warehouse is updated, how to programmatically use
warehouses are built using relational database technology. the model to do predictions on other data sets, and how to
Recently, there has been a tremendous commercial browse models. Over the life cycle of an enterprise
interest in mining the information in these data application, such deployment and management of models
warehouses. However, building mining applications on remains one of the most important tasks. The objective of
data over relational databases is nontrivial and requires proposing OLE DB DM is to alleviate problems in
significant work on the part of application builders. deployment of models and to ease the data preparation by
Therefore, what is needed is a strong system support to working directly on relational data.
ease the burden of developing mining applications against Another motivation to introduce OLE DB DM is to
relational databases. In this paper, we introduce the enable enterprise application developers to participate in
“OLE DB for Data Mining” (henceforth abbreviated as building data mining solutions. Easing the ability to
OLE DB DM in this paper) Application Programming develop integrated solutions is critical for the growth of
Interface (API). As we will demonstrate in the rest of this data mining technology in the enterprise space. In
paper, if a data source supports the OLE DB DM (i.e., if enterprises, data mining needs to be perceived as a value-
the data source is an OLE DB DM “provider”), then the added component along with traditional decision support
task of building mining applications is simplified because techniques, e.g., traditional SQL as well as OLAP
of the native support for DM primitives in OLE DB DM. querying environments [3]. In order to ensure success of
The case for supporting such an API is best understood data mining in this regard, it is important to recognize that
by reviewing the current state of the art. Traditionally, a typical enterprise application developer is hardly an
data mining tasks are performed assuming that the data is expert in statistics and pattern recognition. Rather, he or
1
Author’s current affiliation: digiMine, Inc.
she is comfortable using traditional database APIs/tools OLE DB API to make sure that support for mining
based on SQL, OLE DB, and other well-known standards concepts are built in.
and protocols. Therefore, it is extremely important that Figure 1 shows how Microsoft SQL Server database
the infrastructure for supporting data mining solutions is management system exposes OLE DB DM. In this
aligned with traditional database development specific implementation, the “core” relational engine
environment and with APIs for database access. exposes the traditional OLE DB interface. However, the
Consequently, we decided to exploit the well-known OLE analysis server component exposes an OLE DB DM
DB API to ensure that data mining models and operations interface (along with OLE DB for OLAP) so that data
gain the status of first-class objects in the mainstream mining solutions may be built on top of the analysis
database development environment. server. Of course, this is only one example of how OLE
Our proposal builds upon OLE DB [7], an application DB DM may be exposed. For example, in the future, data
programming interface that has been gaining popular use mining support may be deeply integrated in the “core”
for universal data access not only from relational systems relational database system.
but also from any data source that can be viewed as a “set Our approach in defining OLE DB DM is to maintain
of tables” (rowsets) as well. It is important to point out the SQL metaphor and to reuse many of the existing
that OLE DB for DM is independent of any particular notions whenever possible. Due to the widespread use of
provider or software and is meant to establish a uniform SQL in the developer community, we decided to expose
API. Another feature of our proposal is that it is not data mining interfaces in a language-based API. This
specialized to any specific mining model but is structured decision reflects a central design principle: building data
to cater to all well-known mining models. Any party mining solutions (applications) should be easy. Thus, our
interested in using this interface is encouraged to do so by design philosophy requires relational database
building its own provider, as with OLE DB (and ODBC). management systems to invest resources to provide the
However, we should also add that one of the strengths of “drivers” to support OLE DB DM so that data mining
our proposal is that we have implemented this API as part solutions may be developed with ease by data mining
of the recent release of Microsoft SQL Server 2000 application developers. Since there are likely to be many
(Analysis Server component). The proposal also more application developers than DBMS vendors, such a
incorporates feedback from a multitude of vendors in the design philosophy inflicts the least pain.
data mining space who participated in preview meetings The key challenge is to be able to support the basic
and calls for reviews over the past 18 months. operations of data mining without introducing much
Our proposal is not focused on any particular KDD change in programming model and environment that a
algorithms per se since our intent is not to propose new database developer is used to. Fundamentally, this
algorithms, but to suggest a system infrastructure that requires supporting operations on data mining models.
makes it possible to “plug in” any algorithm and enable it These key operations to support on a data mining model
to participate in the entire life-cycle of data mining. This are the ability to:
paper is our attempt to summarize the basic philosophy 1. Define a mining model, i.e., identify the set of
and design decisions leading to the current specification attributes of data to be predicted, the set of
of OLE DB DM. A complete specification may be attributes to be used for prediction, and the
obtained from [9]. algorithm used to build the mining model.
The rest of the paper is organized as follows. We first 2. Populate a mining model from training data using
introduce the overall architecture and design philosophy the algorithm specified in (1).
of OLE DB for Data Mining (OLE DB DM) in Section 2. 3. Predict attributes for new data using a mining
In Section 3, we discuss, with illustrative examples of model that has been populated.
their usage, the primitives and basic building blocks in the 4. Browse a mining model for reporting and
API. Section 4 presents related work and we conclude in visualization applications.
Section 5. In developing OLE DB DM, we have decided to
represent a data mining model (henceforth model or
DMM) as analogous to a table in SQL.2 Thus, it can be
2. Overview and Design Philosophy defined via the CREATE statement (used to create a table
OLE DB DM provides a set of extensions to OLE DB
to make it “mining-enabled.” OLE DB is an object- 2
Although viewed conceptually and programmatically as
oriented specification for a set of data access interfaces
tables, models have important differences with tables. A data
designed for record-oriented data stores. In particular, by mining model is populated by consuming a rowset but its own
virtue of supporting command strings, OLE DB can internal structure can be more abstract, e.g., a decision-tree is a
provide full support to SQL. In this paper, we will tree-like structure, much more compact than the training data set
primarily introduce extensions that leverage this aspect of used to create it.
in SQL). It can be populated, possibly repeatedly via the and visualization purposes. Fundamental operations
INSERT INTO statement (used to add rows in a SQL on models include CREATE, INSERT INTO,
table), and it can be emptied (reset) via the DELETE PREDICTION JOIN, SELECT, DELETE FROM,
statement. A mining model may be used to make and DROP.
predictions on datasets. This is modeled through
prediction join between a mining model and a data set, It should be noted that OLE DB DM has all the
just as the traditional join operation is used to pull traditional OLE DB interfaces. In particular, schema
together information from multiple tables. Finally, the rowsets specify the capabilities of an OLE DB DM
content of a mining model can be browsed via SELECT provider. This is the standard mechanism in OLE DB
(used to identify a subset of data in SQL). Since the whereby a provider describes information about itself to
mining model is a native named object at the server, the potential consumers. This information describes the
mining model can participate in interaction with other supported capabilities (e.g. prediction, segmentation,
objects using the primitives listed above (see also Section sequence analysis, etc.), types of data distributions
3). Note that this achieves the desirable goal of avoiding supported, limitations of the provider (size or time
excessive data movement, extraction, copying and thus limitations), support for incremental model maintenance,
resulting in better performance and manageability etc. Other schema rowsets provide metadata on the
compared to the traditional “mining outside the database columns of a mining model, on its contents, and the
server” framework. Finally, once the DMM is created and supported services. The details of these schema rowsets
optimized, deployment within the enterprise becomes as are beyond the scope of this paper.
easy as writing SQL queries. This significantly minimizes
the cost of deployment by eliminating the need for The rest of this section is organized as follows. In
developing a proprietary infrastructure and interface to Section 3.1, we describe how input data is modeled as a
deploy the data mining solution. Model deployment and caseset. In Section 3.2, we describe how data mining
maintenance costs have received little attention in the models may be defined (created) in a relational
literature, though they are an expensive and difficult task framework and the kind of meta-information on casesets
in real applications. In the next section, we review the that mining needs to be aware of. Section 3.3 describes
details of the language extensions that provide an operations on mining models, notably insertion
extensible framework for incorporating support for data (populating the model), prediction, and browsing.
mining.
3. Basic Components of OLE DB DM 3.1. Data as Cases
To ensure that a data mining model is accurate and
OLE DB DM has only two notions beyond traditional meaningful, data mining algorithms require that all
OLE DB definition: cases and models. information related to the entity to be mined be
• Input data is in the form of a set of cases (caseset). A consolidated before invoking the algorithms.
case captures the traditional view of an “observation” Traditionally, mining algorithms also view the input as a
by machine learning algorithms as consisting of all single table that represents such consolidated information
information known about a basic entity being where each row represents an instance of an entity. Data
analyzed for mining. Of course, structurally, a set of mining algorithms are designed so that they consume an
cases is no different from a table.3 Cases are entity instance at a time. Having all the information
explained further in Section 3.1. related to a single entity instance together in a rowset, has
• Data mining model (DMM) is treated as if it were a two important benefits. First, traditional data mining
special type of “table:” (a) We associate a caseset algorithms can be leveraged with relative ease. Next, it
with a DMM and additional meta-information while increases scalability as it eliminates the need for data
creating (defining) a DMM. (b) When data (in the mining algorithms to do considerable bookkeeping.
form of cases) is inserted into the data mining model, In a relational database, data is often normalized and
a mining algorithm processes it and the resulting hence the information related to an entity scattered in
abstraction (or DMM) is saved instead of the data multiple tables. Therefore, in order to use data mining, a
itself. Once a DMM is populated, it can be used for key step is to be able to pull the information related to an
prediction, or its content can be browsed for reporting entity into a single rowset using views. Furthermore, as is
well recognized, data is often hierarchical in nature and so
ideally we require the view definition language to be
3 powerful enough to be able to produce a hierarchical
However, like many other data sets, cases are more naturally
represented as hierarchical data, i.e., nested tables are more rowset. This collection of data pertaining to a single entity
natural. is called a case, and the set of all relevant cases is referred
to as a caseset. As mentioned above, a caseset may need
Customer Gender Hair Color Age Age Product Purchases Car_Ownership
ID Prob. Name Quantity Type Car Prob.
1 Male Black 35 100% TV 1 Electronic Truck 100%
VCR 1 Electronic Van 50%
Ham 2 Food
Beer 6 Beverage
Table 1: A nested table representation of the 3-way join result
to be hierarchical. In the rest of this section, we elaborate Nested table is a familiar notion in the relational
on these two issues through examples and further database world, though not adopted universally in
discussions. commercial systems. In OLE DB DM, we have chosen to
Consider a schema consisting of 3 tables representing use the Data Shaping Service5 to generate a hierarchical
customers. The Customers table has some customer rowset. It should be emphasized that it is a logical
properties. The Product Purchases table stores for each definition and does not have any implications for data
customer what products they bought. The Car Ownership storage. In other words, it is not necessary for the storage
table lists the cars owned by each customer.4 Consider a subsystem to support nested records. Cases are only
customer, say Customer ID 1, who happens to be a male, instantiated as nested rowsets prior to training/predicting
has black hair, and believed to be 35 years old with 100% data mining models. Note that the same physical data may
certainty. Assume this customer has bought a TV, a VCR, be used to generate different casesets for different
Beer (quantity 6), and Ham (quantity 2). Further assume analysis purposes. For example, if one chooses to mine
we know this customer owns a truck (100%) and we models over all products, each product then becomes a
believe he also has a van (50% certainty). Suppose one single case and customers would need to be represented
were to issue a join query over the 3 tables to get all the as columns of the case.
information we know about the set of customers. Readers Once a hierarchical caseset has been prepared using
familiar with SQL will quickly realize that this join query the traditional view definition mechanism and the Data
will return a table of 12 rows. Furthermore, these 12 rows Shaping Service in OLE DB, we can use this caseset to
contain lots of replication. More importantly, since the populate a data mining model or we can predict values of
information about an entity instance is scattered among unknown columns using a pre-existing model. In order to
multiple rows, the quality of output from data mining do that, we need a way to represent data mining models.
algorithms is negatively impacted by such flattened We illustrate this in the next section.
representation.
Table 1 shows how a nested table can succinctly 3.2. Creating/Defining Data Mining Models
describe the above join result. As illustrated in Table 1, a In OLE DB DM, a data mining model (DMM) is
customer case can include not only simple columns but treated as a first class object of the provider, much like a
also multiple tables (nested tables as columns). Each of table. In this section, our focus is on definition (creation)
these tables inside the case can have a variable number of of data mining models. In particular, the DMM definition
rows and a different number of columns. Some of the includes a description of the columns of data over which
columns in the example have a direct one-to-one the model will be created, including meta-information on
relationship with the case (such as “Gender” and “Age”), relationships between columns. It is assumed that a data
while others have a one-to-many relationship with the set is represented as a nested table as discussed in Section
case and therefore exist in nested tables. The reader can 3.1. In Section 3.3, we will describe how other operations
easily identify the two tables (nested) contained in the on data mining models may be supported in OLE DB
sample case in Table 1. DM.
• Product Purchases table containing the columns As discussed earlier, a data mining model is trained
Product Name, Product Quantity, and Product over a single caseset and is used to predict columns of a
Type. caseset. Therefore, defining (creating) a data mining
• Car Ownership table containing the columns Cars model requires us to specify:
and Car Probability. • The name of the model.
4 5
The columns Age Probability and Car_Ownership.Probability The Data Shaping Service is part of the Microsoft Data Access
are unusual in that they represent degree of certainty. Components (MDAC) products.
• The algorithm (and parameters) used to build the The above details are discussed in the rest of this
model. subsection and we conclude this section with a
• The algorithm for prediction using the model. comprehensive example. Finally, note that despite the
• The columns the caseset must have and the similarities with CREATE TABLE with respect to
relationships among these columns. programmability, mining models are conceptually
• Identifying columns to be used as “source” columns different as they represent only summarized information
and the columns to be filled by the mining model for a data set. Also, unlike traditional data tables, they are
(“prediction columns”). capable of prediction (See Section 3.3).
The following example illustrates the syntax: 3.2.1 Content Types of Columns – Column Specifiers
Each column in a case can represent one of the
CREATE MINING MODEL [Age Prediction] ( following content types:
%Name of Model KEY: the columns that identify a row. For example,
[Customer ID] LONG KEY, “Customer ID” uniquely identifies customer cases,
[Gender] TEXT DISCRETE, “Product Name” uniquely identifies a row in “Product
[Age] DOUBLE DISCRETIZED PREDICT, Purchases” table.
%prediction column ATTRIBUTE: A direct attribute of the case. This type
[Product Purchases] TABLE(
of column represents some value for the case. Example
[Product Name] TEXT KEY,
[Quantity] DOUBLE NORMAL CONTINUOUS, attributes are: the age, gender, or hair-color of the
[Product Type] TEXT DISCRETE customer or the quantity of a specific product the
RELATED TO [Product Name] )) customer purchased.
USING [Decision_Trees_101] RELATION: Information used to classify attributes,
%Mining Algorithm used other relations, or key columns. For example, “Product
Type” classifies “Product Name.” A given relation value
The annotations with the example identify the source must always be consistent for all of the instance values of
columns used for prediction, the column to be predicted the other columns it describes—for example, the product
(age) and the mining algorithm used. Specifically, the “Ham” must always be shown as “Food” for all cases. In
DMM is named [Age Prediction] (the square brackets [ ] the CREATE MINING MODEL command syntax,
are name delimiters by convention). The PREDICT relations are identified in the column definition by using a
keyword indicates that the Age column is to be predicted. RELATED TO clause to indicate the column being
The [Product Purchases] is a nested table, with [Product classified.
Name] being its key. The USING clause specifies the QUALIFIER: A special value associated with an
mining algorithm to use. Parameters to the algorithm attribute that has a predefined meaning for the provider.
could be included as part of this clause. When we contrast Take for example the probability that the attribute value is
this example with the CREATE TABLE syntax in SQL, a correct. These qualifiers are all optional and apply only if
key difference is the use of substantially more meta the data has uncertainties attached to it or if the output of
information that adorns a column reference above. previous predictions is being chained as input to a
The model columns describe all of the information subsequent DMM training step. In the CREATE MINING
about a specific case. For example, assume that each case MODEL command syntax, modifiers are identified by
in the DMM represents a customer. The columns of the using an OF clause to indicate the attribute column they
DMM will include all known and desired information modify. There are many qualifiers. We give a few
about the customer (e.g. Table 1). Several kinds of examples here:
information may be specified for columns: (a) PROBABILITY: the probability [0,1] of the
• Information about the role of a column (Section associated value.
3.2.1): Is it a key? Does it contain an attribute? (b) VARIANCE: A number that describes the
How does it relate to other columns or nested variance of the value of an attribute.
tables? Is it a qualifier on other columns? (c) SUPPORT: A float that represents a weight
• For columns that represent attributes, what type of (case replication factor) to be associated with the
attribute is it? Attribute types are described in value.
Section 3.2.2. (d) PROBABILITY_VARIANCE: The variance
• Additional information about data distributions associated with the probability estimator used for
that can be provided as hints to the DMM (Section PROBABILITY.
3.2.3). (e) ORDER: Specifies the order of a column. (See
ORDERED below.)
(f) TABLE: A nested table in the case consists of a could be modeled as binary where the significant
special column with the data type TABLE. For any information is whether the age is known or not.
given case row, the value of a TABLE type column
contains the entire contents of the associated nested 3.2.4 Prediction Columns
table. The value of a TABLE type column is in Output columns are columns in a dataset that the
itself a table containing all of the columns for the DMM is capable of predicting. Because these columns
nested table. In the CREATE MINING MODEL will be predicted over cases in which their values are
command syntax, nested tables are described by a missing or unspecified, additional statistical information
set of columns, all of which are contained within may be associated with such predictions. We discuss this
the definition of a named TABLE type column. additional information here. Section 3.2.3 presents the
syntax of such prediction columns.
3.2.2 Content Types of Columns – Attribute Types Attribute or Table type columns can be input columns,
For a column that is specified as an attribute, it is output columns, or both. The data mining provider builds
possible to further indicate the type of attribute. The a DMM capable of predicting or explaining output
following keywords specify the attribute type to the DM column values based on the values of the input columns.
provider. Predictions may convey not only simple information such
• DISCRETE: The attribute values are discrete as “estimated age is 21;” but they may also convey
(categorical) with no ordering implied by the values additional statistical information associated with the
(often called states). “Area Code” is a good example. prediction such as confidence level and standard
• ORDERED: Columns that define an ordered set of deviation. Further, the prediction may actually be a
values with a total ordering but no distance or collection of predictions, such as “the set of products that
magnitude semantics are implied. A ranking of skill the customer is likely to buy.” Each of the predictions in
level (say one through five) is an ordered set, but a the collection may also include a set of statistics.
skill level of five isn't necessarily five times better A prediction can be expressed as a histogram. A
than a skill level of one. histogram provides multiple possible prediction values,
• CYCLICAL: A set of values that have cyclical each accompanied by a probability and other statistics.
ordering. Day of the week is a good example, since When histogram information is required, each prediction
day number one follows day number seven. (which by itself can be part of a collection of predictions)
• CONTINOUS: Attributes with numeric values may have a collection of possible values that constitutes a
(integer of float). Values are naturally ordered and histogram.
have implicit distance and magnitude semantics. Since there is a wealth of information we can associate
Salary is a typical example. with predictions, it is often necessary to extract only a
• DISCRETIZED: The data that will be inserted into portion of the predicted information. For example, a user
the model is continuous, but it should be transformed may want to see only the “best estimate,” “top 3
into and modeled as a number of ORDERED states estimates,” or “estimates with probability greater than
by the provider. Some data mining algorithms cannot 55%.” Neither every provider nor every DMM can
accept CONTINUOUS attributes as input, or they support all of the possible requests. Therefore, it is
may not be able to predict CONTINUOUS values. necessary for the output column to define whatever
• SEQUENCE_TIME: A time measurement range. The information may be extracted out of it.
format is not restricted, e.g., a period number is OLE DB DM defines a set of standard transformation
acceptable. This is typically used to associate a functions on output columns. These functions are
sequence time with individual attribute values such as discussed in detail in the OLE DB DM specification
purchase time. document [9]. The basic mechanism we used to
accommodate the flexibility in output is to draw on the
3.2.3 Distribution Hints familiar notion of user-defined functions (UDF) used in
An attribute, continuous or discrete, may have an OLAP [10,15]. Each provider ships a set of functions that
associated distribution. These distributions are used as can be referenced in the prediction query. Some UDF’s
hints to the DMM and can specify prior knowledge about are scalar-valued, such as probability, or support. Others
the data. For example, a continuous attribute can be have tables as values, such as histogram and hence return
normal (Gaussian), log normal, or uniform. A discrete can nested tables when invoked.
be binomial, multinomial, Poisson, mixture. Other hints
could include: Not_Null which indicates there should 3.2.5 Detailed Example of DMM Column Specification
never be a null value, model existence only which means Now that we have described what the column models
the information of interest is not in the value of an are, it may be useful to provide an example. Column
attribute, but in the fact that a value is present. Hence Age descriptions allow the provider to better model the
training data it is given. We now return to our running between a data mining model and the given (test) data set.
example and indicate how content types of some of the Prediction join between the cases from the test data test
columns in the example can now be classified as shown with the DMM and selecting the “Age” column from the
below (content type in italics): DMM returns a predicted “Age” for each of the test
(a) Gender: discrete attribute cases.
(b) Age: continuous attribute We need to use “prediction join” since the process of
(c) Age Probability: probability modifier of age combining actual cases with all possible model cases is
(d) Customer Loyalty: ordered attribute not as simple as the semantics of a normal SQL equi-join.
(e) Product Name: Key (for Product Purchases For the instance when the schema of the actual case table
table) matches the schema of the model, NATURAL
(f) Product Type: Relation of Product Name PREDICTION JOIN can be used, obviating the need for
the ON clause of the join. Columns from the source query
The example above includes a few additional columns will be matched to columns from the DMM based on the
to illustrate a few more content types. names of the columns.
3.3. Operations on Data Model
SELECT t.[Customer ID], [Age Prediction].[Age]
FROM [Age Prediction]
Populating a Mining Model: Insert Into PREDICTION JOIN (SHAPE {
Once a mining model is defined, the next step is to SELECT [Customer ID], [Gender], FROM Customers
populate a mining model by consuming a caseset that ORDER BY [Customer ID]}
satisfies the specification in the Create Mining Model APPEND ({SELECT [CustID], [Product Name], [Quantity]
statement. In OLE DB DM, we use INSERT to model FROM Sales
instantiation of the mining model. As against a traditional ORDER BY [CustID]}
table, insertion does not result in addition of the tuple to RELATE [Customer ID] To [CustID])
the rowset. Rather, the insertion corresponds to AS [Product Purchases]) as t
ON [Age Prediction].Gender = t.Gender and
consuming the observation represented by a case using [Age Prediction].[Product Purchases].[Product Name]
the data mining model. The following example illustrates = t.[Product Purchases].[Product Name] and
the syntax. Note the use of the SHAPE statement to create [Age Prediction].[Product Purchases].[Quantity]
the nested table from the input data: = t.[Product Purchases].[Quantity]
INSERT INTO [Age Prediction] The DMM content can be thought of as a “truth table”
([Customer ID], [Gender], [Age], containing a row for every possible combination of the
[Product Purchases]([Product Name], [Quantity], [Product distinct values for each column in the DMM. In other
Type])) SHAPE words, it contains every possible case along with every
{SELECT [Customer ID], [Gender], [Age] possible associated set of output columns. With this view
FROM Customers
ORDER BY [Customer ID]}
in mind, a DMM can be used to look up predicted values
APPEND ( and statistics by using the attribute values of a case as a
{SELECT [CustID], [Product Name], [Quantity], [Product key for the join. It is important to point out that this
Type] FROM Sales ORDER BY [CustID]} analogy is not quite accurate as the set of all possible
RELATE [Customer ID] To [CustID]) cases (combinations of attribute values) is huge, and
AS [Product Purchases] possibly infinite when one considers continuous
attributes. However, this logical view allows us to map
prediction into a familiar basic operation in the relational
Using Data Model to Predict: Prediction Join world.
We now explain how predictions can be obtained from When a model is joined with a table, predicted values
models in OLE DB DM that has been populated. The are obtained for each case that “matches” the model (i.e.
basic operation of obtaining prediction on a dataset D cases for which the model is capable of making a
using a DMM M is modeled as a “prediction join” prediction). A SELECT statement associated with the
between D and M. Of course, the caseset needs to match PREDICTION JOIN specifies which predicted values
the schema of the DMM. Since the model does not should be extracted and returned in the result set of the
actually contain data details, the semantics of the query. User-defined functions (UDF) in the projection
Prediction Join are different than those of a equi-join on columns are used to further enhance the set of predicted
tables. The following example illustrates how prediction values with additional statistical information. We have
join may be used. In this example, the value for “Age” is discussed this aspect in Section 3.2.4.
not known and is being predicted using the prediction join
Browsing Model Content: Select tools provide application interfaces and UI to support
In addition to listing the column structure of a DMM, a mining and allow access to data in the warehouse.
different type of browsing is to navigate the content of the However, such tools do not provide the ability to deal
model viewed as a graph. Using a set of input cases, the with arbitrary mining models and integration of the
content of a DMM is learned by the data mining mining application interfaces with SQL as well as
algorithm. The content of a DMM is the set of rules, relational data access APIs (ODBC, OLE DB).
formulas, classifications, distributions, nodes, or any other The paper by Meo et al. [8] is an example of
information that was derived from a specific set of data specialized extensions to SQL to provide support for
using a data mining technique. Content type varies association rules. Such an approach is narrowly focused
according to the specific data mining technique used to and does not provide an extensible framework for
create the DMM. The DMM content of a decision tree– supporting models derived from other mining techniques,
based classification will differ from a segmentation e.g., clustering and classification. Moreover, no support is
model, which, in turn, differs from a multi-regression provided for management of mining models and for using
DMM. such models subsequently in other contexts.
The most popular way to express DMM content is by More recently, CRISP-DM [4] (http://www.crisp-
viewing it as a directed graph (that is, a set of nodes with dm.org/) has been proposed as a standard for Data Mining
connecting edges). Note that decision trees fit such a view process model. This initiative is complementary to OLE
nicely. Each node in the tree may have relationships to DB DM. CRISP-DM provides a set of methodologies,
other nodes. A node may have one or more parent nodes best practices, and attempts to define the various activities
and zero or more child nodes. The depth of the graph may involved in the KDD process. It does not address issues of
vary depending on the specific node. integration with the database system.
Tree navigation we adopted is similar to the already A related effort, called Predictive Model Markup
defined mechanism in the OLE DB for OLAP Language (PMML), provides an open standard for how
specification [10]. On querying (SELECT * FROM models should be persisted in XML. The basic idea of
<mining model>.CONTENT), the content of a mining PMML is to enable portability and model sharing [5].
model is exposed through the PMML does not address database integration issue but
MINING_MODEL_CONTENT schema rowset. specifies a persistence format for models. We are
Unfortunately, a detailed discussion of currently working with the PMML group to use PMML
MINING_MODEL_CONTENT schema rowset is beyond format as an open persistence format. In fact, in the model
the scope of this paper (see Appendix A of [9]). browsing methods, briefly discussed in Section 3.3.3, we
use PMML inspired XML strings in exposing the content
of data mining model nodes.
4. Related Work
As mentioned earlier, OLE DB DM is an extension to
OLE DB, a specification of data access interfaces. OLE
The focus of OLE DB DM is not about new KDD
DB generalized the widely used ODBC connectivity
algorithms. Therefore, we do not discuss here the wide
standard by enabling access to record-oriented data stores
body of literature on KDD algorithms and scalable
that are not necessarily SQL data sources.
techniques for such algorithms.
Compared to the wide body of work on KDD
algorithms as well as the more recent work on scalable 5. Concluding Remarks
algorithms, there has been relatively much less work on
the problem of integration of data mining with relational We presented the highlights of OLE DB DM, a new
databases. API for integrating data mining into SQL databases. A
The work by Sarawagi et al. [13] address the problem reference OLE DB DM provider is scheduled to ship with
of how to implement derivation of association rules the next release of a commercial database system:
efficiently using SQL. Although significant from the Microsoft SQL Server 2000.
performance implications, this direction of work is The target user base of OLE DB DM is the community
orthogonal to the issue of how to enable, deploy, and of enterprise developers who are familiar with SQL and
expose data mining models and algorithms as first class database connectivity APIs such as ODBC and OLE DB.
objects in the database application API. A similar effort For this reason, a key design goal was to keep the set of
relating to generalized association rules and sequences is extensions as close to SQL’s look and feel as possible.
also focused on implementing particular algorithms in OLE DB DM introduces data mining objects and entities
SQL [14]. by drawing straightforward analogies to familiar
Research projects such as Quest relational objects. We have only introduced two new
(http://www.almaden.ibm.com/cs/quest) [1] and DBMiner concepts: Data Mining Models (DMM) and Cases.
[6] (http://db.cs.sfu.ca) as well as commercial mining
We hope that OLE DB DM will help steer our [10] OLE DB for OLAP, Microsoft Corporation,
community to establish a standard in the field. The http://www.microsoft.com/data/oledb
advantages of such a standard include integration with the [11] R. Ramakrishnan and J. Gehrke. Principles of Database
database so that data mining no longer needs to involve Management (2nd Edition), 1999.
[12] J. Shanmugusundaram, U. M. Fayyad and P. S. Bradley.
taking data outside the database, addressing issues of Compressed data cubes for OLAP aggregate query
model deployment once models are created, and approximation on continuous dimensions. In Proc. 5th intl.
providing an infrastructure for managing data and models conf. on knowledge discovery and data mining, pp. 223-
while leveraging the existing SQL infrastructure. 232, 1999.
Establishing standards also reduces the costs, increases [13] Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal:
the ability to interoperate and partner, and reduces the Integrating mining with relational database systems:
risks for both providers and consumers. Consumers are alternatives and implications. Proc. of SIGMOD
protected since they are no longer dependent on a Conference, pp. 343-354, 1998.
proprietary interface so other vendors can step in to [14] S. Thomas, S. Sarawagi Mining generalized association
rules and sequential patterns using SQL queries. Proc. of
provide replacements and enhancements. Providers are the 4th Int'l Conference on Knowledge Discovery in
protected from the need to build infrastructure from Databases and Data Mining, New York, August 1998.
scratch as partners can step in to build different [15] E. Thomsen, G, Spofford, D. Chase, Microsoft OLAP
complementary components. To move towards a standard, solutions, Wiley, John & Sons, 1999.
an open forum such as standards committee elected by the
community will be needed. We hope that this proposal
will encourage further discussions along this direction.
Acknowledgement: We are grateful to Venkatesh
Ganti for his thoughtful comments on the draft.
6. References
[1] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer,
R. Srikant: The Quest data mining system, Proc. of the 2nd
int'l conference on knowledge discovery in databases and
data mining, August, 1996.
[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I.
Verkamo: Fast discovery of association rules. In Advances
in knowledge discovery and data mining, pp. 307-328,
AAAI Press, Menlo Park, CA, 1996.
[3] S. Chaudhuri and U. Dayal: An overview of data
warehousing and OLAP technology. SIGMOD Record
26(1), pp. 65-74, 1997.
[4] P. Chapman, R. Kerber, J. Clinton, T. Khabaza, T.
Reinartz, R. Wirth: The CRISP-DM process model,
Technical Report. http://www.crisp-dm.org/ 1999.
[5] DMG Organization, PMML 1.0-Predictive Model Markup
Languange, http://www.dmg.org/public/techreports/pmml-
1_0.html, 1999.
[6] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski,
D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, O. R.
Zaiane: DBMiner: A system for mining Knowledge in
Large Relational Databases. Proc. of int'l conf. on data
mining and knowledge discovery (KDD'96), Portland,
Oregon, pp. 250-255, August 1996.
[7] P. Hipson: OLE DB and ADO developer's guide, McGraw-
Hill; 1999.
[8] Rosa Meo, Giuseppe Psaila, Stefano Ceri: An extension to
SQL for mining association rules. 195-224, Data Mining
and Knowledge Discovery, Vol2, Number 2, 1998.
[9] Introduction to OLE DB for Data Mining, Microsoft
Corporation, http://www.microsoft.com/data/oledb
Related docs
Other docs by vsm12170
Sample Multiple Choice Questions for a First Midterm- CSCI 570 Oracle - Download as PDF
Views: 17 | Downloads: 0
Get documents about "