Docstoc

3.DATA WAREHOUSE _ DATA MINING

Document Sample
3.DATA WAREHOUSE _  DATA MINING Powered By Docstoc
					  DATA WAREHOUSE & DATA MINING
  DATA WAREHOUSE & DATA MINING
1. Indroduction.

2. Data Warehouse.

3. Specific Rules For The basic structure of a warehouse.

4. Data Mart.

5. Data Warehouse Backend Process.

6. Advantages

7. Data Mining –Definition.

8. Data Mining –Process.

9. Data Mining Architecture.

10. Data Mining-Knowledge Discovery Process (KDD)

11. Partition Algorithm to Discover All Frequent Sets From

The Data Warehouse by Using Data Mining.

12. Conclusion.

13. References.

                                          ABSTRACT

    Recent developments in information systems technologies have resulted in computerizing
many applications in various business areas.Data has become a critical resource in many
organizations and therefore,efficient access to data,sharing the data, extracting information from
the data, and making use of the information has become an urgent need. As a result, there have
been many efforts on not only integrating the various data sources scattered across several sites,
but extracting information from these databases in the form of patterns and trends has also
become important. These data sources may be databases managed by Database Management
Systems, or they could be Data warehoused in a repository from multiple data sources.
    In this way any organization will become a learning system. Discovering associations
between items in a large database in one such data mining activity. In finding associations,
support is used as an indicator as to whether an association is interesting. In this paper, we
discuss about, two user specified constraints that every rule satisfies and how the problem of
mining association rules can be decomposed into two sub problems and see the important
properties of frequent set, maximal frequent set & border set. We discuss the algorithms that
effectively find all associations with minimum I/O operations and at the same time are efficient in
computing

    Also we have discussed about the concepts involved in Data Warehouse and Data mining,
& the meta data which deals with data about Business data, concepts of Data marts and their
usefulness in decision making is also discussed, applications of Data Warehouse and its merits
and demerits.

INTRODUCTION

    Recent developments in information systems technologies have resulted in computerizing
many applications in various business areas. Data has become a critical resource in many
organizations and therefore, efficient access to data, sharing the data, extracting information from
the data, and making use of the information has become an urgent need. As a result, there have
been many efforts on not only integrating the various data sources scattered across several sites,
but extracting information from these databases in the form of patterns and trends has also
become important. These data sources may be databases managed by Database Management
Systems, or they could be Data warehoused in a repository from multiple data sources.

    Data mining is the process of extracting information and patterns often previously unknown
from large quantities of data using various techniques from areas such as Machine Learning,
Pattern Recognization and Statistics. Data could be in files, Relational databases, or other types
of databases such as Multimedia databases. Data may be structured or unstructured.

    In this way any organization will become a learning system. Discovering associations
between items in a large database in one such data mining activity. In finding associations,
support is used as an indicator as to whether an association is interesting. In this paper, we
discuss about, two user specified constraints that every rule satisfies and how the problem of
mining association rules can be decomposed into two sub problems and see the important
properties of frequent set, maximal frequent set & border set. We discuss the algorithms that
effectively find all associations with minimum I/O operations and at the same time are efficient in
computing


  DATA WAREHOUSE & DATA MINING
   Every day organizations, both large and small, generate billions of bytes of data related to all
aspects of their business. But locked up in a variety of systems, most of this data is extremely
difficult to access. Only a very small part of the data – captured, processed and stored – is
available to decision makers.

    A large amount of the right information is the key to survival in today’s competitive
environment. And this kind of information can be made available only if there’s a totally integrated
enterprise data warehouse.

DATA WAREHOUSE

    A data warehouse is a repository of integrated information, available for queries and analysis.
For such a repository, data and information are extracted from heterogeneous sources and
consolidated in a single source. This makes it much easier and efficient to query the data.

    There are two fundamentally different types of information systems in enterprises: operational
systems and informational systems.

    Operational systems help us run daily enterprise operations like ERP (enterprise Resource
Planning).Informational systems analyze data and help make decisions on how the enterprise will
operate .not only do informational systems have a different focus from operational ones, they
often have a different scope altogether.

There are some specific rules that govern the basic structure of a warehouse, namely that
such a structure should be:

    1. Time dependant: That is, containing information collected over time, which implies there
       must always be a connection between the information in the warehouse and the time
       when it was entered. This is one of the most important aspects of the warehouse as it
       relates to data mining, because information can then be stored according to period.
    2. Non-volatile: that is data in a data warehouse is never updated but used only for
       queries. Thus such data can only be loaded from other databases such as the
       operational database. End –users who want to update data must use operational
       database, as only the latter can be updated, changed, or deleted. This means that a data
       warehouse will always be filled with historical data.
    3. Subject oriented: that is, built around all the existing applications of the operational data.
       Not all the information in operational database is useful for a data warehouse, since the
       data warehouse is designed specifically for decision support while the operational
       database contains information for day-to-day.
    4. Integrated: that is, it reflects the business information of organization. In an operational
       data environment you will find many types of information being used in a variety of
       applications and some applications will be using different name for the same entities.
       However in a data warehouse it is essential to integrate this information and make it
       consistent; only one name must exist to describe each individual entity.
    5. A data warehouse is designed especially for decision support queries therefore only data
       that is needed for decision support is extracted from the operational data and stored and
       stored in the warehouse.
    DATA MARTS
         Data Marts are partitions of the overall data warehouse. It contains some overlapping data.
    The task of implementing a data warehouse can be a very big effort, taking a significant amount
    of time. One feasible option is to start with a set of data marts for each of component
    departments. One can have a stand alone data mart or dependent data mart. A set of smaller,
    manageable, database is called data marts.

        Stand alone data mart- a data mart with minimal or no impact on the enterprise operational
    database.

       Dependent data mart- similar to the stand – alone data mart, Except that management of the
    data sources by the enterprise data base is required. These data resources include operational
    databases and external sources of data.

    DATA WARE HOUSE BACKEND PROCESS

    Data Extraction: which gathers data from multiple, heterogeneous, external sources.

    Data Cleaning: which detects errors in the data and rectifies
    them when possible?

    Data Transformation: which converts data from legacy or host format to ware house format?

    Loading:

       Which sorts, summarizes, consolidates, computes, views, checks integrity, and builds indices
    and partitions.

    Refresh: which propagates the updates from the data sources to the warehouse.

    ADVANTAGES

           Data warehouses are free from the restrictions of the transactional environment there is
    an increased efficiency in query processing.
           Artificial intelligence techniques, which may include genetic algorithm and neural
    networks, are used for classification and are employed to discover knowledge from the data
    warehouse that may be unexpected or difficult to specify in queries.


                  DATA MINING: A Definition
    Data mining is an iterative process of discovery. Data mining seeks to find new patterns hidden in
    the data stored in large databases common to major firms. These databases,
Containing operational, marketing, and customer data — form an untapped resource. Data
mining is the procedure to unlock and exploit the patterns in those databases. Data mining has
two principal activities: finding Patterns in data and describing those patterns clearly. Successful
data mining provides insights into your data, explains your data, and

Enables you to make profitable predictions from it. The patterns that data mining discovers can
have various forms:

• Trends in data over time

• Clusters of data defined by important combinations of variables

• Evolution of these clusters over time

A typical example of data mining is a marketing database of direct mailing campaigns. Data
mining can isolate where mailings have succeeded and failed.

DATA MINING: Process

Data I/O
Reading and writing data is a critical capability in data mining. The issue of the supported
formats. Most data actually still originates in text files - studies show as much as 50% of data
mining is performed on text sources. Commonly supported formats include

• Value separated files. This includes common formats like comma separated values (CSV) and
tab separated values (TSV) as well as the use of user-defined delimiters.
• Fixed format files. Allow you to read data from a file containing fixed-width columns.
• Proprietary vendor formats. This includes formats like Excel and statistical environments like
S-PLUS & SAS, as well as older formats like Lotus & Quattro.

• Database. The most common formats are Oracle, DB2, Microsoft SQL Server and Sybase.
Most tools supply ODBC access, so any DBMS supporting ODBC can be accessed as well.
Some tools also provide so-called "native" drivers to the more common databases. This is often
deemed to be an advantage because native drivers provide more direct access to the database
(more efficient).
Data exploration
the main purpose of the exploration phase of data mining is to provide you with a high-level
understanding of the structure of your data. For example, charts and descriptive statistics are
helpful for examining and visualizing the distribution of your data, as they reveal variables in your
data that are nearly constant or variables that have large numbers of missing values. Correlations
are helpful for determining whether two variables in your data are related, while cross tabulations
can determine the distribution of data among the levels in categorical variables. Tables are useful
for viewing both the values and the data types for each column. Finally, you can compare
columns, rows, or even cells in two inputs, helpful when you are looking for small- or large-scale
differences. Most data mining environments include the following basic functionality:
• Chart 1-D. Basic one-dimensional graphs of the variables in a data set. Common chart types
are pie charts, bar charts, column charts, dot charts, histograms, and box plots.
• Correlations. Computes correlations and co variances for pairs of variables in a data set.
• Cross-tabulate. This produces tables of counts for various combinations of levels in categorical
variables.
• Descriptive Statistics. This computes basic descriptive statistics for the variables in a data set
and displays them with one-dimensional charts.
• Table View. This displays data in a tabular format, allowing you to see both the data values and
the data types (continuous, categorical, string, or date).
• Compare. This creates an absolute, relative, or Boolean comparison of each column, row, or
cell for two inputs.

Data cleaning
Most data, especially business data, is notoriously "dirty." Data mining tools provide several key
ways of addressing this issue and cleaning your data:

• Missing Values functionality allows you to deal with your data’s missing values in one of five
different ways. You can filter all rows containing missing values from your dataset, attempt to
generate sensible values for those that aremissing based on the distributions of data in the
columns, replace the missing values with the means of the corresponding columns, carry a
previous observation forward, or replace the missing values with a constant you choose.

• Outlier Detection functionality detects multidimensional outliers in your data. Based on the
information returned by Outlier Detection, you may choose to filter certain rows that are flagged
by the component as outliers.



Data manipulation
Data manipulation is critical for transforming your data from their original format into a form that is
compatible with model building. For instance, CRM models are typically built from a flat structure
where every customer is represented in a single row. Transactional systems, on the other hand,
store customer data in highly normalized data structures; for example, customers and their
transactions may be in separate tables with one-to-many relationships.

Row manipulations

• Aggregate
• Append
• Filter Rows
• Partition
• Sample
• Shuffle
• Sort
• Split
• Stack
• Unstack



Columns manipulations

• Bin
• Create Columns
• Filter Columns
• Join
• Modify Columns
• Transpose
• Normalize



Data cleaning
Most data, especially business data, is notoriously "dirty." Data mining tools provide several key
ways of addressing this issue and cleaning your data:

• Missing Values functionality allows you to deal with your data’s missing values in one of five
different ways. You can filter all rows containing missing values from your dataset, attempt to
generate sensible values for those that aremissing based on the distributions of data in the
columns, replace the missing values with the means of the corresponding columns, carry a
previous observation forward, or replace the missing values with a constant you choose.

                                    Data mining Architecture

Many data mining tools currently operate outside of the warehouse, requiring extra steps for
extracting, importing, and analyzing the data. Furthermore, when new insights require operational
implementation, integration with the warehouse simplifies the application of results from data
mining. The resulting analytic data warehouse can be applied to improve business processes
throughout the organization, in areas such as promotional campaign management, fraud
detection, new product rollout, and so on. Figure illustrates architecture for advanced analysis in
a large data warehouse.



                           Figure - Integrated Data Mining Architecture




The ideal starting point is a data warehouse containing a combination of internal data tracking all
customer contact coupled with external market data about competitor activity. Background
information on potential customers also provides an excellent basis for prospecting. This
warehouse can be implemented in a variety of relational database systems: Sybase, Oracle,
Redbrick, and so on, and should be optimized for flexible and fast data access.
    An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user
    business model to be applied when navigating the data warehouse. The multidimensional
    structures allow the user to analyze the data, as they want to view their business – summarizing
    by product line, region, and other key perspectives of their business. The Data Mining Server
    must be integrated with the data warehouse and the OLAP server to embed ROI-focused
    business analysis directly into this infrastructure. An advanced, process-centric metadata
    template defines the data mining objectives for specific business issues like campaign
    management, prospecting, and promotion optimization. Integration with the data warehouse
    enables operational decisions to be directly implemented and tracked. As the warehouse grows
    with new decisions and results, the organization can continually mine the best practices and
    apply them to future decisions.

    This design represents a fundamental shift from conventional decision support systems. Rather
    than simply delivering data to the end user through query and reporting software, the Advanced
    Analysis Server applies users’ business models directly to the warehouse and returns a proactive
    analysis of the most relevant information. These results enhance the metadata in the OLAP
    Server by providing a dynamic metadata layer that represents a distilled view of the data.
    Reporting, visualization, and other analysis tools can then be applied to plan future actions and
    confirm the impact of those plans

                      DATA MINING and Knowledge Discovery Process (KDD):

    Databases

    The term Knowledge discovery in databases (KDD) and Data mining are often used
    interchangeably. But Over the past few years KDD has been used to refer to a process consisting
    many steps. The following definition of both clears the differences between them:

    Knowledge discovery in database (KDD) is the process of finding useful information and
    patterns in data.

    Data mining is the use of algorithms to extract the information and patterns derived by
    KDD process

    The various steps involved in KDD are as follows:

            Selection: The data needed for the data mining process may be obtained from many
    different and heterogeneous data sources like databases, files, data warehouses, non-electronic
    resources etc.
            Preprocessing: The data to be used by the process may have incorrect or missing data.
    The erroneous may be corrected or removed, whereas missing data must be supplied or
    predicted.
            Transformation: Data from different sources must be converted into common format for
    processing. Some data may be encoded or transformed into more usable formats. Data reduction
    may be used to reduce the number of possible data values being considered.
            Data mining: Based on the data-mining task being performed, this step applies algorithm
    to the transformed data to generate the desired results.
            Interpretation/evaluation: How the data mining results are presented in front of user is
    extremely important because the usefulness of the result is dependent on it. Various visualization
    techniques employed for this purpose are:

    1. Graphical

    2. Geometric

    3. Icon-based

    4. Pixel-based

    5. Heirarchical

    6. Hybrid

          PARTITION ALGORITHM TO DISCOVER ALL FREQUENT SETS FROM THE DATA
                          WAREHOUSE BY USING DATA MINING

    BASICS

         Let A = { l1,l2,…,lm} be a set of item. Let T, the transaction in the database, be a set of
    transactions, where each transaction t is a set of items. Thus, t is a subset of A. A transaction t is
    said to support an item lI, if lI is present in t. t is said to support a subset of items X A, it t supports
    each item l in X. An item set X A has a supports in T, denoted by s(X)T , is s% of transactions in
    T support X. Support can also be defined as a fractional support, which means the proportion of
    transactions supporting X in T. For a given transaction database T, an association rule is an
    expression of the form X Y, where X and Y are subsets of A and X Y holds with confidence , if %
    of transaction in D that support X also support Y. The rule X Y has support in the transaction set
    T if % of transaction in T support X U Y. ANN NN.

         Each rule has a left hand side and a right hand side. The left hand side is also the antecedent
    and the right hand side is also called the consequent. In general, both the left hand side and the
    right hand side contain multiple items. Confidence (or predictability) measures how much a
    particular item is dependent on another. Support does not depend on the direction (or implication)
    of the rule, it is only dependent on the set of items in the rule.

       The discovery of association rules is the most well studied problem in data mining. There are
    many interesting algorithms proposed recently and we shall discuss about partition algorithm for
    making associations. The features of any efficient algorithm are (a) to reduce the I/O operations,
    and (b) at the same time be efficient in computing.

    PARTITION ALGORITHM
    Partition algorithm is based on the observation that the frequent sets are normally very few in
number compared to the set of all item sets. The partition algorithm uses two scans of the
databases to discover all frequent sets. In one scan, it generates a set of all potential frequent
item sets by scanning the database once. This set is a super set of all frequent item sets, i.e. it
may contain false positives. The algorithm executes in two phases. In the first phase, the partition
algorithm logically divides the database into a number of non-overlapping partitions. The
partitions are considered one at a time and all frequent item sets for that partition are generated.

                                           PARTITION ALGORITHM

P = Partition-database (T); n = Number of partitions

for I = 1 to n do begin       // Phase I

read-in-partition (Ti in P)

LI = generate al frequent item set of T using a priori method in main memory

End

for (k = 2 ; LIk = , I = 1,2, … n, k++) do begin // merge phase

CGK = U I = 1n LIK end

for I = 1 to n do begin       // phase II

read_in_partition( TI in P)

for all candidates C CG compuate S ( C ) Ti end

LG = { C CG / S( C )TI >= }

Anwer = LG

Example: Let us take the database T, and . let us partition, for the sake of illustration, T into three
partitions T1,T2,T3, each containing 5 transactions. The first partition T1 contains transaction 1 to
5, T2 contains transactions 6 to 10 and , similarly, T3 contains transactions 11 to 15. We fix the
local support as equal to the given support, that is 20%. Thus, any item set that appears in just
one of the transactions in any partition is a local frequent set in that partition.

A1 A2 A3 A4 A5 A6 A7 A8 A9

10 0011010

010100010
000110100

011000000

000011100

011100000

010001101

000010000

000000010

001010100

000011010

010101100

101010100

011000001

L := { {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,5}, {1,6}, {1,8}, {2,3}, {2,4}, {2,8}, {4,5}, {4,7}, {4.8},
{5,6}, {5,7}, {5,8}, {6,7}, {6,8}, {1,6,8}, {1,5,6}, {1,5,8}, {2,4,8}, {4,5,7},{ 5,6,8}, {5,6,7}, {1,5,6,8} }

Similarly

L := { {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {2,3}, {2,4}, {2,7}, {2,9}, {3,4}, {3,5}, {3,7}, {5,7}, {6,7},
{6,9}, {7,9}, {2,3,4}, {2,6,7}, {1,5,8}, {2,6,9}, {2,7,9}, {3,5,7}, {2,6,7,9} }

L := { {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5}, {1,7}, {2,3}, {2,4}, {2,6}, {2,7}, {2,9}, {3,5},
{3,7}, {3,9}, {4,6}, {4,7}, {5,6}, {5,7}, {5,7}, {5,8}, {6,7}, {6,8},{1,3,5},{1,5,7},{2,3,9},
{2,4,6},{2,4,7},{3,5,7}{4,6,7},{5,6,8},{2,4,6,7}

In phase II, we have the candidate set as

C=LULUL

C := { {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5}, {1,6}, {1,7}, {1,8}, {2,3}, {2,4}, 2,6}, {2,7},
{2,8}, {2,9), {3,4}, {3,5}, {3,7}, {3,9}, {4,5}, {4,6}, {4,7}, {4,8}, {5,6}, {5,7}, {5,8}, (5,7}, {6,7}, {6,8},
{6,9}, {7,9}, {1,3,5},{1,3,7}, {1,5,6} , {1,5,7}, {1,5,8}, {1,6,8}, {1,5,8}, {1,6,8},{2,3,4},
{2,3,9},{2,4,6},{2,4,7},{2,4,8}, {2,6,7}, {2,6,9], {2,7,9}, {3,5,7}, {4,5,7}, {4,6,7}, {5,6,8}, {5,6,7},
{1,5,6,8}, {2,6,7,9}, {1,3,5,7},{2,4,,6,7} }
                                   CONCLUSION
    Is data mining as useful in science as in commerce? Certainly, data mining in science has much
    in common with that for business data. One difference, though, is that there is a lot of existing
    scientific theory and knowledge. Hence, there is less chance of knowledge emerging purely from
    data. However, empirical results can be valuable in science (especially where it borders on
    engineering) as in suggesting causality relationships or for modeling complex phenomena

    Another difference is that in commerce, rules are soft - sociological or cultural – and

    Assume consistent behavior. For example, the plausible myth that “30% of people who

    Buy babies’ nappies also buy beer” is hardly fundamental, but one might profitably apply

    It as a selling tactic (until perhaps the fashion changes from beer to lager). On the other hand,
    scientific rules or laws are, in principle, testable objectively. Any results from data mining
    techniques must sit within the existing domain knowledge. Hence, the involvement of a domain
    expert is crucial to the data mining process. Naive data mining often yields “obvious” results. The
    challenge is to incorporate rules known a priori into the empirical induction, remembering that the
    whole KDD process is exploratory and iterative.

    REFERENCES

          Data Warehousing, Data mining and OLAP – Berson & Smith , Mc-Graw Hill.
          Data mining techniques, tools and trends – Bhavani Thuraisingam
          Data Base Systems – Elmasri ,Tata Mc-Graw Hill.
          IEEE transaction “ Knowledge and Data Engineering ”
          www.Datamining.com
          Data mining ‘Introductory and Advanced Topics’ by Margaret Dunham
          U M Fayyad, S G Djorgovski and N Weir, Automating the Analysis and Cataloging of Sky
    Surveys, in U M Fayyad et al (eds.), Advances in Knowledge Discovery and Data Mining, p471,
    AAIT Press and MIT Press, 1996.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:8/26/2012
language:English
pages:12