papers

Document Sample
papers Powered By Docstoc
					DATAWAREHOUSING AND DATAMINING

CONTENTS: 1.INTRODUCTION 2.DATA WAREHOUSING
        DEFINITION CHARACTERISTICS PROCESS USES MODEL EXAMPLE CRITERIA ADVANTAGES

3.DATA MINING
          DEFINITION CHARACTERISTICS TECHNIQUES WHO NEEDS DATA MINING? PROCESS WHAT CAN DATA MINING DO? ELEMENTS ADVANTAGES PROBLEMS APPLICATIONS

4.CONCLUSION AND EXTENTION OF WORK

ABSTRACT:

Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data has become the software engineers and developers lifeblood, and data warehouses are playing an increasingly prominent role in supporting the day-to-day business decisions. As is well known, in mining, enormous quantities of debris have to be removed before diamonds or gold can be found. The analogy that, with a computer, we can automatically find the one „information-diamond‟ among the tons of data-debris in our database is of course very attractive. Thus In this paper, Data Warehousing, Mining and their underlying technologies are examined in depth.

INTRODUCTION:
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. A data warehouse is a special kind of database that is intended for deep analysis and strategic planning. Distinct from day-to-day operational databases or online transaction processing systems that capture point-of-sales data, for example, a data warehouse combines disparate data sources possibly including external sources such as census data to provide one location for decision support data. This aggregated data is placed on a common basis for the same period of time, for the same portion of an organization, and for the same currency to provide the consolidated, summary data that facilitates high-level business analysis. Furthermore, data warehouses can be used in concert with online analytical processing (OLAP) and data mining tools to discover trends and unknown patterns within the detail data. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining. Data warehousing is the science of transforming data into meaningful business information. Data warehousing integrates and relates data throughout an enterprise, regardless of source, location, format, or communication requirements. Through the use of OLAP (On-Line Analytical Processing) technology, the data warehouse environment can satisfy the requirements of the organization's operational and analytical activities. The true return on investment of implementing a data warehouse environment is its impact on business decisions.

Data mining is the nontrivial discovery of meaningful, new correlations, patterns, and trends, and the extraction of implicit, previously unknown, and potentially useful information from large amounts of data. It uses pattern recognition technologies in conjunction with machine learning, statistical, and visualization techniques to discover and present knowledge in a form that is easily comprehensible.

Most organizations have large databases that contain a wealth of potentially accessible information. However, it is usually very difficult to access this information. The unbridled growth of data will inevitably let to a situation in which it is increasingly difficult to access the desired information: it will always be like looking for a needle in a haystack, only the amount of hay will be growing all the time. Against this background, the great interest that is being shown in a new field of „ Data Mining ‟ or Knowledge Discovery in Database is understandable.

DATA WAREHOUSING
A data warehouse is a copy of transaction data specifically structured for querying and reporting. The form of the stored data has nothing to do with whether something is a data warehouse. A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity. Before going to the concept of data warehousing in depth, some of the key terms should be familiar.

Data:
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:


Operational or transactional data such as, sales, cost, inventory, payroll, and accounting



Non-operational data, such as industry sales, forecast data, and macro economic data



Meta data - data about the data itself, such as logical database design or data dictionary definitions

Information:
The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

Knowledge:
Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

Data warehouse Definition:


A data warehouse can be defined as any centralized data repository which can be queried for business benefit



. Warehousing makes it possible to
o o o

Extract archived operational data Overcome inconsistencies between different legacy data formats Integrate data throughout an enterprise, regardless of location, format, or communication requirements

o

Incorporate additional or expert information.

Characteristics of a data warehouse:


Subject-oriented - data organized by subject instead of application e.g.
o

An insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.)

o 

Contains only the information necessary for decision support processing

Integrated - encoding of data is often inconsistent e.g.
o

Gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention



Time-variant - the data warehouse contains a place for storing data that are five to 10 years old, or older e.g.
o o

This data is used for comparisons, trends, and forecasting These data are not updated



Non-volatile - data are not updated or changed in any way once they enter the data warehouse



Data are only loaded and accessed

Data warehousing Processes:
Data warehousing is the process of constructing and using data warehouses. Data

warehouse is a repository of information collected from multiple sources, stored, under a unified

schema and which usually resides at a single site. Data warehouses are constructed via a process data cleaning, data transformation, data integration, data loading and periodic data refreshing. Data cleaning: Routines attempt to fill in missing values, smooth out noise while identifiers outliers and inconsistencies in the data. Data integration: It combines data from multiple sources into a coherent data store. Data Transformation: Data are transformed or consolidated into forms appropriate for mining. Data loading and refreshing: Data loading is the loading of data and refreshing refers to refreshing of data. Data –A subject oriented, integrated, time-variant and non-volatile Warehousing: Collection of data in support of management‟s decision-making process. Subject oriented: A data warehouse is organized around major subjects, such as customer, product and sales. Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources. Time-variant: Every key structure in the warehouse contains implicitly or explicitly an element of time. Non-volatile: A data warehouse is always a physically a separate store of data transformed from the application data found in the operational environment.


Insulate data - i.e. the current operational information
o o

Preserves the security and integrity of mission-critical OLTP applications Gives access to the broadest possible base of data



Retrieve data - from a variety of heterogeneous operational databases
o

Data is transformed and delivered to the data warehouse/store based on a selected model (or mapping definition)

o

Metadata - information describing the model and definition of the source data elements



Data cleansing - removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times.



Transfer - processed data transferred to the data warehouse, a large database on a high performance box

Uses Of a data warehouse


A central store against which the queries are run
o

Uses very simple data structures with very little assumptions about the relationships between data



A data mart is a small warehouse which provides subsets of the main store, Summarized information
o o

Depending on the requirements of a specific group/department Marts often use multidimensional databases which can speed up query processing as they can have data structures which are reflect the most likely questions.

Data Warehouse model

Structure of data inside the data warehouse:

An example of levels of summarization of data:

Criteria for a data warehouse:


Load Performance
o o

Require incremental loading of new data on a periodic basis Must not artificially constrain the volume of data



Load Processing
o

Data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update



Data Quality Management
o

Ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive database size



Query Performance
o

Must not be slowed or inhibited by the performance of the data warehouse RDBMS



Terabyte Scalability
o

Data warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations. It must support modular and parallel management.



Mass User Scalability
o

Access to warehouse data must not be limited to the elite few has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance.



Networked Data Warehouse
o

Data warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses from a single client workstation



Warehouse Administration
o

Large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility



The RDBMS must Integrate Dimensional Analysis
o

Dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools



Advanced Query Functionality End users require advanced analytic calculations, sequential and comparative analysis,

and consistent access to detailed and summarized data.

Advantages Of Data Warehousing:
 A data warehouse may present relevant information from which to measure to performance.  A data warehouse may enhance business activity.  A data warehouse facilities customer relationship management  A data warehouse may bring about cost reduction.

DATA MINING
Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful.

Characteristics Of a data mining system


Large quantities of data

o

Volume of data so great it has to be analyzed by automated techniques e.g. POS, satellite information, credit card transactions etc.



Noisy, incomplete data
o o

Imprecise data is characteristic of all data collection Databases - usually contaminated by errors, cannot assume that the data they contain is entirely correct e.g. some attributes rely on subjective or measurement judgments

 

Complex data structure - conventional statistical analysis not possible Heterogeneous data stored in legacy systems

Who needs data mining?
Who (ever) has information fastest and uses it will win.


Businesses are looking for new ways to let end users find the data they need to:
o o o

Make decisions Serve customers and Gain the competitive edge

Data Mining Process

Stages


Data pre-processing
o o

Heterogeneity resolution Data cleansing

o 

Data warehousing

Data Mining Tools applied
o

Extraction of patterns from the pre-processed data



Interpretation and evaluation
o

User bias i.e. can direct DM tools to areas of interest
   

Attributes of interest in databases Goal of discovery Domain knowledge Prior knowledge or belief about the domain

Techniques used:


Data mining encompasses a number of different technical approaches, such as:
o o o o

Clustering, Data summarization, Learning classification rules, Finding dependency net works,



Set oriented approaches/Databases
o

make use of DBMSs to discover knowledge, SQL is limiting



Statistics
o

can be used in several data mining stages


data cleansing i.e. the removal of erroneous or irrelevant data known as outliers

   

EDA, exploratory data analysis e.g. frequency counts, histograms etc. data selection - sampling facilities and so reduce the scale of computation attribute re-definition e.g. Body Mass Index, BMI, which is Weight/Height2 data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.



Visualization
o

enhances EDA, makes patterns more visible e.g. NETMAP a commercial data mining tool uses this technique.

o

The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.



Clustering i.e. Cluster Analysis

o

Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric

o

Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of molecules

o

Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base

o

It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases

o

Can be used for profile generation for target marketing i.e. where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response

What can data mining do?
Companies with a strong consumer focus - retail, financial, communication, and marketing organizations, primarily use data mining today. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. For example, The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.

By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.

Elements:
Data mining consists of five major elements:
    

Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table.

Advantages Of Data Mining:
 Data Mining enables the extractions of previously hidden information needed by management and represent a significant step forward in the whole area of information management.  We have identified pollution, found numerous interesting patterns via different techniques, identified what various techniques can do and which technique work well on our dataset, and identified interesting clusters of clients in the Database.

Problems in Data Mining:
     Lack of long term machine Legal and privacy restriction Timing problems Files are hard to connect for technical reason Interpretation problems

Applications:



Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.

 

Finance - stock market prediction, credit assessment, fraud detection etc. Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behaviour' etc.

  

Knowledge Acquisition Scientific discovery - superconductivity research, etc. Engineering - automotive diagnostic expert systems, fault detection etc.

CONCLUSION AND EXTENTION OF WORK:
Data mining may be used to automatically perform knowledge discovery by giving the mining algorithm loose cues about potential relationships and letting the algorithm work on the data to discover the relationships and items to focus on further. If a data warehouse is present in the environment, either it or a data mart, would be the database used by OLAP for example, if we are analyzing sales, the data warehouse could contain 3 years of item level detail. Data mining could perform the advanced analytics required to determine what items are generally purchased together (market basket analysis) or what shopper demographics lead to the highest sales volume. These pieces of knowledge could lead to better product placement and marketing strategies. Likewise, an analyst could perform OLAP on the data warehouse to determine what products sell the most and what customers buy the most. While this information lacks the correlations that data mining would yield, it is nonetheless still valuable. This could also drive promotions strategy as well as vendor management strategies. So it is concluded that, “Data Warehousing and Data Mining require a completely new interpretation of database technology.”

Futures in Data Warehousing:
The last few years have seen database in the open systems arena grow by a couple of orders of magnitude. These order of magnitude changes are of major significant. With each multiple of 10 the techniques required to architect, design, build and manage large systems have to be revisited, and more often than not have to be replaced with completely new techniques.

As the size of the database grows, the estimate of what constitutes a very large database (VLDB) continues to grow. The acronym VLDB has become so stretched in the range that it covers that a new acronym is required to distinguish these future systems from what has happened to date. We define an extremely large database (ELDB). The future shape of data warehouses will be very different from that of the data warehouses that are being created today. The requirement to handle vast quantities of multiformat data, while satisfying a user base of thousands, will drive solutions toward distributed data bases and three-tier architectures.

Future of Data Mining:
    Data mining is becoming the focus of a great deal of attention – Hundreds of companies world wide have started such projects. New opportunities provided by these techniques are constantly being discovered. The exponential growth of data in the world, it is to be expected that this trend towards Data mining will continue for a considerable time. It is supported by other techniques such as Multiprocessing Machines, Client/Server and extremely well organized databases. All of these together make it possible to fully maximize the benefit of working with data mining. We believe that the future will see the development of many new applications based on Data Warehouses and using Data Mining techniques. Data Mining allows for the creation of a self -learning organization and it is this that convinces us that, “WE ARE AT THE BEGINNING OF A VERY PROMISING ERA”.


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:35
posted:1/26/2010
language:English
pages:17