DATAWAREHOUSING AND DATAMINING
CONTENTS: 1.INTRODUCTION 2.DATA WAREHOUSING
DEFINITION CHARACTERISTICS PROCESS USES MODEL EXAMPLE CRITERIA ADVANTAGES
DEFINITION CHARACTERISTICS TECHNIQUES WHO NEEDS DATA MINING? PROCESS WHAT CAN DATA MINING DO? ELEMENTS ADVANTAGES PROBLEMS APPLICATIONS
4.CONCLUSION AND EXTENTION OF WORK
Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data has become the software engineers and developers lifeblood, and data warehouses are playing an increasingly prominent role in supporting the day-to-day business decisions. As is well known, in mining, enormous quantities of debris have to be removed before diamonds or gold can be found. The analogy that, with a computer, we can automatically find the one „information-diamond‟ among the tons of data-debris in our database is of course very attractive. Thus In this paper, Data Warehousing, Mining and their underlying technologies are examined in depth.
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. A data warehouse is a special kind of database that is intended for deep analysis and strategic planning. Distinct from day-to-day operational databases or online transaction processing systems that capture point-of-sales data, for example, a data warehouse combines disparate data sources possibly including external sources such as census data to provide one location for decision support data. This aggregated data is placed on a common basis for the same period of time, for the same portion of an organization, and for the same currency to provide the consolidated, summary data that facilitates high-level business analysis. Furthermore, data warehouses can be used in concert with online analytical processing (OLAP) and data mining tools to discover trends and unknown patterns within the detail data. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining. Data warehousing is the science of transforming data into meaningful business information. Data warehousing integrates and relates data throughout an enterprise, regardless of source, location, format, or communication requirements. Through the use of OLAP (On-Line Analytical Processing) technology, the data warehouse environment can satisfy the requirements of the organization's operational and analytical activities. The true return on investment of implementing a data warehouse environment is its impact on business decisions.
Data mining is the nontrivial discovery of meaningful, new correlations, patterns, and trends, and the extraction of implicit, previously unknown, and potentially useful information from large amounts of data. It uses pattern recognition technologies in conjunction with machine learning, statistical, and visualization techniques to discover and present knowledge in a form that is easily comprehensible.
Most organizations have large databases that contain a wealth of potentially accessible information. However, it is usually very difficult to access this information. The unbridled growth of data will inevitably let to a situation in which it is increasingly difficult to access the desired information: it will always be like looking for a needle in a haystack, only the amount of hay will be growing all the time. Against this background, the great interest that is being shown in a new field of „ Data Mining ‟ or Knowledge Discovery in Database is understandable.
A data warehouse is a copy of transaction data specifically structured for querying and reporting. The form of the stored data has nothing to do with whether something is a data warehouse. A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity. Before going to the concept of data warehousing in depth, some of the key terms should be familiar.
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:
Operational or transactional data such as, sales, cost, inventory, payroll, and accounting
Non-operational data, such as industry sales, forecast data, and macro economic data
Meta data - data about the data itself, such as logical database design or data dictionary definitions
The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.
Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.
Data warehouse Definition:
A data warehouse can be defined as any centralized data repository which can be queried for business benefit
. Warehousing makes it possible to
o o o
Extract archived operational data Overcome inconsistencies between different legacy data formats Integrate data throughout an enterprise, regardless of location, format, or communication requirements
Incorporate additional or expert information.
Characteristics of a data warehouse:
Subject-oriented - data organized by subject instead of application e.g.
An insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.)
Contains only the information necessary for decision support processing
Integrated - encoding of data is often inconsistent e.g.
Gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention
Time-variant - the data warehouse contains a place for storing data that are five to 10 years old, or older e.g.
This data is used for comparisons, trends, and forecasting These data are not updated
Non-volatile - data are not updated or changed in any way once they enter the data warehouse
Data are only loaded and accessed
Data warehousing Processes:
Data warehousing is the process of constructing and using data warehouses. Data
warehouse is a repository of information collected from multiple sources, stored, under a unified
schema and which usually resides at a single site. Data warehouses are constructed via a process data cleaning, data transformation, data integration, data loading and periodic data refreshing. Data cleaning: Routines attempt to fill in missing values, smooth out noise while identifiers outliers and inconsistencies in the data. Data integration: It combines data from multiple sources into a coherent data store. Data Transformation: Data are transformed or consolidated into forms appropriate for mining. Data loading and refreshing: Data loading is the loading of data and refreshing refers to refreshing of data. Data –A subject oriented, integrated, time-variant and non-volatile Warehousing: Collection of data in support of management‟s decision-making process. Subject oriented: A data warehouse is organized around major subjects, such as customer, product and sales. Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources. Time-variant: Every key structure in the warehouse contains implicitly or explicitly an element of time. Non-volatile: A data warehouse is always a physically a separate store of data transformed from the application data found in the operational environment.
Insulate data - i.e. the current operational information
Preserves the security and integrity of mission-critical OLTP applications Gives access to the broadest possible base of data
Retrieve data - from a variety of heterogeneous operational databases
Data is transformed and delivered to the data warehouse/store based on a selected model (or mapping definition)
Metadata - information describing the model and definition of the source data elements
Data cleansing - removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times.
Transfer - processed data transferred to the data warehouse, a large database on a high performance box
Uses Of a data warehouse
A central store against which the queries are run
Uses very simple data structures with very little assumptions about the relationships between data
A data mart is a small warehouse which provides subsets of the main store, Summarized information
Depending on the requirements of a specific group/department Marts often use multidimensional databases which can speed up query processing as they can have data structures which are reflect the most likely questions.
Data Warehouse model
Structure of data inside the data warehouse:
An example of levels of summarization of data:
Criteria for a data warehouse:
Require incremental loading of new data on a periodic basis Must not artificially constrain the volume of data
Data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update
Data Quality Management
Ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive database size
Must not be slowed or inhibited by the performance of the data warehouse RDBMS
Data warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations. It must support modular and parallel management.
Mass User Scalability
Access to warehouse data must not be limited to the elite few has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance.
Networked Data Warehouse
Data warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses from a single client workstation
Large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility
The RDBMS must Integrate Dimensional Analysis
Dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools
Advanced Query Functionality End users require advanced analytic calculations, sequential and comparative analysis,
and consistent access to detailed and summarized data.
Advantages Of Data Warehousing:
A data warehouse may present relevant information from which to measure to performance. A data warehouse may enhance business activity. A data warehouse facilities customer relationship management A data warehouse may bring about cost reduction.
Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful.
Characteristics Of a data mining system
Large quantities of data
Volume of data so great it has to be analyzed by automated techniques e.g. POS, satellite information, credit card transactions etc.
Noisy, incomplete data
Imprecise data is characteristic of all data collection Databases - usually contaminated by errors, cannot assume that the data they contain is entirely correct e.g. some attributes rely on subjective or measurement judgments
Complex data structure - conventional statistical analysis not possible Heterogeneous data stored in legacy systems
Who needs data mining?
Who (ever) has information fastest and uses it will win.
Businesses are looking for new ways to let end users find the data they need to:
o o o
Make decisions Serve customers and Gain the competitive edge
Data Mining Process
Heterogeneity resolution Data cleansing
Data Mining Tools applied
Extraction of patterns from the pre-processed data
Interpretation and evaluation
User bias i.e. can direct DM tools to areas of interest
Attributes of interest in databases Goal of discovery Domain knowledge Prior knowledge or belief about the domain
Data mining encompasses a number of different technical approaches, such as:
o o o o
Clustering, Data summarization, Learning classification rules, Finding dependency net works,
Set oriented approaches/Databases
make use of DBMSs to discover knowledge, SQL is limiting
can be used in several data mining stages
data cleansing i.e. the removal of erroneous or irrelevant data known as outliers
EDA, exploratory data analysis e.g. frequency counts, histograms etc. data selection - sampling facilities and so reduce the scale of computation attribute re-definition e.g. Body Mass Index, BMI, which is Weight/Height2 data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.
enhances EDA, makes patterns more visible e.g. NETMAP a commercial data mining tool uses this technique.
The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.
Clustering i.e. Cluster Analysis
Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric
Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of molecules
Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base
It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases
Can be used for profile generation for target marketing i.e. where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response
What can data mining do?
Companies with a strong consumer focus - retail, financial, communication, and marketing organizations, primarily use data mining today. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. For example, The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.
By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.
Data mining consists of five major elements:
Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table.
Advantages Of Data Mining:
Data Mining enables the extractions of previously hidden information needed by management and represent a significant step forward in the whole area of information management. We have identified pollution, found numerous interesting patterns via different techniques, identified what various techniques can do and which technique work well on our dataset, and identified interesting clusters of clients in the Database.
Problems in Data Mining:
Lack of long term machine Legal and privacy restriction Timing problems Files are hard to connect for technical reason Interpretation problems
Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.
Finance - stock market prediction, credit assessment, fraud detection etc. Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behaviour' etc.
Knowledge Acquisition Scientific discovery - superconductivity research, etc. Engineering - automotive diagnostic expert systems, fault detection etc.
CONCLUSION AND EXTENTION OF WORK:
Data mining may be used to automatically perform knowledge discovery by giving the mining algorithm loose cues about potential relationships and letting the algorithm work on the data to discover the relationships and items to focus on further. If a data warehouse is present in the environment, either it or a data mart, would be the database used by OLAP for example, if we are analyzing sales, the data warehouse could contain 3 years of item level detail. Data mining could perform the advanced analytics required to determine what items are generally purchased together (market basket analysis) or what shopper demographics lead to the highest sales volume. These pieces of knowledge could lead to better product placement and marketing strategies. Likewise, an analyst could perform OLAP on the data warehouse to determine what products sell the most and what customers buy the most. While this information lacks the correlations that data mining would yield, it is nonetheless still valuable. This could also drive promotions strategy as well as vendor management strategies. So it is concluded that, “Data Warehousing and Data Mining require a completely new interpretation of database technology.”
Futures in Data Warehousing:
The last few years have seen database in the open systems arena grow by a couple of orders of magnitude. These order of magnitude changes are of major significant. With each multiple of 10 the techniques required to architect, design, build and manage large systems have to be revisited, and more often than not have to be replaced with completely new techniques.
As the size of the database grows, the estimate of what constitutes a very large database (VLDB) continues to grow. The acronym VLDB has become so stretched in the range that it covers that a new acronym is required to distinguish these future systems from what has happened to date. We define an extremely large database (ELDB). The future shape of data warehouses will be very different from that of the data warehouses that are being created today. The requirement to handle vast quantities of multiformat data, while satisfying a user base of thousands, will drive solutions toward distributed data bases and three-tier architectures.
Future of Data Mining:
Data mining is becoming the focus of a great deal of attention – Hundreds of companies world wide have started such projects. New opportunities provided by these techniques are constantly being discovered. The exponential growth of data in the world, it is to be expected that this trend towards Data mining will continue for a considerable time. It is supported by other techniques such as Multiprocessing Machines, Client/Server and extremely well organized databases. All of these together make it possible to fully maximize the benefit of working with data mining. We believe that the future will see the development of many new applications based on Data Warehouses and using Data Mining techniques. Data Mining allows for the creation of a self -learning organization and it is this that convinces us that, “WE ARE AT THE BEGINNING OF A VERY PROMISING ERA”.