DATA MINING AND DATA WAREHOUSING
Computer technologies are changing the practice of research and business and very slowly the content and practiced of education are beginning to follow suit. This paper discuss how work in Data Mining and Data Warehousing is contributing new approaches to education and learning. This paper provides an introduction to the basic technologies of data min and Data Warehousing . Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data mining and data warehouse architecture can evolve to deliver the value of data mining to end users. The organization of the paper is roughly chronological. The first section deals with data mining .In Data Mining the various topics covered are architecture, functions, techniques and its benefits. The techniques like neural networks, decision trees, genetic algorithm, nearest neighbor method, rule induction and evolutionary programming are mentioned. Among these neural network technique is explained in detail. The second section deals with Data Warehousing. In Data Warehousing the various topics covered are architecture, model, schemas and its benefits. At last, we conclude that data mining and data warehousing has an endless path.
DEFINITION OF DATA MINING
ARCHITECTURE VIEW OF DATA MINING
DATA MINING FUNCTIONS
DATA MINING TECHNIQUES
BENEFITS OF DATA MINING
DEFINITION OF DATA WAREHOUSING
ARCHITECTURE VIEW OF DATA WAREHOUSING
DATA WAREHOUSE MODEL
DATA WAREHOUSE SCHEMAS
BENEFITS OF DATA WAREHOUSING
Generally, data mining or knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Data mining tool predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into Data Warehouses. Warehouses are much larger than other kinds of databases ; sizes ranging from several gigabytes to terabytes are common. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data.
Data mining refers to extracting or “mining” knowledge from large amounts of data. There are many other terms for data mining, such as Knowledge mining from databases, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Data mining is a synonym for another popularly used term, Knowledge discovery in databases, or KDD. KDD consists of an iterative sequence of the following steps: Data cleaning to remove noise and inconsistent data. Data integration where multiple data sources may be combined. Data selection where data relevant to the analysis task are database. retrieved from the
Data transformation where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance. Data mining an essential process where intelligent methods are applied in order to extract data patterns. Pattern evaluation to identify the truly interesting patterns representing knowledge based on some interestingness measures. Knowledge presentation where visualization and knowledge representation techniques are used to present the mined knowledge to the user.
ARCHITECTURE VIEW OF DATA MINING
The architecture of a typical data mining system may have the following major components:
Graphical user interface
Pattern evaluation Knowledge base
Data mining engine
Data warehouse server Data cleaning Data Integration Database Data warehouse Filtering
Database, data warehouse, or other information repository: This is one or a set of databases, data warehouses, spreadsheets or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association, classification, cluster analysis, and evolution and deviation analysis. Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. Graphical user interfaces: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results.
DATA MINING FUNCTIONS:
The functions used in data mining are
ASSOCIATION: These rules correlate the presence of a set of items with another
range of values for another set of variables. Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule A, B and C are said to be on an
opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule. Example: An x-ray image containing characteristics A and B is likely to also exhibit characteristic C.
CLASSIFICTION: The goal is to work from an existing set of events or
transactions to create a hierarchy of classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class. When learning classification rules the system has to find the rules that predict the class from the predicting attributes so firstly the user has to define conditions for each class, the data mine system then constructs descriptions for the classes. Basically the system should given a case or tuple with certain known attribute values be able to predict what class this case belongs to. Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class. A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, are very probable. The categories of rules are: Exact rule - permits no exceptions so each object of LHS must be an element of RHS Strong rule - allows some exceptions, but the exceptions have a given limit Probabilistic rule - relates the conditional probability P(RHS|LHS) to the probability P(RHS) Example: A Population may be divided into five ranges of credit worthiness based on a history of previous.
SEQUENTIAL/TEMPORAL PATTERN: Sequential/temporal pattern
functions analyze a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i.e. consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyze such collections of related records and will detect frequently occurring patterns of products bought over time. Sequential pattern mining functions are quite powerful and can be used to detect the set of customers associated with some frequent buying patterns. Example: If a patient underwent cardiac bypass surgery for blocked arteries and later developed high blood urea within a year of surgery, he or she likely to suffer from kidney failure within next 18 months.
CLUSTERING/SEGMENTATION: Clustering and segmentation are the
processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters. In clustering, A given population of events or items can be partitioned into sets of “similar” elements Clustering according to similarity is a very powerful technique, the key to it being to translate some intuitive measure of similarity into a quantitative measure. When learning is unsupervised then the system has to discover its own classes i.e. the system clusters the data in the database. The system has to discover subsets of related objects in the training set and then it has to find descriptions that describe each of these subsets. There are a number of approaches for forming clusters. One approach is to form rules, which dictate membership in the same group based on the level of similarity between members. Another approach is to build set functions that measure some property of partitions as functions of some parameter of the partition.
Example: The adult population in US may be categorized into five groups from “most likely to buy” to “least likely to buy”.
DATA MINING TECNIQUES:
The most commonly used techniques in classification function are: Neural networks: A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights. Artificial Neural network is a non-linear predictive model that learns through training and resembles biological neural networks in structure. Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships. Evolutionary programming: At present this is the youngest and evidently the most promising branch of data mining. The underlying idea of the method is that the system automatically formulates hypothesis about the dependence of the target variable on other variables in the form of programs expressed in an internal programming language. Let us briefly discuss the neural network technique in detail:
Neural networks are an approach to computing that involves developing
mathematical structures with the ability to learn. The methods are the result of academic
investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions. Neural networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including sales forecasting, industrial process control, customer research, data validation, risk management, target marketing etc. Neural networks use a set of processing elements (or nodes) analogous to neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e. the network learns from experience just as people do. This distinguishes neural networks from traditional computing programs that simply follow instructions in a fixed sequential order. The structure of a neural network looks something like the following: Structure of a neural network
The bottom layer represents the input layer, in this case with 5 inputs labels X1 through X5. In the middle is something called the hidden layer, with a variable number of nodes. It is the hidden layer that performs much of the work of the network. The output layer in this case has two nodes, Z1 and Z2 representing output values we are trying to determine
from the inputs. Each node in the hidden layer is fully connected to the inputs which means that what is learned in a hidden node is based on all the inputs taken together. Statisticians maintain that the network can pick up the interdependencies in the model. The following diagram provides some detail inside a hidden node:
Simply speaking a weighted sum is performed: X1 times W1 plus X2 times W2 on through X5 and W5. This weighted sum is performed for each hidden node and each output node and is how interactions are represented in the network. Example: Neural network is used to predict sales (output) based on past sales, price and season (input) in shops.
BENEFITS OF DATA MINING:
Data mining automates the process of finding predictive information in large databases. Data mining tools sweep through the databases and identify previously hidden patterns in one step. It provides the automated discovery of previously unknown patterns. The databases can have more columns and rows. High performance data mining allows user to explore the full depth of a database, without pre-selecting a subset of variables Data Mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems existing platforms are upgraded and new products are developed.
Data Mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes.
A data warehouse is a subject oriented, integrated, time-variant, nonvolatile collection of data in support of management decision. Once gathered the data are stored for a long time, permitting access to historical data. Thus, Data warehouses provide the user a single consolidated interface to data, making decision support queries easier to write.
DATA MART: A data mart is a segment of a data warehouse that can provide data for
reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, and production. Data marts are sometimes complete individual data warehouses, which are usually smaller than the corporate data warehouse.
DATA WAREHOUSE ARCHITECTURE
A Data Warehouse Architecture (DWA) is a way of representing the overall structure of data, communication, processing and presentation that exists for end-user computing within the enterprise. The architecture is made up of a number of interconnected parts. External Data Sources Visualization
Metadata Repository EXTRACT CLEAN TRANSFORM LOAD REFRESH Data Warehouse Operational Databases Data Mining OLAP SERVES
CREATING A WAREHOUSE
Data is extracted from operational databases and external sources, cleaned to minimize errors and fill in missing information when possible, and transformed to reconcile semantic mismatches. Transforming data is typically accomplished by defining a relational view over the tables in the data sources. Loading data consists of materializing such views and storing them in the Warehouse. Unlike a standard view in a relational DBMS, therefore the view is stored in a database that is different from the databases containing the tables it is defined over. The cleaned and transformed data is finally loaded into the warehouse. Data is partitioned and indexes are built for efficiency. Loading a terabyte of data sequentially can take weeks, and loading even gigabytes can take hours. Parallelism is therefore important for loading warehouse. After data is loaded in to a warehouse, additional measures must be
taken to ensure that the data in the warehouse is periodically purge old data.
MAINTAINING A WAREHOUSE:
An important task in maintaining a warehouse is keeping track of the data currently stored in it; this bookkeeping is done by storing information about the warehouse are very large and often stored and managed in a separate database called a metadata repository. The size and complexity of the catalogs is in part due to the size and complexity of the warehouse itself and in part because a lot of administrative information must be maintained .The data in a warehouse is typically accessed and analyzed using a variety of tools, including OLAP query engines, data mining algorithms, uniformation visualization tools, statistical packages, and report generators.
DATA WAREHOUSE MODEL
Data warehousing is the process of extracting and transforming operational data into informational data and loading it into a central data store or warehouse.
The structure of data inside the data warehouse
Once the data is loaded it is accessible via desktop query and analysis tools by the decision makers. The data within the actual warehouse itself has a distinct structure with the emphasis on different levels of summarization.
Current detail data: The Heart of a data warehouse is its current detail, where the
bulk of data resides. It reflect the most recent happenings, which are usually the most interesting; Data is voluminous as it is stored at the lowest level of; it is always (almost) stored on disk storage which is fast to access but expensive and complex to manage.
Older detail data is stored on some form of mass storage, it is infrequently accessed
and stored at a level detail consistent with current detailed data.
Lightly summarized data is data distilled from the low level of detail found at the
current detailed level and generally is stored on disk storage. When building the data warehouse have to consider what unit of time is summarization done over and also the contents or what attributes the summarized data will contain.
Highly summarized data is compact and easily accessible and can even be found
outside the warehouse.
Metadata is the final component of the data warehouse and is really of a different
dimension in that it is not the same as data drawn from the operational environment but it is used as a directory to help the DSS analyst locate the contents of the data warehouse, a guide to the mapping of data as the data is transformed from the operational
environment to the data warehouse environment, a guide to the algorithms used for summarization between the current detailed data and the lightly summarized data and the highly summarized data, etc.
The most popular data model for a data warehouse is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains a large central table containing the bulk of data, with no redundancy and a set of smaller attendant tables one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table. Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, there by further splitting the data into additional tables .the resulting schema graph from a shape similar to a snowflake. The major difference between the snowflake and star schema models is the dimension tables of the snowflake mode l may be kept in normalized form to reduce redundancies. Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.
BENEFITS OF DATA WAREHOUSING:
Data warehouses are designed to perform well with aggregate queries running on large amounts of data. The structure of data warehouses is easier for end users to navigate, understand and query against unlike the relational databases primarily designed to handle lots of transactions.
Data warehouses enable queries that cut across different segments of a company's operation. E.g. production data could be compared against inventory data even if they were originally stored in different databases with different structures. Queries that would be complex in very normalized databases could be easier to build and maintain in data warehouses, decreasing the workload on transaction systems. Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non-uniform and scattered throughout a company. Data warehousing is an efficient way to manage demand for lots of information from lots of users. Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom that can provide an organization with competitive advantage.
Data Mining and Data warehousing has many and varied fields of application ,some of which are listed below. 1. 2. 3. 4. Retail/Marketing Banking Insurance and Health Care Transportation
5. Financial Data Analysis 6. Biomedical and DNA Data analysis
7. Telecommunication Industry.
Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users. Comprehensive data warehouses that integrate operational data with customer, supplier and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However there is a growing gap between more powerful storage and retrieval systems and the users ability to effectively analyze and act on the information they contain. Over the next few years, the growth of data warehousing and data mining is going to be enormous with new products and technologies coming out frequently.
“TECHNOLOGY WILL ALWAYS BE IMPROVING” This is today, and this is real, not fantasy. What does tomorrow hold? “ “ Watch this space. THUS IT HAS AN ENDLESS PATH.
“DATA MINING “ BY JIAWEI HAN & MICHELINE KAMBER “DATABASE “DATABASE MANAGEMENT MANAGEMENT SYSTEM” SYSTEM” BY BY RAGHU KORTH,
RAMAKRISHNAN/JOHANNES GEHRKE SILBERSCHATZ,
SUDHARSHAN. “DATABASE SYSTEMS” BY ELMASRI NAVATHE “DATA MINING” BY PIETER ADRIAANS, DOLF ZANTINGE.