Introduction - Wright State University

Document Sample
Introduction - Wright State University Powered By Docstoc

                            Dr. Yan Liu
Department of Biomedical, Industrial and Human Factors Engineering
                     Wright State University
                       What Is Data Mining (DM)?

   Two Views of DM
       As a synonym for “knowledge discovery in databases (KDD)”
            Discovery of interesting (non-trivial, implicit, previously unknown and potentially
             useful) patterns or knowledge from large amounts of data
       As an essential Step in KDD
            Extraction of patterns or models from data using algorithms
                Decision tree, neural network, association rules, statistical methods, etc.

                                               Evaluation and
                      relevant              Patterns
               Selection and
Cleaning and

                               KDD Process
                   (Adapted from Han & Kamber (2006)

                             What Is DM? (Cont.)
   Watch Out: Is Everything DM?
       Database system or information retrieval system
            Deductive query and information retrieval in databases
       Small machine learning system or statistical analysis tool
            Do not handle large amounts of data
   Machine Learning
       An important technique applied in DM
       Design and development of algorithms that allow computers to "learn" from
        given facts
       e.g. Decision tree, neural network, cluster analysis

               Why Not Traditional Data Analysis?

   Huge Amounts of Data
       Algorithms must be highly scalable to handle such as tera-bytes of data
   High-dimensionality of Data
       Micro-array may have tens of thousands of dimensions
   High Complexity of Data
       Data streams and sensor data
       Time-series data, temporal data, sequence data
       Graphs, social networks and multi-linked data
       Heterogeneous databases and legacy databases
       Spatial, spatiotemporal, multimedia, text and Web data
       etc.

                                      Flat File
   Two-Dimensional File
       Rows (corresponding to data records)
       Columns (corresponding to attributes)
       e.g MS Excel Spreadsheet, .txt files, etc.

                         Relational Databases

   Consists of a Collection of Tables
       Each table has a unique name, a set of attributes (columns or fields), and a
        number of data records (rows or tuples)
       Each record represents an object identified by a unique key and described by a
        set of attribute values
   Entity-Relationship (ER) Data Model
       Represents database as a set of entities and their relationships

cust_ID    name               address                        age       income    …
  C1    Smith, Candy 1223 Lake Ave., Chicago, IL              31       $78,000   …
  …          …                  …                             …           …      …
 item_ID      name               brand                 category         type     …
    I3      hi-res-TV           Toshiba             High resolution      TV      …
    …           …                  …                      …              …       …
empl_ID    name                   category                 group       salary    …
  E55   Jones, Jane           home entertainment          manager     $118,000   …
   …        …                        …                       …           …       …
branch_ID        name                       address
   B1         City Square         396 Michigan Ave., Chicago, IL
    …              …                           …
Sales (used to represent relationships between tables Customer and Employee)
trans_ID cust_ID empl_ID    date              time         amount
  T100     C1      E55   12/12/2008          15:45         $1,347
   …       …        …        …                  …            …
Item_Sold                                 Works_At
trans_ID item_ID        qty               empl_ID      branch_ID
  T100      I3           1                  E55           B1
   …        …           …                    …             …

 Fragments of a Relational Database of the AllElectronics Company                    8
   The AllElectronics is an international company, with branches around the
   world. Each branch has its own databases. You are asked to analyze the sales
   for the entire company. What would you do?

Short-term solution: create and execute SQL (Structured Query Language) queries

  Long-term solution: build a data warehouse

                             Data Warehouse
   What is Data Warehouse
       A repository of data collected from multiple sources, stored under a unified
        schema, and that usually resides at a single site
       Constructed through a process of data cleaning, data integration, data
        transformation, data loading, and periodic data refreshing

              Key Characters of Data Warehouse
   Subject-Oriented
       Organized around major subjects (e.g. customer, item, sales)
       Focuses on the modeling and analysis of data for decision makers (not on daily
        transaction processing)
       Provides a concise view around particular subject issues for decision support
   Integrated
       Constructed by integrating multiple, heterogeneous data sources
            Relational databases, flat files, on-line transaction records
   Time-Variant
       All data in data warehouse is identified with a particular time period
       The time horizon for the data warehouse is significantly longer than that of
        operational systems
            Operational database: record current data
            Data warehouse: provide information from a historical perspective

     Key Characters of Data Warehouse (Cont.)
   Non-Volatile
       A physically separate store of data transformed from the operational
       Operational update of data does not occur in the data warehouse environment
            Does not require transaction processing, recovery, and concurrency control
            Requires only two operations in data accessing
                Initial loading of data
                Later access of data

                                  Data Cubes
   What is Data Cube
       Allows data to be modeled and viewed in multiple dimensions
       Data warehouses are based on a multidimensional data model
   Dimensions
       Perspectives or entities with respect to which an organization wants to keep
       e.g. AllElectronics may create a sales data warehouse in order to keep records of
        the store’s sales with respect to dimensions of time, item, branch, and location

A 3-D Data Cube Representation of Sales Data for AllElectronics
           (Dimensions of Item, Time, and Location)

A 4-D Data Cube Representation of of Sales Data for AllElectronics
       (Dimensions of Item, Time, Location, and Supplier)

              Data Mining: Classification Schemes
   General Functionalities
        Descriptive Tasks
             Characterize the general properties of the data
        Predictive Tasks
             Perform inference on the current data in order to make predictions
   Different Views Lead to Different Classifications
        Data view: kinds of data to be mined
             Relational, data warehouse, transactional, multimedia, etc.
        Knowledge view: kinds of knowledge to be discovered
             Characterization, discrimination, association, classification, regression, clustering,
              trend/deviation, outlier analysis, etc.
        Method view: kinds of techniques utilized
             Machine learning, statistics, visualization, database-oriented
        Application view: kinds of applications adapted
             Retail, telecommunication, banking, fraud analysis
                             Class Description
   Classes
       e.g. Customers of a bank can be classified into those with “good Credit” and
        “bad credit”; Grades of students in a class include “A”, “B”, “C”, and “D”
   Data Characterization
       Summarize the data in each class
       e.g. summarize the distributions of age, educational level, and household income
        of customers that have “good credit” or “bad credit”
   Data Discrimination
       Compare data in different classes
       e.g. compare customers with “good credit” and those with “bad credit” in their
        distributions of o age, educational level, and household income

               Mining Frequent Pattern, Associations,
                         and Correlations
    Frequent Patterns
         Patterns that occur frequently in data
              Itemsets: a set of items that frequently appear together in a transactional dataset
              Subsequences: a set of events that frequently occur in a particular sequence
              Substructures: a set of structures (such as graphs, trees, lattices) that appear
    Association Mining
         Discovery of frequent patterns, associations and correlations

    Association Rules
    Computer => Software (support=1%, confidence=50%)
    Age(20,29] and Income(20K, 29K] => CD Player (support=2%, confidence=60%)

                     Classification and Prediction
   Classification
       Process of finding a model that describes and distinguishes data classes, for the
        purpose of being able to use the model to predict the class of objects whose class
        label (categorical, unordered) is unknown
   Prediction
       Models continuous-valued functions to predict the missing or unavailable
        numerical data values

                                 Cluster Analysis
   Functions
       Analyze data without consulting a known class label
       Divide data into groups(clusters) so that objects within the same cluster are
        similar while those belonging to different clusters differ much

                                 Outlier Analysis
   Function
       Identify objects that do not comply with the general pattern of the data

Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases
of extremely large amounts for a given account number in comparison to regular
charges incurred by the same account

                              Evolution Analysis
   Function
       Describes and models regularities or trends for objects whose behavior changes
        over time

Suppose you have the major stock market (time-series) data of the last several years
available from the New York Stock Exchange and you would like to invest in shares
of high-tech industrial companies. A data mining study of stock exchange data may
identify stock evolution regularities for overall stocks and for the stocks of particular
companies. Such regularities may help predict future trends in stock market prices,
contributing to your decision making regarding stock investments

                         Interestingness of Patterns
   Not all patterns generated by a DM system are interesting
   When is a Pattern Interesting
       Can be easily understood
       Valid on test data with some degree of certainty
       Potentially useful
       Novel
   Measures of Interestingness
       Objective
            Support ( X ⇒Y ) = P( X and Y )
            Confidence ( X ⇒Y ) = P(Y | X )
       Subjective
            Based on user beliefs in the data
            Unexpected (contrary to user’s belief)
            Actionable (offer strategic information on which user can act)
            Support a hypothesis that user sought to confirm

Shared By: