Docstoc

Introduction - Wright State University

Document Sample
Introduction - Wright State University Powered By Docstoc
					                             Introduction

                            Dr. Yan Liu
Department of Biomedical, Industrial and Human Factors Engineering
                     Wright State University
                       What Is Data Mining (DM)?

   Two Views of DM
       As a synonym for “knowledge discovery in databases (KDD)”
            Discovery of interesting (non-trivial, implicit, previously unknown and potentially
             useful) patterns or knowledge from large amounts of data
       As an essential Step in KDD
            Extraction of patterns or models from data using algorithms
                Decision tree, neural network, association rules, statistical methods, etc.




                                                                                               2
                                               Evaluation and
                                                Presentation
                                 Modeling
                       Task-
                      relevant              Patterns
                        Data
               Selection and
               Transformation
    Data
  Warehouse
Cleaning and
 Integration




       Databases
                               KDD Process
                   (Adapted from Han & Kamber (2006)

                                                                3
                             What Is DM? (Cont.)
   Watch Out: Is Everything DM?
       Database system or information retrieval system
            Deductive query and information retrieval in databases
       Small machine learning system or statistical analysis tool
            Do not handle large amounts of data
   Machine Learning
       An important technique applied in DM
       Design and development of algorithms that allow computers to "learn" from
        given facts
       e.g. Decision tree, neural network, cluster analysis




                                                                                4
               Why Not Traditional Data Analysis?

   Huge Amounts of Data
       Algorithms must be highly scalable to handle such as tera-bytes of data
   High-dimensionality of Data
       Micro-array may have tens of thousands of dimensions
   High Complexity of Data
       Data streams and sensor data
       Time-series data, temporal data, sequence data
       Graphs, social networks and multi-linked data
       Heterogeneous databases and legacy databases
       Spatial, spatiotemporal, multimedia, text and Web data
       etc.


                                                                                  5
                                      Flat File
   Two-Dimensional File
       Rows (corresponding to data records)
       Columns (corresponding to attributes)
       e.g MS Excel Spreadsheet, .txt files, etc.




                                                     6
                         Relational Databases

   Consists of a Collection of Tables
       Each table has a unique name, a set of attributes (columns or fields), and a
        number of data records (rows or tuples)
       Each record represents an object identified by a unique key and described by a
        set of attribute values
   Entity-Relationship (ER) Data Model
       Represents database as a set of entities and their relationships




                                                                                       7
Customer
cust_ID    name               address                        age       income    …
  C1    Smith, Candy 1223 Lake Ave., Chicago, IL              31       $78,000   …
  …          …                  …                             …           …      …
Item
 item_ID      name               brand                 category         type     …
    I3      hi-res-TV           Toshiba             High resolution      TV      …
    …           …                  …                      …              …       …
Employee
empl_ID    name                   category                 group       salary    …
  E55   Jones, Jane           home entertainment          manager     $118,000   …
   …        …                        …                       …           …       …
Branch
branch_ID        name                       address
   B1         City Square         396 Michigan Ave., Chicago, IL
    …              …                           …
Sales (used to represent relationships between tables Customer and Employee)
trans_ID cust_ID empl_ID    date              time         amount
  T100     C1      E55   12/12/2008          15:45         $1,347
   …       …        …        …                  …            …
Item_Sold                                 Works_At
trans_ID item_ID        qty               empl_ID      branch_ID
  T100      I3           1                  E55           B1
   …        …           …                    …             …

 Fragments of a Relational Database of the AllElectronics Company                    8
   The AllElectronics is an international company, with branches around the
   world. Each branch has its own databases. You are asked to analyze the sales
   for the entire company. What would you do?

Short-term solution: create and execute SQL (Structured Query Language) queries

  Long-term solution: build a data warehouse




                                                                                  9
                             Data Warehouse
   What is Data Warehouse
       A repository of data collected from multiple sources, stored under a unified
        schema, and that usually resides at a single site
       Constructed through a process of data cleaning, data integration, data
        transformation, data loading, and periodic data refreshing




                                                                                  10
              Key Characters of Data Warehouse
   Subject-Oriented
       Organized around major subjects (e.g. customer, item, sales)
       Focuses on the modeling and analysis of data for decision makers (not on daily
        transaction processing)
       Provides a concise view around particular subject issues for decision support
   Integrated
       Constructed by integrating multiple, heterogeneous data sources
            Relational databases, flat files, on-line transaction records
   Time-Variant
       All data in data warehouse is identified with a particular time period
       The time horizon for the data warehouse is significantly longer than that of
        operational systems
            Operational database: record current data
            Data warehouse: provide information from a historical perspective



                                                                                   11
     Key Characters of Data Warehouse (Cont.)
   Non-Volatile
       A physically separate store of data transformed from the operational
        environment
       Operational update of data does not occur in the data warehouse environment
            Does not require transaction processing, recovery, and concurrency control
             mechanisms
            Requires only two operations in data accessing
                Initial loading of data
                Later access of data




                                                                                          12
                                  Data Cubes
   What is Data Cube
       Allows data to be modeled and viewed in multiple dimensions
       Data warehouses are based on a multidimensional data model
   Dimensions
       Perspectives or entities with respect to which an organization wants to keep
        records
       e.g. AllElectronics may create a sales data warehouse in order to keep records of
        the store’s sales with respect to dimensions of time, item, branch, and location




                                                                                  13
A 3-D Data Cube Representation of Sales Data for AllElectronics
           (Dimensions of Item, Time, and Location)


                                                                  14
A 4-D Data Cube Representation of of Sales Data for AllElectronics
       (Dimensions of Item, Time, Location, and Supplier)



                                                                     15
              Data Mining: Classification Schemes
   General Functionalities
        Descriptive Tasks
             Characterize the general properties of the data
        Predictive Tasks
             Perform inference on the current data in order to make predictions
   Different Views Lead to Different Classifications
        Data view: kinds of data to be mined
             Relational, data warehouse, transactional, multimedia, etc.
        Knowledge view: kinds of knowledge to be discovered
             Characterization, discrimination, association, classification, regression, clustering,
              trend/deviation, outlier analysis, etc.
        Method view: kinds of techniques utilized
             Machine learning, statistics, visualization, database-oriented
        Application view: kinds of applications adapted
             Retail, telecommunication, banking, fraud analysis
                                                                                                   16
                             Class Description
   Classes
       e.g. Customers of a bank can be classified into those with “good Credit” and
        “bad credit”; Grades of students in a class include “A”, “B”, “C”, and “D”
   Data Characterization
       Summarize the data in each class
       e.g. summarize the distributions of age, educational level, and household income
        of customers that have “good credit” or “bad credit”
   Data Discrimination
       Compare data in different classes
       e.g. compare customers with “good credit” and those with “bad credit” in their
        distributions of o age, educational level, and household income




                                                                                  17
               Mining Frequent Pattern, Associations,
                         and Correlations
    Frequent Patterns
         Patterns that occur frequently in data
              Itemsets: a set of items that frequently appear together in a transactional dataset
              Subsequences: a set of events that frequently occur in a particular sequence
              Substructures: a set of structures (such as graphs, trees, lattices) that appear
               frequently
    Association Mining
         Discovery of frequent patterns, associations and correlations

    Association Rules
    Computer => Software (support=1%, confidence=50%)
    Age(20,29] and Income(20K, 29K] => CD Player (support=2%, confidence=60%)



                                                                                                18
                     Classification and Prediction
   Classification
       Process of finding a model that describes and distinguishes data classes, for the
        purpose of being able to use the model to predict the class of objects whose class
        label (categorical, unordered) is unknown
   Prediction
       Models continuous-valued functions to predict the missing or unavailable
        numerical data values




                                                                                   19
                                 Cluster Analysis
   Functions
       Analyze data without consulting a known class label
       Divide data into groups(clusters) so that objects within the same cluster are
        similar while those belonging to different clusters differ much




                                                                                    20
                                 Outlier Analysis
   Function
       Identify objects that do not comply with the general pattern of the data

Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases
of extremely large amounts for a given account number in comparison to regular
charges incurred by the same account




                                                                                   21
                              Evolution Analysis
   Function
       Describes and models regularities or trends for objects whose behavior changes
        over time

Suppose you have the major stock market (time-series) data of the last several years
available from the New York Stock Exchange and you would like to invest in shares
of high-tech industrial companies. A data mining study of stock exchange data may
identify stock evolution regularities for overall stocks and for the stocks of particular
companies. Such regularities may help predict future trends in stock market prices,
contributing to your decision making regarding stock investments




                                                                                    22
                         Interestingness of Patterns
   Not all patterns generated by a DM system are interesting
   When is a Pattern Interesting
       Can be easily understood
       Valid on test data with some degree of certainty
       Potentially useful
       Novel
   Measures of Interestingness
       Objective
            Support ( X ⇒Y ) = P( X and Y )
            Confidence ( X ⇒Y ) = P(Y | X )
       Subjective
            Based on user beliefs in the data
            Unexpected (contrary to user’s belief)
            Actionable (offer strategic information on which user can act)
            Support a hypothesis that user sought to confirm
                                                                              23

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/6/2013
language:Unknown
pages:23