Docstoc

ANDROID

Document Sample
ANDROID Powered By Docstoc
					       Data Mining Fundamentals
             A First View



Instructor: Su-Hsien Huang
                          Data Mining
• Definition
   – The process of employing one or more computer learning
     techniques to automatically analyze and extract knowledge
     from data.
• DM use induction-based learning
   – The process of forming general concept definitions by
     observing specific examples of concepts to be learned.
• Knowledge Discovery in Databases (KDD)
   – Used with DM
   – The application of the scientific method to data mining. Data
     mining is one step of the KDD process.

                                                                     2
            What Can Computers Learn?
• Four Levels of Learning
   –   Facts
   –   Concepts
   –   Principles
   –   Procedures
• Concepts are the output of a data mining session
• Three concept view
   – Classical view
   – Probabilistic view
   – Exemplar view

                                                     3
                            Classical view

• Directly apply to achieve an answer

 IF Annual Income >= 30,000 & Years at Current Position >=5 & Owns Home=True
                         Then Good Credit Risk=True




                                                                               4
                 The Probabilistic View

• Help with the decision making process

                 The mean annual income for individuals
           who consistently make loan payment on time is $30,000


      Most individuals who are good credit risks have been working for
                  the same company for at least five years



            The majority of good credit risks own their own home




                                                                         5
               The Exemplar View

• Associate a probability of concept membership
  with each classification
                         Exemplar #1:
                    Annual Income=32,000
              Number of Years at Current Position =6
                          Homeowner
                         Exemplar #2:
                    Annual Income=52,000
             Number of Years at Current Position =16
                            Renter

                         Exemplar #3:
                    Annual Income=28,000
             Number of Years at Current Position =12
                         Homeowner
                                                       6
               Supervised Learning

• Build a learner model using data instances of
  known origin.
• Use the model to determine the outcome new
  instances of unknown origin.
• Decision Tree
  – A tree structure where non-terminal nodes represent
    tests on one or more attributes and terminal nodes
    reflect decision outcomes

                                                          7
                 Example: Supervised Learning
                                                                      Output
Table 1.1 • Hypothetical Training Data for Disease Diagnosis         Attribute

   Patient        Sore              Swollen
    ID#          Throat    Fever    Glands    Congestion   Headache          Diagnosis

     1            Yes       Yes       Yes        Yes           Yes           Strep throat
     2            No        No        No         Yes           Yes             Allergy
     3            Yes       Yes       No         Yes           No                Cold
     4            Yes       No        Yes        No            No            Strep throat
     5            No        Yes       No         Yes           No                Cold
     6            No        No        No         Yes           No              Allergy
     7            No        No        Yes        No            No            Strep throat
     8            Yes       No        No         Yes           Yes             Allergy
     9            No        Yes       No         Yes           Yes               Cold
     10           Yes       Yes       No         Yes           Yes               Cold




                                                                                       8
              Example: Decision Tree

                                  Swollen
                                  Glands
Only two concerning attributes

                                 No       Yes

                                       Diagnosis = Strep Throat

                      Fever



                  No             Yes

   Diagnosis = Allergy      Diagnosis = Cold
                                                                  9
                              Examining Example

Table 1.2 • Data Instances with an Unknown Classification

    Patient        Sore             Swollen
     ID#          Throat   Fever    Glands    Congestion    Headache   Diagnosis

      11           No       No        Yes        Yes          Yes          ?   Strep Throat
      12           Yes      Yes       No         No           Yes          ?   Cold
      13           No       No        No         No           Yes          ?   Allergy




                                                                                      10
              Production Rule

          Production Rules
IF Swollen Glands = Yes
 THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
 THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
 THEN Diagnosis = Allergy



                                       11
                            Unsupervised Clustering

     • A data mining method that builds models from
       data without predefined classes.
     • Example

Table 1.3 • Acme Investors Incorporated

   Customer    Account       Margin   Transaction   Trades/                  Favorite    Annual
      ID         Type       Account     Method      Month     Sex   Age     Recreation   Income

     1005          Joint      No        Online       12.5     F     30–39     Tennis     40–59K
     1013      Custodial      No        Broker       0.5      F     50–59     Skiing     80–99K
     1245          Joint      No        Online       3.6      M     20–29      Golf      20–39K
     2110      Individual     Yes       Broker       22.3     M     30–39     Fishing    40–59K
     1001      Individual     Yes       Online       5.0      M     40–49      Golf      60–79K




                                                                                          12
             Unsupervised Clustering

• Question
  – What attribute similarities group customers of Acme
    Investors together?
  – What differences in attribute values segment the
    customer database?
• Unsupervised Clustering
  – Provide initial cluster number
  – No initial cluster number


                                                          13
       Example: Unsupervised Clustering
IF Margin Account=Yes & Age=20-29 & Annual Income=40-59K
THEN Cluster=1
                                       Younger Investor
{accuracy=0.80, coverage=0.50}

IF Account Type=Custodial & Favorite Recreation=Skiing & Annual
   Income=80-90K
THEN Cluster=2                        Advertising money in ski maganizes
                                      To promote custodial accounts
{accuracy=0.95, coverage=0.35}

IF Account Type=Joint & Trade/Month>5 & Transaction
   Method=Online
THEN Cluster=3
{accuracy=0.82, coverage=0.65}
                                                                     14
       Is Data Mining Appropriate for Me?
• Can we clearly define the problem?
• Does potentially meaningful data exist?
• Does the data contain hidden knowledge or is the data
  factual and useful for reporting purpose only?
• Will the cost of processing less than the increase in
  profit?
• Four types of knowledge
   –   Shallow Knowledge
   –   Multidimensional Knowledge
   –   Hidden Knowledge
   –   Deep Knowledge
                                                          15
            Data Ming vs. Data Query

• Use data query if you already almost know what
  you are looking for.
   – A list of all Acme Department Store customers who
     used a credit card to buy a gas grill
• Use data mining to find regularities in data that
  are not obvious.
   – Develop a general profile for credit customer who
     take advantage of promotions offered with their credit
     card billing
                                                          16
                   Expert System

• A computer program that emulates the problem-
  solving skills of one or more human experts
• A knowledge engineer trained to interact with an
  expert in order to capture their knowledge.




                                                     17
Expert System vs. Data Mining
                                                    Data Mining Tool
                    Data




                                            If Swollen Glands = Yes
                                            Then Diagnosis = Strep Throat




  Human Expert   Knowledge Engineer



                                              Expert System
                                               Building Tool




                                      If Swollen Glands = Yes
                                      Then Diagnosis = Strep Throat
                                                                            18
            Data Mining Process Model

• Assemble a collection of data to analyze
• Present these data to a data mining software
  program
• Interpret the results
• Apply the results to a new problem




                                                 19
              Data Mining Process Model


Operational   SQL Queries
 Database



                                     Interpretation
                                                        Result
  Data                 Data Mining         &
                                                      Application
Warehouse                              Evaluation




                                                                20
                   Mining the Data

•   Should learning be supervised or unsupervised?
•   Which instances will be building or testing?
•   Which attributes will be selected?
•   What parameter setting should be used to best
    present the data?




                                                     21
                 Mining Applications

•   Fraud Detection
•   Health Care
•   Business and Finance
•   Scientific Applications
•   Sport and Gaming




                                       22
                                Example

                  _         _
                        _
                _
                Quit Card unless _
                      _
              aggressive marketing
                  _        _    _
  Intrinsic                                                           X
(Predicted)                                                   X
                  _
    Value
                                                                      X
              _                                        X
                                             unaggressive marketing
                                                         X
                                                 is appropriate X
                                             X
                                                      X        X

                                     Actual Value



                                                                          23
24

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:14
posted:9/16/2011
language:English
pages:24