eBusiness Tutorial

Document Sample
eBusiness Tutorial Powered By Docstoc
					   Issues in Data Mining Applications

                     How to Make A Decision
                About Your Own Data Mining Tool?

Authors:   Nemanja Jovanovic, nemko@sezampro.yu
           Valentina Milenkovic, tina@eunet.yu
           Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu
Data Mining vs. Knowledge Mining = ?

                   Page Number: 2
Instead of a foreword

“….If you are not able
to swim in the ocean of the data,
you will get drowned….”

                        Page Number: 3
Tutorial Content:
    This Tutorial will guide you through the following sections:

 What really means Data Mining?
 Successful Data Mining
 Comparasion of fourteen DM tools
 How to improve existing Data Mining applications?
 Potential applications
 Myths and facts about Data Mining
 Two case studies
 The future of DM applications

                                   Page Number: 4
Other definitions of Data Mining

 Data mining is the (semi)automatic discovery of patterns,
  associations, anomalies, and changes in data.
 Data mining, on the other hand, extracts information
       from a database that the user did not know existed.
 Also, data mining is the search for relationships
  and global patterns that exist in large databases
  but are `hidden' among the vast amount of data.

                              Page Number: 5
The Foundations of Data Mining

 Massive data collection
 Powerful multiprocessor computers
 Data Mining algorithms

    of data

              1970   1980   1990     2000

                                   Page Number: 6
Evolution Of Data Mining
  Evolutionary Step   Business Question            Enabling           Product Providers     Characteristics
Data Collection       What was my average      Computers,             IBM,                Retrospective,
(1960s)               total revenue over the   tapes,                 CDC                 static data delivery
                      last 5 years?            disks

Data Access           What were unit sales     RDBMS,                 Oracle, Sybase      Retrospective,
                      in New England           SQL,                   Informix, IBM,      dynamic data delivery
(1980s)                                                               Microsoft           at record level
                      last March?              ODBC

Data Navigation       What were unit sales     OLAP,                  Pilot, IRI,         Retrospective,
                      in New England last      Multidimensional       Arbor, Redbrick,    dynamic data delivery
(1990s)               March?                   databases,                                 at multiple levels
                      Drill down to Boston.    data warehouses        Technologies

Data Mining           What’s likely to         Advanced algorithms,   Lockheed,           Prospective, proactive
                      happen to Boston unit    multiprocessors,       IBM, SGI,           information delivery
(2000)                sales next month?        massive databases      numerous startups

                                                   Page Number: 7
Data Mining context

 Application domain
 Data mining problem type
 Technical aspect
 Data mining tools and technique

                             Page Number: 8
Data Mining Techniques
   Artificial Neural Networks
   Decision Trees
   Genetic Algorithms
   Rule Induction
   K-Nearest Neighbor (k-NN)
   Data Visualization

          Input                                   .   .

                                                  .   .
          patern      .                                   Output
                      .                               .

                                 Page Number: 9
Examples of DM projects to stimulate your imagination

   Here are six examples of how data mining is helping corporations
    to operate more efficiently and profitably in today's business environment.
     – Targeting a set of consumers
       who are most likely to respond to a direct mail campaign

     – Predicting the probability of default for consumer loan applications

     – Reducing fabrication flaws in VLSI chips

     – Predicting audience share for television programs

     – Predicting the probability that a cancer patient
       will respond to radiation therapy

     – Predicting the probability that an offshore oil well is actually going
       to produce oil

                                    Page Number: 11
Successful Data Mining

   Come up with a precise formulation of the problem you are trying
    to solve and use the right data
   Have a clearly articulated business problem and then determine
    whether data mining is the proper solution technology
   Understand and deliver the fundamentals
   Have your technology folks be involved, too
   Visualization of the data mining output is very important
    in a meaningful way
   Allow the user to interact with the visualization

                               Page Number: 12
Comparison of forteen DM tools

   Evaluated by four undergraduates inexperienced at data mining,
    a relatively experienced graduate student and
    a profesional data mining consultant
   Run under the MS Windows 95, MS Windows NT,
    Macintosh System 7.5
   Use one of the four technologies:
    Decision Trees, Rule Inductions, Neural or Polynomial Networks
   Solve two binary classification problems:
    multi-class classification and noiseless estimation problem
   Price from 75$ to 25.000$

                              Page Number: 13
Comparison of forteen DM tools

   The Decision Tree products were
         - CART
         - Scenario
         - See5
         - S-Plus
   The Rule Induction tools were
         - WizWhy
         - DataMind
         - DMSK
   Neural Networks were built from three programs
         - NeuroShell2
         - PcOLPARS
         - PRW
   The Polynomial Network tools were
         - ModelQuest Expert
         - Gnosis
         - a module of NeuroShell2
         - KnowledgeMiner

                                   Page Number: 14
Criteria for evaluating DM tools

A list of 20 criteria for evaluating DM tools, put into 4 categories:

   Capability measures what a desktop tool can do,
    and how well it does it
         - Handless missing data
         - Considers misclassification costs
         - Allows data transformations
         - Quality of tesing options
         - Has programming language
         - Provides useful output reports
         - Visualisation

                                 Page Number: 15

   + excellent capability  good capability - some capability “blank” no capability

                                 Page Number: 16
Criteria for evaluating DM tools

   Learnability/Usability shows how easy a tool is to learn and use

         -   Tutorials
         -   Wizards
         -   Easy to learn
         -   User’s manual
         -   Online help
         -   Interface

                                Page Number: 17
Criteria for evaluating DM tools

   Interoperability shows a tool’s ability to interface
    with other computer applications

         - Importing data
         - Exporting data
         - Links to other applications

   Flexibility

         - Model adjustment flexibility
         - Customizable work enviroment
         - Ability to write or change code

                                 Page Number: 18
Data Input & Output Model

                               + excellent capability
                                good capability
                               - some capability
                                “blank” no capability

             Page Number: 19
A classification of data sets

   Pima Indians Diabetes data set
     –   768 cases of Native American women from the Pima tribe
           some of whom are diabetic, most of whom are not
     –   8 attributes plus the binary class variable for diabetes per instance
   Wisconsin Breast Cancer data set
     –   699 instances of breast tumors
           some of which are malignant, most of which are benign
     –   10 attributes plus the binary malignancy variable per case
   The Forensic Glass Identification data set
     –   214 instances of glass collected during crime investigations
     –   10 attributes plus the multi-class output variable per instance
   Moon Cannon data set
     –   300 solutions to the equation:
                      x = 2v 2 sin(g)cos(g)/g
     –   the data were generated without adding noise

                                         Page Number: 20
Evaluation of forteen DM tools

               Page Number: 21
Strenghts and Weaknesses

             Strengths                               Weaknesses
   Ease of use                              Difficult file I/O
          (Scenario, WizWhy..)                       (OLPARS,CART)
   Data visualisation                       Limited visualisation
    (S-plus,MineSet...)                       (PRW,See5,WizWhy)
   Depth of algorithms (tree options)       Narrow analyses path
    (CART,See5,S-plus..)                             (Scenario)
   Multiplte neural network

                                   Page Number: 22
How to improve existing DM applications

The top ten points:
 Database integration
   – no more flat files
   – use the millions $ spent on data warehousing
 Automated model scoring
   – without scoring DM is pretty useless
   – should be integrated with the driving applications
 Exporting models to other applications
   – close the loop between DM and applications
     that need to use the results (scores)

                              Page Number: 23
How to improve existing DM applications

 Business templates
   – cross-selling specific application is more valuable
        than a general modeling tool
 Effort knob
   – it is relevant in a way that tuning parametars are not
 Incorporate financial information
   – the financial information is very important and often available
      and shold be provided as input to the DM application

                              Page Number: 24
How to improve existing DM applications

 Computed target columns
   – allow the user to interactively create a new target variable
 Time-series data
   – a year’s worth of monthly balance information is qualitatively
     different than twelve distinct non-time-series variables
 Use versus View
   – do not present visually to user the full model,
     only the most important levels
 Wizards
   – not necessarily but desirable
   – prevent human error by keeping the user on track

                              Page Number: 25
Potential Applications
Data mining has many varied fields of application,
some of which are listed below.

                 Retail/Marketing
   Identify buying patterns from customers
   Find associations among customer demographic characteristics
   Predict response to mailing campaigns
   Market basket analysis

                               Page Number: 26
Potential Applications

                   • Banking
   Detect patterns of fraudulent credit card use
   Identify `loyal' customers
   Determine credit card spending by customer groups
   Find hidden correlations between different financial indicators
   Identify stock trading rules from historical market data

                                   Page Number: 27
Potential Applications

                  • Insurance and Health Care

   Claims analysis - i.e., which medical procedures are claimed together
   Predict which customers will buy new policies
   Identify behaviour patterns of risky customers
   Identify fraudulent behaviour

                                    Page Number: 28
Potential Applications

                   • Transportation
   Determine the distribution schedules among outlets
   Analyse loading patterns
                   • Medicine
   Characterise patient behaviour to predict office visits
   Identify successful medical therapies for different illnesses
   To predict the effectiveness of surgical procedures or
    medical tests

                                   Page Number: 29
Potential Applications

                   • Sport
   To make the best choice about players in different circumstance
   To predict the results of relevance match
   Do a better list of seed players in groups or tournament
      DM report from an NBA game
        When Price was Point-Guard, J.Williams missed 0% (0)
        of his jump field-goal attempts and made 100% (4)
        of his jump field-goal-attempts.
       The total number of such field-goal-attempts was 4.

                                    Page Number: 30
DM and Customer Relationship Management

 CRM is a process that manages the interactions
       between a company and its customers
 Users of CRM software applications are database marketers
 Goals of database marketers are:
      identifying market segments, which requires significant data
         about prospective customers and their buying behaviors
      build and execute campaigns
   Tightly integrating the two disciplines presents an opportunity
    for companies to gain competetive adventage

                                      Page Number: 31
DM and Customer Relationship Management

   How Data Mining helps Database Marketing
   Scoring
   The role of Campaign Management Software
   Increasing the customer lifetime value
   Combining Data Mining and Campaign Management

                            Page Number: 32
DM and Customer Relationship Management

   Evaluating the benefits of a Data Mining model

               Gains chart                 Profability chart

                               Page Number: 33
Myths and Facts about Data Mining

   Myth: DM produces surprising results
    that will utterly transform your business.

   Myth: DM techniques are so sophisticated
    that they can substitute for domain knowledge
    or for experience in analysis and model building.

   Myth: DM tools automatically find the patterns
    you are looking for, without being told what to do.

                                Page Number: 34
Myths and Facts about Data Mining

   Myth: Data mining is more effective with more data,
    so all existing data should be brought into any data-mining effort.

   Myth: Building a DM model on a sample of a database
    is ineffective, because sampling loses the information
    in the unused data.
   Myth: Data mining is another fad that will soon fade,
    allowing us to return to standard business practice.

                                Page Number: 35
Myths and Facts about Data Mining

   Myth: DM is useful only in certain areas,
    such as marketing, sales, and fraud detection.

   Myth: The methods used in DM are fundamentally different
    from the older quantitative model-building techniques.

   Myth: Data mining is an extremely complex process.

   Myth: Only massive databases are worth mining.

                               Page Number: 36
Data Mining Examples

 Bass Brewers
  “We’ve been brewing beer since 1777, with increased competition
  comes a demand to make faster better informed decision”
 Northern Bank
        “The information is now more accessible, paperless and timely.”
 TSB Group Plc
        “We are using Holos because of its flexibility and its excellent
  multidimensional database”

                                Page Number: 37
Data Mining Examples

   Delphic Universites
          “Real value is added to data by multidimensional manipulation
          (being able to to easily compare many different views
    of the avaible information in one report) and by modeling.”
   Harvard - Holden
          “Sybase technology has allowed us to develop an information
    system that will preserve this legacy into the twenty-first century”
   J.P.Morgan
          “The promise of data mining tools like Information Harvester is
          that they are able to quickly wade through massive amounts
    of data to identify relationships or trending information
          that would not have been avaible without the tool”

                                  Page Number: 38
Case study of Breast Cancer Survival Analysis

 Case study of the influence of various patient characteristics
          on survival rates for breast cancer
 The survival analysis technique employed is Cox Regression
          (this technique is useful in situations,
          where some of the patients do not die during the observation
 Linear regression technique
  (if all patients had died during the observation period)

                                 Page Number: 39
Case study of Breast Cancer Survival Analysis

 The observation period runs for 133.8 months
 The modeling sample contains 746 patients
        (50 patients died during the observation period and 696
        who survived beyond the end of the observation period)
 In this example, we are testing only four predictors:
       Age, in years, at the start of the observation period (22 to 88)
       Pathological tumor size, in centimeters (0.10 to 7.00)
       Number of positive axillary lymph nodes (0 to 35)
       Estrogen receptor status (positive vs. negative)

                                    Page Number: 40
    Case study of Breast Cancer Survival Analysis

 The Cox Regression used a backward stepwise likelihood-ratio
  variable selection method
 Significance criteria were set at 0.05 for inclusion in the model,
  and 0.10 for removal from the model
 Printout from the final step of the stepwise regression analysis:

________________ Variables in the Equation ______________
Variable        B  S.E.    Wald df Sig        R    Exp(B)
AGE        -.0314 .0121 6.7486 1 .0094 -.0893 .9691
PATHSIZE    .3975 .1175 11.4476 1 .0007 .1259 1.4881
LNPOS        .1372 .0361 14.4100 1 .0001 .1443 1.1471
The column labeled "Sig" shows the statistical significance of included variables
The column labeled "R" shows the degree of unique correlation with the dependent variable

                                               Page Number: 41
Case study of Breast Cancer Survival Analysis

Some key things to note are:

   Estrogen status was removed as a predictor because
    it did not reach the 0.05 significance criterion for inclusion
   Number of positive axillary lymph nodes was the strongest
    predictor of survival rates (R=.1443 / Sig=.0001),
          then follow pathological tumor size (R=.1259 / Sig.=.0007),
          over the course of the observation period
   Age, although significant, is somewhat less influential
          than the other two predictors (R=-0.893 / Sig.=.0094)
   Note that both the number of positive axillary lymph nodes and
    the pathological tumor size are positively correlated, which means
    that they are directly associated with more rapid mortality.
   Age is negatively correlated with the dependent variable, which
    means that younger age is predictive of somewhat longer survival.

                               Page Number: 42
Case study of Breast Cancer Survival Analysis

   All patients survive through            The following chart shows the cumulative
    the 10 month of the observation                     survival function during the observation
    period                                  period:

   At the fortieth month,
    the mortality rate increases and
    continues at this fairly constant
    increased rate
    through the forty-fifth month
   At the forty-fifty month,
    there is a five-month period
    without additional mortality
   11% of the original sample has

                                    Page Number: 43
Case study of Breast Cancer Survival Analysis

Conclusions and Implications

 The case study presented here is relatively simple,
         and is for illustrative purposes only.
 With the addition of more candidate predictors
         (progesterone receptor status, histologic grade, blood type
  etc.), an even more powerful model could emerge.
 By understanding the influence of patient characteristics
         on mortality rates over time, we are in a better position to
  estimate survival times for individual patients, and to defend
  using different or more aggressive therapeutic approaches for
  some patients.

                               Page Number: 44
Securities Brokerage Case Study

 The following four pages are derived
  from a copyrighted case study
  originally created by SmartDrill Data Mining
  (Marlborough, MA, U.S.A.).
 Their website is:
 And the original case study appears in its entirety here:

                               Page Number: 45
Securities Brokerage Case Study

 Predictive market segmentation model designed to identify
  and profile high-value brokerage customer segments
  as targets for special marketing communications efforts.
 The dependent variable for this ordinal CHAID model
  is brokerage account commission dollars during the past 12 months
 We begin by splitting the client's entire customer file
  into a modeling sample and a validation sample.
       (Once the model is built using the modeling sample,
         we apply it to the validation sample to see how well it works
         on a sample other than the one on which it was built).

                                Page Number: 46
Securities Brokerage Case Study

 The resulting CHAID model has 55 segments.
 However, the results are summarized in the following comb chart,
  showing the segment indexes (indexes of average dollar value)

                             Page Number: 47
     Securities Brokerage Case Study
                              The part of Gains Chart: Average Annual Brokerage Commission Dollars

 Gains chart provides
quantitative detail useful
for financial and marketing

 We have highlighted the
top 20% of the file in blue

 The top 20% of the file
is worth an average
of about $334 per account,
which is nearly three times
the average account value
for the entire sample.

                                   …     …       …       …       …      …       …      …       ...
                                          Page Number: 48
Securities Brokerage Case Study

 Using the data in the gains chart this information,
       we can better plan our communications/promotion budget.
 In general, the best segments represent customers
       who are experienced, aggressive, self-directed traders.
 The other decisions, which the gains chart
       and the segmentation rules can help us make:
      We might wish to conduct some market research among customers
       in under-performing segments, or among under-performing customers
       in the better segments
      We can use the segment definitions to help us identify possible issues
         and question areas to include in the survey
   Before we try to apply such a model, we perform a validation
    against a holdout sample, to confirm that it is a good model.

                                      Page Number: 49
The future of DM applications

 Different opinions
 Very little functionality in DB systems to support DM applications
 Data mining, as a vital application,
  is just one more advance in the on-going research process
   Data mining will not go away

                        The End

                               Page Number: 50

Shared By: