University of California_ Davis

Document Sample
University of California_ Davis Powered By Docstoc
					                                   269 Business Intelligence Technologies

                                         Homework 6
                          You should work individually on this homework.

Due Date: Davis: 3/1; Sac: 3/3; Bay: 3/12

Delivery Method: Please hand in paper report before class

In order to save trees, please use a new word file for your solution, don’t include the
problem description in your new file.

Points for Question 1~6: 14, 8, 3, 5, 5, 15 (total 50).

Question 1.
A winery maintains a dataset containing information about customers who subscribe to
its tasting events and special offers for wine cases. The winery occasionally mails tasting
samples of new wines in an effort to increase sales. The chief marketing officer is aiming
to send samples of a newly produced wine to customers who are NOT likely to place an
order for the new wine (when probability of purchase is less than 0.5). Based on a survey
conducted amongst its customers and their willingness to buy a case of the new wine
before tasting it, the firm collected tree attributes – whether customers prefer dry, or have
preference for red wine and their ages. The classification problem is to predict whether
customers will buy or not buy wine, with certain confidence/probability. The following
classification tree was induced:

                                                         Prefer
                                                          Dry?

                                       Yes                               No



                            Preference for                                           Age
                                 Red?

                       No                Yes                             <50                  >=50


                      Buy                                               Buy
                                         Not buy                                               Not Buy
               (with probability                                  with probability
                                     (with probability                                     with probability
                     0.9)                                                0.6
                                           0.93                                                  0.95




Answer the following 4 questions:

   1) After running a data mining software, suppose the above tree is generated. If you
       have to choose a single attribute to predict which customer is likely to place an
       order or not, which attribute would you use (check one and briefly justify your
       choice)?
              A. Preference for red wine
              B. Age (whether the customer is older or younger than 50 years old)
              C. Preference for dry wine
              D. Impossible to determine given the information provided.


   2) Read off (from the tree) the rule from the highlighted path.

   3) The Winery manager, Barbara, wants to consult with you if she should send a
      sample to a new customer named George. She tells you that George prefers dry
      wine and strongly prefers red wine. What’s your recommendation (on whether to
      send a sample to George)? Justify. (Note: The chief marketing officer is aiming to
      send samples of a newly produced wine to customers who are NOT likely to place
      an order for the new wine (when probability of purchase is less than 0.5).)

   4) Assume that the cost of producing and shipping a wine sample to a customer is $8
      and the revenue from each order is $100. Assume the company ascertained that
      the probability of purchase will become 17% for a customer who receives a
      sample of a new wine. Would you suggest shipping a sample to customers
      preferring dry and red wines? Explain your answer.

Question 2.

Now that the NBA season begins, an NBA specialist tries to predict whether each team
has a chance to win the championship (so this is the dependent variable). He decides to
use two predictors (i.e. independent variables) – 1) whether a team won more than 55
games during the past season and 2) whether the team as a whole is healthy. The dataset
below contains information about the top eight teams and whether the specialist thinks
they have a chance to win it all.

Team                     Win less        Team           Having a chance
                         than 55         Healthy?       to win the
                         games?                         championship?
Phoenix Suns             No              Fair           No
Detroit Pistons          No              Excellent      No
San Antonio Spurs        No              Fair           Yes
Miami Heat               No              Fair           Yes
Denver Nuggets           Yes             Fair           Yes
Seattle Supersonics      Yes             Excellent      No
Houston Rockets          Yes             Excellent      Yes
Dallas Mavericks         No              Excellent      No
Using the examples in the above database to determine which attribute you should split
on first, in order to build a decision tree to predict whether a team has a chance to win.
Explain each step and show all relevant computations. To simplify your computation,
you are given that the information gain if splitting on “team healthy” is 0.189.

At most, you need the following logarithm values for answering the question:

     2                1                1            3                2
log 2  = -0.585, log 2 = -1.585, log 2   = -1, log 2 = -0.737, log 2   = -1.322
     3                3                2            5                5
     1           3
log 2 = -2, log 2 = -0.415
     4           4



Question 3.


In preparation for a direct marketing campaign you have built two models using two
different techniques to predict customer response. The performance of the two models is
featured in the cumulative gains chart below. Which model should you use if you wish to
target the top 30% customers? Which model should you use if you wish to target the top
60% customers? Explain.
Question 4.

The table shows the ratings of different customers for different movies (higher is better):

Customer    M1   M2    M3       M4
C1          1    2     3        5
C2          2    2     2        3
C3          3    4     1        2
C4          5    3     5        4
C5          2    2

If you want to pick one movie from M3 and M4 to recommend to C5, which movie are
you going to recommend? Please use 3-nearest neighbors, use Euclidean distance as the
distance function, and use average (not weighted by the distance) as the combination
function.

Question 5.


Here is a Neural Networks with three input nodes and one output node.



                           output


                    output
                    node




       w1                            w3
                           w2



  x1                  x2                  x3



If a training example {x1 = 2, x2 = 3, x3 = 1} is fed to the network and the weights are
{w1 = 0.2, w2 = 0.5, w3 = 0.4}, the transfer function is f(x) = x 2 , what will be the
output?
Question 6.

A bank selected 12 inputs to predict whether each applicant defaults. The output variable
(BAD) indicates whether an applicant defaulted on the home equity line of credit. The
following table describes the variables used.

     Name           Type        Description

     BAD            Binary      1=applicant defaulted on loan or seriously delinquent

                                0=applicant paid loan

     CLAGE          Numeric Age of oldest credit line in months

     CLNO           Numeric Number of credit lines

     DEBTINC        Numeric Debt-to-income ratio

     DELINQ         Numeric Number of delinquent credit lines

     DEROG          Numeric Number of major derogatory reports

     JOB            Nominal Occupational categories

     LOAN           Numeric Amount of the loan request

     MORTDUE Numeric Amount due on existing mortgage

     NINQ           Numeric Number of recent credit inquiries

     REASON         Binary      DebtCon=debt consolidation

                                HomeImp=home improvement

     VALUE          Numeric Value of current property

     YOJ            Numeric Years at present job



Question 6.1.a: Download the file HMEQ.arff I posted online. Loan HMEQ.arff into
WEKA. What’s the class distribution for BAD (i.e. percentage of 1s vs. percentage of
0s)? Build a classification model to predict whether a customer will default using
decision tree model J48 (classifiers -> trees -> J48). Please use the default parameter
setting for J48, and choose Percentage split (66%) for Test options. Please report the
classification accuracy rate of the model built, as well as the confusion matrix. Please
compare with the original class distribution and comment on how good the model is, and
also use confusion matrix to comment on how good the model is.



6.1.b Now, go back to the Preprocess window where you have the HMEQ.arff file open.
Under Filter, choose filters-> supervised->instance->Resample. In the parameter window
for Resample, change biasToUniformClass to 1 while leaving other parameters
unchanged (i.e. invertSelection=False, noReplacement=False, sampleSizePercent=100).
Click on Apply and now report the class distribution of BAD. Use the same model set up
in 6.1.a to build the model. Please report the classification accuracy rate of the model
built, as well as the confusion matrix. Please compare with the prior class distribution and
comment on how good the model is, and also use confusion matrix to comment on how
good the model is, especially compared to the model in 6.1.a.



6.1.c Now, go back to the Preprocess window where you still have the data after you do
the resampling in 6.1.b. Click on Undo once to go back to the original data set before
your Resample in 6.1.b (or you can open the HMEQ.arff again). Now, let’s use this data
to generate a training data as well as a testing data. Open the parameter window for
Resample, set biasToUniformClass=0, invertSelection=False, noReplacement=True,
sampleSizePercent=20. Click Apply to get a data set for testing, then click Save to save
that data set as test.arff. Click on Undo once to go back to the data set right before you
sample   20%     for   testing.   Open   the   parameter   window     for   Resample,    set
biasToUniformClass=0,              invertSelection=True,             noReplacement=True,
sampleSizePercent=20. Click on Apply, and this will get the rest of the data (other than
the instances we put in test.arff). Then, open the parameter window for Resample, change
biasToUniformClass to 1 while setting invertSelection=False, noReplacement=False,
sampleSizePercent=100. Click Apply will generate a training set which has balanced
class distribution. Then save it as train.arff. Now, please explore both train.arff and
test.arff, and report the class distribution for BAD, as well as the number of instances in
each file.

Question 6.2.

Now, please download the other two data sets I posted online HMEQ-train.arff and
HMEQ-test.arff. Open HMEQ-train.arff in WEKA, and run the following models using
the test option “Supplied test set”, which you set to be the HMEQ-test.arff file.

Model 1: J48, use default setting

Model 2: IBk (classifiers->lazy->IBk), set KNN=4, distanceWeighting=Weight by
1/distance.

Model 3: MultilayerPerceptron (classifiers->functions-> MultilayerPerceptron), use
default setting

Compare the results of these three models on accuracy rate and confusion matrix.




Question 6.3.

Download the Lift.xls file I posted online. The file contains the confidence of the
predictions for the J48 decision tree model and the neural networks (ANN) model. Please
draw lift curves to compare the performance of the tree and ANN in picking out the
default (BAD = 1) customers. (Note: The type of chart you should use is Scatter with
Smooth lines. In one single chart, please include the baseline, the lift curve from the tree
as well as the lift curve from ANN. Some of the Excel functions that you might use are:
ROW(), COUNTIF(). You can use other functions you prefer).
Question 6.4.

Now, the bank wants to focus on the good customers. Assume that the customers in the
Lift.xls file are all the customers the bank wants to consider when it is deciding which
customers to mail the application for a new equity product. Assume the cost of sending
application materials to a customer is $3, the revenue the bank obtains from a non-
defaulting customer is $1000, and the loss the bank incurs from a defaulting customer is
$1500. If the actual response of the customer is 0, then the profit for that customer is
1000-3; if it’s 1, then the profit is -1500-3. Please decide for the bank the number of
customers to whom the bank should send the application, and the maximum profit the
bank can achieve. Is the maximum profit the bank can achieve higher under the tree
model or the ANN model? Please also experiment with the given cost/profit values (i.e.
$1000, $1500, $3), and identify a set of values under which the other model gives you the
higher maximum profit.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:31
posted:2/18/2013
language:Unknown
pages:8