Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Stat Project by jennyyingdi


									                                        Stat 474
                                       Project # 1

The consumer credit department of a bank wants to automate the decision making
process for approval of home equity lines of credit. To do this, they will follow the
recommendations of the Equal Credit Opportunity Act to create an empirically derived
and statistically sound credit scoring model. The model will be based on data collected
from recent applicants granted credit through the current process of loan underwriting.
The model will be built from predictive modeling tools, but the created model must be
sufficiently interpretable so as to provide a reason for any adverse actions (rejections).

The Home Equity dataset (DMNN.HMEQ on SODA) contains baseline and loan
performance information for 5,960 recent home equity loans. The target (BAD) is a
binary variable indicating whether an applicant eventually defaulted or was seriously
delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant,
12 input variables were recorded.

                Model    Measurement
   Name                                                 Description
                Role         Level
BAD            Target    Binary            1=client defaulted on loan
                                           0=loan repaid
CLAGE          Input     Interval          Age of oldest trade line in months
CLNO           Input     Interval          Number of credit lines
DEBTINC        Input     Interval          Debt-to-income ratio
DELINQ         Input     Interval          Number of delinquent credit lines
DEROG          Input     Interval          Number of major derogatory reports
JOB            Input     Nominal           Six occupational categories
LOAN           Input     Interval          Amount of the loan request
MORTDUE        Input     Interval          Amount due on existing mortgage
NINQ           Input     Interval          Number of recent credit inquiries
REASON         Input     Binary            DebtCon=debt consolidation,
                                           HomeImp=home improvement
VALUE          Input     Interval          Value of current property
YOJ            Input     Interval          Years at present job

The credit scoring model will give a probability of a given loan applicant defaulting on
loan repayment. A threshold will be selected such that all applicants whose probability
of default is in excess of the threshold will be recommended for rejection.

Construct the following diagram in SAS Enterprise Miner:

In the target profiler, set up a profit matrix to reflect that for every two dollars loaned to a
person who does not default, three dollars is eventually returned. A bad loan of two
dollars forgoes the two-dollar loan itself.

In the Data Partition node, assign 67% of the cases for training the model, and the
remaining 33% for validating the model.

In the Impute node, use the tree-based imputation method for both interval and class

Specify one of the two Tree nodes as a CHAID tree, and the other one as a C4.5 tree.
In both Tree nodes, specify the minimum number of observations in a leaf to be 20,
specify the observations required for a split search to be 50. In addition, for the CHAID
tree, specify the maximum depth of the tree to be 10 and the maximum number of
branches to be 4. For the C4.5 tree, specify the maximum depth to be 20, but keep the
maximum number of branches set at the default value of 2.

In the Transform Variables node, select the transformation “Optimal Binning for
Relationship to Target” for the following variables:
       LOAN                  YOJ
       MORTDUE               DEROG
       VALUE                 DELINQ
Specify 4 bins for each of the transformations. Retain any other default settings.

In the Variable Selection node, use the following settings:
      Target Model = R-Square
      For the R-square selection criterion settings:
              Use AOV16 Variables = Yes
              Use Group Variables = Yes
Retain any other default settings.

In all of the Regression nodes, use Stepwise Regression for variable selection, with
significance levels for both Entry and Stay set to 0.10. Also set the Maximum Number
of Steps to 15. Retain any other default settings.

Model assessments are done using the validation data set, using Total Expected Profit.

What to show in your write-up:
     An explanation of the diagram used. Specifically, there are 5 different predictive
      models being compared. Briefly describe the differences between these models.
     From the Input Data Source node:
      1.     the status of the variables (model role and measurement type);
      2.     the profit matrix used.
     From the Model Comparison node: The chart of Total Expected Profit for the
      Validation data set, and an assessment of which model is better, and why.
      1.     If a Decision Tree model is selected as the best model, display the entire
             tree, and list the variables used in order of importance.
      2.     If a Logistic Regression model is selected as best, provide a summary of
             the results of the stepwise regression, showing the variables included in
             the model, their corresponding partial 2 tests and p-values, and the
             overall test of the significance of the model.
      For any model that is fit, this model will provide the Total Expected Profit for each
      case in the data set. If you sort these cases by Total Expected Profit, and then
      divide these sorted cases into 10 deciles, the Total Expected Profit within each
      decile is what is displayed on the profit chart. For the model you select as best,
      determine the percentage of top cases that should be selected to generate
      maximum Total Expected Profit. (This would be the point on the chart where the
      curve levels off.) Also specify the Total Expected Profit at that point.


To top