Project # 1
The consumer credit department of a bank wants to automate the decision making
process for approval of home equity lines of credit. To do this, they will follow the
recommendations of the Equal Credit Opportunity Act to create an empirically derived
and statistically sound credit scoring model. The model will be based on data collected
from recent applicants granted credit through the current process of loan underwriting.
The model will be built from predictive modeling tools, but the created model must be
sufficiently interpretable so as to provide a reason for any adverse actions (rejections).
The Home Equity dataset (DMNN.HMEQ on SODA) contains baseline and loan
performance information for 5,960 recent home equity loans. The target (BAD) is a
binary variable indicating whether an applicant eventually defaulted or was seriously
delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant,
12 input variables were recorded.
BAD Target Binary 1=client defaulted on loan
CLAGE Input Interval Age of oldest trade line in months
CLNO Input Interval Number of credit lines
DEBTINC Input Interval Debt-to-income ratio
DELINQ Input Interval Number of delinquent credit lines
DEROG Input Interval Number of major derogatory reports
JOB Input Nominal Six occupational categories
LOAN Input Interval Amount of the loan request
MORTDUE Input Interval Amount due on existing mortgage
NINQ Input Interval Number of recent credit inquiries
REASON Input Binary DebtCon=debt consolidation,
VALUE Input Interval Value of current property
YOJ Input Interval Years at present job
The credit scoring model will give a probability of a given loan applicant defaulting on
loan repayment. A threshold will be selected such that all applicants whose probability
of default is in excess of the threshold will be recommended for rejection.
Construct the following diagram in SAS Enterprise Miner:
In the target profiler, set up a profit matrix to reflect that for every two dollars loaned to a
person who does not default, three dollars is eventually returned. A bad loan of two
dollars forgoes the two-dollar loan itself.
In the Data Partition node, assign 67% of the cases for training the model, and the
remaining 33% for validating the model.
In the Impute node, use the tree-based imputation method for both interval and class
Specify one of the two Tree nodes as a CHAID tree, and the other one as a C4.5 tree.
In both Tree nodes, specify the minimum number of observations in a leaf to be 20,
specify the observations required for a split search to be 50. In addition, for the CHAID
tree, specify the maximum depth of the tree to be 10 and the maximum number of
branches to be 4. For the C4.5 tree, specify the maximum depth to be 20, but keep the
maximum number of branches set at the default value of 2.
In the Transform Variables node, select the transformation “Optimal Binning for
Relationship to Target” for the following variables:
Specify 4 bins for each of the transformations. Retain any other default settings.
In the Variable Selection node, use the following settings:
Target Model = R-Square
For the R-square selection criterion settings:
Use AOV16 Variables = Yes
Use Group Variables = Yes
Retain any other default settings.
In all of the Regression nodes, use Stepwise Regression for variable selection, with
significance levels for both Entry and Stay set to 0.10. Also set the Maximum Number
of Steps to 15. Retain any other default settings.
Model assessments are done using the validation data set, using Total Expected Profit.
What to show in your write-up:
An explanation of the diagram used. Specifically, there are 5 different predictive
models being compared. Briefly describe the differences between these models.
From the Input Data Source node:
1. the status of the variables (model role and measurement type);
2. the profit matrix used.
From the Model Comparison node: The chart of Total Expected Profit for the
Validation data set, and an assessment of which model is better, and why.
1. If a Decision Tree model is selected as the best model, display the entire
tree, and list the variables used in order of importance.
2. If a Logistic Regression model is selected as best, provide a summary of
the results of the stepwise regression, showing the variables included in
the model, their corresponding partial 2 tests and p-values, and the
overall test of the significance of the model.
For any model that is fit, this model will provide the Total Expected Profit for each
case in the data set. If you sort these cases by Total Expected Profit, and then
divide these sorted cases into 10 deciles, the Total Expected Profit within each
decile is what is displayed on the profit chart. For the model you select as best,
determine the percentage of top cases that should be selected to generate
maximum Total Expected Profit. (This would be the point on the chart where the
curve levels off.) Also specify the Total Expected Profit at that point.