Auto Loan Credit Model
MGS 8040 Data Mining Spring 2007
Auto Loan Credit Model
Project Report
Yuanyuan Ding Will Munji Geetanjali Talreja Mukund Thiyagarajan
-1-
Auto Loan Credit Model
TABLE OF CONTENTS 1.0 2.0 3.0 4.0 5.0 6.0 PROJECT OVERVIEW PROJECT DATASET DATA PROCESSING ANALYSIS AND PRESENTATION TOOLS MODEL DEVELOPMENT MODEL RESULTS AND DISCUSSION 3 3 3 3 4 4
4 6 9 9
Model Scorecard KS Tests Suggestions for Scorecard Use Suggestions for Model Monitoring
7.0 8.0
JUDGING THE EFFECTIVENESS OF THE MODEL JUDGING THE EFFECTIVENESS OF THE MODEL
10 10
A GLOBAL CLASSIFICATION RATE WAS PERFORMED FOR THE EVALUATION OF THE MODEL AND THE RESULTS ARE AS FOLLOWS: 10 9.0 CONCLUSION 11 12 13 16
APPENDIX A – MODEL DEVELOPMENT FLOWCHART APPENDIX B - DATA DICTIONARY OF THE MERGED FILE (XTRA1 + DATA1) APPENDIX C – SAS REGRESSION OUTPUT (FINAL REGRESSION)
-2-
Auto Loan Credit Model
1.0 Project Overview This project involves the analysis of a dataset of auto-finance consumer data, including credit–bureau information, loan-application data, and company data regarding consumers‟ performance since acquisition. The goal of this project was to analyze the dataset and create a prediction model to evaluate the credit-worthiness of new loan applicants.
2.0 Project Dataset The dataset contained 14,042 observations including 55 independent variables, some of which were metric and others, categorical with each observation representing a separate auto-loan. The dataset consists of binary dependent variable „GOOD‟, which indicated whether the customer‟s auto loan was not charged-off. The rate of „BAD‟ performance (1-GOOD) in the original dataset was found to be 32.76%. 3.0 Data Processing The dataset was thoroughly reviewed by printing ad-hoc frequency tables and cross-tabs of independent variables against the dependent variable. Based on this review, five text-only variables and the loan account number were excluded from the analysis. The variables were then formatted to identify ranges that would split up variable data into 2 – 10% segments. Using the ratio of GOOD/BAD of these segments, increasing or decreasing trends were then identified based on an independent variables relationship to the dependent variable. In order to increase the sensitivity of the model, each segment was then assigned a binary indicator (“dummy”) variable, each representing 0:1 membership in a subcategory of the original variable. Neutral groups were defined with a ratio that was similar to those of special cases. Variables that did not follow a logical pattern were excluded from further analysis. 4.0 Analysis and Presentation Tools This project utilized SAS® v.9.13 for data processing and analysis, and Microsoft® Excel and Visio 2003 for results presentation.
-3-
Auto Loan Credit Model
5.0 Model Development The model was developed by following the steps outlined in the flowchart show in Appendix A. The major steps in development included Splitting the dataset into 2 equal halves Designating one dataset as Training, and the other as Validation Analyzing the Training dataset by formatting the data into 2 – 10% bands, creating “dummy variables” Creating “dummy” macros for processing the data in the regression analysis Performing numerous of iterations of the regression analysis and successively removing or dropping whole variables or “dummies” that were statistically insignificant (in general, judged to be high (> .05) p-values on regression results). Despite having relatively high p-values, a few variables were retained in the model because of their trends or prior/external knowledge about the variable. Refining the model (taking into consideration variables that are least correlated with each other and most correlated with the dependent variable) Scoring the Validation dataset and the [original] combined dataset Performing KS tests on the scored datasets to determine the effectiveness of the models in separating GOODs from BADs Developing a monitoring report and a model score card from the final regression indicating the variables and coefficients included in the model
6.0 Model Results and Discussion Model Scorecard The final model is depicted in the following scorecard (Figure 6.1). The scorecard reflects a plainEnglish explanation of the regression results in a form that is easily transferable to a production loan application system. The intercept (score of 591) indicates the base score assigned to each applicant. Some variables have a number of predefined ranges, each with a different coefficient. Positive coefficients would increase the applicant‟s score while negative coefficients would decrease an applicant‟s score. Each range affects the applicant‟s score in an additive way.
-4-
Auto Loan Credit Model
The scorecard contains a few variables whose ranges and score adjustments may not match intuition – these variables are may be unusually correlated with other variables, and a further refinement of the model should result in more intuitive scoring, as well as higher overall K-S tests on training and holdout samples.
Variable Intercept Average Age of Trades (AGEAVG1), months
Range
Points 591.8
28 - 54 55+ 0 - 10 10.01 - 18
49.8 89.6 -131.8 -135.3 -95.7 -59.6 42.5 66.7 92.5 23.2 54.6 46.0 24.5 16.2 54.2 63.2 41.0 -21.3 47.9 89.7 78.8 25.1 -65.9 67.1 -36.2 -135.4 -72.0
Ratio of Sat Trades to Total Trades (HSATRT)
18.01 - 30 30.01+ 0+ 6 - 119 120+ 1+ 4801 - 7200 7201+ 46 - 58 59 - 78
# of Inquiries in Last 12 Mos. (INQ012) Time at Job, months (VJOBMOS) S&V Net Trade In (NTTRIN) V Base Wholesale Value (BSWHOL), $
Age of Oldest Revolving Trades (RVOLDT), months
79 - 102 103 - 150 151+
# of Revolving Trades (RVTRDS) # of Open Revolving Trades (ORVTB01)
1+ 1-2 3+ 0 1-6 16+ 11+ 0 - 4399 12+ 9 - 11
# of Trades Ever Rated Bad Debt (HST79X) Time since Most Recent 30DPD+ Rating (TIME291), months S&V Book Retail Value (BKRETL), $ Vehicle age (based on Vehicle Model Year MODLYR), Years
-5-
Auto Loan Credit Model
8 0-4 Figure 6.1 – Model Scorecard KS Tests
-60.9 38.2
The Kolmogorov-Smirnov test, also known as the KS test, was used to evaluate the delinquent and nondelinquent (or good/bad) scoring results. The greater the separation of these plots, the better the model. The KS test was run on the Training dataset as well as the Validation dataset. The difference of the KS index between the Training and Validation dataset was less than 10%, indicating the model was not over fitted to the Training dataset.
The Training dataset (Fig 6.2), the best regression model produced a K-S test of 0.226 on score group 649-699 range. The Validation dataset (Fig 6.3), the best regression model produced a K-S test of 0.212 on score group 649-699 range. The Combined dataset (Fig 6.4), the best regression model produced a K-S test of 0.216 on score group 649-699 range.
K S T est fo r T rai i g set nn
100 90 80
ati C u m u l ve P ercen tag e
70 60 50 40 30 20 10 0
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9 29 to 20 0 to
E
99
94
89
84
79
74
69
64
59
54
49
44
39
34
O
to
to
to
to
to
to
to
to
to
to
to
to
to
M
9
0
9
0
9
0
9
0
9
0
9
0
9
0
to
or
94
90
84
80
74
70
64
60
54
50
44
40
34
30
10
00
S co re cu to ff
Figure 6.2 – KS Test for Training Dataset
-6-
24
9
24
R
9
Auto Loan Credit Model
KS Test for Validation set
100 90 80 70 60 50 40 30 20 10 0 1000 949 to 900 to 849 to 800 to 749 to 700 to 649 to 600 to 549 to 500 to 449 to 400 to 349 to 300 to or 999 949 899 849 799 749 699 649 599 549 499 449 399 349 MORE
Cumulative Percentage
Score Cutoff
Figure 6.3 – KS Test for Validation Dataset
-7-
Auto Loan Credit Model
KS Test for full set
100 90 80 70 60 50 40 30 20 10 0 1000 or MORE 949 to 999 900 to 949 849 to 899 800 to 849 749 to 799 700 to 749 649 to 699 600 to 649 549 to 599 500 to 549 449 to 499 400 to 449 349 to 399 300 to 349 249 to 299
Cumulative Percentage
Score Cutoff
Figure 6.4 – KS Test for Combined Dataset
-8-
Auto Loan Credit Model
25
20
15
KS Index
Full set Validation set Training set
10
5
0
or 94 MOR 9 E to 90 99 0 9 to 84 94 9 9 to 80 89 0 9 to 74 84 9 9 to 70 79 0 9 to 64 74 9 9 to 60 69 0 9 to 54 64 9 9 to 50 59 0 9 to 44 54 9 9 to 40 49 0 9 to 34 44 9 9 to 30 39 0 9 to 24 34 9 9 to 20 29 0 9 to 24 9
10 00
Score Cutoff
Suggestions for Scorecard Use Without any input from the client, we recommend a cut-off score of 649, which matches the maximum KS test value from the combined (training + validation) sample. This cut-off score could allow the client to separate customers into two categories – good vs. bad.
Alternatively, the client could implement a tiered application of the scorecard for different levels of loan pricing to as follows: 500-649 - applicants in an „average‟ category f risk 700-849 – applicants in a „low‟ category of risk 850+ - applicants in a „very low‟ category of risk
Suggestions for Model Monitoring We recommend that model performance be evaluated on a quarterly basis, comparing actual performance of new applicants since last refinement of the Model with performance predicted by the
-9-
Auto Loan Credit Model
Model. The organization should set a threshold for Model error at or below some percentage over a given period subject to the organizations balance between risk and loan origination.
Furthermore, we recommend that general attributes of applicant data be monitored, because the predictive power of the model should decrease as the organization changes the distribution of certain customer attributes. For example, a strategic shift or prolonged marketing campaign toward attracting sub-prime borrowers would affect the distributions of these attributes. Such skewing would negatively influence the predictive power of the model, necessitating a re-fit to the most current applicant data.
It is therefore possible that regression estimates suggested by a further refinement of the scorecard could produce higher K-S tests while maintaining a low difference between the samples. We recommend a thorough analysis and correction for violations of underlying model assumptions such as normality and multicollinearity. A reduction of multicollinearity between predictors could lead to more intuitive patterns of scorecard coefficients.
7.0 Judging the Effectiveness of the Model
Our model results indicate that the recommended cut-off score for separating GOOD from BAD applicants is 649, which incidentally is the lower bound of the range that had the maximum KS separation. At this score, XX%. Adopting a particular cut-off would require a cost/benefit analysis of delinquent vs. non-
delinquent customers. If the client were to lower the cut-off score, they would need to determine whether allowing more lower-scored applicants to be acquired would increase total revenues or not. 8.0 Judging the Effectiveness of the Model
A Global Classification rate was performed for the evaluation of the Model and the results are as follows:
True Good Predicted Good Predicted Bad Total
True Bad
Total
6138 3304 9442
2601 1999 4600
8739 5303 14042
GCR = ([(6138+1999)]/14042) = 58%
- 10 -
Auto Loan Credit Model
From the table, it has been found that, The Bad Rate Percent at the maximum separation of 649+ was 28% whereas the Bad percent was 33% if everyone in the group were accepted.
9.0 Conclusion
While good, the KS test separation observed in all three datasets (Training, Validation and Combined) could have been slightly higher. Factors that could influence and yield a higher score include improving the “dummy” variables and overall variable selection for the model.
These not withstanding, once implemented, this model will realize an overall and substantial decrease in delinquency rates of auto loan applications.
- 11 -
Auto Loan Credit Model
Appendix A – Model Development Flowchart
Data1.sd2 Xtra1.sd2
Step 1
ü ü
Combine Data Sets (Merge)
SORT / DATA-MERGE original Data1.sd2 and Xtra1.sd2 into Combo.sd2 PROC FREQ with applying leveling formats to variables, then FREQ * dependent variable GOOD
Combo.sd2
Step 2
ü
Split into two datasets
Split data set into 2 using if-then-else logic and random numbers Use Tran.sd2 for model development Keep Valid.sd2 aside for model validation
ü ü
Train.sd2
Valid.sd2
Prepare Data
ü ü ü
Remove dummies / variables based on p / chi sq values
Step 3 Split up variables amongst group members FORMAT / PROC FREQ the data so each group contained 2 – 10 % Created Dummy Variables, ratios assigned to Dummies and Neutral variables and Created Dummy Macro program REG / DATA – run regression on formatted data
Perform regression
ü ü
Refine Model
Step 4
Score Train, Valid & Combo data sets
ü ü ü ü ü
SCORE / DATA FORMAT score output Final Model consists of X dummies (X variables) REG / SCORE / FREQ / K-S test, dropped insignificant entire groups REG / SCORE / FREQ / K-S test, merged insignificant dummy variables into adjacent neutral groups Score Train dataset, Valid dataset and Combo datasets using coefficients Perform KS tests Create Scorecard and interpret results
Perform KS tests
ü
Analyze results Scorecard
ü ü
Implement
- 12 -
Auto Loan Credit Model
Appendix B - Data Dictionary of the Merged File (Xtra1 + Data1) Alphabetic List of Variables and Attributes # Variable Type Len Format Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Label MATCH KEY (ACCOUNT #) Average Age of Trds Age of Oldest Trade Ratio of Bal. to HC for All Open Auto Tr PERF. (C/O IN 12 MOS.) S&V BOOK RETAIL VALUE Time Since Bankruptcy # of Open Bank Rev Trds With Bal>=$1000 # of Bank Rev. Trds Ever 30 DPD # of Bank Rev. Trds Ever 60 DPD # of Bank Rev. Trds Ever 90 DPD # of Bank Rev. Trds Ever 120+ DPD Age of Oldest Bank Rev. Trade # of Open Bank Rev. Trds # of Bank Rev. Trds V BASE RETAIL VALUE V BASE WHOLESALE VALUE Credit Bureau Type # of Fin. Trds CONTRACT PRICE PAID Ratio of Currently Sat Trds to Open Trds # of Trds Currently Rated Sat. S&V DOWN PAYMENT PERF. (NOT CHARGED OFF)
20 ACCTNO 2 1 AGEAVG AGEOTD
14 AUUTIL 25 BAD 54 BKRETL 18 BKTIME 52 BRBAL1 10 BRHS2X 11 BRHS3X 12 BRHS4X 13 BRHS5X 9 8 7 BROLDT BROPEN BRTRDS
31 BSRETL 30 BSWHOL 19 CBTYPE 15 CFTRDS 37 CONTPR 48 CSORAT 6 CURSAT
28 DWNPMT 26 GOOD
- 13 -
Auto Loan Credit Model
Alphabetic List of Variables and Attributes # 5 Variable HSATRT Type Len Format Num Num Num Num Char Num Num Char Num Char Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Label Ratio of Sat. Trds to Total Trds # of Trds Never 90DPD+ # of Trds Ever Rated Bad Debt # of Inqs. in Last 12 Mos. V MAKE OF AUTO V MILEAGE S&V MONTHLY GROSS PAY V MODEL OF AUTO V MODEL YEAR V NEW USED INDICATOR S&V NET TRADE IN # of Open Rev. Trds # of Open Rev. Trds With Bal. > $0 # of Derogatory Public Records # of Retail Trds Ever Rated Sat. Age of Oldest Rev. Trade # of Rev. Trds # of Trds Rated 30 DPD+ in the Last 24 M # of Trds Rated 60 DPD+, 24 Mos. # of Trds Rated 90 DPD+, 24 Mos. S&V TERM OF LOAN Mos. Since Most Recent 30DPD+ Rating Mos. Since Most Recent 60DPD+ Rating Mos. Since Most Recent 90DPD+ Rating Total Bal
44 HST03X 43 HST79X 16 INQ012 32 MAKE 34 MILEAG 36 MNGPAY 33 MODEL 55 MODLYR 35 NEWUSE 29 NTTRIN 40 OREVTR 42 ORVTB0 17 PUBREC 53 REHSAT 38 RVOLDT 41 RVTRDS 49 T2924X 50 T3924X 51 T4924X 27 TERM 45 TIME29 46 TIME39 47 TIME49 4 TOTBAL
- 14 -
Auto Loan Credit Model
Alphabetic List of Variables and Attributes # 3 Variable TRADES Type Len Format Num Num Num Char 8 8 8 4 8 8 TEN. # of Trds # of Trds Opened in Last 24 Mos. CUSTOMER AGE CHECKING/SAVINGS ACCOUNTS TIME AT JOB IN MONTHS TIME AT RESIDENCE IN MONTHS Label
39 TROP24 21 VAGE 24 VDDASAV
23 VJOBMOS Num 22 VRESMOS Num
- 15 -
Auto Loan Credit Model
Appendix C – SAS Regression Output (Final Regression) The SAS System The REG Procedure Model: bgscore Dependent Variable: GOOD PERF. (NOT CHARGED OFF) Number of Observations Read 14042 Number of Observations Used 14042
Analysis of Variance Source Model Error DF 29 Sum of Squares 194.45290 Mean F Value Pr > F Square 6.70527 32.41 <.0001
14012 2898.63925 0.20687
Corrected Total 14041 3093.09215
Root MSE
0.45483
R-Square 0.0629 Adj R-Sq 0.0609
Dependent Mean 0.67241 Coeff Var 67.64131
Parameter Estimates Variable Intercept Intercept Label DF Parameter Standard t Value Pr > |t| Estimate Error 1 1 1 1 1 1 1 0.59182 0.04978 0.08956 -0.13175 -0.13527 -0.09568 -0.05956 0.02055 0.01060 0.01570 0.01844 0.02221 0.02183 0.02192 28.79 4.70 5.70 -7.14 -6.09 -4.38 -2.72 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0066
AGEAVG1 AGEAVG : 28 - 54 AGEAVG2 AGEAVG : 55 - 999999 HSATRT1 HSATRT2 HSATRT3 HSATRT4 HSATRT : 0 - 10 HSATRT : 10.01 - 18 HSATRT : 18.01 - 30 HSATRT : 30.01 - 999999
- 16 -
Auto Loan Credit Model
Parameter Estimates Variable inq0121 Label inq012 : 0 - 9999 DF Parameter Standard t Value Pr > |t| Estimate Error 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.04249 0.06672 0.09250 0.02321 0.05462 0.04604 0.02448 0.01619 0.05422 0.06316 0.04098 -0.02129 0.04790 0.08967 0.07882 0.02509 -0.06593 0.06713 -0.03621 -0.13543 -0.07200 -0.06085 0.03823 0.02362 0.00883 0.01658 0.00921 0.01046 0.01427 0.01620 0.01665 0.01804 0.01764 0.02099 0.01233 0.01092 0.01956 0.01787 0.01020 0.02064 0.00984 0.01243 0.02401 0.01469 0.01528 0.01003 1.80 7.56 5.58 2.52 5.22 3.23 1.51 0.97 3.01 3.58 1.95 -1.73 4.39 4.58 4.41 2.46 -3.19 6.82 -2.91 -5.64 -4.90 -3.98 3.81 0.0721 <.0001 <.0001 0.0118 <.0001 0.0013 0.1308 0.3307 0.0027 0.0003 0.0509 0.0841 <.0001 <.0001 <.0001 0.0139 0.0014 <.0001 0.0036 <.0001 <.0001 <.0001 0.0001
vjobmos1 vjobmos : 6 - 119 vjobmos2 vjobmos : 120 - 9999 NTTRIN1 NTTRIN : 1 - 999999
BSWHOL1 BSWHOL : 4801 - 7200 BSWHOL2 BSWHOL : 7201 - 999999 RVOLDT1 RVOLDT : 46 - 58 RVOLDT2 RVOLDT : 59 - 78 RVOLDT3 RVOLDT : 79 - 102 RVOLDT4 RVOLDT : 103 - 150 RVOLDT5 RVOLDT : 151 - 999999 RVTRDS1 RVTRDS : 1 - 99999 ORVTB01 ORVTB0 : 1 - 2 ORVTB02 ORVTB0 : 3 - 99999 HST79X1 HST79X2 HST79X3 TIME291 BKRETL1 HST79X : 0 - 0 HST79X : 1 - 6 HST79X : 16 - 9999999 TIME29 : 11 - 9999999 BKRETL : 0 - 4399
MODLYR1 MODLYR : 0 - 1986 MODLYR2 MODLYR : 1987 - 1989 MODLYR3 MODLYR : 1990 - 1990 MODLYR4 MODLYR : 1994 - 9999999
- 17 -