final report_Yu Meng by fanzhongqing

VIEWS: 1 PAGES: 17

									MGS 8040: Data Mining
 Final Project Report
Auto Loan Credit Model




                         Project Team

                            Yu Meng
                         Wenxuan Ye
                          Yang Wang
                     Auto Loan Credit Model

1. PROJECT OVERVIEW

This project will develop a Logistic Regression Model and develop a scorecard
for a financial company to determine whether or not to approve a customer’s auto
loan application, help them increase the percentage of acceptance while
reducing overall risk.

SAS 9 software and MS Excel software were used in the project.

2. DATA PREPARATION

a) Data Source

The data was collected by an auto loan company of the current and previous
customers. It contains 14,042 applicants for car loans and 55 variables. It
included information such as the applicant’s age, credit history, and some other
information. See Appendix A for a list of the variables. The dependent variable is
a binary variable named “GOOD”, which indicates if the customer’s loan was not
charged off. The rate of “GOOD” performance in the original dataset was 67.24%
and the rate of “BAD” (1-GOOD) was 32.76%.

b) Data Processing

The data was provided in two files, data1.sd2 and xtral.sd2. We started the
project by merging these two data files by the common variable ACCTNO. Then
we randomly split the combined dataset into to a training dataset named “trainp”
and a validation dataset named “validp” with each containing approximately 50%
of the observations. The training dataset was used to build the model and the
other dataset was used for testing purpose.

Fortunately the data provided was already in a clean state and no variables had
to be dropped, so we were able to go ahead and apply further analysis. The next
step was to define and create dummy variables. We started with assigning
appropriate formats to selected variables with approximately 7% to 11%
observations in each category. Five text-only variables and the account number
were excluded from the analysis. This ensured proper representation of every
category without giving too much weight for one over the other. Then we apply
the cross table to the variables with the dependent variable GOOD. Then the
cross tab charts of these segments were imported into MS Excel to get the ratio
of GOOD/BAD. Taking increasing/decreasing trends into account, neutral groups
and dummy groups were created based on these ratios. Neutral groups were
defined with a ratio that was similar to those of special cases. Variables that did
not follow a logical pattern were excluded from further analysis.

Example: Dummy creating process using variable CONTPR

Table of CONTPR by GOOD

CONTPR: Contract Price Paid




See Appendix B for an example of SAS program for a dummy creating process.
3. REGRESSION

Dummy creation is the most time consuming and complicated part of this whole
project. Once that was done, a logistic regression was then performed on all of
the dummy variables. In each iteration we dropped variables with a p-value
exceed a certain criteria, which is 0.8, 0.5, 0.2, 0.1 and 0.05 respectively. We
also examined coefficients of each variable, compared them with the cross-tab
results to identify variables whose coefficients’ patterns were contrary to the
trend or the business logic. We either eliminated the dummy or combined it with
the category next to it, whichever kept the model logical. After 5 iterations, we
were left with 18 variables (31 dummy variables). All of the dummy variables
have a p-value less than 0.05 and coefficient with each other less than 0.3, which
kept the variables as uncorrelated as possible.

4. RESULTS AND DISCUSSIONS

a) Scorecard

The following scorecard shows how the variable values are treated in the model.
The intercept score of 761 indicates the base score assigned to each applicant.
Some variables have a number of predefined rages, each with a different
coefficient. Positive coefficients would increase the applicant’s score while
negative ones would decrease the score.

Table 1: Scorecard

                                                                        Model
                     Variable                         Intervals
                                                                        Points
                   Intercept                                             761
               MILEAG:V Mileage                         0-70K             37
                                                   70,001 TO 80K          30
                                                        80K+              0
   MNGPAY:S&V MONTHLY GROSS PAY                     $1 TO $1000          -45
                                                     $1001 TO
                                                                          -61
                                                        $1250
                                                     $1251 TO
                                                                          -33
                                                        $1600
                                                       $1600+              0
          MODLYR:V MODEL YEAR                      1969 TO 1993            0
                                                         1994              54
                                                         1995              81
                                                        1995+              92
         NTTRIN:S&V NET TRADE IN                          0               -58
                                          $1 TO $1000     0
                                             $1000+       47
     OREVTR:# of Open Rev. Trds           special case    0
                                                  1       46
                                                 2+       0
   RVOLDT:Age of Oldest Rev. Trade           0 TO 35     -51
                                                36+       0
T4924X:# of Trds Rated 90 DPD+, 24 Mos.       0 TO 3      0
                                              4 TO 6     -49
                                                 6+      -53
TIME29:Mos. Since Most Recent 30DPD       special case    0
                                              0 TO 5     -78
                                              6 TO 9     -56
                                                10+       0
     AGEOTD: Age of Oldest Trade              -5 - 72    -64
                                             Over 73      0
    BKTIME: Time Since Bankruptcy              < 60       0
                                             Over 60      97
   BSRETL: V BASE RETAIL VALUE               0 - 8000     0
                                          8001 - 11000    31
                                           Over 11000     0
BSWHOL: V BASE WHOLESALE VALUE               0 - 1500    -77
                                           Over 1500      0
   CONTPR: CONTRACT PRICE PAID              1 - 10000     0
                                           Over 10001    -48
CURSAT: # of Trds Currently Rated Sat.           -6       0
                                                  2       95
                                              Over 2     109
   DWNPMT: S&V DOWN PAYMENT                   1 - 999    -62
                                          1000 - 1800     0
                                          1801 - 2700     78
                                           Over 2700     128
HSATRT(Ratio of Sat. Trds to Total Trds       -32.78      0
                                           Over 27.78     42
   HST03X: # of Trds Never 90DPD+               <3        0
                                               4-5        39
                                              Over 5      0
HST79X: # of Trds Ever Rated Bad Debt             0       64
                                               1-3        32
                                                 4+       0
      INQ012: # of Inqs. in Last 12 Mos.                 -4 - 2               0
                                                         Over 2              -21


b) K-S Test

In order to assess the final model we built, we applied K-S test to evaluate the
delinquent and non-delinquent (or good/bad) scoring results. K-S test measures
the distance between the distribution function of two classifications, GOOD and
BAD. The score generating the maximum difference between the cumulative
functions of Good and Bad is considered the cutoff value for accepting or
rejecting an applicant. The greater the separation of these plots, the better the
model. We applied K-S test on training dataset, validation dataset and merged
dataset, as shown below.



Table 2: K-S Test analysis for training set

                          GOO     Tota    Cumulative    Cumulative Ratio    Differenc
    BGSCORE        BAD
                           D        l    Good    Bad    Good       Bad          e
  1000 OR
   MORE               3      59     62      59      3   0.01241   0.00133    1.1087%
 950 TO 999          15      76     91     135     18   0.02840   0.00796    2.0446%
 900 TO 949          18     172    190     307     36   0.06459   0.01592    4.8676%
 850 TO 899          40     285    325     592     76   0.12455   0.03360    9.0954%
 800 TO 849          89     437    526    1029    165   0.21649   0.07294   14.3551%
 750 TO 799         166     589    755    1618    331   0.34042   0.14633   19.4086%
 700 TO 749         254     716    970    2334    585   0.49106   0.25862   23.2438%
 650 TO 699         317     694   1011    3028    902   0.63707   0.39876   23.8309%
 600 TO 649         452     743   1195    3771   1354   0.79339   0.59859   19.4808%
 550 TO 599         363     496    859    4267   1717   0.89775   0.75906   13.8686%
 500 TO 549         257     287    544    4554   1974   0.95813   0.87268    8.5453%
 450 TO 499         142     111    253    4665   2116   0.98149   0.93546    4.6030%
 400 TO 449          92      64    156    4729   2208   0.99495   0.97613    1.8823%
 350 TO 399          35      16     51    4745   2243   0.99832   0.99160    0.6716%
 300 TO 349          13       7     20    4752   2256   0.99979   0.99735    0.2442%
 250 TO 299           4       1      5    4753   2260   1.00000   0.99912    0.0884%
 200 TO 249           2       0      2    4753   2262   1.00000   1.00000    0.0000%
Figure 1: K-S Test based on training set




Table 3: K-S Test analysis for validation set

                                                Cumulative      Cumulative %
      BGSCORE           BAD     GOOD    Total                                     Difference
                                                Good   Bad    Good       Bad
  1000 OR MORE              3      69     72      69      3   0.01462   0.00133     1.3292%
   950 TO 999              17      92    109     161     20   0.03412   0.00887     2.5252%
   900 TO 949              21     176    197     337     41   0.07141   0.01817     5.3240%
   850 TO 899              54     281    335     618     95   0.13096   0.04211     8.8850%
   800 TO 849              97     433    530    1051    192   0.22272   0.08511    13.7610%
   750 TO 799             175     577    752    1628    367   0.34499   0.16268    18.2311%
   700 TO 749             265     697    962    2325    632   0.49269   0.28014    21.2547%
   650 TO 699             319     729   1048    3054    951   0.64717   0.42154    22.5628%
   600 TO 649             420     717   1137    3771   1371   0.79911   0.60771    19.1397%
   550 TO 599             371     461    832    4232   1742   0.89680   0.77216    12.4637%
   500 TO 549             224     279    503    4511   1966   0.95592   0.87145     8.4469%
    450   TO   499       149     128     277   4639   2115   0.98305   0.93750   4.5547%
    400   TO   449        84      55     139   4694   2199   0.99470   0.97473   1.9968%
    350   TO   399        39      15      54   4709   2238   0.99788   0.99202   0.5860%
    300   TO   349        12       8      20   4717   2250   0.99958   0.99734   0.2236%
    250   TO   299         5       2       7   4719   2255   1.00000   0.99956   0.0443%
    200   TO   249         1       0       1   4719   2256   1.00000   1.00000   0.0000%




Figure 2: K-S test based on validation set
Table 4: K-S test analysis for merged set


   BGSCORE        BAD    GOOD    Total       Cumulative       Cumulative %      Difference
                                            Bad    Good      Bad      Good
  1000 OR
   MORE             12     129     141        12      129   0.00261   0.01366     1.1054%
 950 TO 999         34     158     333        46      287   0.01000   0.03040     2.0396%
 900 TO 949         41     340     714        87      627   0.01891   0.06641     4.7492%
 850 TO 899        115     576    1405       202     1203   0.04391   0.12741     8.3496%
 800 TO 849        188     856    2449       390     2059   0.08478   0.21807    13.3286%
 759 TO 799        365    1164    3978       755     3223   0.16413   0.34135    17.7217%
 700 TO 749        517    1377    5872      1272     4600   0.27652   0.48718    21.0663%
 650 TO 699        671    1452    7995      1943     6052   0.42239   0.64097    21.8575%
 600 TO 649        860    1436   10291      2803     7488   0.60935   0.79305    18.3704%
 550 TO 599        723     977   11991      3526     8465   0.76652   0.89653    13.0004%
 500 TO 549        506     547   13044      4032     9012   0.87652   0.95446     7.7937%
 450 TO 499        285     252   13581      4317     9264   0.93848   0.98115     4.2670%
 400 TO 449        179     127   13887      4496     9391   0.97739   0.99460     1.7207%
 350 TO 399         70      31   13988      4566     9422   0.99261   0.99788     0.5273%
 300 TO 349         25      16   14029      4591     9438   0.99804   0.99958     0.1533%
 250 TO 299          7       4   14040      4598     9442   0.99957   1.00000     0.0435%
 200 TO 249          2       0   14042      4600     9442   1.00000   1.00000     0.0000%
   0 TO 200          0       0   14042      4600     9442         1         1           0
Figure 3: K-S test on merged set




Table 5: K-S Test summary table

            Cutoff     Cum%        Cum%       K-S
            score      Good        Bad        score
Training
set         650--700      63.71%     39.88%   23.83%
Valid set   650--700      64.72%     42.18%   22.56%
Merged
set         650--700      64.10%     42.24%   21.86%


From this table, we can see that the K-S test for Training set, Validation set and
Merged set show that the maximum separation is around 22% at score range
650 – 700, which is acceptable for this project. Also the variation between these
three set is within 2%, which is relatively good. This means our model is good
and applicable.
c) Implementation

Upon delivery of the model, the client can proceed to use the model to score
potential applicants. As indicated in the K-S test the cutoff score for Good or Bad
applicants should fall somewhere in the 650 to 700 range. The recommended
cutoff point is 650. At this score there is a high percentage of Good customers
and a manageable number of Bad customers. Any customer falling below 650
should be considered risky for financing and offered the appropriate higher
interest rate. Any score above 650 should be considered less risky and given a
lower interest offer to coincide with the score. The higher the score, the less
risky a customer is. Thus the lower the interest rate should be. Of course, the
client would have to analyze what percentages of Good and Bad customers
above 650 would give their business the most profit.

d) Monitoring Report

In order to monitor the performance of the model after implementation, the client
should periodically run the K-S test to make sure the scorecard is still valid and
thus the model effective. Redevelopment may be necessary if the model fails to
validate. For example, once the model has been implemented for six months, a
preliminary report on the GOOD vs. Bad Score Distribution could be generated in
order to check if the credit score model is still appropriately separating the good
from the bad accounts.

Table 7: Monitoring Report

        Actual vs. Expected Score Distribution
                Expected
Score           Score        Actual Score
Above           Distribution Distribution Difference
>1000              1.00413%
>950               2.37146%
>900               5.08475%
>850              10.00570%
>800              17.44054%
>750              28.32930%
>700              41.81740%
>650              64.17667%
>600              73.28728%
>550              85.39382%
>500              92.89275%
>450              96.71699%
>400              98.89617%
>350    99.61544%
>300    99.90742%
>250    99.98576%
>200   100.00000%
>Low   100.00000%
Appendixes:
A. Data Dictionary

1.    ACCTNO         account number; match key

2.    AGEAVG         average age of trades (financial contracts)

3.    AGEOTD         age of oldest trade

4.    AUUTIL         ratio of balance to high credit for all open auto transactions

5.    BAD            performance (c/o in 12 months)

6.    BKRETL         S&V book retail value; retail book value of the car being
                     purchased.

7.    BKTIME         time since bankruptcy

8.    BRBAL1         # of open bank revolving trades (credit cards) with
                     balances >= $1000

9.    BRHS2X         # of bank revolving trades ever 30 DPD (days past due)

10.   BRHS3X         # of bank revolving trades ever 60 DPD

11.   BRHS4X         # of bank revolving trades ever 90 DPD

12.   BRHS5X         # of bank revolving trades ever 120+ DPD

13.   BROLDT         age of oldest bank revolving trade

14.   BROPEN         # of open bank revolving trades

15.   BRTRDS         # of bank revolving trades

16.   BSRETL         vehicle base retail value

17.   BSWHOL         vehicle base wholesale value

18.   CBTYPE         credit bureau type

19.   CFTRDS         # of financial trades

20.   CONTPR         contract price paid

21.   CSORAT         ratio of currently satisfactory trades to open trades

22.   CURSAT         # of trades currently rated satisfactory
23.   DWNPMT   S&V down payment

24.   GOOD     performance (not charged off)

25.   HSATRT   ratio of satisfactory trades to total trades

26.   HST03X   # of trades never 90 DPD+

27.   HST79X   # of trades ever rated bad debt

28.   INQ012   # of inquiries in last 12 months

29.   MAKE     vehicle make of auto

30.   MILEAG   vehicle mileage

31.   MNGPAY   S&V monthly gross pay

32.   MODEL    vehicle model of auto

33.   MODLYR   vehicle model year

34.   NEWUSE   vehicle new/used indicator

35.   NTTRIN   S&V net trade in

36.   OREVTR   # of open revolving trades

37.   ORVTBO   # of open revolving trades with balances > $0

38.   PUBREC   # of derogatory public records

39.   REHSAT   # of retail trades ever rated satisfactory

40.   RVOLDT   age of oldest revolving trade

41.   RVTRDS   # of revolving trades

42.   T2924X   # of trades rated 30 DPD+ in the last 24 months

43.   T3924X   # of trades rated 60 DPD+ in the last 24 months

44.   T4924X   # of trades rated 90 DPD+ in the last 24 months

45.   TERM     S&V term of loan

46.   TIME29   months since most recent 30 DPD+ rating

47.   TIME39   months since most recent 60 DPD+ rating

48.   TIME49   months since most recent 90 DPD+ rating
49.    TOTBAL                 total balance

50.    TRADES                 # of trades

51.    TROP24                 # of trades opened in the last 24 months

52.    VAGE                   customer age

53.    VDDASAV                checking/savings accounts

54.    VJOBMOS                time at job in months

55.    VRESMOS                time at residence in months



B. SAS program for a dummy creating process

1) Frequency

proc freq data=save.trainp;

tables CONTPR;



2) Cross table and applying format

value contpr

 0<-3000='$1 TO $3000'

 3000<-4500='$3001 TO $4500'

 4500<-5500='$4501 TO $5500'

 5500<-6000='$5001 TO $6000'

 6000<-6500='$6001 TO $6500'

 6500<-7000='$6501 TO $7000'

 7000<-7500='$7001 TO $7500'

 7500<-8000='$7501 TO $8000'

 8000<-9000='$8001 TO $9000'

 9000<-10000='$9001 TO $10000'

 10000<-11000='$10001 TO $10000'

 11000<-12000='$11000 TO 12000'
 12000<-High='$12001+' ;



ods html file="C:\Users\^_^&^_^\Documents\My SAS Files\Project 1\Output\tabs.html";

proc freq data=save.trainp;

tables contpr*good

format contpr contpr. ;



3) Create dummy variables

%macro dummy(var,first,last,tot);

 if (&first<=&var<=&last) then &var.&tot=1;

 else &var.&tot=0;

 lable &var.&tot="&var:&first-&last";

%mend dummy;



%dummy (contpr,1,4500,1)

%dummy (contpr,4501,6000,2)

%dummy (contpr,7501,10000,3)

%dummy (contpr,10001,9999999,4)



C. Project Flow

								
To top