PAKDD COMPETITION 2009 Predictive Modeling Credit Risk Assessment Wenjun Zhang Email wenjunzhang123 163 com Zhejiang University of Science and Technology Abstract The 2009 PAKDD data mining com by cheesepie7


									                         PAKDD COMPETITION 2009

               Predictive Modeling Credit Risk Assessment
                                        Wenjun Zhang
                         Zhejiang University of Science and Technology

The 2009 PAKDD data mining competition required the analysis of real world data from a major
Brazilian retail chain, which focuses on performance robustness against degradation along time.
To achieve high Area Under the Receiver Operating Characteristic Curve (AUC) for classification
and the robustness of the modeling, Stochastic Gradient Boosting based on the slow learning
strategy is the data mining algorithms used in this project.
Keywords: PAKDD2007 Competition; Credit Risk Assessment; Stochastic Gradient Boosting;

1    Introduction
The PAKDD 2009 Data Mining Competition presents a problem on the well known application of
credit scoring (See [1]). The offer of credit for potential clients is a very important service for
stimulating consumption in the market. But on the other hand, there are some difficulties related to
credit scoring which are often overlooked by modelers. Thus, the aim is to explore a scoring
model, whose robustness against performance degradation caused by market gradual changes
along few years of business operation.

2    Data Observations
Some observations about the dataset are
(1) Diversity of the modeling variable types. Among these 30 modeling variables, 22 ones are
numeric variables (including ordered and continuous variables), and other 7 ones are categorical.
(2) Different scales of the numeric variables.
(3) Missing data in the dataset.
(4) Misrecorded and dirty data.
(5) Outlier data.
Thus, modeling on the dataset is clearly a complex task. We will explain the adopted technique,
Stochastic Gradient Boosting.

3    Variable Selection
To avoid market gradual changes along few years of business operation as far as possible, we
which are big Offsets on the three datasets (modeling, leaderboard and prediction). Furthermore,
without the three variables the model has relatively high accuracy.

4    Modeling Technique
We adopt Stochastic Gradient Boosting for this problem, which can deal with complex variable
type and missing data. Stochastic Gradient Boosting has been introduced by J. Friedman [2].
Stochastic Gradient Boosting generates small trees which are summed to obtain an overall score.
Each tree is linked to predecessors. Like a series expansion where the addition of terms
progressively improves the predictions. By adjusting small shrinkage parameter based on slow
learning strategy, we have guarded against overfitting and build the robust model.
Stochastic Gradient Boosting can also give the importance values of the modeling variable (See
Table 1).

                                  Table 1. Variable Importance
                           Variable name                           Importance value
        AGE                                                              100.00
        PROFESSION_CODE                                                  64.64
        ID_SHOP                                                          62.03
        FLAG_RESIDENCIAL_PHONE$                                          59.27
        MONTHS_IN_THE_JOB                                                57.37
        MARITAL_STATUS$                                                  51.11
        MONTHS_IN_RESIDENCE                                              48.17
        RESIDENCE_TYPE$                                                  41.87
        SEX$                                                             39.96
        AREA_CODE_RESIDENCIAL_PHONE                                      39.72
        FLAG_RESIDENCE_TOWN$                                             33.79
        FLAG_FATHERS_NAME$                                               26.97
        QUANT_ADDITIONAL_CARDS_IN_THE_APPLICATION                        16.35
        SHOP_RANK                                                        14.15
        FLAG_RESIDENCIAL_ADDRESS$                                        13.17
        FLAG_RESIDENCE_STATE$                                            11.46
        FLAG_MOTHERS_NAME$                                               10.48
        QUANT_DEPENDANTS                                                  0.00
        FLAG_OTHER_CARD$                                                  0.00
        FLAG_MOBILE_PHONE$                                                0.00
        FLAG_CARD_INSURANCE_OPTION$                                       0.00
        QUANT_BANKING_ACCOUNTS                                            0.00
        EDUCATION                                                         0.00
        COD_APPLICATION_BOOTH                                             0.00
        FLAG_CONTACT_PHONE$                                               0.00

5   Result Analysis
The rank-ordering performance metric selected by the PAKDD 2009 competition judges was the
area under the ROC Curve (See [3]), which is to be applied to an independent validation sample of
10000 accounts. The ROC value on the testing data is 0.677. The ROC curve is shown in Figure 1.
                                     Figure 1.
[1] PAKDD2009., 2009

[2] Friedman, J. H. Greedy function approximation: A gradient boosting machine.
Annals of Statistics, 29, 1189-1232, 2001

[3] ROC Curve,, 2009

To top