Docstoc

ppt

Document Sample
ppt Powered By Docstoc
					Towards Logistic Regression Models
  for Predicting Fault-prone Code
      across Software Projects
               Erika Camargo
                     and
              Ochimizu Koichiro
      Japan Institute of Science and Technology


                   ESEM 2009
                                                  1
                  Contents
1.   Abstract
2.   Background
3.   Problem Analysis
4.   Case study
5.   Results
6.   Conclusion and Future Work



                                  2
                        Abstract
Challenge:
To make logistic regression (LR) models, which
use design-complexity metrics, able to predict
fault-prone o-o classes across software projects.
                        P(y=1)
    X=                               P(Fault prone
    design-complexity            x
    metric                           class)
First attempt of solution:
simple log data transformations

                                                     3
                  Background
• Some design-complexity metrics have shown to
  be good predictors of fault-prone classes in LR
  models

• Among these metrics are the Chidamber &
  Kemerer (CK) metrics

  – 80th and 20th percentiles of the distributions can be
    used to determine high and low values
  – Their thresholds cannot be determined before their
    use and should be derived and used locally

                                                            4
               Problem Analysis
Can a LR model built with these kind of
metrics work efficiently with different
software projects?
     P (y=1)   LEAST FAULTY      MOST FAULTY
                                         Large Size SW project

        Small Size SW project




                                                 X = Number of Methods
                            10              20
                                                                         5
                  Case Study

1. Data analysis of 7 different projects and
   application of simple log data transformations.
2. Construction of 3 univariate LR models using a
   large open source project (1st release of the
   MYLYN System with 638 Java classes).
  – Dependent Variables: CK-CBO, CK-RFC, CK-WMC
  – Independent Variables: Defects (from Bugzilla & CVS)
3. Test these models with 2 other smaller projects
   (with 11 and13 Java classes)
                                                           6
  BNS: Banking system (2006) *
                                                           Challenge
  CRS: Cruise control system (2005) *
  ECS: ecommerce system (2006) *
  ELCS: Elevator control system (2003)*
  FACS: Factory automation system (2005) *
  GMF: Graphic Modeling Framework **
  MYL : Mylyn system **




              produced biased
              regression estimates
              and reduce the
              predictive power of
              regression models




(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time   7
Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
    BNS: Banking system (2006) *
    CRS: Cruise control system (2005) *                                       RFC Data of
    ECS: ecommerce system (2006) *
    ELCS: Elevator control system (2003)*
                                                                              BNS is more
    FACS: Factory automation system (2005) *                                  spread than
    GMF: Graphic Modeling Framework **                                         the data of
    MYL : Mylyn system **                                                       the MYL




(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time   8
Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
       BNS: Banking system (2006) *
       CRS: Cruise control system (2005) *
       ECS: ecommerce system (2006) *
       ELCS: Elevator control system (2003)*
       FACS: Factory automation system (2005) *
       GMF: Graphic Modeling Framework **
       MYL : Mylyn system **




   RFC Data of
   BNS is more
   spread than
    the data of
     the MYL




(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time   9
Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
                      Case Study
Solution. Simple data transformation using
  “Log10”         Number of Outliers are less
                  Data Spread is more uniform

Example :




    LCBO = Log10(CBO+1)      LTCBO = Log10(CBO+1) + dm;
                             Where dm is the difference of CBO medias of the
                             Mylyn system and the system which data is being
                             transformed
                                                                         10
                    Results
Effects of the Log data Transformations:
• Elimination of great number of outliers
• Overall goodness of fit of the 3 models is
  better
• Discrimination (Most Faulty/Least Faulty)
  – All models discriminate well between most Faulty
    and Least Faulty classes of the Mylyn System
  – What about using different projects?

                                                       11
MF: Most Faulty
LF: Least Faulty
                             Results
                           BANKING SYSTEM
       Group         Model   Correct          Correct          Effect
                             Classification   Classification
                             (RAW DATA)       (LOG Tx DATA)
       MF            CBO     2                5                  
       (6 classes)   RFC     5                5                  =
                     WMC     6                6                  =
       LF            CBO     5                5                  =
       (5 classes)   RFC     3                3                  =
                     WMC     4                4                  =
       BOTH         CBO      7                10                 
       (11 classes) RFC      8                8                  =
                     WMC     10               10                 =
                                                                        12
MF: Most Faulty
LF: Least Faulty
                              Results
                            E-COMMERCE SYSTEM
        Group         Model Correct          Correct          Effect
                            Classification   Classification
                            (RAW DATA)       (LOG Tx DATA)
        MF            CBO     3              7                  
        (9 classes)   RFC     9              8                  
                      WMC     7              6                  
        LF            CBO     4              4                  =
        (4 classes)   RFC     0              3                  
                      WMC     0              4                  
        BOTH         CBO      7              11                 
        (13 classes) RFC      9              11                 
                      WMC 7                  10                 

                                                                       13
    Conclusions and Future work
• CK-CBO, CKR-RFC ad CK-WMC can have
  different distributions in different projects
• Simple Log Transformations seem to improve
  the prediction ability of LR models, specially
  when the project measures are not as spread
  as those used in the construction of the
  model.
• Further data exploration and study of data
  transformations

                                                   14
           Thank you!
     questions, comments …

contact: erika.camargo@jaist.ac.jp



                                     15
16
17
18

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:12/25/2012
language:Unknown
pages:18