Docstoc

An Improved Multiperceptron Neural Network Model To Classify Software Defects

Document Sample
An Improved Multiperceptron Neural Network Model To Classify Software Defects Powered By Docstoc
					                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 2, February 2011




      AN IMPROVED MULTIPERCEPTRON NEURAL NETWORK
           MODEL TO CLASSIFY SOFTWARE DEFECTS

           M.V.P. Chandra Sekhara Rao,                                               Dr.B.Raveendra Babu
                                                                      Director (Operations), Delta Technologies (P) Ltd.,
                  Aparna Chaparala,                                                   Hyderabad, India

                 Department of CSE,                                                    Dr. A.Damodaram
         R.V.R. &J.C. College of Engineering,                              JNTU, CSE Department, JNTU College of
                                                                                  Engineering, Kukatpally,
                    Guntur, India
                                                                                    Hyderabad, INDIA




Abstract: Predicting software defects in modules not only            quality of software but does not ensure zero defects
helps in maintaining legacy systems but also helps the               and is a very expensive proposition if not planned
software development process and ensures higher                      properly.
reliability. Advantage includes planning of resources for
the projects and minimization of budget. Research has been
carried out using statistical methodology and machine                Software quality modeling becomes an important
learning techniques which are generic in nature. The                 criterion to ensure that the software not only meets
dependability on legacy Software systems to meet current             the desired quality but also within time and budget
demanding requirements is a major challenge for any IT               lines. Defect prediction based on quantifiable metrics
administrator and estimation of costs to maintain the same
                                                                     though in controversy, has been used successfully to
is a huge challenge. In this paper, it is proposed to modify
the existing multi layer perceptron Neural Network which             predict defects in modules. Defect prediction models
is a popular supervised classification algorithm to predict          have independent variables captured in the form of
defects in a given module based on the available software            product and process metrics and one dependent
metrics.                                                             variable which indicates whether there could be a
                                                                     fault or no fault in the module. Typically researchers
Keywords— Legacy software, Software metrics, Software                have used product metrics extensively to predict fault
reliability, Classification, Multilayer Perceptron Neural            in the modules. The independent variables used for
network, Fault-proneness.                                            prediction of defects can be parameters captured in
                                                                     previous projects which is available in the
                                                                     configuration management system or can be
                 I. INTRODUCTION                                     computed from the current project.


Software reliability and Software quality assurance                  Predicting module defects also finds application in
are two major areas in software engineering which                    legacy systems where it may not be possible to
ensures high quality software. Both these concepts                   replace legacy systems through the practice of
are drawn in throughout the development and                          application retirement. Defect prediction provides a
maintenance process. The notable major activities                    cost effective process to enhance them.
used are performance analysis, functional tests,
quantifying time and budget along with measurement                   The previous work carried out by the author [3]
of metrics[1]. In addition; code reviews, key                        investigates the KC1 for defect classification using
personnel assignment and automatic test-case                         Decision Tree induction and Bayesian networks.
generation are the other strategies that are applied to              Various pre-processing techniques were also
reach the high reliability [2].                                      investigated [4]. The results obtained are tabulated in
                                                                     table 1 and 2.
Software quality can be viewed from different
perspectives including time, budget and mean time to
failure. Alpha and Beta testing help to improve the




                                                               124                            http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 9, No. 2, February 2011




    TABLE-I. CLASSIFICATION ACCURACY ON KC1                       decades. There are several techniques proposed to
                     DATASET                                      classify the modules for identifying fault-prone
                                                                  modules
                Correctly    Incorrectly      Mean
                classified    classified     Absolute                     III. DATA MINING TECHNIQUES
KC1 Dataset         %             %           error
                                                                  Data Mining (DM) aims to establish something new
Ramdom tree       81.86        18.14          0.1924              from the facts recorder in the databases. Originally,
CART              84.91        15.09          0.2095              data mining is a statistician’s term for overusing data
Bayesian                                                          to draw in legitimate inferences. DM is the use of
logistic                                                          powerful tools to sift out important or significant
regression        86.03        13.97          0.1397              traits that are previously unknown from databases or
                                                                  data warehouses.

TABLEII.  CLASSIFICATION     ACCURACY               AFTER         Software is prone to have errors and bugs. The
PREPROCESSING IN KC1 DATA SET                                     process of software testing is to assess the quality of
                                                                  computer software and verify whether the software
                 % correctly           % Incorrectly              complies with software specification and customer
                 classified             Classified                needs. There are two ways to find errors in software
 Random                                                           testing: manual and automated. Manually debugging
 Tree              94.5531                 5.4469                 is laboured intensive and costly while automated
 Logistic                                                         debugging can classify and locate the software defect
 regression        95.6704                 4.3296                 automatically. Data mining based software
 CART              96.7877                 3.2123                 debugging is becoming more and more accepted and
                                                                  it can significantly reduce the amount of labour cost
                                                                  in software debugging.
In this paper, the efficacy of neural network for
defect prediction using available model and our                   Data Mining extracts useful information and
proposed model is verified.                                       knowledge from huge amount of data. DM methods
                                                                  can be applied to the data generated in every stage of
This paper is organized into the following sections.              software life cycle such as design, development,
Section II describes software metrics, Section III                testing, deployment and maintenance, and extract
describes data mining techniques for classification,              potential errors in the software.
Section IV gives an introduction to Neural Network
used, Section V describes the dataset used in the                             IV. NEURAL NETWORKS
work, Section VI includes the improved neural
network technique and output obtained. The last                   Neural networks consist of multiple layers of
section analyses and concludes the paper.                         computational units, usually interconnected in a feed-
                                                                  forward way. Each neuron in one layer has directed
         II. SOFTWARE METRICS                                     connections to the neurons of the subsequent layer. In
                                                                  many applications the units of these networks apply a
Software metrics are collected at various phases of               sigmoid function as an activation function.
the software development process. These metrics
contain information of software and can be used to                The feed forward neural network was the first and
predict software quality in the early stages of                   arguably simplest type of artificial neural network
software life cycle.                                              devised. As the majority of faults are found of its
                                                                  modules, there is a need to investigate the modules
Software reliability engineering is one of the most               that are affected severely as compared to other
important aspects of software quality. Recent studies             modules and proper maintenance to be done on time
show that software metrics can be used in software                especially for the critical applications Ebru Ardil et.
module fault-proneness prediction. A software                     al (2009).
module has a series of metrics, some of which are
related to fault-proneness. Multiple research works               Algorithms based on neural networks have a lot of
on the software quality prediction using the                      applications in knowledge engineering. In data
relationship between software metrics and software                mining, the following neural network architectures
module’s fault-proneness have been done in the last               are used:




                                                            125                            http://sites.google.com/site/ijcsis/
                                                                                           ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 9, No. 2, February 2011




                                                                       Kohenen’s self-organizing maps provide means for
                        •   Multilayered feed forward neural           visualization of multivariate data, because two
                            networks                                   clusters of similar members activate output neurons
                        •   Kohenen’s self-organizing maps.            with small distance in the output layer. In other
                                                                       words, neurons that share a topological resemblance
         A) Multilayered           feed   forward    neural            will be sensitive to inputs that are similar. This
            networks                                                   property has no other algorithm of cluster analysis.

     Multilayered feed forward neural networks (ANNs)                  SOM is a dynamic system, which learns abstract
     are non-parametric regression methods, which                      structure in high-dimensional input space using low-
     approximate the underlying functionality in data by               dimensional space for representation.
     minimizing the loss function. The common loss
     function used for training and ANN is quadratic error                                  V. DATA SET
     function. ANN is used for adaptation supervised
     learning. Database form a training set. During                    Data from the NASA’s Metric Data Program (MDP)
     training, specified items of data records are put as the          data repository is made use of. The KC1 dataset used
     input of neural network and its weights are changed               contains LOC measure, cyclomatic complexity, Base
     in such a way that its output would approximate the               Halstead Measures, Derived Halstead measures from
     values in the data set. After finishing learning                  various software modules.
     process, the learned knowledge is represented by the
     values of neural network weights. For training, the               The attributes used in this work is described briefly
     algorithm of back propagation of error is often used.             below

     Input                    Hidden                Output             LOC_BLANK - The number of blank lines in a
     Layer                    Layers                Layer              module.
x                                                                      LOC_CODE_AND_COMMENT - The number of
     x            w1j                                                  lines which contain both code & comment in a
             x1                                                        module.
                                                                       LOC_COMMENTS - The number of lines of
x1 x
              w2j                                                      comments in a module.
                                                                       CYCLOMATIC_COMPLEXITY - The cyclomatic
                                 Oj          Wjk                       complexity of a module.
xi                wij                                                  DESIGN_COMPLEXITY - The design complexity
     x
                                                       Ok              of a module.
                  wnj                                                  ESSENTIAL_COMPLEXITY - The essential
                                                                       complexity of a module.
xn x                                                                   LOC_EXECUTABLE - The number of lines of
              Fig. 1. Multilayaer Neural Network                       executable code for a module (not blank or comment)
                                                                       HALSTEAD_CONTENT - The Halstead length
          B) Kohenen’s self-organizing maps                            content of a module.
     Kohenen’s self-organizing maps (SOMs) have                        HALSTEAD_DIFFICULTY             -    The    Halstead
     become a promising technique in cluster analysis.                 difficulty metric of a module.
     They are adapted by unsupervised learning. In data                HALSTEAD_EFFORT - The Halstead effort metric
     mining, Kohenen’s self-organizing maps based                      of a module.
     cluster techniques have the following advantages                  HALSTEAD_ERROR_EST - The Halstead error
     over standard statistical methods.                                estimate metric of a module.
                                                                       HALSTEAD_LENGTH - The Halstead length metric
     DM typically deals with high-dimensional data. A                  of a module.
     record in a database typically consists of a large                HALSTEAD_LEVEL - The Halstead level metric of
     number of items. The data do not have regular                     a module.
     multivariate distribution and thus the traditional                HALSTEAD_PROG_TIME             -     The    Halstead
     statistical methods have their limitations and they are           programming time metric of a module.
     not effective. SOMs work with high-dimensional data               HALSTEAD_VOLUME - The Halstead volume
     efficiently.                                                      metric of a module.
                                                                       NUM_OPERANDS - The number of operands
                                                                       contained in a module.




                                                                126                             http://sites.google.com/site/ijcsis/
                                                                                                ISSN 1947-5500
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 9, No. 2, February 2011




NUM_OPERATORS - The number of operators
contained in a module.                                           Where y is the input and w is the weight. L2
NUM_UNIQUE_OPERANDS - The number of                              Criterion is used to compute the cost function
unique operands contained in a module.                           desirable. The error computed to the supervised
NUM_UNIQUE_OPERATORS - The number of                             learning procedure is the squared Euclidean distance
unique operators contained in a module.                          between the network's output and the desired
LOC_TOTAL - The total number of lines for a given                response.
module.
                                                                 65 percent of the data was used as the training set and
      VI. PROPOSED METHODOLOGY &                                 the remaining used as the test set. The classification
      EXPERIMENTAL INVESTIGATION                                 accuracy obtained on KC1 dataset is 98.2%.

The Multilayer Perceptron is an example of a                     The proposed fuzzy based neural model was able to
supervised learning artificial neural network that is            classify better than Random Tree by 14.66%, CART
used extensively for the solution of a number of                 by 11.41% and Bayesian logistic regression by
different problems, including classification, pattern            10.50%. However the proposed method needs to be
recognition and interpolation. The algorithm for                 evaluated with other datasets to better test the
Perceptron Learning is based on the back-                        performance in terms of consistency.
propagation rule. The hidden layer typically consists
of either sigmoid or tanh function. The algorithm for            The results obtained by our proposed methodology is
multi layer perceptron neural network is given below.            improved over the regular multilayer perceptron
                                                                 model with sigmoidal hidden function by 3.92%.
    i. Present input and desired output                          Figure 2 displays the accuracy obtained by various
Present input Yp = y0 ,y1 ,y2 ,...,yn-1 and target               classification methods carried out.
output Cp = c0 ,c1 ,...,cm-1 where n is the number of
input nodes and m is the number of output nodes.
                                                                   100

     ii. Calculate the actual output                                 95
Each layer calculates the following:                                 90
fxpj = f [w0y0 + w1y1 + .... + wyn]                                  85
jThis is then passed to the next layer as an input. The              80
final layer outputs values opj.
                                                                     75
                                                                     70
     iii. Adapts weights, starting from the output we
                                                                                                     CART




                                                                                                                                                        MLP NN
                                                                          Random Tree




                                                                                                             Random Tree




                                                                                                                                                                 Proposed NN
                                                                                        Regression




                                                                                                                           Regression


                                                                                                                                        Preprocessing


              now work backwards.
                                                                                                                                          CART with
                                                                                         Logistic




                                                                                                                            Logistic




                                                                                                                                                                    Model
wij(t+1) = wij(t) + ñþpjopj , where ñ is a gain term
                                                                                                                with




and þpj is an error term for pattern p on node j.

For output units
þpj = kopj(1 - opj)(t - opj)
                                                                 FIG.2. CLASSIFICATION ACCURACY ON KC1
For hidden units                                                 DATA SET
þpj = kopj(1 - opj)[(þp0wj0 + þp1wj1 + ....+þpkwjk)]

where the sum is over the k nodes in the layer above                                                 CONCLUSION
node j.
                                                                 In this paper, it has been observed that the proposed
In this paper, a fuzzy bell hidden layer is proposed,            Bell fuzzy based neural network model performs
that uses a bell shaped curve as its fuzzy member in             better than existing neural network model and other
the hidden layer and is given by                                 classification algorithms. Thus it can be very
                                                                 decisively said that Bell fuzzy function used in multi-
                                                                 perceptron neural network improves the classification
                                                                 accuracy of software defect prediction..




                                                          127                                               http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 2, February 2011




                    REFERENCES
                                                                                         2
                                                                                       Aparna      Chaparala,       is   the
[1]     C. Yilmaz, C. Catal, O. Kalipsiz, & A. Porter,                                Associate      Professor     of    the
        “Distributed Quality Assurance”. Proc.2nd Turkish                             department of computer science and
        National Symposium on Software Engineering, Ankara,
        Turkey, 2005, 189-198.                                                        engineering in R.V.R. & J.C. College
                                                                                      of Engineering, Chowdavaram,
[2]     T. M. Khoshgoftaar & N. Seliya, “An Empirical Study                           Guntur. She has 9 years experience
        of Predicting Software Faults with Case Based                                 in teaching. She completed her
        Reasoning”, Software Quality Journal, 14, 2006, 85-
        111.                                                          M.Tech in Computer Sicence & Engineering. She is
                                                                      doing her research in the area of Data Mining.
[3]     M.V.P. Chandra Sekhara Rao ,Dr. B. Raveendra Babu,            Presently pursuing Ph.D from J.N.T.U, Hyderabad.
        Dr. A Damodaram and B. Madhusudhanan, “ Business              She has published 3 papers in international journals.
        Intelligence Model Using Data Mining Techniques for
        Code optimization in legacy systems”

[4]     M.V.P. Chandra Sekhara Rao ,Dr. B. Raveendra Babu,                               3
                                                                                         Dr B. Raveendra Babu, obtained
        Dr. A Damodaram and Ch. Aparna “ Severity Based                                 his Masters in Computer Science
        Code optimization : A Data Mining Approach”
        International Journal of Computer Science and                                   and Engineering from Anna
        Engineering(IJCSE), Vol. 02, No. 05, 2010, 1754-1757.                           University, Chennai. He received his
                                                                                        Ph.D. in Applied Mathematics at
[5]      N. Nagappan, T. Ball, B. Murphy, Using Historical                              S.V University, Tirupati. He is
        Data and Product Metrics for Early Estimation of
        Software Failures, In Proc. ISSRE 2006, Raleigh, NC,                            currently leading a Team as Director
        2006.                                                         (Operations), M/s.Delta Technologies (P) Ltd.,
                                                                      Madhapur, Hyderabad. He has 26 years of teaching
[6]     Sttefan Lessmann, (2008). Benchmarking Classification         experience. He has more than 25 international &
        Models for Software Defect Prediction: A Proposed
        Framework       and    Novel     Findings,      IEEE          national publications to his credit. His research areas
        TRANSACTIONS ON SOFTWARE Engineering,                         of interest include VLDB, Image Processing, Pattern
        34(4), pp. 485-496.                                           analysis and Wavelets.
[7]     M.H. Halstead, (1977). Elements of software Science.
        Elsevier.
                                                                                     4
                                                                                      Dr.A.Damodaram received B.Tech
[8]     NASA     Metrics    data     Repository    available:                        (CSE), M.Tech (CSE) from JNTU,
        www.mdp.ivv.nasa.gov
                                                                                     Hyderabad & he did his Ph.D in
[9]     J.Han, M. Kamber, “Data Mining: Concepts and                                 Image Processing area from JNTU,
        Techniques”, Harchort India Private Limited, 2001.                           Hyderabad. He has been serving
                                                                                     JNTU since 1989. He is Professor in
[10]    H Lu, R Setiono, H Liu. Effective Data Mining Using
                                                                                     Department of C.S.E and worked as
        Neural Network. IEEE Transactions on Knowledge and
        Data Engineering, 1996, 8(6): 957-961.                        Director, Vice-Principal, JNTU-UGC-Academic
                                                                      Staff College. He is presently working as Director
[11]    wilamowski, B.M. Neural Network Architectures and             for Distance Education Learning, JNTU, Hyderabad.
        Learning Algorithms, IEEE Industrial Electronics              He has published more than 30 research publications
        Magazine, Vol.3., Issue.4, pg. 56-63, 2009.
                                                                      in various National, International conferences,
ABOUT THE AUTHORS                                                     proceedings and Journals.

                1
                M.V.P.Chandra Sekhara Rao, is
               the Associate Professor of the
               department of computer science and
               engineering in R.V.R. & J.C. College
               of    Engineering,     Chowdavaram,
               Guntur. He has 15 years experience
               in teaching. He completed his B.E
and M.Tech in Computer Science & Engineering. He
is doing his research in the area of Data Mining.
Presently pursuing Ph.D from J.N.T.U, Hyderabad.
He has published 3 papers in international journals
and presented one paper in international conference.




                                                                128                           http://sites.google.com/site/ijcsis/
                                                                                              ISSN 1947-5500

				
DOCUMENT INFO
Description: The International Journal of Computer Science and Information Security (IJCSIS Vol. 9 No. 2) is a reputable venue for publishing novel ideas, state-of-the-art research results and fundamental advances in all aspects of computer science and information & communication security. IJCSIS is a peer reviewed international journal with a key objective to provide the academic and industrial community a medium for presenting original research and applications related to Computer Science and Information Security. . The core vision of IJCSIS is to disseminate new knowledge and technology for the benefit of everyone ranging from the academic and professional research communities to industry practitioners in a range of topics in computer science & engineering in general and information & communication security, mobile & wireless networking, and wireless communication systems. It also provides a venue for high-calibre researchers, PhD students and professionals to submit on-going research and developments in these areas. . IJCSIS invites authors to submit their original and unpublished work that communicates current research on information assurance and security regarding both the theoretical and methodological aspects, as well as various applications in solving real world information security problems.