Using Simulation for Assessing the Real Impact of Test-Coverage on Defect-Coverage by f3dryrahmanz


More Info
									60                                                                                       IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000

     Using Simulation for Assessing the Real Impact of
            Test-Coverage on Defect-Coverage
                                                     Lionel C. Briand and Dietmar Pfahl

    Abstract—The use of test-coverage measures (e.g., block-cov-                             number of test objects
erage) to control the software test process has become an                      DC            set of    measurements
increasingly common practice. This is justified by the assumption              TC            set of measurements
that higher test-coverage helps achieve higher defect-coverage
and therefore improves software quality. In practice, data often                             Cdf of coverage measurements at a certain , with
show that defect-coverage and test-coverage grow over time, as                                       or
additional testing is performed. However, it is unclear whether this                         Gaussian ( -normal) Cdf with mean, , and standard
phenomenon of concurrent growth can be attributed to a causal                                deviation,
dependency, or if it is coincidental, simply due to the cumulative                           null hypothesis
nature of both measures. Answering such a question is important
as it determines whether a given test-coverage measure should be                -value       Pr wrongly rejecting
monitored for quality control and used to drive testing.                                      -significance level
   Although it is no general answer to this problem, a procedure is                          number of repetitions in a Monte Carlo simulation
proposed to investigate whether any test-coverage criterion has a                            coefficient of determination, [(multivariate) regres-
genuine additional impact on defect-coverage when compared to                                sion coefficient]
the impact of just running additional test cases. This procedure
applies in typical testing conditions where                                                      under      (generated through simulation)
    • the software is tested once, according to a given strategy,                                derived from empirical data set
    • coverage measures are collected as well as defect data.                  eps           error term in (multivariate) regression equation
This procedure is tested on published data, and the results are com-
pared with the original findings. The study outcomes do not sup-               Definitions of Test-Coverage Measures
port the assumption of a causal dependency between test-coverage                  Test-coverage is measured as the fraction of constructs—as
and defect-coverage, a result for which several plausible explana-
tions are provided.                                                            defined by the coverage criterion—that have been executed at
                                                                               least once during testing. In the literature, a variety of test-cov-
  Index Terms—Defect-coverage, Monte Carlo simulation, Soft-                   erage criteria have been suggested, e.g., [13]:
ware test, Test-coverage, Test intensity.
                                                                                   • Block coverage: A control flow based criterion that
                                                                                     measures the portion of basic blocks executed during
                           I. INTRODUCTION                                           testing. Basic blocks are maximal code fragments without
Abbreviations and Acronyms                                                           branching. Thus, a basic block can only be executed
                                                                                     entirely from beginning to end, as it contains no internal
AT           acceptance test
                                                                                     flow of control change.
IT           integration test
                                                                                   • Decision coverage: A control flow based criterion that
UT           unit test
                                                                                     measures the portion of decisions executed during testing.
K–S          Kolmogorov–Smirnov
                                                                                     A decision is a pair of basic blocks              such that is
A–D          Anderson–Darling
                                                                                     a successor of . Decision coverage subsumes block cov-
 -           statistical
                                                                                     erage, i.e., 100% decision coverage implies 100% block
Notation                                                                           • C-use coverage: A data flow based criterion that measures
             defect-coverage                                                         the portion of c-uses (computational uses) covered during
             test-coverage                                                           testing. A c-use is:
             test intensity                                                              • a variable , and
             number of levels                                                            • the set of all paths in the data flow graph from node
             number of measures                                                                to node      such that:
                                                                                                i) is defined in ;
   Manuscript received August 31, 1999; revised October 1, 1999.                               ii) is not defined in any other node on the paths
   L. C. Briand is with the Systems and Computer Engineering Department,                           from      to ;
Carleton University, 1125 Colonel By Drive, Ottawa K1S 5B6 Canada (e-mail:                                                                      iii) is used in a computational expression of
   D. Pfahl is with the Fraunhofer Institute for Experimental Software Engi-                       node , e.g., as a procedure argument, as an
neering (IESE), Sauerwiesen 6, D-67661 Kaiserslautern Fed. Rep. Germany                            initializer in a declaration, as a return value of
   Responsible editor: M.A. Vouk.                                                                  a function call, as the second operand of the
   Publisher Item Identifier S 0018-9529(00)06203-5.                                               assignment operator ( ).
                                                            0018–9529/00$10.00 © 2000 IEEE

           A c-use is covered if at least one of the paths in the         Section III presents a simulation-based procedure to test the
           c-use is executed during testing.                           impact of test-coverage on defect-coverage.
   • P-use coverage: A data flow based criterion that mea-                Section IV provides the results of applying our simulation-
     sures the portion of p-uses (predicate uses) covered during       based procedure.
     testing. A p-use is:                                                 Section V discusses the work and proposes directions for fu-
         • a variable and                                              ture research.
         • the set of all paths in the data flow graph from node
               to node      such that:
                                                                                           II. PROBLEM STATEMENT
                i)  is defined in ;
               ii)  is not defined in any other node on the paths         When a relationship is observed between test-coverage and
                  from       to , except possibly ,                    defect-coverage, it is commonly assumed to support the hypoth-
              iii) is used in a predicate expression of node           esis that test-coverage leads to defect-coverage. However, does
                     , e.g., as the first operand in the conditional   this assumption really capture reality? This has important prac-
                  expression of an if, for, while, do,                 tical implications as it justifies why testing should be coverage
                  or switch statement.                                 driven, or evaluated based on coverage achievement. Therefore,
                                                                       it is important that test-coverage measures be validated as im-
      A p-use is covered if at least one of the paths in the p-use
                                                                       portant defect-coverage drivers.
      is executed during testing.
                                                                          Another, perhaps more plausible, interpretation of any em-
   Testing is one of the most effort-intensive activities during
                                                                       pirical relationship between a test-coverage measure and de-
software development [2]. Much research is directed toward
                                                                       fect-coverage is that they are both driven by more testing (re-
developing new, improved test methods. One way to control
                                                                       ferred to as test intensity). This is the typical dilemma of inter-
testing better—and thus improve test resource allocation—is
                                                                       preting a relationship as causal or coincidental.
to measure estimators (referred to as test-coverage) of the
                                                                          One way to approach this problem is to determine whether
fraction of defects detected during testing (referred to as
                                                                       test-coverage has any additional impact on defect-coverage
defect-coverage). Many test-coverage measures have been
                                                                       as compared to test intensity alone. In other words, this is
proposed and studied. They range from
                                                                       equivalent to assess whether test-coverage is still a -significant
    • simple measures counting the program blocks covered, to          indicator of defect-coverage, when the effect of test intensity
    • data-flow based measures looking at the definition and use       has already been accounted for. One approach is to determine
      of variables.                                                    whether the combined effect of test intensity and test-coverage
   Many of these test-coverage measures have been investigated         can better explain the variations in defect-coverage than test
in terms of their subsumption relationships, the ease with which       intensity alone. If this is the case, then one can conclude
complete coverage can be achieved, and the ways they can be            that evidence suggests that test-coverage has an important
used to drive test case generation and selection [9], [12], [16],      additional impact on defect-coverage.
[22], [24]. Several additional studies reporting the application          Where testing is mainly driven by test-coverage, test inten-
of test-coverage measures to control or improve test efficiency        sity and test-coverage cannot be differentiated. But in typical
have been published [3], [23], [29]. More importantly in the           situations, this is not the case. In addition, defect-coverage and
context of our research, researchers have attempted to build           test-coverage data are usually collected at a few discrete points
defect-coverage models based on test-coverage measures [1],            in time, e.g., at the end of each testing phase such as unit, in-
[5]–[8], [10], [14], [15], [17], [19], [21], [26], [30]. The basic     tegration, or system testing. This requires an analysis approach
assumption, regardless of the modeling strategy, is that there is      considering these practical constraints. In the data set used in
some (important) causal effect between test-coverage and de-           this paper, the test intensity has 3 possible values. Due to the de-
fect-coverage.                                                         sign of the original study from which the data are drawn, all sys-
   However, since both test-coverage and defect-coverage in-           tems showed identical test intensity (viz, number of test cases)
crease with test intensity or time, it is not surprising that em-      at the end of each test phase, when coverage measurement was
pirical data usually show a relationship. But it does not neces-       taken.
sarily mean that additional test-coverage drives the detection of         In order to test the importance of the impact of test-coverage
new defects. The question investigated in this paper is how to         on defect-coverage, a procedure is defined that can be easily
test whether a given test-coverage measurement, or several of          used in a context where defect and test-coverage data are col-
them combined, are actually having an important impact on de-          lected at a few discrete points in the testing process. This pro-
fect-coverage. It is also important that any solution be usable        cedure is based on Monte Carlo simulation and can easily be
in typical testing conditions. The main focus of this paper is to      automated.
present an easily replicable procedure that is based on simula-
tion techniques and is designed to investigate the relationship
                                                                               III. TESTING THE IMPACT OF TEST-COVERAGE
between test-coverage and defect-coverage. Data coming from
[13] are used to exemplify the procedure and show how it can              This section presents our procedure to test whether test-cov-
yield more precise results.                                            erage has an important impact on defect-coverage. Section III-A
   Section II precisely defines the problems associated with           presents the rationale and relates it to the fundamentals of sta-
using test-coverage measures for controlling test effectiveness.       tistical testing. Section III-B describes the procedure in detail.
62                                                                                  IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000

Fig. 1. Relationship between Defect-Coverage and Test-Coverage

This procedure has been designed to work in a typical con-                distributions. Then, it uses these estimated population distribu-
text where testing is not coverage driven and where test and              tions to generate the anticipated        distribution by using only
defect-coverage data are collected at a few points in time, e.g.,         the test intensity information in the sample.
at the end of testing activities such as unit, integration, or system        To allow for an analysis across systems of varying functional
testing.                                                                  size, test intensity should be normalized using any suitable mea-
                                                                          sure of functional size, e.g., function points. The validity of spe-
A. Rationale                                                              cific test intensity and coverage measures is context-dependent
   Using statistical testing terminology, the goal is to test the null    and is not discussed here, because it does not affect the pro-
hypothesis:                                                               cedure in this paper. However, defining a meaningful test in-
                                                                          tensity measure requires carefully assessing its underlying as-
     • test-coverage has no additional impact on defect-coverage
                                                                          sumptions. For example, if larger systems happen to be more
       as compared to test intensity alone.
                                                                          difficult to test (diseconomies of scale), a measure of test effort
In order to test this hypothesis, we need to estimate what would          per function point would not be a valid measure of test intensity.
be the strength of the relationship between test-coverage and             In other words, equal test intensity values might not be compa-
defect-coverage, assuming          is true. If the strength of this re-   rable across systems of different sizes.
lationship is measured in terms of goodness-of-fit or correlation            To illustrate the principles, assume that there is only 1 (e.g.,
(e.g., ), the -expected          distribution is estimated under .        block coverage) and . For example, although the relationship
Such a distribution can then be used for statistical testing by           between and does not have to be linear,1 the sample data
comparing          in the observed sample to the distribution, and        could look like the scatterplot in Fig. 1.
assessing the probability of obtaining an equivalent or higher               Typical test data sets are composed of data vectors containing
    . If this probability is small (say, below 0.05 or 0.01), we              • defect-coverage and test-coverage data,
can assuredly reject        and assume that the impact of test-cov-
                                                                              • test effort,
erage on defect-coverage is plausible. Otherwise,           cannot be
                                                                              • number of test cases run.
rejected and there is no supporting evidence that test-coverage
                                                                          Across systems, such data vectors can be grouped according to
has any impact on defect-coverage that is not already explained
                                                                          test intensity levels (e.g., groups are depicted by rectangles in
by test intensity.
                                                                          Fig. 1). Several strategies can be adopted for data collection;
   The main issue is to devise a method to compute the antici-
                                                                          coverage and test intensity data can be collected
pated       distribution under     . We typically have to work with
a sample of projects for which we have defect-coverage and                   a) at the completion of test phases (e.g., as in Fig. 1), or
test-coverage data (usually for several test-coverage measures),             b) on a regular basis, e.g., daily, weekly.
corresponding to certain test intensity values (e.g., number of           Strategy a) applies in a context of organizations with a well-de-
test cases, test effort), and collected at the end of various testing     fined test process and strategy, which are presumed to yield sim-
activities or phases. At an intuitive level, what the procedure (in       ilar test intensities across systems for each of the test phases. For
Section III-B) does is to use the sample test-coverage and de-               1If the relationship is not linear, e.g., exponential, it can be linearized to fa-
fect-coverage distributions to estimate the respective population         cilitate the analysis.

strategy b), daily measurement is probably necessary for organi-          2) Formal Representation of Empirical Data Set: The avail-
zations that develop small and medium software programs. For           able data on , , and can be formally represented to facili-
organizations that develop large software systems, over long pe-       tate further treatment.
riods of time, weekly measurement might be sufficient. In any              • Test intensity: 1 to           levels of :                   ,
case, it is important to collect coverage and test intensity data            with                       ; e.g., each level can correspond
at sufficient granularity level to allow for the grouping, across            to the completion of a test phase. Depending on the spe-
systems, of data vectors showing similar test intensities. The               cific context, can be measured in terms of test cases, test
number of groups does not matter much, as long as the number                 effort, or any other measure that adequately captures the
of observations within groups is large enough for the estimation             amount of testing applied to a piece of software.
of the population distributions.                                           • Test-coverage: 1 to                    measures, e.g., block,
   The example of Fig. 1 shows coverage data collected from                  decisions; 1 to             test objects, e.g. modules, units,
measuring 12 programs (or any other type of object under test),              programs, sub-systems.
and groups them according to 3 test phases:                                     A TC of           test-coverage measurements              ,
    • unit test,                                                             with                                          , for—
    • integration test,
    • system test.
Groups are depicted by rectangles. For the sake of simplicity,
only one test-coverage measure is shown here. Corresponding                  .
to each test phase, there are unknown population distributions
for     and . These distributions can be estimated using the
sample distributions on these dimensions. The idea is to gen-
erate, through simulation, many scatterplots (e.g., more than             • Defect-coverage: A data set DC of              measurements
1000) such as the one in Fig. 1, by using the estimated popu-                      , with                            , for—
lation distributions on and and sampling from them inde-
pendently. Each simulated sample would have exactly the same
number of observations corresponding to each group as the orig-
inal sample. Following that simulation procedure, the test inten-
sity effect is preserved since we sample from the corresponding                               .
   and      distributions for each test phase. However, since we
ignore the pairing between defect-coverage and test-coverage
(because independent sampling is performed for the two dis-                    is calculated as the cumulated number of defects found
tributions), no coverage relationship is preserved if not already           at a given , divided by the total number of defects.
accounted for by test intensity. If test-coverage is important, we
                                                                         Both and         measurements are usually expressed in per-
anticipate that the simulated samples would show, on average, a
poorer -correlation than the actual sample, where both test in-
                                                                         3) Procedure to Construct the Test Distribution: A 4-step
tensity and test-coverage effects on defect-coverage should be
                                                                       procedure is defined to construct the test distribution for
visible. In other words, the     distribution of the simulated scat-
                                                                       under     (probability distribution of ):
terplots is the distribution we would anticipate under        and it
can be used to test how likely the sample -value is under this           Step P1) Estimate Theoretical Distributions for all Coverage
condition. Such procedure can also be used when using several                      Measures
test-coverage measures and multivariate regression.                                   For each coverage criterion (test and defect), find
                                                                                   the best fit distributions
B. Procedure for Statistical Testing
   This section presents a procedure to compute the    distribu-
tion characterizing the relationship between and which is                         to the sample data of level          . Finding the best
anticipated under , that is when it is assumed that defect-cov-                   distributions and fitting its parameters can be easily
erage is driven only by test intensity. Such a distribution can                   automated using tools such as BestFit [4]. Specific
then be used for testing whether our      can be rejected based                   statistical tests (e.g., chi-square, K–S, A–D [27])
on available evidence, viz, the actual sample .                                   are usually helpful to find the analytic distribution,
   1) Hypothesis Definition:                                                      e.g., -Normal (Gaussian), Beta, Weibull, with the
                                                                                  closest fit to the data and to determine whether any
                                                                                  subset plausibly represents the population distribu-
      is the (multivariate) regression coefficient , relating            Step P2) Derive Coverage Conditional Distributions for Sub-
to , and , as calculated from the available data set.                             sequent Levels
       is the (multivariate) regression coefficient under     ,                      Because and measurements are cumulated,
when      is related only to .                                                    the dependencies between the distributions of
64                                                                             IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000

              coverage measurements of subsequent levels are              Step P4) Derive the       Distribution under
              anticipated. The dependencies between                                   For each of the        data sets generated in step
                                                                                   P3, (multivariate) regression analysis is performed
                              and                                                  between and measures and the corresponding
                              and                                                       is calculated. The sample of         -values can
                         .                                                         then be used for statistical testing, as described in
                         .                                                         Section III-B-4.
                                    and                                   4) Testing     : To test     , steps – must be performed:
                                                                          Step T1) Set
              can be modeled by using the envelope-method
                                                                                      Set to a given risk level of falsely rejecting    .
              [27]. If -normal distributions are used, and if
                                                                                   Typical levels are 5% or 1%.
              linear dependency can be assumed, then the en-
                                                                          Step T2) Calculate
              velope-method involves running a least squares
                                                                                      Calculate           based on TC and DC by per-
              regression analysis between the same measure-
                                                                                   forming (multivariate) regression, e.g., by calcu-
              ments of different test phases. Section IV gives
                                                                                   lating the regression line
              details on how to model dependencies between
              distributions by using least squares regression, and
              illustrates them by an example.
                 To make sure that a realistic population is gener-                for a linear relationship.    is the measurement of a
              ated when using simulation, it is important to guar-                 given coverage measure across all levels and test
              antee that dependencies within samples are explic-                   objects:           .
              itly modeled. The dependencies due to the cumula-           Step T3) Perform Statistical Test
              tive nature of the data sets are to a large degree con-                 To compare              with the sample of
              served by using the envelope-method. However, it                        -values computed on simulated samples as
              is still possible that simulated data sets are outside               calculated in step P4, compute the number of
              a realistic range, e.g., they may show a decreasing                  instances that are above             . If this number
              coverage. This can happen because fitting based on                   represents a fraction of larger than , then        is
              the empirical sample can yield distributions whose                   not rejected; and we cannot conclude that has a
              domain is larger than the realistic sampling ranges.                  -significant impact on .
              Thus, to avoid unrealistic sampling from fitted dis-
              tributions, the following lower and upper bounds for
                                                                                               IV. CASE STUDY
                 and have to be enforced when necessary:
                                                                           Based on data generated during an experiment conducted by
                                                                        the University of Iowa and the Rockwell/Collins Avionics Di-
                                                                        vision [13], this section illustrates how to apply the statistical
                                                                        testing procedure defined in Section III-B.
          .                                                             A. Background Information
                                                                           The purpose of the experiment was to investigate the rela-
                                                                        tionship between the “coverage of program constructs during
                                                                        testing” and defect-coverage. For this purpose, based on 1 spec-
     Step P3) Perform the Monte Carlo Simulation                        ification, 12 program versions were developed independently,
                 By independently sampling from the          and        the program sizes ranging from 900–4000 uncommented lines
              distributions modeled in steps P1 and P2, the Monte       of code. Then, and            were measured in 3 subsequent test
              Carlo simulation can be used to generate data sets        phases: UT, IT, AT. Because the programs were also exposed
              that conserve the distribution properties of the orig-    to field trials, a realistic approximation of the total number of
              inal data sets TC and DC. For large , the generated       defects (ranging from 5–10 defects) contained in each program
              data sets should provide a representative picture of      could be made; thus allowing for a sensible calculation of actual
              what samples would look like under . This stems               during test. An important prerequisite for the procedure de-
              from the fact that and        distributions are sam-      fined in Section III is: in each test phase, an equal level of is
              pled independently for each level, thus ignoring          applied to the programs. In the experiment in [13], this prereq-
              any possible relationship between these measures.         uisite was fulfilled since each program was subject to exactly
              Latin Hypercube sampling [27] can speed up the            the same set of test cases in each test phase. Information on
              convergence of the simulated distributions toward         was not used to influence testing, e.g., by driving the generation
              the theoretical population distribution from which        of test cases such that is systematically increased.
              the sample is drawn. In this context,             data       In the experiment, was measured for 4 criteria [18]:
              sets usually provide an adequate level of precision           • block coverage,
              for the estimated population distribution.                    • decision coverage,

                                                                              TABLE I
                                                                            RAW DATA [13]

Fig. 2. Result of Fitting N (;    ) to d   Data Using the Tool, BestFit

    • c-use coverage,                                                               C. Generation of Test Distribution
    • p-use coverage.                                                                 This section describes, step by step, how the “procedure to
Blocks and decisions are constructs contained in the control                        construct the test distribution (Section III-B)” can be applied to
flow of a program. Typical examples of blocks in a program                          the raw data in Table I.
are: consecutive code fragments that execute together and do                          Step P1) Estimate Theoretical Distributions for all Coverage
not contain branches. Decisions are defined by the possible                                     Measures
values of a branch predicate. C-use and p-use are data flow                                        With the help of the tool BestFit, for each and
oriented coverage measures. They represent special cases of                                         criterion, suitable analytic distributions:
definition-use pairs associated with program variables:
    • first use of variable in a calculation,
    • first use of variable in a predicate after latest modification
      (or definition).                                                                          were derived by fitting 21 possible distributions
Each coverage criterion was measured for each program version                                   against the sample data. As an example, Fig. 2
at the end of each test phase using the ATAC (Automatic Test                                    shows the result of fitting the -normal distribution
Analysis for C) tool [18].                                                                      to the    data from phase UT. The distributions of
                                                                                                type            and             are derived in step P2,
                                                                                                because they are conditional on the distributions of
B. Description of Available Data Set                                                            coverage values at levels UT and IT, respectively.
                                                                                                   Tables II and III, show in all cases, and for each
   Table I shows the raw data for        and the four measures                                  of the 3 test statistics (chi-square, K–S, A–D), the
taken from 12 test objects at 3 levels of (UT, IT, AT). The mea-                                 -normal distribution is among the subgroup of
surements are expressed in terms of cumulated numbers over                                      plausible theoretical distributions (nonplausible
test phases, and presented in fractions.                                                        distributions having insufficient goodness-of-fit
66                                                                           IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000

                                                                TABLE II

                              TABLE III                                          phases, including UT. The main reason for this is
                                                                                 to make sure that assuming a -normal theoret-
                                                                                 ical distribution makes sense for all test phases,
                                                                                 although the fitted distributions for IT and AT
                                                                                 are not actually used in the simulation procedure.
                                                                                 If goodness-of-fit across theoretical distributions
                                                                                 does not appear consistent among test statistics, it
                                                                                 is easy to check the sensitivity of the overall test
                                                                                 results to the selected theoretical distributions by
                                                                                 performing steps P2–P4, using different distribu-
                                                                                 tions and comparing the results. This is easily done
                                                                                 since the whole procedure can be automated.
                                                                        Step P2) Derive Coverage Conditional Distributions for Sub-
                                                                                 sequent Levels (Test Phases)
                                                                                    The cumulative nature of the coverage measure-
                                                                                 ments contained in the sample creates dependencies
                                                                                 between distributions of a particular coverage mea-
                                                                                 sure across subsequent test phases. The main depen-
                                                                                 dency is caused by the monotonicity of cumulative
                                                                                 data, e.g., for a particular program, block coverage
                                                                                 at the end of phase IT cannot be smaller than at the
                                                                                 end of phase UT. Fig. 3 illustrates:
              are set in italics). In other words, there is a high                   • the monotonicity of a coverage measure by the
              probability that the empirical data could have been                      fact that the related distributions shift from left
              produced by the fitted           . Table III provides                    to right;
              detailed information on the related goodness-of-fit                    • that, due to overlapping distributions of subse-
              values for . Based on these results (the good-                           quent phases, independent random sampling
              ness-of-fit values for the remaining coverage mea-                       can violate the monotonicity condition.
              sures are comparable), and to facilitate subsequent                To ensure that this violation does not happen and
              analysis steps (application of envelope-method),                   that the Monte Carlo sampling presented in Step
              we decided to use the -normal distribution across                  P3 is realistic, dependencies between the distribu-
              the board, for all phases and criteria. Table IV                   tions of subsequent test phases have to be modeled
              shows the fitted -normal distributions for all test                explicitly through conditional distributions, e.g., IT

                                                                       TABLE IV
                                                                     FITTED N (;  )

Fig. 3.   Fitted Distributions of Block Coverage Measurements for Subsequent Test Phases

               block coverage distribution as a function of a spe-                                       TABLE V
                                                                                   REGRESSION SUMMARY BETWEEN BLOCK COVERAGES (UT) AND (IT)
               cific UT block coverage value.
                  The regression results in Table V provide the
               equation of the least squares line

               and the standard deviation of the vertical distances
               of each point from the least-squares line,         .
               Least squares regression assumes that the error of                          In addition, the table entries specify the maximum
               the data about the least squares line is -normally                          boundary (no coverage measurement can be greater
               distributed. Thus, if                                                       than 100%) and nonnegativity (no coverage mea-
                                                                                           surement can be less than 0%) conditions. One
                                                                                           special case should be noted: Since 11 out of 12
                                                                                           measurements of          in phase AT were equal to
               is the equation of the least squares line, the condi-
                                                                                           100%, the envelope-method was not applicable.
               tional distribution is modeled as
                                                                                           Thus, the actual fitted distribution, with almost no
                                                                                           variance, was used. This is clearly an idiosyncrasy
                                                                                           of this data set as, for real-life systems, defects are
                 Fig. 4 is an example where, for any value sam-                            likely to slip to the field.
               pled from the fitted distribution of block coverage                Step P3) Perform the Monte Carlo Simulation
               measurements in phase UT (block(UT)), the related                              Using the distributions in Table VI, 1000 data sets
               conditional distribution of block coverage for IT is                        were generated with Monte Carlo simulation (Latin
               calculated as                                                               Hypercube sampling). For each of the 5 coverage
                                                                                           measures (test and defect), 36 data points were gen-
                                                                                           erated, 12 data points for each of the 3 test phases
                                                                                           (UT, IT, AT). Simulated samples are therefore com-
                  Table VI shows the full set of best-fit (UT) and                         parable to the actual sample in the sense that they
               conditional distributions. As anticipated, condi-                           are based on the same coverage distributions.
               tional distributions show a reduced variance. In                               For this task we used Microsoft Excel [20] and
               order to guarantee monotonicity, constraints have                           @Risk [28].
               been added to the entries of Table VI. If simulation               Step P4) Derive the       Distribution under
               results happen to violate these constraints, then                              For each of the 1000 data sets generated in step
               they are modified according to the specified rules.                         P3 the (multivariate)       is calculated: on each data
68                                                                              IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000

Fig. 4.   Example of Applying the Envelope-Method

                                                                TABLE VI

               set a multivariate linear regression analysis is per-   D. Result of Statistical Testing
               formed:                                                    In order to test    , the steps T1–T3 were performed using
                                                                       the     distribution from step P4:
                                                                         Step T1) Set
                                                                         Step T2) Calculate
               This task used Stata [11], which allows for an easy                           was calculated from empirical data sets TC
               automation of such an iterative procedure.                           and DC by multivariate linear regression.2 Table VII
                 The distribution of 1000          -values shown in                 shows the regression summary; it provides the value
               Fig. 5 is the distribution anticipated under    .                    of                    , which is surprisingly high.
                 As explained in Section 4.4, this distribution is        2Just by looking at the data, there was no graphical evidence of nonlinearity.
               used to test whether the observed -value is likely      In case of doubt, appropriate tests for linearity and scale transformations should
               under      : has no impact of its own on .              be performed.

Fig. 5.   Distribution of Generated R -Values with Median and 95%-Quantile

            Moreover, the low -values indicate that the various                                        TABLE VII
                criteria complement each other with respect to                    REGRESSION SUMMARY FOR MULTIVARIATE LINEAR REGRESSION
                                                                                                 WITH EMPIRICAL SAMPLE
            their predictive power on .
   Step T3) Perform Statistical Test
               To reject     , since          , the        must be
            greater than the 95%-quantile of the test distribution
            (see Fig. 5).
               With                                       %-quan-
            tile,     cannot be rejected. For this data set, the re-
            sults do not support the claim that has an impor-
            tant, additional impact on      when the effect of
            is already accounted for. If this data-set from real
            projects, then nothing in the results would suggest
                                                                                One of the preconditions of using our procedure is that
            that test case generation would be improved by fol-
                                                                             is not the main driver of the testing process (e.g., the design
            lowing a -driven strategy.
                                                                             of test cases). When this is not true, then cannot really be
                                                                             distinguished from . This can be easily checked by looking at
                            V. DISCUSSION
                                                                             the relationship between the number of test cases executed and
   The main issue in this paper is that a relationship between               the increase in .
and does not necessarily mean there is a causal relationship.                   The relationship between , , testability, and (or its var-
It is plausible to assume that both       and     are driven by              ious definitions) is complicated. However, it needs to be mod-
(e.g., number of test cases). This would then lead to an empir-              eled and tested in environments where early testing phases need
ical relationship between test and      measures. Concluding on              to be controlled. In particular, we believe that testability might
the existence of a causal relationship between and would                     explain the large variations observed in the relationship between
have important practical consequences, because it suggests that              defect and . Our future work includes the development of a set
testing should be driven by such . It is therefore important to              of case studies and experiments to study these complex relation-
test whether really impacts when the effect is accounted                     ships and find optimal ways to model them.
   This paper shows an appropriate procedure and in what cir-
cumstances it should be used. The data-set did not suggest that                                             REFERENCES
any of the measures, even when used together, has any ad-
                                                                               [1] V. Basili and R. Selby, “Comparing the effectiveness of software testing
ditional effect on      when      was already accounted for. Of                    strategies,” IEEE Trans. Software Engineering, vol. 13, no. 12, pp.
course, such a result is anticipated to vary across environments,                  1278–1296, 1987.
depending on the distribution of defects, the type of defects, etc.            [2] B. Beizer, Software Testing Techniques: Van Nostrand Reinhold, 1990.
                                                                               [3] A. Bertolino and M. Marré, “How many paths are needed for branch
   This result sheds new light on the conclusions of the original                  testing?,” J. Systems and Software, vol. 35, no. 2, pp. 95–106, 1996.
study [13], where the authors suspected that: “there’s a correla-              [4] BestFit: Probability Distribution Fitting for Windows, User’s Guide,
tion between the number of faults detected in a version and the                    Palisade Corp., 1997.
                                                                               [5] F. Del Frate, P. Garg, A. Mathur, and A. Pasquini, “On the correlation
coverage of its program constructs.” Based on their data, our                      between code coverage and software reliability,” in Proc. 4th Int’l. Symp.
procedure provides an accurate answer to that question.                            Software Reliability Engineering (ISSRE), 1995, pp. 124–132.
70                                                                                             IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000

  [6] L. Foreman and S. Zweben, “A study of the effectiveness of control and           [24] S. Rapps and E. J. Weyuker, “Selecting software test data using data
      data flow testing strategies,” J. Software and Systems, vol. 21, no. 3, pp.           flow information,” IEEE Trans. Software Engineering, vol. 11, no. 4,
      213–228, 1993.                                                                        pp. 367–375, 1985.
  [7] P. G. Frankl and S. N. Weiss, “An experimental comparison of the effec-          [25] RISKView, The Distribution Viewing Companion, User’s Guide, Pal-
      tiveness of branch testing and data flow testing,” IEEE Trans. Software               isade Corp., 1996.
      Engineering, vol. 19, no. 8, pp. 774–787, 1993.                                  [26] A. Veevers and A. C. Marshall, “A relationship between software cov-
  [8] P. G. Frankl, S. N. Weiss, and C. Hu, “All-uses vs mutation testing: An               erage metrics and reliability,” Software Testing, Verification and Relia-
      experimental comparison of effectiveness,” J. Systems and Software, vol.              bility, vol. 4, no. 1, pp. 3–8, 1994.
      38, no. 3, pp. 235–253, 1997.                                                    [27] D. Vose, Quantitative Risk Analysis: A Guide to Monte Carlo Simulation
  [9] P. G. Frankl and E. J. Weyuker, “An applicable family of data flow                    Modeling: John Wiley & Sons, 1996.
      testing criteria,” IEEE Trans. Software Engineering, vol. 14, no. 10, pp.        [28] L. W. Wayne, Simulation Modeling Using @Risk: Duxbury Press, 1996.
      1483–1498, 1988.                                                                 [29] E. J. Weyuker, “More experience with data flow testing,” IEEE Trans.
 [10]       , “Provable improvements on branch testing,” IEEE Trans. Soft-                  Software Engineering, vol. 19, no. 9, pp. 912–919, 1985.
      ware Engineering, vol. 19, no. 10, pp. 962–975, 1993.                            [30] W. E. Wong, J. R. Horgan, S. London, and A. P. Mathur, “Effect of
 [11] L. C. Hamilton, Statistics with Stata 5: Duxbury Press, 1998.                         test set size and block coverage on the fault detection effectiveness,” in
 [12] B. Haworth, “Adequacy criteria for object testing,” in Proc. 2nd Int’l.               Proc. 3rd Int’l. Symp. Software Reliability Engineering (ISSRE), 1994,
      Software Quality Week, Belgium, Nov. 1998.                                            pp. 230–238.
 [13] J. R. Horgan, S. London, and M. R. Lyu, “Achieving software quality
      with testing coverage measures,” IEEE Computer, pp. 60–69, Sept.
 [14] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, “Experiments on             Lionel C. Briand is with the Department of Systems and Computer Engi-
      the effectiveness of dataflow- and controlflow-based test adequacy cri-         neering, Carleton University, where is an Associate Professor. Before that,
      teria,” in Proc. 16th Int’l. Conf. Software Engineering (ICSE),, 1994, pp.      Lionel was a department head at the Fraunhofer Institute for Experimental
      191–200.                                                                        Software Engineering, Germany, and a Software Engineering group leader at
 [15] R. Jacoby and K. Masuzawa, “Test coverage dependent software relia-             the Computer Research Institute of Montreal (CRIM), Canada. Lionel worked
      bility estimation by the HGD model,” in Proc. 1st Int’l. Symp. Software         as a research scientist for the Software Engineering Laboratory, a consortium
      Reliability Engineering (ISSRE), 1992, pp. 193–204.                             of the NASA Goddard Space Flight Center, CSC, and the University of
 [16] Z. Jin and J. Offutt, “Coupling-based criteria for integration testing, soft-   Maryland. He has been on the program, steering, or organization committees
      ware testing,” Verification and Reliability, vol. 8, no. 3, pp. 133–154,        of many international, IEEE conferences such as ICSE, ICSM, ISSRE, and
      1998.                                                                           METRICS. Lionel is on the editorial board of Empirical Software Engineering:
 [17] N. Li, Y. K. Malaiya, and J. Denton, “Estimating the number of defects:         An International Journal (Kluwer). His research interests include software
      A simple and intuitive approach,” in Proc. 7th Int’l. Symp. Software Re-        testing and inspections, object-oriented software development, and quantitative
      liability Engineering (ISSRE), 1998, pp. 307–315.                               methods applied to software quality engineering.
 [18] M. R. Lyu, J. R. Horgan, and S. London, “A coverage analysis tool for
      the effectiveness of software testing,” in Proc. 2nd Int’l. Symp. Software
      Reliability Engineering (ISSRE),, 1993, pp. 25–34.
 [19] Y. K. Malaiya, N. Li, and J. Bieman et al., “The relationship between test      Dietmar Pfahl studied applied mathematics, software engineering, and eco-
      coverage and reliability,” in Proc. 3rd Int’l. Symp. Software Reliability       nomics at the University of Ulm, Germany, and the University of Southern Cal-
      Engineering (ISSRE), 1994, pp. 186–195.                                         ifornia, USA. He received his M.Sc. in Applied Mathematics and Economics
 [20] Microsoft Excel 97, User’s Guide, Microsoft Corp., 1997.                        from the University of Ulm. From 1987 to 1996, he was a research staff member
 [21] J. A. Morgan and G. J. Knafl, “Residual fault density prediction using          and software engineering consultant with two corporate research divisions of
      regression methods,” in Proc. 5th Int’l. Symp. Software Reliability En-         Siemens AG, Germany. This affiliation was complemented by a one year stay
      gineering (ISSRE), 1996, pp. 87–92.                                             as a junior scientist at the German Aerospace Research Establishment (DLR)
 [22] S. C. Ntafos, “A comparison of some structural testing strategies,” IEEE        in Oberpfaffenhofen. Since 1996, he has been with the Fraunhofer Institute for
      Trans. Software Engineering, vol. 14, no. 6, pp. 868–874, 1988.                 Experimental Software Engineering (IESE), where he works as a project engi-
 [23] P. Piwowarski, M. Ohba, and J. Caruso, “Coverage measurement expe-              neer in various national and international research and transfer projects with the
      rience during function test,” in Proc. 15th Int’l. Conf. Software Engi-         software industry. His research interests include software process simulation,
      neering (ICSE), 1993, pp. 287–301.                                              and quantitative methods applied to software project management.

To top