VIEWS: 19 PAGES: 11 CATEGORY: Academic Papers POSTED ON: 9/24/2012 Public Domain
60 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000 Using Simulation for Assessing the Real Impact of Test-Coverage on Defect-Coverage Lionel C. Briand and Dietmar Pfahl Abstract—The use of test-coverage measures (e.g., block-cov- number of test objects erage) to control the software test process has become an DC set of measurements increasingly common practice. This is justified by the assumption TC set of measurements that higher test-coverage helps achieve higher defect-coverage and therefore improves software quality. In practice, data often Cdf of coverage measurements at a certain , with show that defect-coverage and test-coverage grow over time, as or additional testing is performed. However, it is unclear whether this Gaussian ( -normal) Cdf with mean, , and standard phenomenon of concurrent growth can be attributed to a causal deviation, dependency, or if it is coincidental, simply due to the cumulative null hypothesis nature of both measures. Answering such a question is important as it determines whether a given test-coverage measure should be -value Pr wrongly rejecting monitored for quality control and used to drive testing. -significance level Although it is no general answer to this problem, a procedure is number of repetitions in a Monte Carlo simulation proposed to investigate whether any test-coverage criterion has a coefficient of determination, [(multivariate) regres- genuine additional impact on defect-coverage when compared to sion coefficient] the impact of just running additional test cases. This procedure applies in typical testing conditions where under (generated through simulation) • the software is tested once, according to a given strategy, derived from empirical data set • coverage measures are collected as well as defect data. eps error term in (multivariate) regression equation This procedure is tested on published data, and the results are com- pared with the original findings. The study outcomes do not sup- Definitions of Test-Coverage Measures port the assumption of a causal dependency between test-coverage Test-coverage is measured as the fraction of constructs—as and defect-coverage, a result for which several plausible explana- tions are provided. defined by the coverage criterion—that have been executed at least once during testing. In the literature, a variety of test-cov- Index Terms—Defect-coverage, Monte Carlo simulation, Soft- erage criteria have been suggested, e.g., [13]: ware test, Test-coverage, Test intensity. • Block coverage: A control flow based criterion that measures the portion of basic blocks executed during I. INTRODUCTION testing. Basic blocks are maximal code fragments without Abbreviations and Acronyms branching. Thus, a basic block can only be executed entirely from beginning to end, as it contains no internal AT acceptance test flow of control change. IT integration test • Decision coverage: A control flow based criterion that UT unit test measures the portion of decisions executed during testing. K–S Kolmogorov–Smirnov A decision is a pair of basic blocks such that is A–D Anderson–Darling a successor of . Decision coverage subsumes block cov- - statistical erage, i.e., 100% decision coverage implies 100% block coverage. Notation • C-use coverage: A data flow based criterion that measures defect-coverage the portion of c-uses (computational uses) covered during test-coverage testing. A c-use is: test intensity • a variable , and number of levels • the set of all paths in the data flow graph from node number of measures to node such that: i) is defined in ; Manuscript received August 31, 1999; revised October 1, 1999. ii) is not defined in any other node on the paths L. C. Briand is with the Systems and Computer Engineering Department, from to ; Carleton University, 1125 Colonel By Drive, Ottawa K1S 5B6 Canada (e-mail: Briand@sce.carleton.ca). iii) is used in a computational expression of D. Pfahl is with the Fraunhofer Institute for Experimental Software Engi- node , e.g., as a procedure argument, as an neering (IESE), Sauerwiesen 6, D-67661 Kaiserslautern Fed. Rep. Germany initializer in a declaration, as a return value of (e-mail: Pfahl@iese.fhg.de). Responsible editor: M.A. Vouk. a function call, as the second operand of the Publisher Item Identifier S 0018-9529(00)06203-5. assignment operator ( ). 0018–9529/00$10.00 © 2000 IEEE BRIAND AND PFAHL: USING SIMULATION FOR ASSESSING THE REAL IMPACT OF TEST-COVERAGE ON DEFECT-COVERAGE 61 A c-use is covered if at least one of the paths in the Section III presents a simulation-based procedure to test the c-use is executed during testing. impact of test-coverage on defect-coverage. • P-use coverage: A data flow based criterion that mea- Section IV provides the results of applying our simulation- sures the portion of p-uses (predicate uses) covered during based procedure. testing. A p-use is: Section V discusses the work and proposes directions for fu- • a variable and ture research. • the set of all paths in the data flow graph from node to node such that: II. PROBLEM STATEMENT i) is defined in ; ii) is not defined in any other node on the paths When a relationship is observed between test-coverage and from to , except possibly , defect-coverage, it is commonly assumed to support the hypoth- iii) is used in a predicate expression of node esis that test-coverage leads to defect-coverage. However, does , e.g., as the first operand in the conditional this assumption really capture reality? This has important prac- expression of an if, for, while, do, tical implications as it justifies why testing should be coverage or switch statement. driven, or evaluated based on coverage achievement. Therefore, it is important that test-coverage measures be validated as im- A p-use is covered if at least one of the paths in the p-use portant defect-coverage drivers. is executed during testing. Another, perhaps more plausible, interpretation of any em- Testing is one of the most effort-intensive activities during pirical relationship between a test-coverage measure and de- software development [2]. Much research is directed toward fect-coverage is that they are both driven by more testing (re- developing new, improved test methods. One way to control ferred to as test intensity). This is the typical dilemma of inter- testing better—and thus improve test resource allocation—is preting a relationship as causal or coincidental. to measure estimators (referred to as test-coverage) of the One way to approach this problem is to determine whether fraction of defects detected during testing (referred to as test-coverage has any additional impact on defect-coverage defect-coverage). Many test-coverage measures have been as compared to test intensity alone. In other words, this is proposed and studied. They range from equivalent to assess whether test-coverage is still a -significant • simple measures counting the program blocks covered, to indicator of defect-coverage, when the effect of test intensity • data-flow based measures looking at the definition and use has already been accounted for. One approach is to determine of variables. whether the combined effect of test intensity and test-coverage Many of these test-coverage measures have been investigated can better explain the variations in defect-coverage than test in terms of their subsumption relationships, the ease with which intensity alone. If this is the case, then one can conclude complete coverage can be achieved, and the ways they can be that evidence suggests that test-coverage has an important used to drive test case generation and selection [9], [12], [16], additional impact on defect-coverage. [22], [24]. Several additional studies reporting the application Where testing is mainly driven by test-coverage, test inten- of test-coverage measures to control or improve test efficiency sity and test-coverage cannot be differentiated. But in typical have been published [3], [23], [29]. More importantly in the situations, this is not the case. In addition, defect-coverage and context of our research, researchers have attempted to build test-coverage data are usually collected at a few discrete points defect-coverage models based on test-coverage measures [1], in time, e.g., at the end of each testing phase such as unit, in- [5]–[8], [10], [14], [15], [17], [19], [21], [26], [30]. The basic tegration, or system testing. This requires an analysis approach assumption, regardless of the modeling strategy, is that there is considering these practical constraints. In the data set used in some (important) causal effect between test-coverage and de- this paper, the test intensity has 3 possible values. Due to the de- fect-coverage. sign of the original study from which the data are drawn, all sys- However, since both test-coverage and defect-coverage in- tems showed identical test intensity (viz, number of test cases) crease with test intensity or time, it is not surprising that em- at the end of each test phase, when coverage measurement was pirical data usually show a relationship. But it does not neces- taken. sarily mean that additional test-coverage drives the detection of In order to test the importance of the impact of test-coverage new defects. The question investigated in this paper is how to on defect-coverage, a procedure is defined that can be easily test whether a given test-coverage measurement, or several of used in a context where defect and test-coverage data are col- them combined, are actually having an important impact on de- lected at a few discrete points in the testing process. This pro- fect-coverage. It is also important that any solution be usable cedure is based on Monte Carlo simulation and can easily be in typical testing conditions. The main focus of this paper is to automated. present an easily replicable procedure that is based on simula- tion techniques and is designed to investigate the relationship III. TESTING THE IMPACT OF TEST-COVERAGE between test-coverage and defect-coverage. Data coming from [13] are used to exemplify the procedure and show how it can This section presents our procedure to test whether test-cov- yield more precise results. erage has an important impact on defect-coverage. Section III-A Section II precisely defines the problems associated with presents the rationale and relates it to the fundamentals of sta- using test-coverage measures for controlling test effectiveness. tistical testing. Section III-B describes the procedure in detail. 62 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000 Fig. 1. Relationship between Defect-Coverage and Test-Coverage This procedure has been designed to work in a typical con- distributions. Then, it uses these estimated population distribu- text where testing is not coverage driven and where test and tions to generate the anticipated distribution by using only defect-coverage data are collected at a few points in time, e.g., the test intensity information in the sample. at the end of testing activities such as unit, integration, or system To allow for an analysis across systems of varying functional testing. size, test intensity should be normalized using any suitable mea- sure of functional size, e.g., function points. The validity of spe- A. Rationale cific test intensity and coverage measures is context-dependent Using statistical testing terminology, the goal is to test the null and is not discussed here, because it does not affect the pro- hypothesis: cedure in this paper. However, defining a meaningful test in- tensity measure requires carefully assessing its underlying as- • test-coverage has no additional impact on defect-coverage sumptions. For example, if larger systems happen to be more as compared to test intensity alone. difficult to test (diseconomies of scale), a measure of test effort In order to test this hypothesis, we need to estimate what would per function point would not be a valid measure of test intensity. be the strength of the relationship between test-coverage and In other words, equal test intensity values might not be compa- defect-coverage, assuming is true. If the strength of this re- rable across systems of different sizes. lationship is measured in terms of goodness-of-fit or correlation To illustrate the principles, assume that there is only 1 (e.g., (e.g., ), the -expected distribution is estimated under . block coverage) and . For example, although the relationship Such a distribution can then be used for statistical testing by between and does not have to be linear,1 the sample data comparing in the observed sample to the distribution, and could look like the scatterplot in Fig. 1. assessing the probability of obtaining an equivalent or higher Typical test data sets are composed of data vectors containing . If this probability is small (say, below 0.05 or 0.01), we • defect-coverage and test-coverage data, can assuredly reject and assume that the impact of test-cov- • test effort, erage on defect-coverage is plausible. Otherwise, cannot be • number of test cases run. rejected and there is no supporting evidence that test-coverage Across systems, such data vectors can be grouped according to has any impact on defect-coverage that is not already explained test intensity levels (e.g., groups are depicted by rectangles in by test intensity. Fig. 1). Several strategies can be adopted for data collection; The main issue is to devise a method to compute the antici- coverage and test intensity data can be collected pated distribution under . We typically have to work with a sample of projects for which we have defect-coverage and a) at the completion of test phases (e.g., as in Fig. 1), or test-coverage data (usually for several test-coverage measures), b) on a regular basis, e.g., daily, weekly. corresponding to certain test intensity values (e.g., number of Strategy a) applies in a context of organizations with a well-de- test cases, test effort), and collected at the end of various testing fined test process and strategy, which are presumed to yield sim- activities or phases. At an intuitive level, what the procedure (in ilar test intensities across systems for each of the test phases. For Section III-B) does is to use the sample test-coverage and de- 1If the relationship is not linear, e.g., exponential, it can be linearized to fa- fect-coverage distributions to estimate the respective population cilitate the analysis. BRIAND AND PFAHL: USING SIMULATION FOR ASSESSING THE REAL IMPACT OF TEST-COVERAGE ON DEFECT-COVERAGE 63 strategy b), daily measurement is probably necessary for organi- 2) Formal Representation of Empirical Data Set: The avail- zations that develop small and medium software programs. For able data on , , and can be formally represented to facili- organizations that develop large software systems, over long pe- tate further treatment. riods of time, weekly measurement might be sufficient. In any • Test intensity: 1 to levels of : , case, it is important to collect coverage and test intensity data with ; e.g., each level can correspond at sufficient granularity level to allow for the grouping, across to the completion of a test phase. Depending on the spe- systems, of data vectors showing similar test intensities. The cific context, can be measured in terms of test cases, test number of groups does not matter much, as long as the number effort, or any other measure that adequately captures the of observations within groups is large enough for the estimation amount of testing applied to a piece of software. of the population distributions. • Test-coverage: 1 to measures, e.g., block, The example of Fig. 1 shows coverage data collected from decisions; 1 to test objects, e.g. modules, units, measuring 12 programs (or any other type of object under test), programs, sub-systems. and groups them according to 3 test phases: A TC of test-coverage measurements , • unit test, with , for— • integration test, • system test. Groups are depicted by rectangles. For the sake of simplicity, only one test-coverage measure is shown here. Corresponding . . . to each test phase, there are unknown population distributions for and . These distributions can be estimated using the sample distributions on these dimensions. The idea is to gen- erate, through simulation, many scatterplots (e.g., more than • Defect-coverage: A data set DC of measurements 1000) such as the one in Fig. 1, by using the estimated popu- , with , for— lation distributions on and and sampling from them inde- pendently. Each simulated sample would have exactly the same number of observations corresponding to each group as the orig- inal sample. Following that simulation procedure, the test inten- . . sity effect is preserved since we sample from the corresponding . and distributions for each test phase. However, since we ignore the pairing between defect-coverage and test-coverage (because independent sampling is performed for the two dis- is calculated as the cumulated number of defects found tributions), no coverage relationship is preserved if not already at a given , divided by the total number of defects. accounted for by test intensity. If test-coverage is important, we Both and measurements are usually expressed in per- anticipate that the simulated samples would show, on average, a centages. poorer -correlation than the actual sample, where both test in- 3) Procedure to Construct the Test Distribution: A 4-step tensity and test-coverage effects on defect-coverage should be procedure is defined to construct the test distribution for visible. In other words, the distribution of the simulated scat- under (probability distribution of ): terplots is the distribution we would anticipate under and it can be used to test how likely the sample -value is under this Step P1) Estimate Theoretical Distributions for all Coverage condition. Such procedure can also be used when using several Measures test-coverage measures and multivariate regression. For each coverage criterion (test and defect), find the best fit distributions B. Procedure for Statistical Testing This section presents a procedure to compute the distribu- tion characterizing the relationship between and which is to the sample data of level . Finding the best anticipated under , that is when it is assumed that defect-cov- distributions and fitting its parameters can be easily erage is driven only by test intensity. Such a distribution can automated using tools such as BestFit [4]. Specific then be used for testing whether our can be rejected based statistical tests (e.g., chi-square, K–S, A–D [27]) on available evidence, viz, the actual sample . are usually helpful to find the analytic distribution, 1) Hypothesis Definition: e.g., -Normal (Gaussian), Beta, Weibull, with the closest fit to the data and to determine whether any subset plausibly represents the population distribu- tion. is the (multivariate) regression coefficient , relating Step P2) Derive Coverage Conditional Distributions for Sub- to , and , as calculated from the available data set. sequent Levels is the (multivariate) regression coefficient under , Because and measurements are cumulated, when is related only to . the dependencies between the distributions of 64 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000 coverage measurements of subsequent levels are Step P4) Derive the Distribution under anticipated. The dependencies between For each of the data sets generated in step P3, (multivariate) regression analysis is performed and between and measures and the corresponding and is calculated. The sample of -values can . then be used for statistical testing, as described in . . Section III-B-4. and 4) Testing : To test , steps – must be performed: Step T1) Set can be modeled by using the envelope-method Set to a given risk level of falsely rejecting . [27]. If -normal distributions are used, and if Typical levels are 5% or 1%. linear dependency can be assumed, then the en- Step T2) Calculate velope-method involves running a least squares Calculate based on TC and DC by per- regression analysis between the same measure- forming (multivariate) regression, e.g., by calcu- ments of different test phases. Section IV gives lating the regression line details on how to model dependencies between distributions by using least squares regression, and illustrates them by an example. To make sure that a realistic population is gener- for a linear relationship. is the measurement of a ated when using simulation, it is important to guar- given coverage measure across all levels and test antee that dependencies within samples are explic- objects: . itly modeled. The dependencies due to the cumula- Step T3) Perform Statistical Test tive nature of the data sets are to a large degree con- To compare with the sample of served by using the envelope-method. However, it -values computed on simulated samples as is still possible that simulated data sets are outside calculated in step P4, compute the number of a realistic range, e.g., they may show a decreasing instances that are above . If this number coverage. This can happen because fitting based on represents a fraction of larger than , then is the empirical sample can yield distributions whose not rejected; and we cannot conclude that has a domain is larger than the realistic sampling ranges. -significant impact on . Thus, to avoid unrealistic sampling from fitted dis- tributions, the following lower and upper bounds for IV. CASE STUDY and have to be enforced when necessary: Based on data generated during an experiment conducted by the University of Iowa and the Rockwell/Collins Avionics Di- vision [13], this section illustrates how to apply the statistical testing procedure defined in Section III-B. . . . A. Background Information The purpose of the experiment was to investigate the rela- tionship between the “coverage of program constructs during testing” and defect-coverage. For this purpose, based on 1 spec- Step P3) Perform the Monte Carlo Simulation ification, 12 program versions were developed independently, By independently sampling from the and the program sizes ranging from 900–4000 uncommented lines distributions modeled in steps P1 and P2, the Monte of code. Then, and were measured in 3 subsequent test Carlo simulation can be used to generate data sets phases: UT, IT, AT. Because the programs were also exposed that conserve the distribution properties of the orig- to field trials, a realistic approximation of the total number of inal data sets TC and DC. For large , the generated defects (ranging from 5–10 defects) contained in each program data sets should provide a representative picture of could be made; thus allowing for a sensible calculation of actual what samples would look like under . This stems during test. An important prerequisite for the procedure de- from the fact that and distributions are sam- fined in Section III is: in each test phase, an equal level of is pled independently for each level, thus ignoring applied to the programs. In the experiment in [13], this prereq- any possible relationship between these measures. uisite was fulfilled since each program was subject to exactly Latin Hypercube sampling [27] can speed up the the same set of test cases in each test phase. Information on convergence of the simulated distributions toward was not used to influence testing, e.g., by driving the generation the theoretical population distribution from which of test cases such that is systematically increased. the sample is drawn. In this context, data In the experiment, was measured for 4 criteria [18]: sets usually provide an adequate level of precision • block coverage, for the estimated population distribution. • decision coverage, BRIAND AND PFAHL: USING SIMULATION FOR ASSESSING THE REAL IMPACT OF TEST-COVERAGE ON DEFECT-COVERAGE 65 TABLE I RAW DATA [13] Fig. 2. Result of Fitting N (; ) to d Data Using the Tool, BestFit • c-use coverage, C. Generation of Test Distribution • p-use coverage. This section describes, step by step, how the “procedure to Blocks and decisions are constructs contained in the control construct the test distribution (Section III-B)” can be applied to flow of a program. Typical examples of blocks in a program the raw data in Table I. are: consecutive code fragments that execute together and do Step P1) Estimate Theoretical Distributions for all Coverage not contain branches. Decisions are defined by the possible Measures values of a branch predicate. C-use and p-use are data flow With the help of the tool BestFit, for each and oriented coverage measures. They represent special cases of criterion, suitable analytic distributions: definition-use pairs associated with program variables: • first use of variable in a calculation, • first use of variable in a predicate after latest modification (or definition). were derived by fitting 21 possible distributions Each coverage criterion was measured for each program version against the sample data. As an example, Fig. 2 at the end of each test phase using the ATAC (Automatic Test shows the result of fitting the -normal distribution Analysis for C) tool [18]. to the data from phase UT. The distributions of type and are derived in step P2, because they are conditional on the distributions of B. Description of Available Data Set coverage values at levels UT and IT, respectively. Tables II and III, show in all cases, and for each Table I shows the raw data for and the four measures of the 3 test statistics (chi-square, K–S, A–D), the taken from 12 test objects at 3 levels of (UT, IT, AT). The mea- -normal distribution is among the subgroup of surements are expressed in terms of cumulated numbers over plausible theoretical distributions (nonplausible test phases, and presented in fractions. distributions having insufficient goodness-of-fit 66 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000 TABLE II RANKED FITTED DISTRIBUTIONS FOR ALL COVERAGE MEASURES (AT t LEVEL UT) [25] TABLE III phases, including UT. The main reason for this is RANKED FITTED DISTRIBUTIONS FOR d (UT) WITH GOODNESS-OF-FIT to make sure that assuming a -normal theoret- VALUES ical distribution makes sense for all test phases, although the fitted distributions for IT and AT are not actually used in the simulation procedure. If goodness-of-fit across theoretical distributions does not appear consistent among test statistics, it is easy to check the sensitivity of the overall test results to the selected theoretical distributions by performing steps P2–P4, using different distribu- tions and comparing the results. This is easily done since the whole procedure can be automated. Step P2) Derive Coverage Conditional Distributions for Sub- sequent Levels (Test Phases) The cumulative nature of the coverage measure- ments contained in the sample creates dependencies between distributions of a particular coverage mea- sure across subsequent test phases. The main depen- dency is caused by the monotonicity of cumulative data, e.g., for a particular program, block coverage at the end of phase IT cannot be smaller than at the end of phase UT. Fig. 3 illustrates: are set in italics). In other words, there is a high • the monotonicity of a coverage measure by the probability that the empirical data could have been fact that the related distributions shift from left produced by the fitted . Table III provides to right; detailed information on the related goodness-of-fit • that, due to overlapping distributions of subse- values for . Based on these results (the good- quent phases, independent random sampling ness-of-fit values for the remaining coverage mea- can violate the monotonicity condition. sures are comparable), and to facilitate subsequent To ensure that this violation does not happen and analysis steps (application of envelope-method), that the Monte Carlo sampling presented in Step we decided to use the -normal distribution across P3 is realistic, dependencies between the distribu- the board, for all phases and criteria. Table IV tions of subsequent test phases have to be modeled shows the fitted -normal distributions for all test explicitly through conditional distributions, e.g., IT BRIAND AND PFAHL: USING SIMULATION FOR ASSESSING THE REAL IMPACT OF TEST-COVERAGE ON DEFECT-COVERAGE 67 TABLE IV FITTED N (; ) Fig. 3. Fitted Distributions of Block Coverage Measurements for Subsequent Test Phases block coverage distribution as a function of a spe- TABLE V REGRESSION SUMMARY BETWEEN BLOCK COVERAGES (UT) AND (IT) cific UT block coverage value. The regression results in Table V provide the equation of the least squares line and the standard deviation of the vertical distances of each point from the least-squares line, . Least squares regression assumes that the error of In addition, the table entries specify the maximum the data about the least squares line is -normally boundary (no coverage measurement can be greater distributed. Thus, if than 100%) and nonnegativity (no coverage mea- surement can be less than 0%) conditions. One special case should be noted: Since 11 out of 12 measurements of in phase AT were equal to is the equation of the least squares line, the condi- 100%, the envelope-method was not applicable. tional distribution is modeled as Thus, the actual fitted distribution, with almost no variance, was used. This is clearly an idiosyncrasy of this data set as, for real-life systems, defects are Fig. 4 is an example where, for any value sam- likely to slip to the field. pled from the fitted distribution of block coverage Step P3) Perform the Monte Carlo Simulation measurements in phase UT (block(UT)), the related Using the distributions in Table VI, 1000 data sets conditional distribution of block coverage for IT is were generated with Monte Carlo simulation (Latin calculated as Hypercube sampling). For each of the 5 coverage measures (test and defect), 36 data points were gen- erated, 12 data points for each of the 3 test phases (UT, IT, AT). Simulated samples are therefore com- Table VI shows the full set of best-fit (UT) and parable to the actual sample in the sense that they conditional distributions. As anticipated, condi- are based on the same coverage distributions. tional distributions show a reduced variance. In For this task we used Microsoft Excel [20] and order to guarantee monotonicity, constraints have @Risk [28]. been added to the entries of Table VI. If simulation Step P4) Derive the Distribution under results happen to violate these constraints, then For each of the 1000 data sets generated in step they are modified according to the specified rules. P3 the (multivariate) is calculated: on each data 68 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000 Fig. 4. Example of Applying the Envelope-Method TABLE VI SUMMARY OF FITTED AND CONDITIONAL DISTRIBUTIONS USED FOR MONTE CARLO SIMULATION set a multivariate linear regression analysis is per- D. Result of Statistical Testing formed: In order to test , the steps T1–T3 were performed using the distribution from step P4: Step T1) Set %. Step T2) Calculate This task used Stata [11], which allows for an easy was calculated from empirical data sets TC automation of such an iterative procedure. and DC by multivariate linear regression.2 Table VII The distribution of 1000 -values shown in shows the regression summary; it provides the value Fig. 5 is the distribution anticipated under . of , which is surprisingly high. As explained in Section 4.4, this distribution is 2Just by looking at the data, there was no graphical evidence of nonlinearity. used to test whether the observed -value is likely In case of doubt, appropriate tests for linearity and scale transformations should under : has no impact of its own on . be performed. BRIAND AND PFAHL: USING SIMULATION FOR ASSESSING THE REAL IMPACT OF TEST-COVERAGE ON DEFECT-COVERAGE 69 Fig. 5. Distribution of Generated R -Values with Median and 95%-Quantile Moreover, the low -values indicate that the various TABLE VII criteria complement each other with respect to REGRESSION SUMMARY FOR MULTIVARIATE LINEAR REGRESSION WITH EMPIRICAL SAMPLE their predictive power on . Step T3) Perform Statistical Test To reject , since , the must be greater than the 95%-quantile of the test distribution (see Fig. 5). With %-quan- tile, cannot be rejected. For this data set, the re- sults do not support the claim that has an impor- tant, additional impact on when the effect of is already accounted for. If this data-set from real projects, then nothing in the results would suggest One of the preconditions of using our procedure is that that test case generation would be improved by fol- is not the main driver of the testing process (e.g., the design lowing a -driven strategy. of test cases). When this is not true, then cannot really be distinguished from . This can be easily checked by looking at V. DISCUSSION the relationship between the number of test cases executed and The main issue in this paper is that a relationship between the increase in . and does not necessarily mean there is a causal relationship. The relationship between , , testability, and (or its var- It is plausible to assume that both and are driven by ious definitions) is complicated. However, it needs to be mod- (e.g., number of test cases). This would then lead to an empir- eled and tested in environments where early testing phases need ical relationship between test and measures. Concluding on to be controlled. In particular, we believe that testability might the existence of a causal relationship between and would explain the large variations observed in the relationship between have important practical consequences, because it suggests that defect and . Our future work includes the development of a set testing should be driven by such . It is therefore important to of case studies and experiments to study these complex relation- test whether really impacts when the effect is accounted ships and find optimal ways to model them. for. This paper shows an appropriate procedure and in what cir- cumstances it should be used. The data-set did not suggest that REFERENCES any of the measures, even when used together, has any ad- [1] V. Basili and R. Selby, “Comparing the effectiveness of software testing ditional effect on when was already accounted for. Of strategies,” IEEE Trans. Software Engineering, vol. 13, no. 12, pp. course, such a result is anticipated to vary across environments, 1278–1296, 1987. depending on the distribution of defects, the type of defects, etc. [2] B. Beizer, Software Testing Techniques: Van Nostrand Reinhold, 1990. [3] A. Bertolino and M. Marré, “How many paths are needed for branch This result sheds new light on the conclusions of the original testing?,” J. Systems and Software, vol. 35, no. 2, pp. 95–106, 1996. study [13], where the authors suspected that: “there’s a correla- [4] BestFit: Probability Distribution Fitting for Windows, User’s Guide, tion between the number of faults detected in a version and the Palisade Corp., 1997. [5] F. Del Frate, P. Garg, A. Mathur, and A. Pasquini, “On the correlation coverage of its program constructs.” Based on their data, our between code coverage and software reliability,” in Proc. 4th Int’l. Symp. procedure provides an accurate answer to that question. Software Reliability Engineering (ISSRE), 1995, pp. 124–132. 70 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 1, MARCH 2000 [6] L. Foreman and S. Zweben, “A study of the effectiveness of control and [24] S. Rapps and E. J. Weyuker, “Selecting software test data using data data flow testing strategies,” J. Software and Systems, vol. 21, no. 3, pp. flow information,” IEEE Trans. Software Engineering, vol. 11, no. 4, 213–228, 1993. pp. 367–375, 1985. [7] P. G. Frankl and S. N. Weiss, “An experimental comparison of the effec- [25] RISKView, The Distribution Viewing Companion, User’s Guide, Pal- tiveness of branch testing and data flow testing,” IEEE Trans. Software isade Corp., 1996. Engineering, vol. 19, no. 8, pp. 774–787, 1993. [26] A. Veevers and A. C. Marshall, “A relationship between software cov- [8] P. G. Frankl, S. N. Weiss, and C. Hu, “All-uses vs mutation testing: An erage metrics and reliability,” Software Testing, Verification and Relia- experimental comparison of effectiveness,” J. Systems and Software, vol. bility, vol. 4, no. 1, pp. 3–8, 1994. 38, no. 3, pp. 235–253, 1997. [27] D. Vose, Quantitative Risk Analysis: A Guide to Monte Carlo Simulation [9] P. G. Frankl and E. J. Weyuker, “An applicable family of data flow Modeling: John Wiley & Sons, 1996. testing criteria,” IEEE Trans. Software Engineering, vol. 14, no. 10, pp. [28] L. W. Wayne, Simulation Modeling Using @Risk: Duxbury Press, 1996. 1483–1498, 1988. [29] E. J. Weyuker, “More experience with data flow testing,” IEEE Trans. [10] , “Provable improvements on branch testing,” IEEE Trans. Soft- Software Engineering, vol. 19, no. 9, pp. 912–919, 1985. ware Engineering, vol. 19, no. 10, pp. 962–975, 1993. [30] W. E. Wong, J. R. Horgan, S. London, and A. P. Mathur, “Effect of [11] L. C. Hamilton, Statistics with Stata 5: Duxbury Press, 1998. test set size and block coverage on the fault detection effectiveness,” in [12] B. Haworth, “Adequacy criteria for object testing,” in Proc. 2nd Int’l. Proc. 3rd Int’l. Symp. Software Reliability Engineering (ISSRE), 1994, Software Quality Week, Belgium, Nov. 1998. pp. 230–238. [13] J. R. Horgan, S. London, and M. R. Lyu, “Achieving software quality with testing coverage measures,” IEEE Computer, pp. 60–69, Sept. 1994. [14] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, “Experiments on Lionel C. Briand is with the Department of Systems and Computer Engi- the effectiveness of dataflow- and controlflow-based test adequacy cri- neering, Carleton University, where is an Associate Professor. Before that, teria,” in Proc. 16th Int’l. Conf. Software Engineering (ICSE),, 1994, pp. Lionel was a department head at the Fraunhofer Institute for Experimental 191–200. Software Engineering, Germany, and a Software Engineering group leader at [15] R. Jacoby and K. Masuzawa, “Test coverage dependent software relia- the Computer Research Institute of Montreal (CRIM), Canada. Lionel worked bility estimation by the HGD model,” in Proc. 1st Int’l. Symp. Software as a research scientist for the Software Engineering Laboratory, a consortium Reliability Engineering (ISSRE), 1992, pp. 193–204. of the NASA Goddard Space Flight Center, CSC, and the University of [16] Z. Jin and J. Offutt, “Coupling-based criteria for integration testing, soft- Maryland. He has been on the program, steering, or organization committees ware testing,” Verification and Reliability, vol. 8, no. 3, pp. 133–154, of many international, IEEE conferences such as ICSE, ICSM, ISSRE, and 1998. METRICS. Lionel is on the editorial board of Empirical Software Engineering: [17] N. Li, Y. K. Malaiya, and J. Denton, “Estimating the number of defects: An International Journal (Kluwer). His research interests include software A simple and intuitive approach,” in Proc. 7th Int’l. Symp. Software Re- testing and inspections, object-oriented software development, and quantitative liability Engineering (ISSRE), 1998, pp. 307–315. methods applied to software quality engineering. [18] M. R. Lyu, J. R. Horgan, and S. London, “A coverage analysis tool for the effectiveness of software testing,” in Proc. 2nd Int’l. Symp. Software Reliability Engineering (ISSRE),, 1993, pp. 25–34. [19] Y. K. Malaiya, N. Li, and J. Bieman et al., “The relationship between test Dietmar Pfahl studied applied mathematics, software engineering, and eco- coverage and reliability,” in Proc. 3rd Int’l. Symp. Software Reliability nomics at the University of Ulm, Germany, and the University of Southern Cal- Engineering (ISSRE), 1994, pp. 186–195. ifornia, USA. He received his M.Sc. in Applied Mathematics and Economics [20] Microsoft Excel 97, User’s Guide, Microsoft Corp., 1997. from the University of Ulm. From 1987 to 1996, he was a research staff member [21] J. A. Morgan and G. J. Knafl, “Residual fault density prediction using and software engineering consultant with two corporate research divisions of regression methods,” in Proc. 5th Int’l. Symp. Software Reliability En- Siemens AG, Germany. This affiliation was complemented by a one year stay gineering (ISSRE), 1996, pp. 87–92. as a junior scientist at the German Aerospace Research Establishment (DLR) [22] S. C. Ntafos, “A comparison of some structural testing strategies,” IEEE in Oberpfaffenhofen. Since 1996, he has been with the Fraunhofer Institute for Trans. Software Engineering, vol. 14, no. 6, pp. 868–874, 1988. Experimental Software Engineering (IESE), where he works as a project engi- [23] P. Piwowarski, M. Ohba, and J. Caruso, “Coverage measurement expe- neer in various national and international research and transfer projects with the rience during function test,” in Proc. 15th Int’l. Conf. Software Engi- software industry. His research interests include software process simulation, neering (ICSE), 1993, pp. 287–301. and quantitative methods applied to software project management.