COMM83 SOFTWARE PRODUCTION MEASUREMENT
1. Software Metrics Data Collection
In this handout I will outline what constitutes good data and provide guidelines as to data
collection assists decision production.
1.1 What is Good Data
Even when we have a well-setout measure that maps a real-world attribute to a formal,
relational system in a relevant way, we need to consider a number of questions about the
Are they correct? Correctness means that the data was collected based on the exact
requirements of the metric.
Are they accurate? Accuracy relates to the difference between the data and actual value.
Time measured by an analogue clock may be less accurate than a digital one.
Are they appropriately precise? Precision relates to number of decimal places required to
express the data. It is sufficient to express the length of a project in days.
Are they consistent? Data needs to be consistent from one measuring device and/or
person to the next, without large discrepancies.
Are they associated with a specific activity or time period? If so, the data should be time-
stamped, so we know when exactly when they were collected.
Can they be replicated? Unless the data collection can be replicated on others,
amalgamation is not possible.
1.2 How to Define the Data
There are typically two types of data which we are interested in. There is the raw data
resulting from the initial measurement process, production and resources, and also a
refined data gains from the raw data to allow the analyst to gain values about the
attributes. To identify the difference the programmer effort raw data is a weekly time
sheet outlining the hours worked on different parts of the project and the refined data for
the effort spent on design would be the sum of the design element activities. Deciding
what to measure is critical. We must state the direct measures and the indirect measures
that can be gained from the direct ones.
1.2.1 The problem with problems
No software engineer produces perfect software the first time. Thus it is critical for
developers to measure components of software quality. Such information is used to
The number of problems in the software
The efficiency of the prevention, detection and removal process.
If the product can be given the customer or move to the next stage of the production
How the product compares with previous or competing projects.
The terminology used to support this investigation and analysis needs to be precise,
enabling us to comprehend the cause and effects of quality assessment and improve
efforts. A fault occurs when a human error causes a mistake in some software product.
A failure is the departure of the system from the required behaviour. One of the
problems with problems is that the terminology is not consistent. If an organization
measures quality based on faults per thousand lines of code it is impossible to compare
this with another company if they do not agree what is meant by fault. The software
engineering literature is full of different meanings for the same term. Below are just a
To many organizations errors often mean faults.
Defects typically refer to faults and failures.
Bugs are faults in the code.
Crashes are special kinds of failures that stop the system functioning.
Until terminology is consistent across the software engineering community it is important
to clearly set out what you mean by the term to assist all who supply, collect, analysis and
use the data.
A failure report concentrates on the external difficulties of the system: the installation, the
chain of events causing the failure, the impact on the user, and the cost to the user and the
developer. The normal failure report includes eight attributes:
Location: is normally a code that represents the installation and platform on which the
failure was identified.
Timing: real-time of the occurrence and execution time up to the occurrence.
Symptom: category states what was observed, which is different from the end result, that
is a measure of the consequence.
End result: refers to consequence of the failure.
Mechanism: describes how the failure happened. This application-reliant classification
sets out the causal sequence leading from the source to symptoms finally identified.
Cause: involves the trigger and the type of source. The trigger can be various things
such as hardware failure, operations conditions, user error and erroneous report. The
source can be hardware fault, design fault and usability problems.
Severity: describes how serious the failure is. For a safety critical system severity can be
Catastrophic involving loss of life or major injury.
Critical failure involving serious injury to a single person.
Significant failures involving light injuries.
Minor failures involves no injury or reduction in the level of safety.
Cost: the amount of effort and resources needed to determine and respond to the error.
A failure is related to the user’s view, a fault is identified only by the developer. A fault
report is arranged in the same way as the failure report but has very different answer to
the same questions. It concentrates on the internals of the system, considering the
specific module where the fault happened and the cost to locate and fix it.
In a fault report, location states the product or part of product that includes the fault. The
IEEE Standard Classification for Software Anomalies offers a high level classification
that can be used to report on location:
Plans and procedures
Often more detail is required to determine the exact where the fault happened. The IEEE
standard can help with this the high-level class can be divided into more detailed ones.
For instance, specification can be made up of requirements, functionality, preliminary
design, detailed design, product design, interface, database and implementation.
Timing relates to the three events that setout the life of the fault:
When the fault is produced;
When the fault is identified;
When the fault is corrected.
The symptom classifies what is identified during diagnosis or inspection.
Mechanism outlines how the fault was produced, detected and corrected. Production
states the kinds of operations that were being performed when the fault occurred.
Detection classifies how the fault was found and correction refers to how the fault was
Cause refers to the human error that led to the fault. The cause might be outlined by
using the classification identified by Collofello and Balcom
Communication: imperfect transfer of information
Clerical: typing or editing error.
Severity identifies the impact of the fault to the user.
The cost explains the total cost of the fault to the system provider.
Once a failure is suffered and identified, the problem is fixed via one or more alterations.
Change reports are used to outline the changes and track the products most influenced by
them. The location identifies the product, subsystem, element, module and subroutine
influenced by a particular change. Timing gives when the change occurred and end result
how successful the change was. The cause entry of the change report gives the reasons
for the change between: corrective, adaptive, preventive or perfective maintenance. The
cost is how the changes costs the developer.
1.3 How to Collect Data
The collection of data needs human observation and reporting. Managers, software
produces, analysts and users must record raw data on forms. The manual recording is
subject to bias, error, omissions and delay. Automatic data capture is desirable like the
recording of the execution time of the real-time software. Nevertheless, manual data
capture is typically the only way and so we must plan the capture before hand. This
Procedures are simple.
Avoid unnecessary recording.
Staff have sufficient training to record data and follow the process.
That results of capture and analysis are provided in a timely fashion
Data collected is validated at a central collection point
Planning for data collection requires various stages. First there must be a decision as to
the products to measure. You may need to measure various products that are used
together or you may measure one part of a larger system. The next step is to ensure that
the product is under configuration control. We need to know the version of each product
that is being measured. Once we have decided the metrics to use and the components to
be measured, there is a need to create a scheme to establish all entities involved in the
measurement procedure. Finally, there is a need to establish processes for handling the
forms, carrying out the analysis and reporting the results.
The data collection form encourages the collection of good quality data. The form
includes the data needed for analysis and feedback. The form should enable both fixed
length and free format data to be included. Boxes and separates should be included to en
force formats for dates, identifies and other standard values.
1.3 When to Collect Data
It is clear that data-collection planning must start as soon as project planning begins, and
careful forms, design and management are required to support good measurement. The
data collection occurs over many phases of development. As there are various kinds of
inspections, there are many inspection-associated measurement activities. For instance,
inspection-associated fault information is recorded after a high-level design, low-level
design and coding of each subsystem or module. Data related to faults gained by
inspections and tests need to be recorded consistently so the effectiveness of the activities
is established. Normally data should be collected at the start of the project to gain initial
values and then again to reflect the activities that have occurred. It is critical that data
collect operations are part of the regular development procedure. If the developers do not
what they are not measure, it will not get done. It is helpful to compare a model of the
normal development procedure with a list of desired measurements, and map the
measurements to the process model.
1.4 How to Store and Extract Data
Raw software-engineering data should be kept in a database, producing using a database
management system (DBMS). A DBMS has many benefits over paper records and
computer-stored “flat” files. Languages can be used to set out data structures, insert, alter
and remove data and extract refined data. Constraints like checking cross-referencing
among record can be produce to ensure consistency of data.
Once the database is designed and populated with data there is a need to take advantage
of the data’s structure in extracting the data for analysis. If we have reliability data and
want to measure reliability for a single baseline version of an individual product over all
installations. We follow these steps:
1 Select each incident that cross-references the particular product version.
2 Group the resulting incidence records by the fault to which they refer, and sort
each group by time of occurrence.
3 Remove all except the first incident in each group.
4 Count the remaining incident record within each period
5 Sum the product use recorded in product version installation session for all
sessions in each period.
The outcome is a list of pairs of numbers: account of faults first detected, and a measure
of the total use of the given product version, in each calendar period.
2. Empirical Investigation
Software engineers have many problems to deal with. The testers want to determine the
technique to find faults, managers the type of skills that produce the best programmers
and designers the model that predict reliability. A software practitioner when evaluating
a technique, method or tool can use three approaches: survey, case studies and formal
2.1 Four Principles of Investigation
If a project manager wishes to use of new tool, technique or method and decides to
investigate this in a scientific manner this can be done using survey, a case study or a
formal experiment. This handout examines the kinds of investigations and gives
examples of situations that each may be relevant.
2.1.1 Choosing an investigation techniques
A survey is a retrospective study of the situation to try to document associations and
results. A survey happens after the event has occurred. Software engineering surveys are
used to poll a set of data from an event that has happened to establish how the population
reacted to a specific method, tool or technique, or to establish trends or relations. When
performing a survey, you have no control over the situation at hand. A case study is a
research technique that involves determining key features that may influence the outcome
of an activity: its inputs, restrictions, resources and outputs. By contrast, a formal
experiment is a rigorous, controlled investigation of activity, where key factors are
identified and controlled to document their effects on outcome.
Kitchenham, Pickard and Pfleeger note the diversity between the research methods is also
reflected in their scale. With formal experiments needing a large amount of control, they
normally are small, including small numbers of people or events. Case studies consider
the typical project instead of finding information on all possible situations. Surveys
attempt to poll what is happening broadly over many projects.
Several guidelines exist to help decide if a survey, case study or formal experiment is
most suitable. If an activity has happened you must use a case study or survey. If the
activity is yet to happen, you can select from case study and a formal experiment. A
central factor is the level of control required for a formal experiment. If you have a great
deal of control over the variables that influence the outcome, then you can consider an
experiment. If you do not have this control then a case study would be a better technique.
If it is possible to control the variables but it is difficult, due to the cost or risk involved
these factors should be weighted up.
Another key factor to consider is the amount that you can replicate the basic scenario you
are examining. Suppose you wish to investigate the impact of language on the software
product. If it is not possible to produce the same product using different languages
formal experimentation is not possible. However, even when it is possible the cost would
prevent this process. These concerns are summarized in the table below. This table is
read across and up. For instance if the cost of replication is low then experiment is the
Table Factors relating to choice of research technique
Factors Experiment Case Studies
Level of control High Low
Difficulty of control Low High
Level of replication High Low
Cost of replication Low High
A formal experiment is good when examining of performing a specific, self-standing
task. On the other hand, case studies may be better than formal experiments if the
process changes from independent variables are wide-ranging, needing the effects to be
measured at a high level and across too many dependent variables to control and
2.1.2 Stating the hypothesis
Before deciding on the research technique it is important to decide on what you are
investigating. The goal for the research can be stated as a hypothesis you want to test.
You must state what you want to know. The hypothesis is theory or belief that you feel
explains the observed behaviour. The hypothesis might be “using JSD produces better
quality software than SSADM.” Once this hypothesis is produced you have decided if
you assess what occurred when a group used each of these methods (case studies),
evaluate a “snapshot” of your organization using JSD (a case study) or do a carefully
controlled comparison of those using JSD and those using SSADM (a formal
experiment). The data is collected to support or disprove the hypothesis you stated.
2.1.3 Maintaining control over variables
Once a hypothesis is devised, there is a need to decide what variables can influence its
truth. For each variable determined you must establish the amount of control you have
over it. A case study is most appropriate when examining events where actions cannot be
manipulated. The difference between case studies and formal experiments can be stated
by looking at state variables. A state variable is a factor that characterizes the project and
affects the evaluation results. State variables are often called independent variables as
they can be altered to influence the outcome. The result is evidenced by the values held
by dependent variables. In case studies you sample from the state variables instead of
over them. This means you take a value that is typical for the organization and its
projects. A state variable is used to differentiate the control context from the
experimental one in formal experiments. When it is not possible to distinguish control
from experiment a case study should be used.
2.1.4 Making your investigation meaningful
There are various areas of software engineering that can be examined using surveys, case
studies and experiments. One main motivator for using formal experiments over case
studies or surveys is that results are normally more generalizable. If a case study or
survey is used the outcomes are particular to that organization that is examined.
However a carefully controlled formal experiment that contrasts diverse values of the
controlled variables is applicable to a wider community.
Many techniques and methods are used as “conventional wisdom” states that they are the
most appropriate. However, there is little quantitative evidence to support claims about
many commonly used tools and methods. Case studies and surveys can confirm these
claims for a single organization and formal experiments the context when these claims
Software producers are interested in the associations between various characteristics of
resources and software products. For example:
How does the team’s experience with an application domain influence the quality of the
How does the design structure influence the maintainability of the code?
The association can be stated using a case study or survey. For example, a survey may
state that software written in JAVA contains fewer errors than one produced using C++.
Understanding and verifying these associations is critical for better software projects.
Each association can be stated as a hypothesis and a formal experiment produced to test
the truth of the association.
Models are constantly used to predict the result of an action or guide the use of a method
or tool. Models are problematic when designing an experiment or case study, as their
predictions influence the results. The predictions become aims, and the developers
attempt to achieve these aims, intentionally or not. For this reason, experiment
evaluation models are designed “double-blind” experiments, where those involved do not
know the prediction until it is over.
2.2 Planning Formal Experiments
If it decided to consider the tool, method or approach using a formal experiment this
should be done with careful planning, if the results are to be meaningful and useful.
2.2.1 Procedures for performing experiments
There are several steps to carrying out a formal experiment.
Dissemination and decision making
The first stage is to decide what you want to learn and outline the objective of the
experiment. No matter what the objective is it need to be stated in a way that can be
examined at the end of the experiment.
Once the objective is outlined, the objective needs to be stated as a hypothesis. Normally
there are two hypotheses: the null hypothesis and the experimental hypothesis. The null
hypothesis states that there is no real difference between two treatments. The
experimental hypothesis states there is a difference between the treatments. The
experiment will include a set of tests of the method or toll, and the experimental design
should outline these tests in an organized manner. Thee should also be an outline of the
people involved in the experiment known as the experimental subjects. The number of
and the association between subjects, objects and variables needs to be outlined in the
experimental plan. The more subjects, objects and variables the more complex the design
and more difficult the outcomes are to analysis.
Preparation includes readying the subjects for application of the treatment. For instance,
preparation for the experiment might include buying tools, training staff or configuring
Finally the experiment can be carried out. You must ensure that items are measured and
treatments applied in a consistent manner.
The analysis stage has two components. First, the measurements must be reviewed to
ensure that they are valid and useful. Second, the data must be analyzed using statistical
22.214.171.124 Dissemination and decision making
At the end of the analysis phase conclusions will be produced about the features that were
examined and how they influence the outcome. It is important to document the
conclusions so they the experiment can be carried out again and the results confirmed.
The experiments results should support decisions on how to produce software in the
future, suggest changes and recommend future experiments.
2.2.2 Principles of experimental design
Useful results rely on careful, rigorous and complete experimental design. In this section
there is an examination of principles that must be considered when designing
experiments. There is a need to achieve simplicity and the maximization of information.
Involved in experimentation are two important concepts: experimental units and
experimental errors. An experimental unit is the experimental object that has a single
treatment performed on it. Experimental error describes the failure of two identically
treated experimental units to produce the same outcomes. The error comes from:
Errors of experiment
Errors of observation
Errors of measurement
Variation in experimental resources.
The three main principles examined below address the difficulty of variability by
providing assistance on producing experimental units to reduce experimental error.
Replication is the repetition of the task. Replication includes repeating an experiment
under the same conditions, rather than repeating measurements on the same experimental
unit. Replication gives an indication of experimental error that performs as a basis for
assessing the importance of identified differences in independent variables.
Replication allows statistical tests of the significance of the outcomes, but does not
ensure the validity of the results. Some element of the experiment must arrange
experimental trails that distributes the observations independently to achieve valid
results. Randomization is the random assigning of subjects to groups or treatments to
experimental units to guarantee independence. Randomization does not guarantee
independence, but it allows us to assume that the correlation on any comparison of
treatment is as small as possible.
126.96.36.199 Local control
Local control is an element of experimental design that matches how much control there
is over the placement of subjects in experimental units and the structuring of those units.
Local control ensures that the design is effective by limiting the magnitude of the
experimental error. Local control is normally discussed based on two characteristics of
the design: blocking and balancing the units. Blocking means allocating experimental
units to blocks or groups so the units in the block are relatively homogenous. Balancing
is the blocking and assigning of treatments so that an equal number of subjects is
assigned to each treatment. Balancing is desirable as it simplifies the statistical analysis.
2.2.3 Types of experimental designs
There are various kinds of experimental designs. It is helpful to know the different
designs that can be used in software engineering. Since the kind of design restricts the
kind of analysis that can be carried out and therefore the kinds of outcomes that be
drawn. Most designs in software engineering research are based on two basic
associations between factors: crossing and nesting.
The design of an experiment can be stated in a notation that matches the number of
factors and how they relate to diverse treatments. Two factors, A and B, in a design are
crossed if each level of each factor appears with each level of the other factor. This
association is denoted A x B. The design is represented with three levels in the table
below, where a i is the levels for factors b j the levels of factor B.
Level 1 Level 2 Level 3
Factor Level 1 a1 b1 a1 b 2 a1 b 3
A Level 2 a 2 b1 a2 b2 a2 b3
Factor B is nested within A if each meaningful level of B occurs in conjunction with only
one level of factor A. The association is represented as B(A), where B is the nested
factor and A is the nest factor. A two-factor nested design is depicted in the table below,
there are two levels of factor A and three levels of factor B. Now B is dependent on A,
and each level of B happens with only one level of A. That is, B is nested within A.
Level 1 Level 2
Factor B Factor B
Level 1 Level 2 Level 3 Level 1 Level 2 Level 3
a1 b1 a1 b 2 a1 b 3 a 2 b1 a2 b2 a2 b3
2.2.4 Selecting an experimental design
There are many choices on how to design an experiment. The ultimate choice relies on
two factors: the goals of the examination and the availability of resources. The rest of
this section explains how to determine which design is right for the context.
188.8.131.52 Choosing the number of factors
Many experiments relate to a single variable or factor. One-variable experiments are
fairly simple to examine, since the impact of the single factor is isolated from other
variables that might influence the outcome. However, it is not always possible to remove
the impact of other variables. Instead, we attempt to reduce the impact or at least
distribute the effects equally across all the possible conditions we are examining. There s
much more information from two factor experiments than from two one-factor
experiments. The two-factor experiments show the association between the factors as
well as the single-factor results. When considering using one factor or more than one you
must to decide what type of comparison that is wanted. If you are considering a set of
competing treatments, you can use a single-factor experiment.
184.108.40.206 Factors versus blocks
Once the number of factors have been decided upon, you must establish how to use
blocking to improve experimental precision. It is difficult to decide between a block and
factor. In many experiments, we feel that the experience of subjects will influence the
outcome. One method in experimental design is to treat experience as a blocking factor.
We can match staff with close experience and allocate staff randomly to the diverse
treatments. This should ensure that that each block contains at least two subjects with the
same experience. If experience is treated as a factor, we must define levels of experience
and allocate subjects to each level randomly to the alternative levels of other factor. To
determine which approach (factor or block) is best it depends on the hypothesis. If for
example if we want to know if design A is better than design B, the experience should be
a blocking factor. However, if we want to know if the designs A and B are influenced by
experience, experience should be a factor.
220.127.116.11 Fixed and random effects
Some factors allow us to have total control over them. For instance, we can control the
language used to develop the system, or the processor the system is produced on. But
other factors are more difficult to control such as staff experience. The amount of control
over factor levels is a critical consideration in selecting an experimental design. A fixed-
effects model has factor levels or blocks that are controlled. A random-effects model has
factor levels or blocks that are random samples from a set of values. The difference
between fixed- and random-effects models influence the manner resulting data is
examined. For completely random experiments, there is no diversity in analysis. But in
more complex designs, the difference influence the statistical methods needed to assess
18.104.22.168 Matched- or same-subject designs
Sometimes, economy or reality stops us from using diverse subjects for each kind of
treatment in the experiment design. We can use the same subjects for diverse treatments,
or match the subjects based on features in order to limit the scale and cost of the
experiments. For instance, the same programmer could use tool A in one context and
then tool B in another. Hence, when designing the experiment there is a decision as to
the number and type of subject to use. With experiments that include one factor, you can
consider testing the levels of the factor with the same subjects or with different subjects.
For two or more variables, you can examine the question of same-or-different separately
for each variable.
22.214.171.124 Repeated measurement
In many experiments, one measure is produce for each item of interest. However, it can
be useful to repeat measurements in particular situations. Repeating a measure can assist
validation, assess errors related with the measurement process.
2.3 Planning Case Studies
When carrying out a case study many of the issues are the same as those involved in
experiments. Below is a consideration of some of differs and the steps to follow. A case
study typically compares one situation with another: the outcomes from one tool
compared with another. To prevent bias and make sure that we are testing the association
you hypothesize, the study can be organized on one of three ways: sister project, baseline,
or random selection.
2.3.1 Sister project
If an organization wants to modify the manner it carries code inspection. To perform a
case study you select two projects called sister project, each of which is typical of the
organization and have close values for the state variables that are being measured. Then
you perform inspections using the current way on the first case and with the new
approach on the second case. By selecting projects that are as similar as possible, you
control as much as possible.
If you cannot find two projects that are close enough to be sister projects, you can
contrast the new inspection technique with a general baseline. In this approach
information is gathered from a number of projects and the average situation for the
company is established. The case study involves completing a project with the new
inspection technique and then comparing it with the base line.
2.3.3 Random selection
It is often possible to split a project into sections, where one section uses the new
technique while the other does not. The case study is like a formal experiment as it is
using randomization and replication in performing the analysis. This type of case study
is useful when the method being studied can have various values.
3. Analysing Software-Measurement Data
Data analysis had various operations and assumptions:
We have various measurements of one and more attributes from diverse software entities.
The set of measurements is a data set or a batch.
We anticipate that the software items are comparable in some way. For instance, we can
contrast modules from the same software product by looking at the differences and
similarities in the data. We can contrast various projects carried by the same company to
establish if certain lessons can be learned from about quality or productivity.
3.2 Analysing the Results of Experiments
After the measurement data has been collected it must be analysed in the relevant
fashion. This section describes the items that need to be considered in selecting the
analysis approach. There is also an examination of the situations that you may wish to
perform an experiment and the technique that is most appropriate for a given situation.
3.2.1 Nature of Data
To examine data we must look at the larger population depicted by the data, as well as the
distribution of that data.
126.96.36.199 Sampling, population and data distribution
The nature of the data assists with determining the analysis techniques that are available.
It is critical for you to comprehend the data as a sample of a larger population of all the
data you can gain. You are using a fairly small sample to generalize on a larger
population, so the characteristics of the population are critical.
From the sample data, you must determine if the measurement differences are the result
of the independent variables, or are obtained through pure chance. Care must be taken
when differentiating what is seen in the sample from what we feel from the full
population. Sample statistics summarise measures produced on a finite group of subjects,
while population parameters depict the values that would be gained if all possible
subjects were measured. We can describe a population or sample by considering the
central tendency (mean, median, and mode) and measure of dispersion (such as variance
and standard deviation). This tells us how the data is distributed across the population or
sample. Many sets of data have a normal distribution and have a bell shaped curve. By
definition, the mean, mode and median are all the same and 96% of the data is with three
standard deviations from the mean.
From the example histogram below the data described as normal as it resembles a bell-
Data resembling a normal distribution
There are other distributions where the date is skewed, so there are more data points on
one side of the mean than the other. There are also distributions that vary radically from
the bell-shaped curve. The kind of distribution influences the analysis that can be
Distribution where data is skewed to left Non-normal distribution
188.8.131.52 The distribution of software measurement
Many common statistical operations are not meaningful for measurements that are not on
an interval or ratio scale. Many software measures are ordinal. This scale results from
the wish to categorize and rank. For instance we ask customers to state how satisfied
they are with or product or ask designers to assign a quality measure to each requirement
before beginning the design stage. Such ordinal scales do not enable us to determine
values such as interval, ratio and absolute data. Hence, we must choose analysis
techniques that are suitable to the data we collected. As well as the scales used we must
also consider that data that has been collected. Many of the statistical techniques are
based on the belief that data sets are made up of measures drawn at random. Even when
software measurements are on a ratio scale and draw at random the distribution may not
There are several ways to establish if the data is normally distribution. One is to compare
the mean with the median. If they are the same then the data is normally distributed.
Before we know something about the data we should be careful when selected techniques
that are depended on a normal distribution. There are various approaches that can be
used without knowing if the data is distributed normally:
We can use robust statistics and non-parametric methods. Robust statistical methods are
descriptive statistics that are resilient to non-normality. Non-parametric statistical
approaches take into account that the data is not normally distributed. These approaches
often use characteristics of the ranking of the data.
We can try to alter the basic measurement into a scale in which the measurement
conforms more closely to a normal distribution.
We could examine the underlying distribution of the data and use techniques appropriate
to that distribution.
184.108.40.206 Statistical inference and hypothesis testing
Statistical inference is the procedure by which with create conclusions related to the
population from the sample. Parametric statistical techniques are only suitable when only
when the sample has been selected from a normal distributed population; otherwise we
must use non-parametric tests. In both situations the techniques are used to determine if
the samples are a good representation of the population. If we are considering the
increase in programmer productivity increases as a result of training. We can calculate
the mean productivity and standard deviation. Statistical inference tells us whether
average productivity for any programmer after training is the same for the full population.
The logic of statistical inference is founded on the two possible outcomes that happen in
any statistical comparison:
The measured differences in the experiment is the result of chance variation in
measurement procedures alone
The measured differences point to the real treatment effects of the independent
The first case matches the null hypothesis and so there is no change. The null hypothesis
is typically represented as H 0 . The second case is the alternative hypothesis and this is
written as H1 . The role of statistical analysis is to establish if it is possible to reject the
null hypothesis. The rejection of the null hypothesis does not always prove the
alternative hypothesis, it may require more experimentation to do this. Statistical
analysis is aimed at whether we can reject the null hypothesis. In this sense we can
disprove the alternative hypothesis using empirical evidence but not prove it.
3.2.2 Purpose of the experiment
The two major reasons to perform a formal investigation are: to confirm a theory or to
examine an association.
220.127.116.11 Confirming a theory
An investigation may be produced to examine the truth associated with a theory. The
theory normally states that the use of a particular method, tool and technique has a certain
impact on the subject, making it better than another treatment. For instances you may
want to identify the impact of using SSADM compared with an alternative technique.
The typical approach for this situation is analysis of variance. That is you consider two
populations, the one with the new technique and the other the old, and you use the
statistical technique to establish if the difference is statistically significant. You examine
the variance in the two sets of data to see if it comes from one population or two.
There are two cases to consider: normal and non-normal data. If data is normally
distributed and two groups are being compared then the Student’s t-test is suitable to
examine the treatment. When using the number of defects per line of code as an
indication of which development techniques this data is typically not normally
distributed. The data on defects can be analysed using a ranking approach such as the
18.104.22.168 Exploring a relationship
Often analysis is used to establish a relationship between data points describing one or
multiple variables. There are three techniques to answer questions about relationships:
box plots, scatter plots and correlation.
Box plots are useful for comparing various data sets and describing the arrangement of
data by displaying the middle 50% of the data, the skew, range and any outliers. A
scatter plot represents the association between two variables. By observing the relative
location of pairs of data points it is possible to determine a likelihood of an association
between the data. Correlation analysis goes further than scatter plot by using statistical
techniques to establish if there is a true relationship between two attributes. Correlation
analysis can either be done by creating measures of association that indicate the closeness
of the behaviour of two variables or by produce an equation that outlines that behaviour.
When measures of association is enough it is critical to establish if the data is normally
distributed or not. When the data is normally distributed a Pearson correlation coefficient
is a measure of association that states whether two variables are highly correlated. For
non-normally distributed data, you must rank the data and use the Spearman rank
correlation coefficient as a measure of association. A further non-normal data technique
is the Kendall robust correlation, which examines the association between pairs of data
points and can identify partial correlations.
When considering the character of the association, you can use linear regression to
produce an equation to outline an equation to outline the association between two
variables you are examining. For more than two variables, multivariate regression is
3.2.3 Design Considerations
The investigation design must be considered when selecting the analysis techniques. At
the same time, the complexity of the analysis can affect the design selected. Multiple
groups normally need to use the F statistic instead of the simple Student t-test with two
3.2.4 Tables of Suitable Techniques
To help which analysis techniques are appropriate the following was produced.
Confirming a theory
2 groups Student t-test
> 2 groups F statistics
Exploring a Relationship
Baseline Box plot
Statistical Normal Pearson
Confirmation of Association Non-normal Kendall
with Correlation Equation Normal Linear Regression
3.3 Examples of Simple Analysis Techniques
There are many robust techniques that can be used with software data, regardless of the
3.3.1 Box plots
As can be seen from the figure below the box plot is made up of the median and central
box which contains 50% of the readings. The size of this box is known as the
interquartile range (IQR) and its ends are the upper and lower hinges. The upper hinge is
the median of the readings above the median and the lower hinge is the median of the
points below the median. The outliners are the readings that are greater than 1.5 times
the IQR above/below the median, while the lower and upper fences are the last points that
are not outliers. The box plot can be used to tell us if the data is skewed by looking at the
positions of the median, the quartiles and the tails. If the data symmetrical around the
median, the median will be in the centre of the box and the tails will be the same size.
Lower Fence Upper Fence
Lower Hinge Median Upper Hinge
Example box plot.
3.3.2 Scatter plots
A useful depiction of data is the scatter plot. Here we simply plot all the points to see if
any patterns or trends can be identified. Looking at the scatter plot below it is possible to
see that most projects take less than 50 person hours, but the last two have required many
more hours than is typical. Unlike box plots scatter graphs allows us to consider more
than one attribute.
0 2 4 6 8 10 12
The scatter plot shows the effort against size. This plot seems to show that there appears
to be a relationship between size and effort, with effort increasing as size increases.
0 10 20 30 40 50 60 70 80
3.3.3 Control charts
Another technique is control charts that help us to see if data falls with acceptable
bounds. By watching the data trends over time, you can decide whether to take action to
stop difficulties before they happen. To identify how control charts work consider a non-
software example. Many processes have a normal variation for a given attribute. Steel
manufactures really produce a one inch nail that is exactly one inch, instead they create
acceptable variations around one inch. We would expect that the actual length value to
randomly distributed two standard deviations around the mean.
Consider the table below. Here there is the ratio between preparation hours and
inspection hours for a series of design inspection. We determine the mean, the standard
deviation of the data, and then the two control limits. The upper control limit is equal to
two standard deviations above the mean and lower limited two below. The upper and
lower limits describe when the system is working within acceptable statistical bounds and
when it is performing in an abnormal way.
Component Number Preparation hours/inspection hours
Standard Deviation 0.5
Upper control limit 2.6
Lower control limit 0.4
To visualize the behaviour it is good to produce a graph called a control chart. The chart
shows the upper limit lower limit and the mean.
Upper control limit
Lower Control Limit
1 2 3 4 5 6 7
3.3.4 Measure of Association
Scatter plots investigation the behaviour of two attributes, and sometimes if these
attributes are related. Evidence of a relationship does not give proof of causality.
However there are statistical techniques that can assist in the examination to likelihood
that an association seen now will happen in the future. This known as the measurement
of association and is support by techniques that identify if the association is significant.
For normally distributed data the Pearson correlation coefficient is a valuable measure of
association. If we want to determine the association between two attributes x and y, we
can create pairs ( x i , y i ), where there are i software items and we wish to determine the
association between x and y. The total number of pairs is n, and for every attribute, we
establish the mean and variance. The mean of the xs are depicted is m x , and the mean of
the ys as m y . Similarly, var(x) is the variance of the set of xs and var(y) the variance of
the y’s. We then calculate:
( x i - m x )(y i - m y )
n var(x ) var(y )
The value r is the correlation coefficient and is between –1 and +1. When r is 1 there is a
perfect positive relationship between x and y and when it is –1 there is a perfect positive
negative relationship. When r is 0 there is no relationship between x and y.
3.3.5 Robust correlation
The most commonly-used robust correlation coefficient is Spearman’s rank correlation
coefficient. It is produced in the same way as Pearson correlation coefficient, but x and y
values are founded on ranks of attributes instead of raw values. That is we put the values
in order with the smallest having the value 1, the next smallest 2 and so on. If two or
more attributes have the same raw value they are given the average of the related rank
values. For instance if the 1st and 2nd smallest modules both have 40 lines of code they
would be allocated 1.5 the average of 1 and 2.
Kendall’s robust correlation coefficient τ varies from –1 to 1 as with Spearman’s rank,
but the approach is different. The Kendall coefficient assumes that, for every two pairs
of attribute values ( x i , y i ) and ( x j , y j ), if there is a positive relationship between the
attributes, then when x i is greater than x j , then it is likely that y i is greater than y j .
3.3.6 Linear regression
Once an association as been identified and the level of association between the two
variables determined, it is important to consider the matter of the association. Linear
regression is a popular and useful method for expressing an association as a linear
formula. The approach is based on a scatter plot. Every pair of attributes is a data point
( x i , y i ) and the technique establishes the line of best fit between the points. The aim is to
express attribute y (dependent variable) based on attribute x (independent variable), in the
y = a+bx
The approach behind regression is to draw a line up or down from the line of best fit to
the actual point to represent the distance. The length of the line is the discrepancy with
the aim being the line of best fit which produces the smallest discrepancy. This
discrepancy is known as the residual.
3.3.7 Multivariate regression
The regression method considered before concentrates on the linear relationship between
two attributes. This technique can be extended to examine a linear relationship between
one dependent value and two or more independent variables, this is known as
multivariate regression. There is a need for caution when using many attributes:
It is not easy to assess the association visually.
Relations between the dependent variables can create unstable equations.
Robust multivariate regression can be complex.
3.4 More Advanced Methods
There are many other techniques for analyzing data. This section of the handout
considers advanced techniques for considering associations among attitudes.
3.4.1 Classification tree analysis
Many statistical techniques deal with pairs of measures. However, often we want to
identify the measures that offer the best information associated with an aim or action.
That is we collect many measures and decide which is best at predicting the behaviour of
a specific attribute. A statistical technique called classification tree analysis can deal with
Suppose you collected data on code modules and wish to establish which is the best
indicator of poor code quality. You define poor quality based on one of the measures say
if the module as more than 3 faults. A classification analysis then produces a decision
that states the other measures that are associated with poor quality. The tree may look
like the one shown below.
< 100 LOC
- >300 LOC
100- 300 LOC Design
Cyclomatic Yes No
- <5 >5
Sometimes it is problematic to understand data in its original form, but easier it is
transformed in some way. Generally transformation is a mathematical function applied
to the measure. The function transforms the original data set into a new one. When a
relationship between two variables is non-linear it might be useful to transform one so it
3.4.3 Multivariate data analysis
There are various techniques that can be used on data that contains many variables.
Principle component analysis is used to simplify a set of data be taking out some of the
dependencies between variables. Cluster analysis takes the principle components that are
the outcome and enables us to group modules by some criterion. Discrimination analysis
produces an assessment criteria to distinguish one data set from another.
22.214.171.124 Principle components analysis
If we wish to examine the association among diverse attributes, we do not want subsets of
associations creating a misleading picture. Principle components analysis produces a
linear transformation of a group of correlated attributes, so that the transformed variables
are independent. The analysis identifies the amount of variability accounted for by each
transformation. The transformation that accounts for the most is variability is known as
the first principle component, the one that accounts for the second most is the second
principle component. Typically principle components that account for less than 5% of
the variance are ignored.
126.96.36.199 Cluster analysis
Cluster analysis can be used to examine the closeness of modules based on their
measurable characteristics. First a principle components analysis is performed, creating a
reduced set of principle components that explain most of the variance. Cluster analysis
then specifies the behaviour by splitting it into two categories.
188.8.131.52 Discriminant analysis
Discrimant analysis enables us to split data into two groups and to establish the class a
new data point should be allocated to.
3.5 Overview of Statistical Techniques
In this handout several statistical techniques have been considered. These have been
shown based on the kind of association we want (linear, multivariate, etc.), but our choice
of technique also must take into account how many groups are being contrasted, the size
of the sample and more. This section there is the description of a few statistical tests, to
explain how they are orientated to a specific experimental situation.
3.5.1 One-group test
In a one-group design, measurements from one set of subjects are contrasted with an
anticipated population distribution of values. These test are relevant when an
experimenter has one set of data with a clear null hypothesis related to the value of the
population mean. For normal distributions, the parametric test that is used is the t-test.
Since, the mean of the data is used data on an interval scale or above is appropriate.
There are several choices for non-parametric data:
184.108.40.206 Binomial test
The binominal test is used when:
Only when the dependent can take two distinct and exclusive values
The measurement trails in the experiment are independent.
220.127.116.11 Chi-squared test for goodness of fit
The chi-squared test is appropriate when:
The dependent can take two or more clear, mutually exclusive and exhaustive values.
The measurement trails in the experiment are independent.
None of the characters has an expected frequency of less than 5 occurrences.
18.104.22.168 Kolmogorov-Smirnov one-sample test
This test assumes that the continuous and examines the similarity among observed and
forecasted cumulative frequency distribution.
3.5.2 Two-group tests
Two-group designs enable you to contrast two samples of dependent measures drawn
from related or matched samples of subjects or two independent groups of subjects.
22.214.171.124 Test to compare two matched or related groups
The parametric t-test for matched groups applies when the dependent measurement is
taken under two diverse conditions, and when one of the following conditions is true:
The same subject is tested is both conditions.
Subjects are matched based on some criteria
Pre-screening has created randomized blocks of subjects.
Non-parametric alternatives to this test incorporate the McNemar change test, the sign
test and Wilcoxon signed ranks test. The McNemar change test is helpful for assessing
alterations on a dichotomous variable, once the experimental treatment has been
administered to a subject. The sign test is applied to related samples when you want to
determine that a particular condition is greater than another on the dependent measure.
The Wilcoxon signed ranks test is close to the sign test, taking into account the
magnitude as well as the direction of difference.
126.96.36.199 Test to compare two independent groups
The parametric test relevant for independent groups or among-subjects designs is the t-
test for differences between populations.
3.5.3 Comparison involving more than two groups
If you have measured an attribute for more than 2 groups the most appropriate statistical
analysis techniques is the analysis of variance (ANOVA). This class of parametric is
appropriate for data from among-subject designs, with-subjects designs, and those that
are a combination of both.