ASE-EVAL: Software Evaluation Lecture 2: The Empirical Context
Qualitative or Quantitative?
Qualitative research is concerned with the study of ‗objects‘ in their natural setting: involves making interpretations that are based upon explanations accepts that there may well be different interpretations of a phenomenon is concerned with discovering the causes of effects/behaviour/… A software engineering example might be to assess why different inspection groups perform differently? Quantitative research mainly seeks to quantify a relationship, or to compare two or more groups: employed to test the effect of some intervention lends itself to statistical analysis A software engineering example might be to determine whether introducing a new inspection method can increase the number of faults found during testing? Quantitative forms used in SE Qualitative forms used in SE
Laboratory experiments: provide a high degree of control but outcomes may be hard to generalise Surveys: explanatory and exploratory forms can be used to explain particular effects, or identify issues that might need fuller investigation (note that surveys often use ordinal scale measures to assess personal opinion, so use quantitative but subjective measures) Case studies: when used in a comparative role (e.g. which method is preferred). May use interviews to gather data Ethnographical studies: where the observer acts in a nonintrusive manner
Surveys: descriptive forms can be used to study the distribution of certain characteristics or attributes Case studies: offer less control than an experiment but can be conducted in a more realistic setting Quasi-experiments: such as field trials, little control by the experimenter
Independent and Dependent Variables
The independent (other terms are ‗stimulus‘ or ‗input‘) variable(s) is(are) associated with cause. They change as a result of the activities of the investigator, and should not be affected by the other variables. Using more than one in a study can complicate the analysis of outcomes. The dependent variable (other terms include ‗response‘ or ‗outcome‘ variable) is associated with effect and should change as a result of changes to the independent variable(s). Identifying this and measuring it is the means by which the outcomes of the study are captured. We are interested in whether an independent variable causes change to the dependent variable, but our methods are not always rigorous enough to allow us to make cause-effect claims, whereupon we then need additional arguments to support these. A confounding factor is some (undesirable) element in an empirical study that produces an effect that makes it impossible to distinguish between two or more possible causes of an effect (as measured through the dependent variable). Typical examples are the skill levels of participants in experimental studies and the sampling processes used in surveys.
Threats to validity
Given that there are many factors affecting a study, a key question for its design must be how valid the outcomes from it are? To be adequately valid, the results should be valid for the ―population of interest‖ (students, software developers, maintainers,…). A threat to validity is therefore a factor that may put the outcomes of a study in question. Particularly relevant for laboratory experiments where the extrapolation can be more problematical. So these are usually assessed as part of the reporting process for empirical studies, and experiments in particular. Three major forms of threat are: internal validity, where we seek to identify factors that might have affected the outcomes (the dependent variable) without the researcher‘s knowledge and which might put the causal relationship between treatment and outcome in question – for software engineering this might arise because of the lack of a control group (where the treatment is not applied); where the membership of groups
ASE-EVAL: Software Evaluation
Page 1 of 3
Lecture 2: Context
has possibly been unbalanced in some way (even if selected randomly); or where participants have not followed the prescribed activities correctly external validity, is concerned with how generalisable the results may be to the intended population of interest, here the threat is that the results are only applicable to the particular context of the study – in software engineering might arise from the selection processes (e.g. using students rather than experienced developers, testers etc.; having too few participants; using an inappropriate treatment; or because of timing construct validity, is concerned with how well the outcomes of the study are linked to the concepts or theory behind the study – for software engineering this might be because the model used in the study is inadequate (e.g. comparison between methods for ‗better‘ without a clear definition of what is meant by ‗better‘); possibility of bias in scoring the outcomes; or an interaction between the treatment and the way that it is being measured. Creates a major problem in SE where we often employ surrogate measures (such as system test defects to assess ‗quality‘).
Planning an evaluation
Having determined an appropriate form of study to use, need to identify: the independent variables – where these are ones that we can control and change (at least, for an experiment), or influence/measure in some way, and will be determined by the question that we are seeking to answer (since they represent ‗cause‘) the dependent variable – that may often not be directly measurable, so we might have to identify and use a surrogate (indirect) measure instead In addition, if we are going to conduct an experiment, we also need to determine the form of treatment to be used (how we will manipulate the independent variables in the study). In software engineering, the treatment might be a process (testing strategy, design method, …) or a product (web browser, development environment, programming language,…)
Participants
Many empirical studies in SE involve people, usually referred to as participants. Recruiting participants: involves ethical issues—recruiting should not put people under pressure to participate, and as far as possible they should be guaranteed anonymity. Plans for studies involving people usually need to be submitted to an ethics committee before they occur (the department has one) should aim to obtain a representative sample from the domain of interest or a surrogate domain we often use students as participants, but need care about how representative they are
The Protocol
This is a document that describes how the study will be conducted. It should be produced before the study begins, and any divergences from this plan that occur should be formally recorded. Aim is to anticipate possible problems and have a plan for addressing these in a systematic and consistent way. The protocol is a formal document (including abstract and references) and should address at least:
background (why needed) context (where it is to be performed) detailed form of the study (processes, activities, tasks etc.) how participants are to be recruited/selected data collection and analysis timetable limitations/constraints
Primary and Secondary Studies
The forms identified above are referred to as primary studies, in that they are individual studies, set within a particular context. A secondary study seeks to aggregate the outcomes of many primary studies, and in particular, to do so in a systematic and objective manner. A major tool used in clinical medicine, education, social science and other domains is the systematic literature review. Where such secondary studies are used they influence primary studies in a number of ways – including identifying reporting standards and also
ASE-EVAL: Software Evaluation
Page 2 of 3
Lecture 2: Context
identifying where additional studies might be needed. Evaluation in SE is increasingly using systematic literature reviews to look at wider issues, such as the use of models and methods.
ASE-EVAL: Software Evaluation
Page 3 of 3
Lecture 2: Context