professional documents
home
Profile
docsters
request
Blogs
Upload
about me
contact me
user photo
submit clear
Acrobat PDF

How to measure success center doc

TDT 4735 Project in software engineering. How to measure success? By: Anders Person and Knut Steinar Engene Subject supervisor: Maria Letizia Jaccheri Department of Computer and Information Science at the Faculty of Information Technology, Mathematics an Electrical Engineering, NTNU, Trondheim, Norway. Fall 2004 ii Abstract Since the term software engineering was established in the 1960's, a lot of the developed software has run into problems like missed deadlines and poor quality. Software development has often been guided by gut feelings and "expert knowledge". In other areas of science, empirical engineering has been a resource in product development, because it helps us understand how and why things work. Empirical software engineering enables the use of statistics, and can therefore back up its claims with significance and probability data. Without empirical engineering, we can't know what mechanisms that drive the costs and benefits of software tools. Unless we have this information, determining whether we are basing our actions on credible interpretations or faulty assumptions, is hard. In this paper, we have conducted an empirical experiment on the organization Gentoo. This is a voluntary organization that produces a distribution called Gentoo Linux. The organization went through a process improvement initiative and we want to find out whether or not this reorganization improved the Gentoo's efficiency. iii Acknowledgements We wish to thank the following people: Our supervisor Thomas Østerlie for doing a good job as a supervisor and keeping us on our toes. He has also provided us with articles and knowledge about Gentoo and the writing of the report. In addition we would like to thank Professor Tor Stålhane for his help with statistics and testing of the hypotheses. Jan Kjeran Kolsrud has also been a resource during the hypotheses-testing. _________________________ _________________________ Anders Person Knut Steinar Engeneiv ABSTRACT...........................................................................................................................................................II ACKNOWLEDGEMENTS ..............................................................................................................................III FIGURE LIST....................................................................................................................................................VII TABLE LIST....................................................................................................................................................VIII 1 INTRODUCTION..............................................................................................................................................1 1.1 Motivation................................................................................................................................................1 1.2 Project context........................................................................................................................................1 1.3 Problem definition..................................................................................................................................2 1.4 Report outline..........................................................................................................................................3 2 PRESTUDY .........................................................................................................................................................4 2.1 EMPIRICAL SOFTWARE ENGINEERING (ESE) ............................................................................................4 2.1.1 Why do empirical software engineering...........................................................................................4 2.1.2 What is ESE?........................................................................................................................................4 2.1.3 Why experiment?.................................................................................................................................5 2.2 HOW TO PERFORM EMPIRICAL SOFTWARE ENGINEERING.......................................................................7 2.2.1 Introduction..........................................................................................................................................7 2.2.2 The evolution........................................................................................................................................7 2.2.3 Alternative approaches....................................................................................................................10 2.2.4 Our approach....................................................................................................................................11 2.3 OPEN SOURCE SOFTWARE (OSS) ..............................................................................................................13 2.3.1 An introduction to OSS.....................................................................................................................13 2.4 THE EVOLUTION OF LINUX AND GENTOO LINUX....................................................................................15 2.4.1 Linux...................................................................................................................................................15 2.4.2 A technical overview of Linux.........................................................................................................16 2.4.3 Linux distributions...........................................................................................................................16 2.4.4 Gentoo Linux.....................................................................................................................................16 2.4.5 Gentoo Linux in detail......................................................................................................................17 2.4.6 Gentoo, organizational...................................................................................................................19 2.4.6.1 Gentoo community........................................................................................................................19 2.4.6.2 The Herds project..........................................................................................................................21 2.4.6.3 Concurrent Versions Control .........................................................................................................21 2.5 REORGANIZATION OF GENTOO .........................................................................................................22 3.1 RESEARCH AGENDA....................................................................................................................................23 3.2 FOCUS...........................................................................................................................................................23 3.3 QUESTIONS..................................................................................................................................................23 3.4 ASSOCIATED RESEARCH METHOD/PROCESS.............................................................................................24 4 EXPERIMENT PLANNING.......................................................................................................................25 4.1 CONTEXT SELECTION.................................................................................................................................25 4.2 HYPOTHESIS EXPLANATION.......................................................................................................................26 4.4 VARIABLES SELECTION..............................................................................................................................29 4.4.1 Independent variables......................................................................................................................29 4.4.2 Dependent variables........................................................................................................................29 4.5 SELECTION OF SUBJECTS............................................................................................................................29 4.6 EXPERIMENT DESIGN ..................................................................................................................................30 4.6.1 Randomization..................................................................................................................................30 4.7 INSTRUMENTATION.....................................................................................................................................30 4.7.1 Python scripting...............................................................................................................................30 4.8 VALIDITY EVALUATION ..............................................................................................................................32 4.8.1 Conclusion validity..........................................................................................................................33 4.8.2 Internal validity................................................................................................................................ 34 4.8.3 Construct validity.............................................................................................................................35 4.8.4 External validity:..............................................................................................................................36 4.9 PRIORITY AMONG TYPES OF VALIDITY THREATS.....................................................................................37 v 5 EXPERIMENT OPERATION.....................................................................................................................38 5.1 INTRODUCTION............................................................................................................................................38 5.2 EXPERIMENT PREPARATION......................................................................................................................38 5.3 EXPERIMENT EXECUTION...........................................................................................................................41 5.3.1 Data collection.................................................................................................................................41 5.3.2 Different methods.............................................................................................................................42 5.4 DATA VALIDATION.....................................................................................................................................42 5.4.1 Data source integrity.......................................................................................................................42 5.4.2 Bugzilla ..............................................................................................................................................42 5.4.3 Manual bug inspection.....................................................................................................................43 5.4.4 The participants................................................................................................................................ 43 5.4.5 Information included in the collected data...................................................................................43 5.4.6 Possible improvements....................................................................................................................43 6 ANALYSIS AND INTERPRETATION ....................................................................................................45 6.1 DESCRIPTIVE STATISTICS...........................................................................................................................45 6.1.1 Hypothesis 1......................................................................................................................................45 6.1.2 Hypothesis 2......................................................................................................................................49 6.1.2.1 Before the reorganization...............................................................................................................50 6.1.2.2 After the reorganization .................................................................................................................50 6.1.2.3 Total number of open bugs ............................................................................................................51 6.1.2.4 Plotting solved vs. new bugs per week ..........................................................................................51 6.1.3 Hypothesis 3......................................................................................................................................53 6.2 DATA SET REDUCTION................................................................................................................................56 6.3 HYPOTHESIS TESTING.................................................................................................................................56 6.3.1 Hypothesis 1......................................................................................................................................56 6.3.1.1 t-Test .............................................................................................................................................57 6.3.1.2 Result interpretation......................................................................................................................57 6.3.1.3 Linear regression...........................................................................................................................57 6.3.1.4 Result interpretation......................................................................................................................58 6.3.1.5 t-Test II..........................................................................................................................................59 6.3.1.6 Brief summary...............................................................................................................................60 6.3.2 Hypothesis 2......................................................................................................................................60 6.3.2.1 t-Test .............................................................................................................................................60 6.3.2.2 Result interpretation......................................................................................................................61 6.3.2.3 Brief summary...............................................................................................................................61 6.3.3 Hypothesis 3......................................................................................................................................61 6.3.3.1 Linear regression...........................................................................................................................62 6.3.3.2 Result interpretation......................................................................................................................63 6.3.3.3 Brief summary...............................................................................................................................63 7 EVALUATION AND DIS CUSSION OF RESULTS .............................................................................65 7.1 DISCUSSING THE HYPOTHESES...................................................................................................................65 7.2 EVALUATING QUESTION 1..........................................................................................................................67 7.3 THINGS WE COULD HAVE DONE DIFFERENTLY........................................................................................67 8 CONCLUSIONS AND FURTHER WORK .............................................................................................68 8.1 PROJECT EVALUATION...............................................................................................................................68 8.2 CONCLUSION ...............................................................................................................................................69 8.3 FURTHER WORK..........................................................................................................................................69 BIBLIOGRAPHY...............................................................................................................................................71 ONLINE REFERENCES........................................................................................................................................73 ATTACHMENT A.............................................................................................................................................75 ATTACHMENT B .............................................................................................................................................76 ATTACHMENT C.............................................................................................................................................83 ATTACHMENT D.............................................................................................................................................84 ATTACHMENT E.............................................................................................................................................90 vi vii Figure list FIGURE 1: OVERVIEW OF VALIDATION METHODS [ZELKOWITZ & WALLACE, 1998]......................................6 FIGURE 2: HIGH-LEVEL STEPS OF GQM/MEDEA [L. C. BRIAND ET AL, 2002] ..............................................9 FIGURE 3: OVERVIEW OF THE EXPERIMENT PROCESS [WOHLIN ET AL, 2000]................................................12 FIGURE 4: LINUX ARCHITECTURE [LINUX, KLINGAUF].....................................................................................15 FIGURE 5: GENTOO USER SURVEY [GWN 08.11.2004] .....................................................................................17 FIGURE 6: DATA FLOW DIAGRAM, GENTOO........................................................................................................19 FIGURE 7: THE GENTOO HERDS PROJECT ............................................................................................................21 FIGURE 8: EXPERIMENT PLANNING.......................................................................................................................25 FIGURE 9: ILLUSTRATION OF INDEP ENDENT AND DEPENDENT VARIABLES.....................................................29 FIGURE 10: RESULTS FROM AN EARLY VERSION OF THE MODIFIED BUGZILLA SCRIPT.................................31 FIGURE 11: EXPERIMENT PRINCIPLES [WOHLIN 2000]......................................................................................32 FIGURE 12: EXPERIMENT OPERATION [WOHLIN ET AL, 2000] ..........................................................................38 FIGURE 13: BUG ACTIVITY LOG [BUGZILLA].......................................................................................................39 FIGURE 14: “MOVES, ADDS AND CHANGES” FROM GWN PUBLISHED 30TH JUNE 2003................................40 FIGURE 15: ANALYSIS AND INTERPRETATION.....................................................................................................45 FIGURE 16: DIAGRAM THAT ILLUSTRATES THE HANDLING TIME FOR EACH OF THE INSPECTED BUGS .......46 FIGURE 17: AVERAGE HANDLING TIME................................................................................................................47 FIGURE 18: COMPARISON OF THE BUG HANDLING TIME....................................................................................48 FIGURE 19: THE DIAGRAM PLOTS THE AVERAGE HANDLING TIME PER BUG ON A WEEKLY BASIS...............48 FIGURE 20: THE PICTURE SHOWS THE DEVELOPMENT IN REPORTED AND CLOSED BUGS ON A WEEKLY BASIS..............................................................................................................................................................49 FIGURE 21: THE PICTURE SHOWS THE NUMBER OF CLOSED BUGS DIVIDED ON THE NUMBER OF NEW BUGS. ........................................................................................................................................................................50 FIGURE 22: NUMBER OF OPEN BUGS EACH WEEK..............................................................................................51 FIGURE 23: COMPARISON OF SOLVED VS NEW BUGS PER WEEK .......................................................................52 FIGURE 24 THE EVOLUTION OF DEVELOPERS FROM 01.012003 .......................................................................53 FIGURE 25: NEW BUGS PER DEVELOPER PER WEEK............................................................................................54 FIGURE 26: SOLVED BUGS PER DEVELOPER PER WEEK......................................................................................55 FIGURE 27: THE DIAGRAM PLOTS THE AVERAGE HANDLING TIME PER BUG ON A WEEKLY BASIS...............55 FIGURE 28: COMPARING LINEAR REGRESSION....................................................................................................59 FIGURE 29: HANDLING TIME PLOT ........................................................................................................................62 FIGURE 30: RESIDUAL PLOT..................................................................................................................................63 viii Table list TABLE 1: GANTT CHART (SMALL VERSION).........................................................................................................24 TABLE 2: SUGGESTED HYPOTHESES.....................................................................................................................27 TABLE 3: CONCLUSION VALIDITY........................................................................................................................33 TABLE 4: INTERNAL VALIDITY.............................................................................................................................34 TABLE 5: CONSTRUCT VALIDITY..........................................................................................................................35 TABLE 6: EXTERNAL VALIDITY............................................................................................................................36 TABLE 8: SCALE FOR CATEGORIZING THE BUG HANDLING T IME......................................................................39 TABLE 9: FINAL SCALE FOR CATEGORIZING THE BUG HANDLING TIME...........................................................40 TABLE 10: SHOWS THE DISTRIBUTION OF THE BUGS WITHIN EACH CATEGORY.............................................47 TABLE 11: T -TEST : TWO-SAMPLE ASSUMING UNEQUAL VARIANCES............................................................57 TABLE 12: OUTPUT OF THE LINE RE GRESSION TEST PRIOR REORGANIZATION...............................................58 TABLE 13: OUTPUT OF THE LINE RE GRESSION TEST AFTER REORGANIZATION..............................................58 TABLE 14: T -TEST : TWO-SAMPLE ASSUMING UNEQUAL VARIANCES............................................................61 TABLE 15: SUMMARY OUTPUT ..............................................................................................................................62 1 1 Introduction 1.1 Motivation In today’s software development society, efficiency, thoroughness and the constant need for improvement are just a few of several crucial factors to success. The competition is tough and many organisations are struggling to gain market shares and keep themselves alive. In the commercial part of the software development community, a lot of techniques and proposals for cost efficiency and process improvement have become available. This work is mainly based on empirical data collected from budgets and financial statements. Therefore, measuring any gained success is fairly easy. In non-commercial open source software development, the measuring of improvement initiatives gets a bit harder. As in the commercial organisations, there is a need for streamlined workflow and organisation architecture to obtain a good result. However, some open source communities have a virtual organizational structure. This means that the participants in open source software projects rarely meet physically and almost all communication takes place on the Internet. In addition, the participants don’t receive any salary for their contributions. The organizations rely on voluntary work, therefore it is difficult to control the participants and the development-progress. As mentioned above some obstacles appear when trying to measure success in these projects, because organization structure and techniques from the commercial world can't be adopted without adjustment. There is a large number of OSS projects. At Sourceforge, the largest repository for open source applications, more than 91,000 projects are registered [Sourceforge]. This indicates that open source software is here to stay. 1.2 Project context The project description for this project was given by the Department of Computer and Information Science at the Norwegian University of Science and Technology. The project is part of the 9th semester at the masters program and it had to be completed within 13 weeks. We were not given any economic aids to the project. We were assigned semipriivat booths with reserved computers. We have worked in a two-man team and have been appointed a supervisor. It has also been possible to consult our teachers. We did not have much experience with OSS, Linux or empirical software engineering when starting on this project. The project goal was to "…determine the outcome of a real software process improvement initiative in an open source project ". It was to be completed "By using state-of-the-art methodology within empirical software engineering,…" and the outcome was to "…determine whether or not this improvement initiative is a success or a failure." [Project assignment, M. L. Jaccheri]. We approached this challenge by first studying empirical software engineering, open source software and the Gentoo Linux project. After gaining this knowledge we performed an empirical experiment where we tried to determine Gentoo Linux's efficiency before and after the reorganization. This was done by creating a question that was supposed to be answered by the outcome of the testing of three hypotheses. 2 1.3 Problem definition The project title is "How to measure success?”. It refers to the reorganization done by Gentoo Linux, and to what extent it was a successful initiative [GLEP 4, D. Robbins]. To do this we had to determine appropriate ways of measuring the success of the reorganization. After this reorganization, questions were raised. Did this OSS project really benefit from the reorganization? Can voluntary virtual organization do such a radical change and still come out on top? This project emphasized on the empirical part of software engineering. Therefore the main task was to conduct an empirical experiment. Then the report was written where the results of the experiment were discussed. We have also given an introduction to open source software, empirical software engineering and Gentoo Linux. 3 1.4 Report outline. We have used a report template given to us by the Department of Computer and Information at NTNU as a basis, then modified it to fit our project. The rest of the report is organized in the following parts: 2. Prestudy Introduction to empirical software engineering, open source software, Linux in general and Gentoo Linux. 3. Problem statement Presents an elaboration of the project, its challenges and our agenda. 4. Experiment planning Here the project is defined. The hypotheses are discussed, as is variables, instrumentation and validity. 5. Experiment operation This section presents the experiment preparation, its execution and data validation. 6. Analysis and interpretation Descriptive statistics are used on the hypotheses, data set reduction is briefly mentioned and the hypotheses are tested. 7. Evaluation and discussion of results Evaluates the theoretical and practical work. Our view on the project and the work process is described. 8. Conclusion and further work The project is briefly summarized and we reach a conclusion. Suggestions to further work are presented. 4 2 Prestudy The aim of this prestudy is to learn about empirical software engineering, open source software and Gentoo Linux. This was necessary for us to understand the scope of the project. The following chapter is a summary of the articles, books and web pages we have read. It gives an introduction to some of the most important areas of our research. 2.1 Empirical Software Engineering (ESE) In this section we will present some of the work done in the field of empirical software engineering. The sources we have used are mainly from the syllabus at the section for empirical software engineering at our university [Syllabus]. In addition we have used one of the textbooks from the course “Software Quality and Empirical Work” [Tdt25], that both the authors attend. 2.1.1 Why do empirical software engineering For a product to evolve, it needs testing and experimenting. By doing empirical experiments and analyze historical data, one might be able to make claims about improvements in future projects. Software is not an exception, and the small amount of software experimentation might hinder its development [Basili et al., 1986]. Empiric software engineering is a good way of doing experiments because it backs up its claims with statistics. There are several different ways of doing empirical research; survey, case study and experiment. Why do many development-projects generate less-than-desirable products? Many of the approaches are chosen on gut feelings, expert opinions and poor research [Fenton et al., 1994]. Fenton et al. claims that quantitative data and well-designed experimental research should be used to substantiate any claims made for new or changed practices. Observing, making theories and experimenting is a formula that has been successful for other sciences like medicine [Kitchenham et al., 2002]. Kitchenham et al believe its time for software engineering to embrace this practice. 2.1.2 What is ESE? Empirical software engineering can be defined as collecting data, doing statistical research, and then use the results in order to reject or not reject a hypothesis. However empirical engineering is not a complete science with set standards. It is still being developed, and there are several suggested templates that compete with each other in order to become a standard [Kitchenham et al. 2002, Basili et al., 1986]. Empirical software engineering is still in its early development and in order for it to mature, it might be useful to perform empirical experiments en masse. This could create trends and indicate what works and what doesn't. Software engineering is not like manufacturing; its technologies are human based. It is hard to build models and verify them with 5 experiments. Reasons for this are the many variables, environments and the evolving technologies [Basili, 1996]. 2.1.3 Why experiment? Experiments can be used to test theories and to explore. Experimentation can help creating a base of knowledge about the software in the experiment. This helps determine what theories, tools and methods are adequate. By experimenting, new, useful and unexpected insights may be learned. Whole new areas of investigation can be revealed. Tichy [1998] claims that in areas where engineering progress is slow, experimentation can push through. Experimenting can quickly eliminate fruitless approaches and erroneous assumptions, thereby accelerating progress. It can also orient engineering and theory in promising directions. The experimenting process in itself can also produce results and knowledge both in the area being experimented in, and the techniques used [Tichy, 1998]. Evidence of new software being superior to old is not often provided. Statements like: "Productivity gains of 250%" and "Time to market reduced by half" might seem very tempting, but are often not backed up with statistical data. This makes it difficult to differentiate va lid claims from invalid ones [Fenton et al., 1994]. Making such claims and being able to back them up with empirical data can give an advantage in the business market. From a business perspective, it is necessary to develop products and processes that can help creating quality systems productively and profitably, e.g., estimate the cost of a project, track its progress and evaluate the quality of a product [Basili, 1996]. These models of process and products should be tailored based upon the data collected within the organization and should be able to continually evolve based upon the organizations evolving experiences [Fenton et al., 1994]. However, empirical experimenting does not come for free. Experiments and data gathering need resources and manpower that could be used elsewhere. There are also direct expenses like equipment and training. Many managers also dread the fact that experiments may need a long time before they start creating profits compared to other types of development. This is especially important in software engineering as technology pushes business and borders extremely fast. As the picture below [Zelkowitz & Wallace, 1998] shows, few experiments are done. Most of the papers either have no experimentation or they use assertion to validate their claims. This looks grim for the software industry as it seems to lure itself by posting all sorts of claims without being able to prove them. However there are several positive trends. The percentage of papers with no experimentation has almost halved from 1985 to 1995. Papers based on assertions have also decreased in the same time period. Also the number of papers validated with case studies and lessons learned have risen. Actually, almost all the validation methods have been used to a greater extent in 1995 then in 1985. 6 Figure 1: Overview of validation methods [Zelkowitz & Wallace, 1998] 7 2.2 How to perform Empirical Software Engineering The following section will describe some of the methods and techniques used to execute empirical software engineering. 2.2.1 Introduction In our preparations for this section of the report, we have studied work from different contributors in the software engineering research community. We have noticed that different theories and proposals are suggested. However the authors seem to agree upon one thing, and that is the need for further work and emphasising on the empirical part of software engineering [Basili 1996, Kitchenham et al., 2002]. One of the main challenges is to create a credible empirical discipline for software engineering with satisfying guidelines for the research and reporting processes. It is claimed that empirical studies in software engineering research have not had the same success as in other parts of modern science[D. Perry et al., 2000]. This is widely discussed in different articles, and possible reasons are presented. N. Fenton et al. [1994] claims that software engineering research got off to a bad start. They characterises many of the publicised articles as “analytical advocacy research” with poor experiment and statistical design. Victor Basili [1996] mentions the differences between software engineering and other fields like physics, medicine and manufacturing, where empirical research is widespread. These differences could be the reason for the lack of success in software engineering. Basili also suggests that the distinctive characteristics of software projects often makes it hard to compare different studies. As software engineering doesn’t have long traditions in the empirical research world, parts of the research community have glanced at other spheres to get ideas for their work. This has resulted in both guidelines and templates for designing, conducting and evaluating empirical studies. One of the first articles that emphasized on the need for experimentation in software engineering was released by Basili, Selby & Hutchens in 1986 [Basili et al., 1986]. This article includes both a framework for analysing and designing experimental work performed in software engineering, and recommendations for performing future experiments. The framework presented consists of four categories; definition, planning, operation and interpretation, each corresponding to phases of the experimentation process. This article has been the inspiration and source for a lot of the research in the software engineering area. 2.2.2 The evolution As mentioned above, a lot of the work in the software engineering research community has aimed at developing guidelines and templates for the empirical research. The tendency from the past was software engineering driven by technology development and advocacy research. This is not acceptable in the long run if control of the software development is desired. To gain this control, the ability to evaluate new methods and techniques before using them is necessary [Wohlin et al., 2000]. This can be achieved by performing empirical studies like surveys, experiments and case studies, and then turn software engineering into a science. 8 This issue is covered by Fenton et al. [1994] where the authors present some suggestions to improve software engineering research practices. They emphasise the importance of claims based on valid evidence. To achieve this, the authors state that: Five questions should be (but rarely are) asked about any claim arising from software engineering research: · Is it based on empirical evaluation and data? · Was the experiment designed correctly? · Is it based on a toy or a real situation? · Were the measurements used appropriate to the goals of the experiment? · Was the experiment run for a long enough time? [N. Fenton et al., 1994, p. 87] Further, the authors are examining each question in detail and presenting examples from real projects to illustrate the consequences when these questions are ignored. According to the article, evaluative research must involve realistic projects with realistic subjects. The proposed hypotheses have to be tested against satisfactory data. This is a timeconssumin and expensive task, but it is a necessity for any valid empirical analysis. In addition to having satisfactory data, i.e. enough and valid data, the design of the experiment itself has to be correct. One way of avoiding this threat is to use appropriate guidelines and to gain experience by carrying out several empirical experiments. When it comes to the question regarding toy versus real situation, the authors call attention to the cost of accomplishing a large-scale study. The cost and time constraints are often the reasons why a lot of software engineering researchers choose to conduct an experiment based on artificial problems in artificial situations. This generates a new problem. The results from toy studies can not unconditionally be scaled up to larger and more realistic situations. But this kind of experiments is not valueless, even if the results they present are not conclusive. They can indicate directions for further investigation, meaning that it is often better to perform a small-scale experiment than none at all. When defining the measurements used in an experiment, Fenton et al. [1994] emphasize the importance of measuring the correct attributes. If this is not done appropriately, wrong conclusions might be reached. In addition, the choice of scales is crucial. If the combination of scale and statistical technique is wrong, then the researcher is in deep water. To support this assertion, a study performed at IBM where the relationship between faults and failure in software is presented. This study claims that focusing on faults instead of failures can be fatal. L. C. Briand et al, [2002] take a closer look on measurement definition. They point out that the principles and methods in software measurements is currently being defined and consolidated. They also claim that few of the measurements presented in publications, are actually used in the industry. This is due to several problems, which the authors point out. The article includes a proposal for defining measures that will be appropriate in the software engineering, but the authors do not expect to find any generally valid quantitative laws. This is regarded as an ideal, long term research goal. The measure definition process proposed in this article is based on the Goal/Question/Metric (GQM) paradigm with some extensions. The authors have named their proposal GQM/MEDEA (GQM/MEtric DEfinition Approach), which is an exhaustive process. The high-level structure of GQM/MEDEA can be summarised in four steps; setting of the empirical study, definition of measure for the independent attributes, 9 definition of measure for the dependent attributes and hypothesis refinement and verification. Figure 2 shows this high-level structure. Figure 2: High-level steps of GQM/MEDEA [L. C. Briand et al ., 2002] 1. Setting of empirical study The first step is setting of the empirical study which is done in two main tasks. As the authors indicates: “The definition of the measurement goals and empirical hypotheses are the fundamental phases since all the other steps in our approach are affected by them”[ L. C. Briand et al., 2002, p. 1111]. Based on the knowledge regarding the corporate objectives, development environment and available resources, measurement goals are defined. It is essential that the corporate objectives are prioritized to increase the probability of receiving adequate support. This information is merged together with information about the specific environment, and results in tactical goals. These goals are more specific than the corporate objectives, and are the foundation of the definition of the measurement goals together with information regarding resources. The process of defining the measurement goals is quite exhaustive and the authors approach it by using GQM. The template that is suggested, includes five goal dimensions that are meant to help the researcher in the task. These goal dimensions are: object of study, purpose, quality focus, viewpoint and environment. Each dimension is guidance in the determination of the measurement goals. According to the article, the authors claim that "A hypothesis captures one’s own intuitive understanding of the studied phenomena and needs to be explicit so it can be discussed, questioned, and refined"[L. C. Briand et al., 2002, p. 1114]. 10 The process of defining the hypotheses isn’t covered in detail by the authors, but some guidelines are given. In the definition of the empirical hypotheses, the authors emphasises the use of terms of measures. They also define just one hypothesis per issue. This results in hypotheses that differ from the hypotheses defined in the statistical test. A statistical test of hypotheses requires both a null hypothesis and an alternative hypothesis. In addition there are the statistical hypotheses defined in terms of measures. The empirical hypotheses are later in the process refined when the measured are defined. 2/3. Definition of measures for independent /dependent attributes After completion of the first step, definition of measures for independent attributes is next. This phase uses the hypotheses from the previous phase, in addit ion to the process and product information to come up with the measures needed. For all the attributes of each of the entities appearing in the empirical hypotheses, appropriate measure must be identified. This is an exhaustive process which includes formalising independent attributes and identifying abstractions for measuring independent attributes. It also instantiates and refines properties for measures of independent attributes. When the measures are defined, the phase is wrapped up with a validation of the measures. The definition of measures for dependent attributes follows an identical path, and is often a bit easier as the dependent attributes usually are more tangible. 4. Hypothesis refinement and verification The last step in the measure definition process is the hypothesis refinement and verification. After the definition of the measures for the dependent and independent attributes, some refining of the original empirical hypothesis might be necessary. This will hopefully result in more precise hypotheses that are consistent with the initial ones. When this is done, the data gathering is the next task. It is crucial that the data collected is consistent with the defined measures, and that adequate information is gathered to carry out the empirical validation. Another issue mentioned above with importance for the research, is the duration. If the study isn’t carried on long enough, the wrong results may appear. By violating this requirement, the researcher might interpret the data incorrect and reject wrong hypotheses. This can for example be the result if he/she credits some initiative as the cause of an alteration in the data, while it actually is a tendency prior the start of the data gathering. 2.2.3 Alternative approaches As mention above, many articles have focused on the difficulties with performing empirical studies in software engineering and tried to reveal the causes for this. Perry, Porter and Votta [D. E. Perry et al., 2000] have another point of view in this matter. They claim that the main problem is the gap between the studies performed, and the goals that these studies try to achieve. To deal with this problem, better design and more credible interpretations must be present, they continue. The structure of an empirical study should according to the authors include the following components: · research context · hypotheses · experimental design · threats to validity · data analysis and presentation 11 · results and conclusions In addition to use this structure, the authors claim that the most important thing a researcher can do is to ask insightful questions. They also point out that the quality of many computer science experiments could be improved by involving others with qualifications and experience. This particular issue is covered in other articles as well. A comprehensive effort made by a group of software engineering researcher and statisticians is presented in the article “Preliminary Guidelines for Empirical Research in Software Engineering” [Kitchenham et al., 2002]. The authors have based their work on publications in the medical and psychological sphere, and tried to merge this with their own experiences from software engineering. They examine six basic topic areas and present a set of do’s and don’ts. The areas are: · Experimental context · Experimental design · Conduct of the experiment and data collection · Analysis · Presentation of results · Interpretation of results In this examination the authors have come up with a set of guidelines on to perform future empirical research in software engineering. But the authors stress that the guidelines alone will not improve the relevance and usefulness of empirical software engineering research. 2.2.4 Our approach Much work has been done by the empirical software engineering research community. Different approaches have been made and a widespread set of solutions have been proposed. We chose to use “Experimentation in software engineering, an introduction” [Wohlin et al., 2000] which is the textbook in one of our classes, as a template. This book was released a few years ago, and many of the articles we have reviewed above have been used as source and inspiration during the compilation of the textbook. It is also convenient that this book is addressing experimentation in software engineering in particular. The experiment process can be divided into five main activities and is illustrated in figure 3 below. There is no requirement that an activity has to be finished prior to the next one in the model, but the order indicates the starting order. The five main activities in Wohlin’s experiment process are quite similar to the basic topic areas that Kitchenham et al. presented in their article [Kitchenham et al., 2002]. We believe that this is a tendency in the community and the contributors seem to agree upon many of the aspects in empirical software engineering. 12 Figure 3: Overview of the experiment process [Wohlin et al., 2000] We will not detail the main activities in this section, but briefly describe some of them in chapters 4, 5 & 6. 13 2.3 Open Source Software (OSS) The following paragraph briefly describes open source software and some of its attributes. We continue by comparing OSS with standard commercial software. Then we briefly discuss communication in OSS communities. 2.3.1 An introduction to OSS "The basic idea behind open source is very simple: When programmers can read, redistribute, and modify the source code for a piece of software, the software evolves. People improve it, people adapt it, and people fix bugs. And this can happen at a speed that, if one is used to the slow pace of conventional software development, seems astonishing" [Opensource]. Official open source definition [Definition]: 1. Free Redistribution 2. Source Code 3. Derived Works 4. Integrity of The Author's Source Code 5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Distribution of License 8. License Must Not Be Specific to a Product 9. License Must Not Restrict Other Software *10. License Must Be Technology-Neutral OSS has gained increased support in the last years [Opensource]. In the beginning it might only have been an alternative for the computer elite and gurus, but recent focus on usability and support has made it a serious competitor to traditional commercial software. The way we interpret things, a popular argument against OSS has been the lack of support for both private households and companies. This has dominated the fact that the software itself is free. Therefore it has long been assumed that the costs and risks of maintaining OSS has outweighed the benefits of free downloads. To counter this, commercial companies like RedHat Linux offer installation CD's, manuals and even personal support [RedHat]. This might have been the final push that convinced companies to try out Linux and open source software. Unlike the slow start of open source operating systems, smaller programs like open source FTP servers have flourished in the market for a long time. The Apache Web Server totally dominates the market, with a market share of almost 70 %, and this number is even increasing! [Netcraft] Although the Apache project differs from many other OSS projects by defining the development process before the actual development began, it can be seen as an sign of the OSS invasion in the commercial market. OSS products often approach users in a very different way compared to commercial software. OSS applications are freely available for downloading on the internet. This 14 means that anyone, anywhere, at any time, can download the software for free, and legally employ it for home or corporate use. Traditionally software had to be purchased, and then the customer received a link to download from, a CD in the mail or a serial to enter into the trial version. This was very logical because the companies that produced the software had the same goal as any other company, to enrich its shareholders. That leads us to the next paragraph; what motivates the OSS developers? Most of them obviously don’t make money on it, at least not directly. There are many different answers to this question, ranging from bazaar gift exchange [Raymond 2001], basic communist philosophies [Glass, 2004] to private-collective activities [Hippel and von Krogh, 2003]. The latter of these suggested answers, argues that programmers both contribute to the public good, and simultaneously obtain private benefits in terms of learning, enjoyment and solutions to their own technical issues. A win-win situation. We believe that perhaps there are other rewards that might not have been investigated adequately. These people spend a lot of time on chat networks like IRC; they have deep relationships and long-term friendships with other people on IRC. The fact that a person is known as a skilled developer gives respect in these peoples online lives. They have administrator rights on huge channels and this means ultimate power over everyone else, hundreds, perhaps thousands of people. This might be compared to a man who is a father at home, but in his parallel work life, he owns a large company, and rules thousands of people’s lives. Of course a developer might have less power than his mundane comparison but in principle, there might be similarities. 15 2.4 The evolution of Linux and Gentoo Linux In this section we will give a brief introduction to Linux operative systems in general but focus on Gentoo Linux. The section contains some of the historical background for the evolution of Linux, and a description of the distinct characteristics of Gentoo. The goal is to give the readers some information regarding the context of the empirical experiment performed later in the report. 2.4.1 Linux Linux is an open-source implementation of the UNIX operating system and was initially created by Linus Torvalds while he was studying at the University of Helsinki in Finland. Torvalds was interested in creating an operative system exceeding the standards of a small OS called MINIX which is very similar to the powerful, interactive timesharing OS UNIX. [Hyperdictionary]. After working with his project for some months, Torvalds made an announcement on Usenet to get feedback on his work. The response was overwhelming, and in September 1991 version 0.01 of Linux was released [Linus]. This was the start on a major open source project which resulted in an operative system with all the expected features like virtual memory, shared libraries, TCP/IP networking and true multitasking. Linux was originally developed for the Intel 80386 microprocessor, but much of the platform-dependent code was later moved into platform-specific modules. Today Linux has gained middleware-like capabilities and support for a widespread of different hardware architectures. This layer architecture is shown below in figure 4. In addition to the fact that it is freely distributed, Linux’s functionality and adaptability are some of the reasons that it has become probably the most popular UNIX-like OS in the world. Figure 4: Linux architecture [Linux, Klingauf] 16 2.4.2 A technical overview of Linux As figure 4 shows, the Linux kernel lies between the hardware and the software applications. The kernel is built up by many sub-elements and includes device driver support, processor and memory management features and support for many different types of file systems [UNIX, W. Knottenbelt] A large group of developers are constantly improving the kernel and adding new features. Periodically this group releases new stable versions of the Linux kernel and users can download these versions from servers all over the world. 2.4.3 Linux distributions A Linux distribution is a complete Linux system. It includes, in addition to the Linux kernel, a selectio n of packages bound around it. These packages give Linux a set of compilers, libraries, utilities and other features resulting in a full-scale useful operating system. There exists a huge amount of different distributions, all with their own features and optimizations for different tasks/hardware. There are both commercial and noncommeercia distributions on the market, and RedHat, Debian, Mandrake and Gentoo Linux are just a few examples. To be able to interact with the system, some sort of interface mus t be present. Linux supports two different sorts of command input; textual command line shells and graphical user interfaces (GUIs). There has been a change in the composition of the user group of Linux, from the early days and until today. In the beginning being a highly skilled computer user was almost a requirement to start using the system. Nowadays user with different levels of skills wish to use Linux, and this has resulted in more focus on userfrienddlines and graphical environments. Many of the dis tributions (a version of Linux) on the market have therefore integrated a great deal of graphical user interfaces. The graphical environment can roughly be separated into two parts; the window manager and the desktop manager. The window manager controls the layout of the windows on the screen, while the desktop manager uses these windows to arrange menu bars, file managers and so on. Gnome and KDE are two of the most popular desktop managers on the market. The textual command line shell is still often used to connect remotely to a Linux server. 2.4.4 Gentoo Linux The development of Gentoo Linux was initiated by Chief Architect Daniel Robbins in 2000. Robbins started the work because he didn't like the functionality that the other Linux distributions offered. The most fundamental issue for Gentoo; "is designing a technology that allows us and others to do what they want to do, without restriction" [Philosophy, D. Robbins ]. On Linux Online’s web site, Gentoo Linux has been given the following description: “Gentoo Linux is designed for the developer, power user and enthusiast. It incorporates the latest sources and technologies (such as ReiserFS and the Portage system).” [Linux Online] Today Gentoo Linux has about 1.0 % market share of Linux distributions, and is also the fastest growing GNU/Linux distribution in terms of users [Market share]. The system is available for free over the Internet, and the install file is about 650 Mb. Potential users 17 can download a Gentoo LiveCD which is a bootable CD that allows him/her to boot Linux from it. This software detects the user’s hardware and loads the appropriate drivers during the boot process. The Gentoo community releases a newsletter every week, and in the edition from the 8th of November 2004 a user survey was presented. This survey had gathered data from more than 9000 users, and was the first ever done. The figure below shows the results from the question: “What was the most important factor for you when choosing Gentoo?” As the pie chart exposes, is the package repository and the availability to customize the distribution the main reason for the lion’s share of the users. Figure 5: Gentoo user survey [GWN 08.11.2004] 2.4.5 Gentoo Linux in detail There are several attributes that distinguish Gentoo Linux from the other Linux distributions available. In the article “Gentoo Linux: The next generation of Linux” [Thiruvathukal, 2004] the author points out some of the features that give Gentoo Linux a competitive advantage compared to other distros. Thiruvathukal especially mentions Gentoo Linux’s use of metadata. This is not unique among the available distros, but Gentoo Linux takes it to another level. Gentoo Linux’s use of metadata gives the user information regarding what version of a package is installed, ho w that package was built, and whether a newer version is available. Thiruvathukal also mentions that the entire operating system is maintained from source code and that the user only needs to install it once. This is because of the available upgrades that are distributed continuous in the Portage system. Modifiability has high priority in Gentoo Linux, and one of the other features that distinguish Gentoo Linux from the other Linux distributions available, is the Portage technology. This technology enable s the user to build the entire system from source code using his/her choice of optimisation, and Gentoo is therefore called a meta-distribution. Portage is a package management system which performs different tasks like software distribution, package building and installation, and keeping the users system up-to-date [Gentoo Portage ]. This is done to simplify many of the obstacles that the users face with 18 open source software. Take the software distribution as an example; the only thing the users have to do is to type a simple command to get the latest version of the system. As mentioned above, Portage also includes an installer. This feature ensures customisation of the software and optimizing it to the respective user’s hardware. As a result of the features that the Portage technology offers, the people behind Gentoo Linux hope that their system will cover the needs of the users. The basis of the portage system is the ebuild scripts. This is the format of the packages stored in the portage system, and these scripts contain all the information required to download, unpack, compile and install a set of functions. The ebuilds also contain information on how to perform any optional pre/post install/removal or configuration steps. By downloading software code before compiling it, Gentoo achieves both advantages and disadvantages. The system potentially executes faster, as the applications only have to support the current system and not be compatible with all other systems. The downside to this is that the compiling takes time, often about two days for a Gentoo installation. This makes Gentoo very powerful but it might need better hardware than other Linux distributions. 19 2.4.6 Gentoo, organizational Gentoo Linux is an open source software project. The structure of Gentoo differs from the traditional organizational structure in the commercial world of software development. We will try to expose some of these points of distinction in the following sections. Figure 6 shows a overview of the data flow in Gentoo. Figure 6: Data flow diagram, Gentoo 2.4.6.1 Gentoo community Since the start in 2000, the Gentoo development community has grown to a group of more than 250 developers [Developer list]. The title of Gentoo ‘developer’ is restricted, and a person can only address himself/herself with this title after being adopted by the Gentoo community. This process can be initiated in different ways. One way of getting approved by Gentoo and become a developer, is to contribute by fixing bugs and submitting ebuilds and thereby be recommended. It also happens that Gentoo is in urgent 20 need of people with certain skills, and announce this in their weekly newsletters [GWN 01.11.2004]. People can then apply, and candidates satisfying the requirements are adopted. When a person is adopted, he or she will then be evaluated for some time before an approval. During this period the new developers will be given a mentor that’s responsible for guidance, assistance and some evaluation. To manage all the processes involved in the adoption and locating of new developers, Gentoo has established a developer recruiter's project. The members of this project have the final word in the selection of new developers. Their decision is based on feedback from the mentors, and the results of a test the candidates have to pass [Monteiro et al., 2004], [Gentoo recruiters]. In August 2003 a project called Gentoo BugDay was organised by one of the developers, Brian Jackson. The motivation behind this event was to take a vigorous pull to close as many bugs as possible, but also to create a context where the users and developers could get to know each other. The participants worked together in an online chat channel on irc.freenode.net, testing, discussing and fixing bugs. It also says in Gentoo Weekly Newsletter that: “…we may even have scouted a few candidates for future developers” [GWN 04.08.2003]. So it seems that this also is a gateway to be adopted as a developer. In addition to the developers, a large number of other people contribute to the development and maintenance of Gentoo Linux by reporting bugs and submitting proposals for solving problems. This is one of the advantages with open source software development. A lot of the work for the developers involves writing ebuilds and maintaining them. This is a challenging task and since Gentoo is OSS, even more obstacles arise if the developers don’t take their share of the workload or in the worst-case, become inactive. 21 2.4.6.2 The Herds project In the Gentoo Linux development structure, a sub-project called the Gentoo herds project was introduced to gain better control of the ebuilds. This project aims to ensure that ebuilds are organised in groups that have maintainers, and that all ebuilds get maintainers assigned. Each herd is a collection of closely related ebuilds which a number of maintainers are given the responsibility to maintain. The maintainers are people that contribute in the development, and they’re often assigned to maintain parts of the system that they have written themselves. Figure 7: The Gentoo Herds Project Since Gentoo Linux is OSS, and therefore a volunteer-driven distribution, high-quality documentation is vital. This is to ensure that interested users can easily get the information they need to be able to contribute in the further development. To satisfy this requirement, all the documentation is gathered in one place and users have the opportunity to report bugs or send proposals to a bugtracking system. A project called the Gentoo Documentation Project handles all these reports. 2.4.6.3 Concurrent Versions Control Another part of the Gentoo system that is crucial in the OSS development is the Concurrent Versions System (CVS). This is a client/server system designed to keep track on changes made by different users on the same files. This allows multiple developers to work on the same source-code at the same time, and prevents that work can get lost [CVS]. Using the tool allows developers situated around the world to store their work in a central repository, and a complete history of the evolution of the system is created. CVS uses the Revision Control System (RCS) that was designed by Walter Tichy [RCS]. This is a software tool for the UNIX system. It allows an individual developer to maintain control over a certain item such as a source file, while he/she implements and tests it. Gentoo Linux Enhancement Proposals (GLEPs) are a particular type of text files that are maintained under the CVS control. A GLEP is according to Gentoo: “a design document providing information to the Gentoo Linux community, or describing a new feature for Gentoo Developer 1..n Ebuild Bug-report Herd 0..n 0..n 1 1..n 1 0..n 22 Gentoo Linux. The GLEP should provide a concise technical specification of the feature and rationale for the feature.” This means that GLEP is the media where information regarding higher architectural subjects is distributed. The structure of the GLEP is quite rigid and the following criteria are stated at the GLEP website: “For a GLEP to be approved it must meet certain minimum criteria. It must be a clear and complete description of the proposed enhancement. The enhancement must represent a net improvement. The proposed implementation, if applicable, must be solid and must not complicate the distribution unduly. Finally, a proposed enhancement must satisfy the philosophy of Gentoo Linux.” [GLEP] 2.5 Reorganization of Gentoo The 24th of June 2003, Daniel Robbins, the chief architect of Gentoo Linux, posted a proposal for a new top-level management structure on GLEP [GLEP 4]. In this proposal, Robbins points out some issues regarding the difficulties to track the status of projects in Gentoo. Robbins describes the current situation as: “…we have no clearly defined topleeve management structure, and no official, regular meetings to communicate status updates between developers serving in critical roles.” He also mentions the problem with not having clearly-defined roles and scopes of executive decision-making authority for top-level developers. This situation results according to Robbins, in: “no one knows what is going on, and everyone defers to the Chief Architect for all executive decisions.” To deal with these problems, Robbins suggests some changes to Gentoo. Firstly he wants to alter the organizational structure of Gentoo by introducing an official top-level management structure. In this management group the chief architect and a chosen group of developers will be members. The developers will be given the title of “Top-level managers” and be responsible for communicating the status of their projects to the rest of the management group. The exchange of status reports will take place in fixed, weekly meetings. In addition, clearly defined areas of responsibility regarding the daily operations, will be created. The 30th of June, GWN announced that Gentoo adopts a new management structure for the Gentoo Linux Project [GWN 30.06.2003]. By doing this adoption, Gentoo hoped that: “…users will notice benefits as well through improved speed of delivery, increased quality control and other tangible benefits”. The final outcome of Daniel Robbins proposal and the motivation behind the reorganization can be found in detail in the Gentoo documentation. [Gentoo Management]23 3 Problem statement 3.1 Research agenda Our research agenda is to find out whether or not the reorganization of Gentoo Linux resulted in a more efficient organization. If it did, such a reorganization could be performed to improve other open source projects. If it didn't, this could be a warning for anyone thinking about doing such a reorganization. A measure of to what extent this reorganization would be a success, can also help creating data on the costs and profits of doing a reorganization in an open source environment. We wanted to perform an empirical experiment in order to verify our data statistically. 3.2 Focus The focus of this project is to do an empirical study on the reorganization in Gentoo Linux. Doing the experiment with empirical engineering enables us to use known statistical templates and tests. This helps us verify to what extent our assumptions are correct. The fact that Gentoo Linux is studied, derives from the PhD done by Thomas Østerlie [Østerlie]. He also works with Gentoo and he is assigned as our supervisor. 3.3 Questions We have stated one question that we want to answer in this project: Q1: Did the reorganization in Gentoo Linux improve the efficiency of the organization? We thought it would be difficult to answer this question directly and therefore constructed three hypotheses that we wanted to answer, and then draw our conclusions to answer the question. These are listed and discussed in chapter 4. 24 3.4 Associated research method/process As stated previously, we have been assigned the empirical research method. Our work process is indicated in the Gantt chart belo w. Gantt diagram v0.3 Last update: 09.09.2004 Week 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Task: Phase 1: Reading /prestudies. Phase 2: Definition Phase 3: Planning Phase 4: Operation Phase 5: Analysis & interpretation Phase 6: Presentation & package Phase 7: Completion & refinement Table 1: Gantt chart (small version) The diagram has been followed very well, although we had to make some modifications as we discovered that some tasks needed more or less work than assumed. This is a scaled-down version of our Gantt chart. The full version can be found in attachment A. 25 4 Experiment planning After defining the experiment the planning takes place. The aim of the planning process is to decide how the experiment is to be conducted. This is an important part of the process and a prerequisite in order to be able to control the experiment. Figure 8: Experiment planning An online project is when the investigation is executed in the field under normal conditions [Wohlin et al., 2000]. This is the opposite of an offline project where the research might be done after the experiment is complete or in a laboratory. Real-time data is a term we use when we describe data that is collected as soon as it is generated. 4.1 Context selection We do not classify the project as on-line, as we are not in the process ourselves. We monitor a process that has already happened, and the few real-time data we gather are done in parallel with the process. According to Wohlin et al. [2000] this reduces any risks. We do not believe that we have much control on the process. We simply observe what people do and have done. The project does not evaluate a special group of people like students or professionals. The subjects come from a variety of backgrounds. We do not know the age of the subjects. The project addresses a real problem, not a toy problem as we research a real organization. We don’t think the project can easily be generalized to 26 the general software domain. This is because there are many differences in OSS organizations and commercial companies. For example the fact that the "employees" in OSS organizations don’t get paid by the company. However the project might be valid to OSS projects and OSS influenced companies. In our case, several issues in the Gentoo Linux organization are compared before and after it was reorganized. As the old method of doing things ended when the new began, it is hardly possible to use the old methods after the reorganization. 4.2 Hypothesis explanation When our project was advertised by NTNU it had a working title "How to measure success ?". On the first briefing we were informed that the scope should be the Gentoo Linux community and the reorganization that was made in June 2003. Below, all the hypotheses proposals are listed with some attributes. An explanation of why we have used them in the report or discarded them is stated in attachment B. The hypotheses were created with a brainstorming process. After a brief introduction in ESE, OSS and Gentoo, we wrote down all possible hypotheses that we could think of. Then all the hypothesis were discussed, evaluated and rated. We had some fundamental questions about the reorganization in the Gentoo community that was used as a basis for the chosen hypotheses. We wanted to know what effects this reorganization had on the massive Gentoo community. What was the motivation behind it and did it fulfill its goals? Based on these questions we brainstormed the following hypotheses: Suggested hypotheses Rate * Source Candidate 1: Reorganizing has improved the efficiency of bug handling. Good Bugzilla Yes 2: Gentoo Linux will continue to exist in the following years. Poor Mail lists, market share, interviews No 3: Reorganizing lead to a higher release cycles. Good Inspect releases, cvs, mail lists Yes 4: Reorganizing lead to improved communication Poor Forums, interviews, newsletters No 5: Number of developers increased as a result of the reorganization. Good Developer records, mail lists, GWN Yes 6: Number of users and market share increased after the reorganization. Medium Independent sites with objective data. No 7: Reorganization fulfilled its goals to a reasonable extent. Bad Community feedback, forums, mail lists, bugzilla No 8: New roles and scopes have simp lified decisionmakking Medium Developer experience, discussion, meeting logs, forums, interviews No 9: "The Cabal" and secret mail lists are negative for the OSS community. Horrible Fork documents, forums, official statements, meeting logs No 10: Meeting deadlines has improved after the Medium Bugzilla, forums, dev No 27 reorganization. discussions 11: The number of bugs is threatening the future of Gentoo. Poor Bugzilla No 12: The increase in the total number of unsolved bugs is a threat to Gentoo Linux. Poor Bugzilla, articles No 13: The increase of unsolved bugs will eventually kill the Gentoo Linux Project. Poor Bugzilla No 14: In x years the number of unsolved bugs will leave Gentoo Linux as a non-competitive distro. Bad Bugzilla No 15: The increase of unsolved bugs dosen’t threatens Gentoo Linux. Difficult Bugzilla, mail, forums No 16: The reorganizing lead to a decrease in the average time used to solve bugs. Good Bugzilla, forums, interviews Yes 17: Reorganization led to a greater share of solved bugs compared to new bugs. Good Bugzilla Yes Table 2: Suggested hypotheses * More info on the hypothesis suggestions and rating in attachment B.28 4.3 Hypothesis formulation After the brainstorming, all the hypothesis proposals were debated and evaluated. We tried to find out whether or not the hypotheses were suited for effective data gathering. They were also evaluated on how interesting they were, and what conclusions that might be drawn from them. Then we selected a few that we believed were the best ones and tried to refine them. We wanted the cause to lead to the effect, and therefore some of them were changed. For example in hypothesis 5 "Number of developers increased as a result of the reorganization” it seemed difficult to determine that the increase in developers was actually caused by the reorganization. The super-hypothesis was changed to a question that the hypotheses were supposed to answer. The question will not be evaluated directly as a definite measurement of effect is outside the scope of this project. However we wish to evaluate the results from the hypotheses, and then draw our conclusions. Some of the hypotheses were instantly discarded when they were first discussed, as they were little more than caffeine-fueled digressions that surfaced during the brain-storming. Others were systematically discarded as we discovered that they were too hard to back up with data. Especially the Bugzilla bug system varied in usability, as it proved great for some data extraction, but other data was hard to extract. We do believe that the data in question is there, we just don’t have time to develop the tools to get it. The hypothesis we refer to is " Growing number of users report increasingly many bugs, result in more work for developers (who might not increase in numbers at the same ratio).", it requires the Python script to first list all bugs, then enter a individual site for each bug, get some data, then enter another site and interpret a table. Having said all that, we also wanted a collection of hypothesis that was real and not trivial. We believe that the ones we have chosen are meaningful and can be generalized to similar projects/companies. Question 1: Q1: Did the reorganization in Gentoo Linux improve the efficiency of the organization? Hypothesis 1: H1.0: The reorganizing did not lead to a decrease in the average time used to solve bugs. H1.1: The reorganizing lead to a decrease in the average time used to solve bugs. Hypothesis 2: H2.0: Reorganization did not lead to a greater share of solved bugs compared to new bugs per week. H2.1: Reorganization led to a greater share of solved bugs compared to new bugs per week. Hypothesis 3: H3.0: Number of developers has no influence on the average time needed to solve bugs on a weekly basis. H3.1: Number of developers has an influence on the average time needed to solve bugs on a weekly basis. 29 4.4 Variables selection There are two kinds of variables in an experiment, independent and dependent. The figure below illustrates the information flow in the experiment. Below, the figure variables are explained and justified. Figure 9: Illustration of independent and dependent variables. 4.4.1 Independent variables "An independent variable is a variable in a process that is manipulated and controlled" [Wohlin 2000, p. 33]. We chose "Organization structure" as our independent variable. The project compares data from before and after the reorganization. The structure of the organization should have an effect on the dependent variables because the entire aim of the reorganization was to improve the efficiency of the organization. 4.4.2 Dependent variables The variables that we want to study to see the effect of the changes in the independent variables, are called dependent variables. The effect of the treatments is measured in our dependent variable: "Efficiency". When we talk about efficiency we monitor several aspects of the organization, i.e. time used to solve bugs and Gentoo release cycles. The efficiency is not an exact value and is therefore measured indirectly by looking at different processes. 4.5 Selection of subjects The automated data gatherings will use the entire bug-reporting community as subjects. Therefore we cannot see any difficulties generalizing this. The data we collect manually will have a far lesser sample pool. However by examining all subjects within a given time period, it would be possible to generalize this (not flawlessly, though) as the selection gives a partially representative view of the population. The low number of samples will enlarge the errors if generalized. 30 The manual data gathering can be called systematic sampling, as we choose a period of time to sample from, and then every n:th period. 4.6 Experiment design To achieve a better understanding about the experiment we’re about to perform, a clear definition must be developed. The type of statistical analyses that we’re applying later in the project depends among other factors on the chosen design. It is therefore an important task to describe the experiment as good as possible and define a design. When defining the experiment design, the basis is the number of factors and treatments included. A factor is the combination of one or more independent variables that affects the dependent variables. A treatment is one particular value of the factors in the experiment. In our experiment the factor is the organization structure and the treatments are the new and the old structure. Based on this we can determines that our experiment has the “one factor with two treatments” design. 4.6.1 Randomization The automatic data gathering did not employ any randomization as it gathered all the available data. When we collected data manually we let the users randomize it for us, and then took the two first bugs reported each day. This might indicate that we only get bugs reported by people that are awake at the beginning of each day, but the different time zones should remedy this. 4.7 Instrumentation Instrumentation is done to provide means for performing the experiment, and to monitor it without affecting the control of the experiment. In our case we needed data that had to be collected in different ways. All the data were available from Bugzilla or GWN, but they had to be accessed differently. We realized that manual data gathering for some of the cases would be to time-consuming and not be feasible. As a result we decided to search for other ways of gathering the data automatically. 4.7.1 Python scripting At first we chose not to spend any time learning Python and how to program datagathherin scripts. Tue, 28 Sep 2004 we wrote a mail to the GWN editor Ulrich Plate, asking about what options to select in Bugzilla to recreate the bug-data in the newsletters. This mail can be seen in attachment C. He responded by giving us the Python script used by the GWN staff to generate bug data. Aided by our project guide Thomas Østerlie we were able to at least to some extent, modify the script to collect the data we wanted. This was not at all planned, but when we received the script we just started to fiddle with it and managed to make it work. We modified it so that it extracted the total number of new bugs in a given period of time (week) and closed bugs in the same time period. 31 Figure 10: Results from an early version of the modified Bugzilla script. We also tried to count the total number of currently open bugs on a weekly basis. As shown in the figure 10, the "Total open bug reports" between the Bugzilla birth-date and a given date, the number of returned bugs is far to low. The last part: "Original total opened" counts the total number of open bugs up to today. This is the query we tried to modify, but we couldn’t get the numbers to match. Initially we thought that this was caused by the fact that the script was run on random hours of the day, therefore if the script was run early, there would still tick in bugs until 24.00. These bugs would not be counted in the numbers stated in GWN. However when we on later dates tried to recreate the GWN data, the query would find all the bugs from the whole week INCLUDING the ones GWN didn’t have. Later during this process, as we discussed our problems with Thomas, we discovered that it might not be possible to get the bugs where status was changed, because it did not look for status changes during the chosen time period, it checked for CURRENT status on the bugs found in the past(the specified date). This would mean that it wasn’t possible to find these data unless the query was done in real-time the actual week in question. More specifically the exact time the GWN crew ran the script. If this was correct it meant that the script wasn't able to do this search. Then we would have to do this manually. Given the fact that we are able to find this data checking the numbers for a year or two should be manageable. Our correspondence with the GWN crew has left us with the impression that having these data accurate, isn’t a big priority. They don’t seem to mind if there are reported 7045 unsolved bugs or 7050, and they can’t really be blamed. On the contrary we think it is sporty to publicly announce Gentoo's inability to solve bugs fast enough, and frankly a little odd, we wouldn’t exactly call it good advertising when on a weekly basis, GWN reports the growing number of unsolved bugs. 32 4.8 Validity evaluation One important issue appears during the experiment planning, and that is the validity evaluation of the results. This task has to be done during the planning phase to ensure valid experiment results. Without the validity evaluation, one might end up with results that are not valid for the population from which the sample is drawn. In the past, different types of threats to the validity of an experiment have been suggested. In “Experiment in software engineering, An introduction” [Wohlin et al., 2000] four types of threats are presented. These threats are mapped to different steps of the experiment, this is shown in figure 11 below. Figure 11: Experiment principles [Wohlin 2000] The figure presents the two areas of an experiment; the theory and the observation area. In the theory area, the hypotheses that we want to test are defined based on data from the observation area. This will hopefully make it possible to draw some conclusions. The process of drawing these conclusions are divided into four steps which are shown below as the numbers from 1-4. In each of these steps, one type of threat to the validity of the result, is present. According to Wohlin et al. the threats are:" 1. Conclusion validity. This validity is concerned with the relationship between the treatment and the outcome. We want to make sure that there is a statistical relationship, i.e. with a given significance. 2. Internal validity. If a relationship is observed between the treatment and the outcome, we must make sure that it is a causal relationship, and that it is not a result of which we have no control or have not measured. In other words that the treatment causes the outcome. 3. Construct validity. This validity is concerned with the relation between theory and observation. If the relationship between cause and effect is causal, we must 33 ensure two things: 1) that the treatment reflects the construct of the cause well (see left part of the figure) and 2) that the outcome reflects the construct of the effect well (see right part of the figure). 4. External validity. The external validity is concerned with generalisation. If there is a causal relationship between the construct of the cause, and the effect, can the result of the study be generalized outside the scope of our study? Is there a relation between the treatment and the outcome?" [Wohlin et al., 2000, p. 63-64] In the following part of this section we will present a list of threats to the validity of the experiment. In addition, every threat is evaluated to determine if it might cause any problems in our experiment. The marking used in the tables are as follows: +: Threats that we believe will not be of any significance /: Threats that might have an effect, but with low probability -: Threats that could affect the result, with significant probability n/a: Threats which are not applicable for our experiment 4.8.1 Conclusion validity Low statistical power /Violated assumption of statistical tests /Fishing and the error rate + Reliability of measures + Reliability of treatment implementation + Random irrelevancies in experimental setting + Random heterogeneity of subjects + Table 3: Conclusion validity · Low statistical power The statistical power can be expressed as: Power = P(reject H0 | H0 false) = 1 – P(type-II-error) Based on the design of our experiment determined above, will we most likely perform a t-Test. This test gives us the ability to determine the confidence in our statements, and thereby ensure high statistical power. However we should be aware of this threat. · Violated assumption of statistical tests Our datasets will most likely be quite large, so any requirements regarding normal distribution should not be an issue. Other requirements might be violated, so we should be aware of this threat to some extent. · Fishing and the error rate Since the persons performing the experiment (i.e. Person, Engene) do not have any connections to the organisation investigated in this project, the probability of fishing for a specific result is low. As long as the confidence intervals of our tests are quite rigid, the threat from the error rate should not be extensive. · Reliability of measures In this experiment, the data is based on number of bugs and number of developers beginning/leaving the organisation for a given period of time. This is objective and direct measures of attributes, and therefore increases the reliability. · Reliability of treatment implementation 34 The treatment that we are applying in our experiment i.e. the structure of the organisation is quite simple and should not lead to any differences in the implementations. · Random irrelevancies in experimental setting As we are using historical data in our experiment, it is hard to determine if there were any elements outside the normal setting that made an impact on the result. But the data is collected from a wide time-period, and any minor irrelevancies should not influence the result in a way that will lead to the wrong conclus ions. · Random heterogeneity of subjects The subjects that take part in the experiment are the developers/maintainers in Gentoo and the users of the distribution. These subjects are chosen by randomisation and should not pose any threats. 4.8.2 Internal validity History + Maturation /Testing + Instrumentation + Statistical regression + Selection + Mortality /Ambiguity about direction of causal influence /Interactions with selection n/a Diffusion of imitation of treatments n/a Compensatory equalization of treatments n/a Compensatory rivalry n/a Resentful demoralization n/a Table 4: Internal validity · History The data in this experiment is collected from a wide period of time, and potential influence of the history will neutralize each other and not affect the final results. · Maturation The experiment collects data from a long period of time, it is possible that some of the developers/maintainers will get bored or loose motivation of performing the bug-fixing. · Testing This experiment is based on historical data. There is no danger that the subjects know about the test and therefore perform differently. · Instrumentation The data is collected quantitatively by using queries in Bugzilla, so the experiment should not be effected negatively by bad designed instrumentation. · Statistical regression In our experiment all subjects involved are one big group, and their participation is included completely. This should prevent the influence of regression that might be a problem when the subjects are classified into experimental groups. · Selection 35 We have included all the persons involved in the bug reporting/fixing in the Gentoo community, and therefore will the effect of selection i.e. that the selected group is not representative the whole population, not be present. · Mortality As this experiment collects data from a long period of time, we believe that some of the initial developers and users in general have left the Gentoo community. These people might have had a higher motivation for contributing to the community than the new users/developers have. Another issue is the additional developers and users who have joined the community continually. All this might have an effect on the historical data that can influence the experimental results. · Ambiguity about direction of causal influence There might be other factors than the reorganisation that affect the outcome of our experiment. This could violate the validity of our statements. As an example it might be hard to prove that an effect is caused by the reorganisation and nothing else. · Interactions with selection As our experiment doesn’t involve multiple groups, the threat due to different behaviours in different groups is not present. · Diffusion or imitation of treatments There exist no control groups in this experiment, and the possible threats connected to diffusion or imitations of treatments are not present. In this experiment the whole population is included, and it is the same group that is evaluated before and after the reorganisation. · Compensatory equalization of treatments See Diffusion or imitation of treatments above. · Compensatory rivalry See Diffusion or imitation of treatments above. · Resentful demoralization See Diffusion or imitation of treatments above. 4.8.3 Construct validity Inadequate preoperational explication of constructs + Mono-operation bias + Mono-method bias + Confounding constructs and levels of constructs /Interaction of different treatments /Interaction of testing and treatment + Restricted generalizability across constructs n/a Hypothesis guessing + Evaluation apprehension + Experimenter expectancies + Table 5: Construct validity · Inadequate preoperational explication of constructs In the selection of the hypotheses made above, we tried to separate the ambiguous and inadequate hypotheses from the rest. As a result, the hypotheses that are chosen are well formulated and the threats avoided. 36 · Mono-operation bias The whole Gentoo community is included in the experiment, therefo re any possible threats regarding mono-operation bias are avoided. This is also the case with the objects, as every bug from the start of Bugzilla is inspected. · Mono-method bias By measuring bugs, release cycle and the pool of developers, we involve different types of measures and observations that can be cross-checked against each other. This results in an avoidance of the risks tied to mono-method bias. · Confounding constructs and levels of constructs As we’re not detailing all the aspects regarding the process of bug handling, some factors like developer experience aren’t measured. This could influence the result, but hopefully the randomisation of subjects will even this out. · Interaction of different treatments The subjects involved in our experiment might be involved other OSS-projects as well. It is therefore possible that this has an influence on our results. · Interaction of testing and treatment The subjects involved in the experiment don’t know that they are participating in an experiment, and the data was collected after the reorganisation. · Restricted generalizability across constructs There is always the possibility that the reorganization did have some negative effects but that is outside the projects scope. · Hypothesis guessing People don’t know that they are part of an experiment and therefore they will not base their behaviour on our hypothesis. · Evaluation apprehension Again the subjects don’t know about our experiment and will not fear our evaluation and results. · Experimenter expectancies Subjects’ unawareness of the experiment prevents them from biasing the results. 4.8.4 External validity: Interaction of selection and treatment -Interaction of setting and treatment -Interaction of history and treatment + Table 6: External validity · Interaction of selection and treatment We don’t know if the subjects included in the study is representative for other commercial and non-commercial organisations. · Interaction of setting and treatment This is not a toy problem; we use the same tools that all the developers use. Bugzilla is an OS project that is used by a number of OSS projects. However, Gentoo Linux is an open source project. Generalizing the result to industrial practice in both commercial and noncommeercia projects, might be difficult. · Interaction of history and treatment The experiment is run during a long period of time, therefore the data should be representative! 37 4.9 Priority among types of validity threats In this project, some of the scope was given to us in advance. This relates to the choice of organisation and process improvement initiative being studied. The aim of this project is not to generalize our result to industrial practice, but to complete an experiment with emphasize on conclusion, internal and construct validity. As a result, the external validity has suffered. 38 5 Experiment operation In this chapter we will detail the operational phase of the experiment. This section documents the task of carrying out the experiment in accordance to the design defined in the previous chapter. The aim is to give the reader adequate information regarding our execution of the experiment and validation of the data collected. Figure 12: Experiment operation [Wohlin et al., 2000] 5.1 Introduction Even a perfectly designed experiment can go seriously wrong if the operational phase is conducted with lack of accuracy. As figure 12 shows, the operational phase of an experiment consist of three steps: preparation, execution and data validation. Each of these steps will be described in detail in the following section, except the preparation step. As the only participants in this experiment are the authors of this report (according to Wohlin et al.), some of the aspects in the preparation will not be applicable. This applies to the challenges regarding inducements, deception and obtaining consent from the participants [Wohlin, 2000]. 5.2 Experiment preparation During this phase, the last preparations prior the execution of the experiment were accomplished. The first hypothesis that we wanted to collect data to, was hypothesis 2. As mentioned in the previous section, we used a script to collect this data. So the only preparation we did in addition to the alteration of the script was to determine the scope of the data collection. As it didn’t cost us any extra effort to include data from all the weeks Gentoo have been using Bugzilla, we decided to do this. By including all this data we also hoped that any tendencies would become even more distinct. 39 The preparations for hypothesis 1 was a bit more extensive. The first issue that we looked into was the scope of the experiment. As the goal of the data gathering was to test our hypotheses and see which one of them we could reject, any data exposing possible influence on the reorganisation would be of interest. Based on this we decided to include data gathered from the 1st of January 2003 until the 31st of July 2004. This data would hopefully give us a basis of comparison for the efficiency of bug handling before and after the reorganisation. The reason why we intended to include data more than a year after the reorganisation, was to see if there had been an instant alteration in the bug handling time that faded away in time, or if the possible alteration still exists. The next thing we had to decide was in which format we should collect the data. As Bugzilla provides a table for every bug called Bug activity, as shown below, we had to come up with some sort of categorizing of the bug handling time. Figure 13: Bug activity log [Bugzilla] After some discussion, a scale with range 1-5 was drawn up. This scale is shown below. By using this scale we hoped that the analysis of the data would be simplified. We also believed that a higher grade of granularity would just cause us more work and not give us any valuable increase in the accuracy, since we only were trying to expose tendencies in bug handling time. Bug handling time Category Less then 1 day 1 Less then 3 days 2 Less then 7 days 3 Less then 1 month 4 More then a month 5 Table 8: Scale for categorizing the bug handling time When the period of time and the granularity was defined, we had to decide the quantity of the sample. As we had no idea how long time it would take to gather the data, a test gathering was carried out. The test indicated that collecting one bug per day for 10 days would take one person approximately 7 minutes. We were not sure if we should collect one or two bugs per day at this time. So some rough estimates were made to help us in the decision making. The estimates concluded that it would take about 6 hours of effective work for one person to collect one bug per day, and 12 hours to collect two bugs. After some discussion we decided to include two bugs per day in the analysis. This decision was based on the fact that 2 bugs would give us a more solid basis for the statistical test. We also decided to do a minor alteration of our scale for bug handling time after the test. It seemed that about 50 percent of the bugs fell into category 5, so we added one more category and ended up with the scale shown below. 40 Bug handling time Category Less than 1 day 1 Less than 3 days 2 Less than 7 days 3 Less than 1 month 4 Less than 3 months 5 More than 3 month 6 Table 9: Final scale for categorizing the bug handling time As every bug in Bugzilla is given a priority from 1 to 5, we decided to include this when we collected the data. Our motivation for this was to see if there where any significant difference in the bug handling time between bugs with high and low priority. The last hypothesis we made preparations for was hypothesis 3. To test this hypothesis we needed some data that would state the evolution of the amount of developers in the Gentoo community. In addition, information about the amount of time used to solve a bug, was needed. To get the first data we decided to use Gentoo Weekly Newsletter and its section “Moves, Adds and Changes”. The reason for this was that we noticed that it would be difficult to find this info using the script witho ut major alterations. As we’re not particularly experienced in writing scripts, this alteration would be too difficult for us. Every week GWN presents previous week’s changes in the developer list. An example of this information is shown in the figure below. To gather this information we manually went trough all the released newsletters and registered the adds and moves. Figure 14: “Moves, Adds and Changes” from GWN published 30th June 2003 41 All the tasks during the preparation phase were accomplished while both Anders and Knut Steinar were present. We believed that this would reduce the risks of misunderstandings regarding the procedures for the data gathering, and illuminate as many aspects and challenges with the following tasks as possible. 5.3 Experiment execution After creating the hypotheses, they were refined to depend on data that was gatherable within the scope of this project. Q1 did not directly need any distinct data as it would be answered based on the conclusions from the hypotheses. 5.3.1 Data collection In the following section, the data collection for each hypothesis is discussed. Hypothesis 1 The hypotheses were much more specific in their demands compared to Q1. To answer the first one we needed to find out how much time that was used to solve individual bugs. This information could be found in Bugzilla, however it required quite a few clicks and scrolling to find out the time used on each bug. We decided to do it manually, this is discussed in a previous section. To ensure sufficient validity and accuracy in the data pool, quite a few bugs had to be examined. When discussing how many bugs we wanted to analyze, we considered several aspects. We did not want to use too much time on this single hypothesis because we felt time was running short. Yet we needed quite a few to make the data representative and generalizable. The process of deciding the research pattern and guidelines is discussed above in the experiment preparation section. The result was that we chose to gather 2 bugs every day from 01-01-2003 to 18-07-2004. In order to catch up, we decided to sacrifice a weekend. In the end we had examined 1130 bugs. At first we also inspected bugs reported in August, but then we found out that none of these bugs could be category 6 (more than 3 months old) as it obviously was less than 3 months from August to October. Therefore we chose to not include any bugs found later than 18-07-2004. Hypothesis 2 Data for the second hypothesis was found with a python script. This script was accidentally obtained as one of the staff in GWN sent it to us when we asked him some questions. This script queried Bugzilla and returned the results. We modified this script to collect the number of new and closed bugs every week from 04-01-2002 to the current date. Then it wrote the data to a excel file. The fact that we could use a script to gather this data let us use all the data available, and not only a selection. Hypothesis 3 In order to come to a conclusion on this hypothesis we calculated how many developers there were in a given week. Then we compared this number with the average time used to solve bugs on a weekly basis. The only place we found data about the number of developers was GWN. This did restrict the scope of the experiment as its duration would be from the date GWN started and until fall 2004. The bug-solving data was collected for the first hypothesis and only required us to calculate the weekly average bug-solving time. The data from GWN was gathered manually. It was done by downloading all the GWN editions, reading it and copying developer data to an excel worksheet. 42 5.3.2 Different methods The first hypothesis was not very suited for automatic data gathering. Or perhaps it was, but not for someone with our lack of Python experience. Each bug required several clicks, scrolling and then a table had to be interpreted. We did not believe that we could create such a script given the time limits that were upon us. Gathering the data manually was quite troublesome, we spent about 12 hours staring at the screen and copying data from the screen to our worksheet. Initially we wanted to inspect 500 bugs before the reorganization and 500 after. We chose to check 1000 bugs because it was quite a lot of bugs and although the task of manually inspecting them bordered to madness, we felt that the task was manageable. Manual data gathering is very open for errors. This along with the small pool of inspected bugs is probably the greatest threats to validity. However the second hypothesis enabled us to gather all the data automatically, this was of great help as we did not have to select some weeks to extract samples from. In 60 minutes the script inspected and saved all the bugs from the day Bugzilla was implemented to the current date. We did use a couple of days editing the script but the results where worth it. When a script does the data gathering it doesn’t err providing it has been configured right, but if it’s not properly configured it will only create bad data. 5.4 Data validation This chapter deals with the degrees of validity on the different data sources. The purpose of checking the validation is to ensure that the data is reasonable and that it has been collected correctly. 5.4.1 Data source integrity During the project we have collected several types of data; number of developers, open bugs, closed bugs and time needed to solve bugs. We believe that the integrity of the data is good as it has come from GWN and Bugzilla. Although both sources are vulnerable to human errors, we believe that GWN take pride in not misleading the audience and developers. The weekly numbers of open and closed bugs are published in the GWN. We could not solely base our data gathering on these numbers as GWN didn’t start until 23-12-2002. Therefore we used the script to ga ther data for all of 2003 and up to the current date in 2004. To verify that the script was correct, we compared the published results from GWN to the data that our script produced. The numbers were identical from the start of GWN until the current date; therefore we presume that the script also has correct data from the period before GWN. 5.4.2 Bugzilla Bugzilla is a database with a web interface where Gentoo users can report bugs. We had to assume that the bugs reported were valid and did not have the time nor the knowledge to test this ourselves. But the fact that developers use Bugzilla as a tool supports our assumption. Still it is likely that these manual bug reports contain errors like incomplete descriptions of bugs and misunderstandings. There is also a big probability that many identical bugs are reported multiple times by different users. To prevent this there are guidelines that are supposed to help users describe the bugs in a correct manner. Developers then manually compare bugs and close duplicates. When developers solve bugs they report how much time is used, this data we collected manually. We had to 43 assume that the hours and days reported were correct, we had no way of querying the developers about time usage on bugs fixed months or years ago. 5.4.3 Manual bug inspection Our own manual research is also vulnerable to errors. The manual data gathering was very repetitive and frankly, quite boring. The fact that we completed the gathering during a weekend and in long sessions, lowered morale and enthusiasm. Therefore we were more susceptible to making errors as we were tired. However the procedure of datagathherin itself didn’t give much room for errors. We increased the dates by 1 day, pressed search, chose a bug, scrolled to a link and clicked it, and then read the contents of the time table. The chance of reading the same bug twice is very small as its link would have changed color if it was inspected earlier. The manual interpretation of the time-table was another possible compromise to the validity. We calculated the difference between the start and end dates and gave the bug a number from our scale. The calculation was done in our heads and although it is fairly easy mathematics, it is still a possible error source. The GWN inspection for the third hypothesis did not leave much room for errors. The GWN pages were loaded and the developer data copied and pasted into an excel sheet. If there are any errors in the data they might come from the GWN editors or the tool they use to gather their data. The calculations on the average time needed to solve bugs were done in excel and generalized for each week. So if the first calculation was wrong, then all of them are. But we have double-and triple-checked the calculation and haven't found any errors. 5.4.4 The participants In this case the participants are the people submitting the bugs, and there is no easy way of knowing how well they understand the Bugzilla interface. But the fact that Bugzilla is doing well as a development tool suggests that most users know how to use it reasonably well. The seriousness of the participants cannot be checked either, but we have found no indications of false reports, and feel forced to assume that most of the bugs are real. 5.4.5 Information included in the collected data In the experiment we did not include information regarding who did the bug solving. In other words, we did not check if most of the bugs were solved by a specific group of developers or other contributors. This might have had an effect on our conclusions. The number of developers might not be the deciding factor, when looking at the rate that bugs are solved. There is a possibility that developers are included in the lists long after they become inactive. As a result, we can't be certain if the number of developers is a problem regarding the efficiency of Gentoo. 5.4.6 Possible improvements In retrospect there are some issues we would have done differently, when gathering data manually we didn’t really have any clear rules on which bugs to count and how to interpret them. This might have caused some deviation because Knut Steinar sometimes used the search day as start date for bug treatment while Anders exclusively used data from the table. As Knut Steinar gathered data both before and after the reorganization we believe that eventual deviances will equal each other out. However this method slightly 44 differs from Anders’ research in 2004. We don’t think it will have a large impact because we look for tendencies and don’t use the data directly. This was probably only an issue for 10% of the bugs and the given scale number did not often deviate much. The scale we made was created in about 10 minutes and evaluated in 5, although it seems to work well it could have been more thought-through. If possible the manual bug-gathering would be done on a time when one of the project members hadn't broken his right arm. 45 6 Analysis and interpretation In order to draw valid conclusions we must interpret the experiment data. The interpretations have been carried out as shown in the figure below. The aim of this chapter is to present the hypotheses and whether or not they should be rejected. Figure 15: Analysis and interpretation 6.1 Descriptive statistics After the data gathering was accomplished, analyzing the data was next on the schedule. As mentioned above, each hypothesis required different data. To get a feeling of how the different data set was distributed, a preliminary phase to the hypothesis testing was carried out. In this phase we tried to visua lize central tendencies to better understand the nature of the data. 6.1.1 Hypothesis 1 The formulation of this hypothesis is as follows: H1.0: The reorganizing did not lead to a decrease in the average time used to solve bugs per week. H1.1: The reorganizing lead to a decrease in the average time used to solve bugs per week. The foundation for the testing of this hypothesis was the handling time of more than 1100 bugs in the determined period of time. These data can be found in attachment D. Initially we visualized the data by making a diagram that showed all the bugs with their handling time. This diagram is shown in figure 16. 46 Bug handling time 01.01.2003-19.07.2004 01234567 1 85 169 253 337 421 505 589 673 757 841 925 1009 1093 Bug handling time 01.01.2003-19.07.2004 Figure 16: Diagram that illustrates the handling time for each of the inspected bugs The scale on the y-axis is identical with the scale that we introduced prior the data gathering, and the x-axis indicates the number of the bugs. This diagram is hard to interpret and makes it difficult to see any tendencies. Based on this, we decided to calculate the average handling time for each of the periods before and after the reorganization. The initial intention for our project was to evaluate the alteration of efficiency in the organization as a result of the reorganization. We assumed that analyzing the data prior to and after this initiative collectively would expose some of the central tendencies. Based on the calculations we made, the diagram in figure 17 was generated. 47 0 0,51 1,52 2,53 3,54 1 2 3 Average handling time 1:Prior to reorg 2:After reorg 3:Whole period Figure 17: Average handling time The scale on the y-axis is based on the scale that we introduced prior the data gathering only with higher granularity. We found that this diagram clearly exposed a difference in the handling time for the two periods. The handling time after the reorganization seems to have decreased with about 20 % compared to the period prior the initiative. This result made us curious, so we decided to see if this decrease was distributed even among the scale that we had introduced. By doing this, we hoped to find out if there was a certain category of bugs that had been altered. The results of this calculation are shown in table 10. Prior reorganization After reorganization Category Number of bugs Percentual Number of bugs Percentual 1 90 0,257 293 0,376 2 20 0,057 67 0,086 3 17 0,049 62 0,079 4 53 0,151 138 0,177 5 63 0,180 111 0,142 6 107 0,306 109 0,140 Table 10: Shows the distribution of the bugs within each category In addition to the table above we displayed the same data in figure 18 shown below. There is a distinct change in the amount of bugs in category 6, i.e. with handling time more than 3 months. Before the reorganisation about 30 percent of all the bugs fell into this category, while the amount decreased to less than 15 percent after the reorganisation. There is also a change in the number of bugs in category 1, i.e. bugs with handling time less than one day. Prior to the reorganization about 25 percents of the bugs were fixed in less than a day, but this share increased to more than 37 percent after the initiative. 48 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 Percentual share 1 2 3 4 5 6 Handling time Comparison of the bug handling time Prior reorganisation After reorganisation Figure 18: Comparison of the bug handling time The hypothesis treats with the average handling time, therefore we also needed some plots regarding this. We decided to plot the average handling time on a weekly basis. This was done based on the average handling time that we calculated in excel. In other words, the sample size per week was 14 bugs. The plot is shown in the figure 19. Figure 19: The diagram plots the average handling time per bug on a weekly basis. 49 X-axis is the week numbers in 2003/2004, the plot starts 01.01.2003. Y-axis is the scale we created to tag the bug handling times. This diagram shows that the average handling time has decreased after the reorganisation. But it is also important that the reader notices that this tendency might have started even before the reorganisation. Unfortunately we didn't collect data prior to 2003, but we hope that the statistical test will gives us some answers in this matter. 6.1.2 Hypothesis 2 In order to partially answer the defined project question, the following hypothesis was examined. H2.0: Reorganization did not lead to a greater share of solved bugs compared to new bugs per week. H2.1: Reorganization led to a greater share of solved bugs compared to new bugs per week. In order for this hypothesis to be rejected or not rejected, we collected data on new and closed bugs. Then we examined the numbers before and after the reorganization. The picture below shows how the number of new and closed bugs inc rease on a weekly basis. Figure 20: The picture shows the development in reported and closed bugs on a weekly basis. The x-axis is weeks, and the y-axis is number of incidents. 50 6.1.2.1 Before the reorganization The graphs in figure 20 show that prior the reorganization the number of new and solved bugs follows each other quite well. This means that the developers were able to solve about the same number of bugs as the users found. This looks like a sign of a healthy organization that manages its challenges well. There are some large spikes that deviate from the rest of the graph. They were not caused by BugDay efforts, as the BugDay phenomenon didn’t start until August 2003. There seems to be a connection between releases and these spikes. Release candidates 1-3 came shortly after each of the 3 first peaks. This indicates that developers have made "all-out efforts" and solved a lot of bugs so the candidates can be released to the public. There are also negative peaks in holidays like Christmas. Figure 21: The picture shows the number of closed bugs divided on the number of new bugs. Figure 21 supports this trend. Before the reorganization the graph has several spikes above 1, meaning that the developers solve more bugs than the users report. However the process of solving bugs seems to be quite disorderly as the graph has big fluctuations. 6.1.2.2 After the reorganization The reorganization was in week 26 in 2003, this is marked by the line. From this point on, the two graphs in figure 20 seem to deviate from each other. There are found more bugs than there are solved. A pool of unsolved bugs arises, and according to picture 19 it keeps growing. We discovered that the release cycle increased after the reorganization. According to pre-reorganization results, this should cause an increase in solved bugs. Yet we see the opposite. In figure 21 the ratio on solved and new bugs becomes more stable. This might be good for future strategies as the workload gets easier to predict. However it stabilizes below 1, meaning that Gentoo developers aren’t able to solve all the reported bugs. This implicates that the number of unsolved bugs rises. 51 6.1.2.3 Total number of open bugs Although we have not done any extensive research on this topic we have expressed some concern about the number of unsolved bugs. The diagram below shows how the number of bugs almost has quadrupled less than 2 years. However there is a variable that we haven’t measured. If Gentoo Linux is measured in lines of code (LOC), and this number has increased. Then the number of unsolved bugs might have increased proportionally. This means that the ratio between LOC and unsolved bugs might be stable. And as the organization grows and gains manpower the increased pool of unsolved bugs might not be a problem after all. #Total Open Bugs 0 1000 2000 3000 4000 5000 6000 7000 800006.01.2003 06.03.2003 06.05.2003 06.07.2003 06.09.2003 06.11.2003 06.01.2004 06.03.2004 06.05.2004 06.07.2004 06.09.2004 #Total Open Bugs Figure 22: Number of open bugs each week 6.1.2.4 Plotting solved vs. new bugs per week In figure 23 we have plotted the ratio between solved and new bugs per week. The plot confirms our earlier assump tions and clearly shows how the workload spikes. This might indicate that the Gentoo project was less streamlined and depended on all-out workingsprrees That is not an ideal situation as Gentoo would be very vulnerable if some developers suddenly were unable to commit to these intensive bug fixing sessions before a release. The plot after the reorganizat