Investigations about replication of empirical studies in software engineering: A systematic mapping study

doi:10.1016/j.infsof.2015.02.001

Information and Software Technology

Volume 64, August 2015, Pages 76-101

https://doi.org/10.1016/j.infsof.2015.02.001 Get rights and content

Abstract

Context

Two recent mapping studies which were intended to verify the current state of replication of empirical studies in Software Engineering (SE) identified two sets of studies: empirical studies actually reporting replications (published between 1994 and 2012) and a second group of studies that are concerned with definitions, classifications, processes, guidelines, and other research topics or themes about replication work in empirical software engineering research (published between 1996 and 2012).

Objective

In this current article, our goal is to analyze and discuss the contents of the second set of studies about replications to increase our understanding of the current state of the work on replication in empirical software engineering research.

Method

We applied the systematic literature review method to build a systematic mapping study, in which the primary studies were collected by two previous mapping studies covering the period 1996–2012 complemented by manual and automatic search procedures that collected articles published in 2013.

Results

We analyzed 37 papers reporting studies about replication published in the last 17 years. These papers explore different topics related to concepts and classifications, presented guidelines, and discuss theoretical issues that are relevant for our understanding of replication in our field. We also investigated how these 37 papers have been cited in the 135 replication papers published between 1994 and 2012.

Conclusions

Replication in SE still lacks a set of standardized concepts and terminology, which has a negative impact on the replication work in our field. To improve this situation, it is important that the SE research community engage on an effort to create and evaluate taxonomy, frameworks, guidelines, and methodologies to fully support the development of replications.

Introduction

Replications of empirical studies play important roles in the construction of knowledge. According to Schmidt, a replication that demonstrates the same findings obtained by other experiment “… is the proof that the experiment reflects knowledge that can be separated from the specific circumstances (such as time, place, or persons) under which it was gained” [2]. Replications are also important to identify the range of conditions under which findings from one experiment hold and the possible exceptions [3].

Considering the importance of replications in the advance of science in general, Schmidt [2] expected that one would find a body of knowledge that provide clear and unambiguous definitions for central questions like ‘what exactly is a replication experiment?’, ‘what exactly is a successful replication?’, and ‘what are all types of replication and their corresponding roles?’. Furthermore, one would expect to find empirically evaluated guidelines on how to perform and report replications complementing existing guidelines to perform experiments and other empirical studies.

However, Schmidt argues that this is not true for most of scientific disciplines [2]. The published replications and the theoretical works about replication research have not used clear-cut definitions of terms and concepts, and there is no generally accepted taxonomy to distinguish between types of replications and their roles in generating scientific knowledge. According to Schmidt, “the word replication is used as a collective term to describe various meanings in different contexts” [2]. Carver et al. [4] report that a similar situation is also found in empirical software engineering research. Our findings reinforce the need to address these issues in software engineering.

The goal of this article is to contribute to the advance of the replication work in empirical software engineering. We expect that the results presented in our study will stimulate and provide support for a debate in the scientific community to central questions related to replications. Although we do not expect to fully answer these questions in this article, we believe our work will contribute to some of the answers:

What should be considered a replication?
What should be considered a successful replication?
What are the types of replications and their functions?
How should replications be performed?
How should replications be reported?

In a recent mapping study, da Silva et al. [5] studied the current state of published replications of empirical studies in software engineering research. The mapping study selected and analyzed papers reporting replications of empirical studies published until 2010 and also found a second set of studies addressing several topics about replication work. The papers about replication were not further analyzed by da Silva et al. [5]. More recently, the same research group performed an update of the mapping study previously published, covering material published in 2011 and 2012 [6]. Also in this update, the same type of papers about replication were collected and saved for future analysis.

In this current article, we analyze and discuss the content of the papers about replications (hereafter referred to as ABO papers) published in the Software Engineering literature to increase our understanding about the current state of the work on replication in empirical software engineering research. We expect that this analysis will shed some light in the issues related to the five questions raised above.

Our goal is twofold. First, to classify the set of ABO studies in Software Engineering into categories related to the topics in which the articles focused on (recommendations, frameworks, guidelines, among others). Second, to analyze how the replications performed between 1994 and 2012 have cited and used the ABO studies, in order to verify the impact of these studies in recent replication work.

The set of papers analyzed in this article is composed of those selected by da Silva et al. [5], those found in the update of the mapping study [6], and papers found through a search process performed to cover work published in 2013. We systematically structured and analyzed data extracted from these articles to answer the following six research questions:

•
RQ1: What was the evolution in the number of ABO studies over the years?
•
RQ2: Which individuals and organizations are most active in publishing ABO studies?
•
RQ3: How the ABO studies define replication?
•
RQ4: What topics or themes have been addressed by the ABO studies?
•
RQ5: Which ABO studies are cited by the papers that reported replications?
•
RQ6: How the results or propositions presented in the cited ABO studies have been used in papers that report replications?

This article is organized as follows. In Section 2, we present a background with discussion on concepts and related works. In Section 3, we present the method used in this study. In Section 4, we present a comprehensive set of results of our review and in Section 5 we discuss these results. Finally, in Section 6, we present some conclusions and proposals for future works.

Section snippets

Background and related work

As briefly discussed in the Introduction, there is little agreement about nomenclature and definition of concepts about replication in many empirical sciences and also in empirical software engineering. In this article, we expect to shed some light on the debate about some theoretical and practical issues related to performing, classifying, and reporting replications in SE research. In this section, we start by providing some preliminary definitions, we then briefly describe the two mapping

Method

The scientific literature differentiates at least two types of systematic reviews: conventional systematic reviews and mapping studies [13]. The former aims to aggregate results about the effectiveness of a treatment, intervention, or technology, and therefore seeks answers to causal or relational research questions (e.g., Is intervention I on population P more effective for obtaining outcome O in context C than comparison treatment C?). The latter, aims to identify all research related to a

Results

Our results naturally fall into two groups. The first group of research questions (RQ1–RQ4) deals with the descriptive nature of ABO studies and the second group of research questions (RQ5 and RQ6) describe how the papers reporting replications use of the results or propositions presented in the ABO studies.

Discussions

Our goal, in this review of research about replications in empirical software engineering, is to plot the general landscape of the body of work about replications and to complement the review of replications produced by da Silva [5] and Bezerra and da Silva [6]. In this section, we discuss our results, their implications for software engineering research, and the limitations of our work. We also briefly discuss the results of the RESER workshop with respect to the published studies about

Conclusions

In this article, we presented a review of 37 papers reporting studies on concepts, classifications, guidelines, frameworks, and other topics about replication in Software Engineering published between 1996 and 2013. We used the papers selected from two mapping studies that covered the period between 1996 and 2012, and from a search procedure performed by the authors to cover the year 2013. Over 67% (25/37) of the papers are published in conferences and workshops (19 full and 6 short papers).

Acknowledgments

Professor Fabio Q. B. da Silva holds a research grant from the Brazilian National Research Council (CNPq), process #314523/2009-0. Cleyton V. C. de Magalhães and Ronnie E. S. Santos are both master students at the Center of Informatics of UFPE where they receive a scholarship from CAPES and FACEPE (process #IBPG-0651-1.03/12), respectively. We would like to thank the anonymous reviewers of this article for their comments and constructive criticisms that lead to important improvements in the

References (25)

O.S. Gómez et al.
Understanding replication of experiments in software engineering: a classification
Inf. Softw. Technol.
(2014)
C.V. de Magalhães et al.
Investigations about replication of empirical studies in software engineering: preliminary findings from a mapping study
S. Schmidt
Shall we really do it again? The powerful concept of replication is neglected in the social sciences
Rev. Gen. Psychol.
(2009)
R.M. Lindsay et al.
The design of replicated studies
Am. Stat.
(1993)
J.C. Carver et al.
Replications of software engineering experiments
Empirical Softw. Eng.
(2014)
F.Q. da Silva et al.
Replication of empirical studies in software engineering research: a systematic mapping study
Empirical Softw. Eng.
(2012)
R. Bezerra, F.Q. da Silva, Replication of Empirical Studies in Software Engineering: A Systematic Mapping Study...
M.A. La Sorte
Replication as a verification technique in survey research: a paradigm
Sociol. Quart.
(1972)
J. Gould et al.
Dictionary of the Social Sciences
(1964)
J. Daly et al.
Veri cation of results in software maintenance through external replication

C.D. Knutson et al.

Report from the 1st international workshop on replication in empirical software engineering research (RESER 2010)

ACM SIGSOFT Softw. Eng. Notes

(2010)

J.L. Krein et al.

Report from the 2nd international workshop on replication in empirical software engineering research (RESER 2011)

ACM SIGSOFT Softw. Eng. Notes

(2012)

Cited by (31)

VALIDATE: A deep dive into vulnerability prediction datasets
2024, Information and Software Technology
Vulnerabilities are an essential issue today, as they cause economic damage to the industry and endanger our daily life by threatening critical national security infrastructures. Vulnerability prediction supports software engineers in preventing the use of vulnerabilities by malicious attackers, thus improving the security and reliability of software. Datasets are vital to vulnerability prediction studies, as machine learning models require a dataset. Dataset creation is time-consuming, error-prone, and difficult to validate.
This study aims to characterise the datasets of prediction studies in terms of availability and features. Moreover, to support researchers in finding and sharing datasets, we provide the first VulnerAbiLty predIction DatAseT rEpository (VALIDATE).
We perform a systematic literature review of the datasets of vulnerability prediction studies.
Our results show that out of 50 primary studies, only 22 studies (i.e., 38%) provide a reachable dataset. Of these 22 studies, only one study provides a dataset in a stable repository.
Our repository of 31 datasets, 22 reachable plus nine datasets provided by authors via email, supports researchers in finding datasets of interest, hence avoiding reinventing the wheel; this translates into less effort, more reliability, and more reproducibility in dataset creation and use.
Empirical research in software architecture — Perceptions of the community
2023, Journal of Systems and Software
Previous research highlighted concerns about empirical research in software engineering (e.g., reproducibility, applicability of findings). It is unclear how these concerns reflect views of those who conduct and evaluate research.
Focusing on software architecture, one subfield of software engineering, we study perceptions of the research community on (1) how empirical research is applied, (2) human participants, (3) internal and external validity, and (4) replications.
We collected responses from 105 key players in architecture research via a survey; we analyzed data quantitatively and qualitatively.
Although respondents do generally not prefer either quantitative or qualitative research, around 40% express a preference for various reasons. Professionals are the preferred participants; there is no consensus on the value of student participants. Also, there is no consensus on when to focus on internal or external validity. Most respondents value replications, but acknowledge difficulties. A comparison with published research shows differences between how the community thinks research should be done.
We provide evidence that consensus about empirical research is limited. Findings have implications for conducting and reviewing empirical research (e.g., training researchers and reviewers), and call for reflection on empirical research (e.g., to resolve conflicts). We outline actions for the future.
Investigating replication challenges through multiple replications of an experiment
2022, Information and Software Technology
Citation Excerpt :
For a piece of knowledge to be considered valid, a phenomenon must be replicable and observable under different contexts [2]. Over the years, many guidelines, frameworks, and techniques have been developed to guide researchers to perform replications in Software Engineering [3–8]. Furthermore, to a limited extent, experiment papers have been published providing their assets for replication in the form of a replication package1 [9,10].
As Empirical Software Engineering grows in maturity and number of publications, more replications are needed to provide a solid grounding to the evidence found through prior research. However, replication studies are scarce in general and some topics suffer more than others with such scarcity. On top, the challenges associated with replicating empirical studies are not well understood.
In this study, we aim to fill this gap by investigating difficulties emerging when replicating an experiment.
We used participants with distinct backgrounds to play the role of a research group attempting to replicate an experimental study addressing Highly-Configurable Systems. Seven external close replications in total were performed. After obtaining the quantitative replication results, a focus group session was applied to each group inquiring about the replication experience. We used the grounded theory’s constant comparison method for the qualitative analysis.
We have seen in our study that, in the replications performed, most results hold when comparing them with the baseline. However, the participants reported many difficulties in replicating the original study, mostly related to the lack of clarity of the instructions and the presence of defects on replication artifacts. Based on our findings, we provide recommendations that can help mitigate the problems reported.
The quality of replication artifacts and the lack of clear instructions might impact an experiment replication. We advocate having good quality replication instructions and well-prepared laboratory packages to foster and enable researchers to perform better replications.
The role and value of replication in empirical software engineering results
2018, Information and Software Technology
Citation Excerpt :
Second, their focus covers all software engineering, whereas we are interested in software project effort prediction and pair programming experiments. As has been noted by others such as de Magalhães et al. [62], one of the difficulties we encountered was that there is no consistent interpretation of the notion of replication. For inclusion in our review we required four things (as discussed in Section 2 on page 4).
Concerns have been raised from many quarters regarding the reliability of empirical research findings and this includes software engineering. Replication has been proposed as an important means of increasing confidence.
We aim to better understand the value of replication studies, the level of confirmation between replication and original studies, what confirmation means in a statistical sense and what factors modify this relationship.
We perform a systematic review to identify relevant replication experimental studies in the areas of (i) software project effort prediction and (ii) pair programming. Where sufficient details are provided we compute prediction intervals.
Our review locates 28 unique articles that describe replications of 35 original studies that address 75 research questions. Of these 10 are external, 15 internal and 3 internal-same-article replications. The odds ratio of internal to external (conducted by independent researchers) replications of obtaining a ‘confirmatory’ result is 8.64. We also found incomplete reporting hampered our ability to extract estimates of effect sizes. Where we are able to compute replication prediction intervals these were surprisingly large.
We show that there is substantial evidence to suggest that current approaches to empirical replications are highly problematic. There is a consensus that replications are important, but there is a need for better reporting of both original and replicated studies. Given the low power and incomplete reporting of many original studies, it can be unclear the extent to which a replication is confirmatory and to what extent it yields additional knowledge to the software engineering community. We recommend attention is switched from replication research to meta-analysis.
Predicting bug-fixing time: A replication study using an open source software project
2018, Journal of Systems and Software
Citation Excerpt :
As this research is the replication study on predicting the bug fixing time, we reviewed the literature both on the value of replication studies in software engineering and also on bug fixing time prediction and estimation. Replication is an integral part of software engineering (SE) experimentation to enrich the body of knowledge and to find the conditions that make an experiment steady (de Magalhães et al., 2015). La Sorte (1972) defined replication as “a conscious and systematic repeat of an original study.”
Background: On projects with tight schedules and limited budgets, it may not be possible to resolve all known bugs before the next release. Estimates of the time required to fix known bugs (the “bug fixing time”) would assist managers in allocating bug fixing resources when faced with a high volume of bug reports.
Aim: In this work, we aim to replicate a model for predicting bug fixing time with open source data from Bugzilla Firefox.
Method: To perform the replication study, we follow the replication guidelines put forth by Carver [J. C. Carver, Towards reporting guidelines for experimental replications: a proposal, in: 1st International Workshop on Replication in Empirical Software Engineering, 2010.]. Similar to the original study, we apply a Markov-based model to predict the number of bugs that can be fixed monthly. In addition, we employ Monte-Carlo simulation to predict the total fixing time for a given number of bugs. We then use the k-nearest neighbors algorithm to classify fixing times into slow and fast.
Result: The results of the replicated study on Firefox are consistent with those of the original study. The results show that there are similarities in the bug handling behaviour of both systems.
Conclusion: We conclude that the model that estimates the bug fixing time is robust enough to be generalized, and we can rely on this model for our future research.
Replicated results are more trustworthy
2016, Perspectives on Data Science for Software Engineering

View all citing articles on Scopus

^☆: Article Notes: Preliminary and partial results of this study have been presented at the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE’2014) and published in the Conference Proceedings [1].

View full text

Investigations about replication of empirical studies in software engineering: A systematic mapping study☆

Abstract

Context

Objective

Method

Results

Conclusions

Introduction

Section snippets

Background and related work

Method

Results

Discussions

Conclusions

Acknowledgments

Inf. Softw. Technol.

Investigations about replication of empirical studies in software engineering: preliminary findings from a mapping study

Shall we really do it again? The powerful concept of replication is neglected in the social sciences

Rev. Gen. Psychol.

The design of replicated studies

Am. Stat.

Replications of software engineering experiments

Empirical Softw. Eng.

Replication of empirical studies in software engineering research: a systematic mapping study

Empirical Softw. Eng.

Replication as a verification technique in survey research: a paradigm

Sociol. Quart.

Dictionary of the Social Sciences

Veri cation of results in software maintenance through external replication

Report from the 1st international workshop on replication in empirical software engineering research (RESER 2010)

ACM SIGSOFT Softw. Eng. Notes

Report from the 2nd international workshop on replication in empirical software engineering research (RESER 2011)

ACM SIGSOFT Softw. Eng. Notes