Cleaning ground truth data in software task assignment

https://doi.org/10.1016/j.infsof.2022.106956Get rights and content

Highlights

  • Devised a debiasing method to clean task assignment datasets.

  • Conducted experiments in two task assignment applications.

  • Debiasing the ground truth data improves learning-based techniques’ performance.

Abstract

Context:

In the context of collaborative software development, there are many application areas of task assignment such as assigning a developer to fix a bug, or assigning a code reviewer to a pull request. Most task assignment techniques in the literature build and evaluate their models based on datasets collected from real projects. The techniques invariably presume that these datasets reliably represent the “ground truth”. In a project dataset used to build an automated task assignment system, the recommended assignee for the task is usually assumed to be the best assignee for that task. However, in practice, the task assignee may not be the best possible task assignee, or even a sufficiently qualified one.

Objective:

We aim to clean up the ground truth by removing the samples that are potentially problematic or suspect with the assumption that removing such samples would reduce any systematic labeling bias in the dataset and lead to performance improvements.

Method:

We devised a debiasing method to detect potentially problematic samples in task assignment datasets. We then evaluated the method’s impact on the performance of seven task assignment techniques by comparing the Mean Reciprocal Rank (MRR) scores before and after debiasing. We used two different task assignment applications for this purpose: Code Reviewer Recommendation (CRR) and Bug Assignment (BA).

Results:

In the CRR application, we achieved an average MRR improvement of 18.17% for the three learning-based techniques tested on two datasets. No significant improvements were observed for the two optimization-based techniques tested on the same datasets. In the BA application, we achieved a similar average MRR improvement of 18.40% for the two learning-based techniques tested on four different datasets.

Conclusion:

Debiasing the ground truth data by removing suspect samples can help improve the performance of learning-based techniques in software task assignment applications.

Introduction

Task assignment in software engineering is concerned with assigning a developer(s) to a development-related task such that the assigned person is capable of completing the task effectively, expediently, and with acceptable quality [1]. Several studies in the literature proposed techniques to automate assignments tasks [2], [3], [4], [5]. The techniques develop models based on historical data, which need to be reliable and accurate enough for the models to perform optimally in practice. In previous work, Tuzun et al. [6] investigated problems that can plague historical data in this and other contexts in software engineering, and proposed strategies that can be applied to improve both data quality and model performance. These problems occur when the proposed techniques blindly rely on the data as absolute ground truth: however historical data that involve human decisions come with biases that are baked into those decisions. Two instances of task assignment discussed in Tuzun et al. are CRR and BA. This paper presents two applications, i.e., CRR and BA, that demonstrate how cleaning up the ground truth to eliminate suboptimal decisions from the historical data can improve the performance of models that rely on the historical data.

The code review process is an important step in the software development lifecycle. Effective code reviews increase internal quality and reduce defect rates [7]. For effective code reviews, reviewers should be selected carefully. According to Google’s best practices for code reviews [8], “the best reviewer is the person who will be able to give you the most thorough and correct review for the piece of code you are writing”. Several CRR techniques exist in the literature [9], [10], [11], [12], [13], [14], [15], [16], [17]. These CRR techniques use different strategies, but they invariably either build or evaluate their models based on datasets gathered from industrial or open-source projects. Hence they rely on the datasets accurately capturing the ground truth regarding past reviewer selections. The models assume that a code reviewer assigned to a review task, often determined in a pull request (PR), in a dataset is the best possible reviewer for the task. However, in practice, the selected code reviewer may not be the most qualified, or even sufficiently qualified to review the submitted PR [18]. In several cases, reviewer assignments can be based on non-technical factors, which may invalidate the central assumption that the models are built on [18]. This situation was described in a study on code reviewer practices at Microsoft [19], where reviewers are assigned to PRs according to their availability and social relationship with the person who makes the reviewer assignments. According to Dogan et al. [18], availability is an important factor for reviewer assignments and is frequently substituted for technical or competency factors. Consequently, recommendation labels in datasets that originate from real practice may be suboptimal, and can negatively affect the accuracy and reliability of the CRR techniques that rely on them.

BA is another important task in managing software projects. Many open-source projects contain open bug repositories, where both developers and users report the problems they encounter. Additionally, they can recommend changes and enhancements to improve the overall quality of the software [3]. As in CRR, for each reported bug, often an assigner selects an appropriate developer who has the expertise to perform the fix. The assignment can be manual or automated. In automated BA, the assignment is decided by heuristics or trained models, using and combining text categorization [20], machine learning [3], [21], artifact/dependency analysis [22], [23]. Regardless of the technique used, historical bug assignment data that originate from proprietary or open-source projects involving human decision makers play an important role. Thus, as in the case of CRR, the data is susceptible to biases resulting in suboptimal assignments where an assigned developer might not have been the most qualified, or even sufficiently qualified, for the task.

In machine learning, the kind of labeling error that exists in CRR and BA datasets is generally referred to as systematic labeling bias. Supervised learning techniques require labels in the training samples. These labels indicate real/actual classes of interest in past data so that models can be built to predict the classes of new data. For instance, in order to distinguish between apples and oranges, an actual label (i.e., apple or orange) for each training sample is required. Ground truth refers to these labels indicating the actual class of the training samples. In more complex tasks of pattern recognition, such as the classification of code review tasks according to who should review them, 100% “correct” labels may not be present in the training samples due to several factors, including subjective ones, that need to be considered. The notion of a “correct” label is not well-defined in such a context. Although the labels are not perfect, they are still considered as the ground truth. When the amount of problematic labels in the ground truth is relatively small or inconsistencies in the class labels are negligible, associated samples can be treated as normal noise. But in some cases, the ground truth may include more pervasive problems due to basic/naïve assumptions in the labeling process [24] or intrinsic properties of the observed data [25] that can prevent the models from converging, learning generalizable patterns, or, as in the CRR and BA cases, being as effective in real practice as they could be. These cases are said to have systematic labeling bias. To the best of our knowledge, automatic cleaning ground truth data in the software task assignment problem has not been investigated by any other study in the literature.

With the goal of preventing labeling problems of the kind described above for task assignment automation in software engineering, we formulate two research questions:

RQ1: How can we eliminate systematic labeling bias in task assignment ground truth data?

RQ2: How does systematic labeling bias elimination in the ground truth data affect the performance of task assignment techniques that rely on the data?

For RQ1, we explore possible solutions and introduce a new approach to detect and eliminate potentially “incorrect” assignee labels in CRR and BA datasets. For RQ2, we measure the effects of our proposed approach by comparing before and after accuracy rates of five CRR techniques and two BA techniques. The five CRR techniques are Naïve Bayes, k Nearest Neighbor (k-NN), Decision Tree, RSTrace [26] and Profile based [27]. The two BA techniques are deep-learning based bug triage — Deep Triage [28] and Convolutional Neural Network (CNN) based word representation — CNN Triage [29].

Section 2 provides a summary of relevant previous work on CRR approaches, BA approaches, cognitive biases in software engineering, and ground truth problems in software engineering. Section 3 defines success criteria for CRR and BA for correct assignments. Furthermore, Section 3 introduces our debiasing (data cleaning) approach. Section 4 describes our experiments underlying the two applications, introducing first the datasets, preprocessing steps, and experimental setups, and then presenting the results. Section 5 answers the research questions and discusses the limitations of our work. Finally, Section 6 summarizes the contributions and discusses future work.

Section snippets

Background and related work

In the following, we provide a summary of CRR and BA techniques, related work on cognitive biases and ground truth problems in software engineering.

Strategies for fixing the ground truth in task assignment

Our goal is to eliminate systematic labeling bias in historical data about task assignment. To be able to do this, we need a way to identify whether a task assignment decision can be considered as successful after the fact: that is, whether the labels of the assignment sample, the assignees, turned out to be the right choices for the task represented by that sample. Thus deciding label correctness requires a success measure. If the success measure is qualitative or subjective, we can only use

The CRR dataset

For the CRR application, we evaluate the debiasing method on two different datasets. These datasets belong to projects from two sources: Qt,1 a company that develops cross-platform software, and from Apache.2 The projects are Qt Creator and HIVE, respectively. They are chosen because they are both open-source, have full PR and code review history, and the PR information is linked to the bug tracking information, as required.

For Qt

Discussion

The debiasing method we propose can be applied to any assignment task in software engineering (i.e., feature assignment, test assignment tasks). However, there are three prerequisites that need to be satisfied to successfully apply the method for a software engineering assignment task. (1) Previous task assignment data should be readily available to represent the ground truth. (2) There should be an objective success measure (e.g., in the bug assignment problem, the success measure was about

Conclusion and future work

Good assignee selection is central to effective task assignment in software development. Task assignment techniques attempt to automate the assignee selection process, but many techniques build their models and evaluate them using historical data whose ground truth may be unreliable. Ground truth problems often result from the susceptibility of human decision makers to cognitive biases, such as substituting a convenience attribute for a competence attribute. When the task assignments are

CRediT authorship contribution statement

K. Ayberk Tecimer: Methodology, Software, Investigation, Data curation, Writing – original draft, Visualization. Eray Tüzün: Conceptualization, Validation, Resources, Writing – review & editing, Supervision, Project administration. Cansu Moran: Methodology, Software, Investigation, Data curation, Writing – original draft, Visualization. Hakan Erdogmus: Conceptualization, Validation, Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (59)

  • SadowskiC. et al.

    Modern code review : A case study at google

  • Code review developer guide

    (2020)
  • BalachandranV.

    Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation

  • LeeJ.B. et al.

    Patch reviewer recommendation in OSS projects

  • ThongtanunamP. et al.

    Who should review my code ?

  • XiaX. et al.

    Who should review this change ?

  • OuniA. et al.

    Search-based peer reviewers recommendation in modern code review

  • ZanjaniM.B. et al.

    Automatically recommending peer reviewers in modern code review

    IEEE Trans. Softw. Eng.

    (2016)
  • SülünE. et al.

    Reviewer recommendation using software artifact traceability graphs

  • JiangJ. et al.

    CoreDevRec: Automatic core member recommendation for contribution evaluation

    J. Comput. Sci. Tech.

    (2015)
  • XiaZ. et al.

    A hybrid approach to code reviewer recommendation with collaborative filtering

  • DoganE. et al.

    Investigating the validity of ground truth in code reviewer recommendation studies

  • KovalenkoV. et al.

    Does reviewer recommendation help developers?

    IEEE Trans. Softw. Eng.

    (2018)
  • CubranicD. et al.

    Automatic bug triage using text categorization

  • JonssonL. et al.

    Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts 21

    (2016)
  • HuH. et al.

    Effective bug triage based on historical bug-fix information

  • NaguibH. et al.

    Bug report assignee recommendation using activity profiles

  • A. Søgaard, B. Plank, D. Hovy, Selection Bias, Label Bias, and Bias in Ground Truth, in: Proceedings of COLING 2014,...
  • CabreraG.F. et al.

    Systematic labeling bias: De-biasing where everyone is wrong

  • Cited by (3)

    • Graph collaborative filtering-based bug triaging

      2023, Journal of Systems and Software
    View full text