An approach to generate the bug report summaries using two-level feature extraction

doi:10.1016/j.eswa.2021.114816

Expert Systems with Applications

Volume 176, 15 August 2021, 114816

https://doi.org/10.1016/j.eswa.2021.114816 Get rights and content

Highlights

•
Classification of features into comment and sentence specific.
•
Better Results in terms of Recall and F-Score No Training Dataset.
•
Reduces the data space of comments for further processing.

Abstract

Bug report is one of the major software artifact which is generated during the software development process. Changing requirements in the software development process leads to the continuous evolution of bugs which give challenges to the project management task. Bug Reports are the most consulted artifact by the software community. A Bug Report not only contains the information about the bug but also includes information like the resolution process, the enhancements by other persons, and the suggestions from the users if there are any. During the software evolution and maintenance phase, a developer spends a lot of effort and time searching for the appropriate bug report for resolving the bug quickly. Automatic Bug Report Summarization is one approach to solve the issue of time and effort. Bug report summarization helps developers not only find the appropriate bug report quickly but also assists in managing many tasks related to Bug Report Maintenance. In this paper, we have developed a two-level approach to generate the Bug Report summaries where the title, the description and the comments of a resolved bug report are considered for the summary. We find the entities in the title to create a template-based sentence for describing what the bug report is about. We use the PageRank algorithm along with the cosine similarity measure to find the summary of the Description field of bug report. The two level feature-based approach is used to find the relevant comments and the sentences from the comments. We have used the BRC dataset which has been used by most of the research community in this field. Finally the summaries from title, description and comments are merged together to create a final summary. The approach uses the features which have been successfully used by other researchers for the text summarization and especially to the domain of the meeting conversation like data. Empirical results shows that our approach works equally well with the other supervised and unsupervised approaches in terms of ROUGE Scores.

Introduction

Changing requirements in this agile world demand for the continuous maintenance and the evolution of a software. During the software development, coding accounts for just the 20 percent of the efforts while the other activities account for about 80 percent of the efforts. Testing, maintenance and evolution are the most important activities which require a lot of effort. During every phase of the software development, some artifacts are created which serves as the information management components of the software like requirement document during the requirement analysis phase, design documents during designing phase, test cases, bug reports, etc. Online Software Repositories are used to manage and maintain this information.

Bug Report is one of the artifacts which are created during the testing, and the maintenance phase. Bug reports are usually stored in the software bug repositories. Bug report is a very versatile artifact which may contain a very structured data or may have a lot of technical dumps, stack traces, code, etc. It also contains opinions and ideas. Bug reports help developers understand how a similar bug has been fixed and how the changes were made in the system in the past. Few of the open software bug repositories include Bugzilla.² Bug Report consists of the information like title, description, comments, authorship information and the timestamp related information. It is a conversational artifact and resembles emails and discussion threads. Usually the comments are written based on the previous contexts. (Rastkar, Murphy, & Murray, 2014) considered the bug reports similar to email conversations and thus applied the classifiers used for email conversations and noticed that the classifiers work equally well in both the artifacts. Bug reports not only contains the bug resolving information, but sometimes also the insights about the enhancements and improvements for the system. In the informal conversation section of the bug report, the developers also gives the suggestions or the alternatives to the existing approaches. The comments are many times the agreement to the previous comment, sometimes the disagreement and sometimes the alternative to the existing approach. Thus, they are one of the most valuable software artifacts and one of the most mined artifact to get the interesting insights. To resolve a bug and understand the bug report, a developer has to go through all the comments which is a very time-consuming and a tedious task (Kukkar & Mohana, 2019). (Lotufo, Malik, & Czarnecki, 2015) mentioned in their paper how bug reports refer to other bug reports also. Especially during regression, the bug reports need to be consulted a lot. Comments help improve the knowledge about the bug report. But at the same time, comments increase the difficulty in understanding the bug report as different contributors may discuss the bug in different contexts which makes the discussion multi-threaded.

One of the solution to reduce the effort and time for performing this task is automatic bug report summarization. It is to find the summary of a bug report automatically in order to help the developers understand the bug report quickly. Summarizing a Bug Report is not an easy task. (Yang, Cheng-Min, & Chung, 2018) et al. have also mentioned that the presence of technical terms poses a big challenge for a developer to understand the bug report and thus make this process a very tedious and time-consuming. Even the searching of a right bug report takes a lot of time as there are usually number of duplicate bug reports in the bug repositories. Even though the bug reports can be so valuable for developers and maintenance team but they are usually written without the intention of easy follow-up (Lotufo et al., 2015). Thus, the automatic bug report summarization faces the challenges of huge data space of comments as they contain lot of words, sentences and the selection of the appropriate sentence is not an easy task. (Mani, Sankaran, & Aralikatte, 2019) mentioned that a bug report on an average contains at least more than 60 to sometimes more than 300 sentences. Sentence selection and scoring also poses another challenge of managing the accuracy and speed of summarization process. Including the semantic context along with the sentence selection is also one of the challenge during the summarization process. Sparsity of data and reduction of data are also few of the challenges for the summarization. Resolving the ambiguity and noisiness of the data in bug reports is again a very big challenge. From the various researches it has been found that one out of five bug reports, is duplicate. Automatic Summarization of Bug Reports not only help in increasing the comprehensibility of the document but also helps in many activities related to it like identification of duplicate Bug Reports, classification of Bug Reports, etc. Analyzing the bug reports, also help improve the quality of software.

Following are the main points which have been observed about the Bug Reports by the research community:

•
In a paper by (Lotufo et al., 2015), it has been observed that most of the Bug Reports refer to the previous Bug Reports or they are referred in other Bug Reports which are generated because of the existing Bug Report
•
In case of open source projects, the collaborators are random which leads to the issues of the quality of Bug Reports and usually there is no proper structure followed in the Bug Report.
•
In the Bug Reports there is a provision for anybody to comment and thus the Bug Report becomes more like an informal chat and resembles more like a meeting conversations or email thread conversations or a chat.

For summarizing the Bug Reports, there are a number of techniques which have been used by the researchers like using feature-based sentence selection, classifier-based extraction, unsupervised approaches which considers the centrality and diversity, Topic-Model Based approaches etc. For the preprocessing of the text in the Bug Report, there are a number of natural language processing tools like Stanford NLP,³ NLTK,⁴ which helps remove the noisiness of the data. (Kukkar & Mohana, 2019) have discussed that the currently available techniques do not give very informative summaries and thus they have to read the whole bug report.

The hypothesis on which we propose the approach are.

•
Timestamp is a key factor for deciding the important comments from the Bug Report
•
The sentences possessing more similarity with the title and the description of the Bug Report are important
•
Noise filtering is very important to remove the irrelevant comments from the Bug Reports
•
Authorship plays a significant role in determining the relevance of the comment

In order to create a summary which includes the information from all title, description and comments together, we propose a solution which uses the features and the unsupervised approach PageRank along with the natural language generation to create a flexible and informative summary. In our approach, we have not used the Machine Learning techniques as (Mani et al., 2019) in their paper showed that unsupervised approaches have outperformed the BRC Classifier. Moreover the availability of large dataset for training is still an open issue in the field of summarization.

The Main contributions of the paper includes:

•
Understanding of the Bug Report and its structure
•
Discussion and the classification of the features which describe the Bug Report. We have classified the features into the comment-specific and the sentence-specific features.
•
Rather than using all the features together for all the sentences of bug report; finding of the important comments first and then applying few selective sentence-specific features to find the important sentences in those comments.
•
Incorporating the domain-specific information and the semantic information to the Bug Report Summarization
•
Combining the description summary and comments summary to enrich the summaries.

The paper is organised as follows: Section 2 states the related works in this field. Section 3 discusses our approach. Section 4 discusses the results and discussion of the approach. Finally the conclusion and the future work that needs to be done.

Section snippets

Related work

Summarization techniques are divided into single document and multi-document summarization when number of documents considered for summarization are taken into the consideration. Summarization is divided into extractive and abstractive summarization when the way to select the important sentences and representation is considered. Extractive summarization is when the important sentences are extracted from the text and arranged to form the summary. Abstractive summarization is when the sentences

Proposed approach

The extractive summarization approaches are based upon the extraction of the important sentences. We have used the extractive approach for our work due to the fact that abstractive summarization is a costly and a complex process.

The proposed approach is an extension of the concept used by (Murray & Carenini, 2008) which (Rastkar et al., 2010), (He et al., 2017), (Yang et al., 2018) have used in their papers for implementing the bug report summarization. (Rastkar et al., 2010) used the machine

Dataset

For the experiments purpose, we have used the modified BRC dataset⁵ as used in (He et al., 2017; Lotufo et al., 2015; Rastkar et al., 2014). The dataset consists of reports from four open source projects which includes Eclipse, Mozilla, Gnome and KDE. Here the bugs are from four open source projects namely Eclipse, Mozilla, KDE and Gnome. The corpus consists of 28 bug reports. There are nine master bug reports and 19 are duplicate bug reports. Along

Conclusion and future work

Bug Reports are the valuable resources for improving the quality of software and for resolving the bugs during the software evolution and maintenance. Reading the complete bug report fully and locating the bug report is one of the very time consuming and tedious task. Automatic summarization helps understand the bug report quickly and help in locating the bug and finding the duplicate bugs. We have created a two level feature extraction approach where we have used selective features for comment

CRediT authorship contribution statement

Som Gupta: Conceptualization, Methodology, Writing - original draft, Visualization, Investigation, Writing - review & editing. Sanjai Kumar Gupta: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (26)

R. Ferreira et al.
Assessing sentence scoring techniques for extractive text summarization
Expert Systems with Applications
(2013)
Carenini, G., Ng, R. T., & Zhou, X. (2008). Summarizing emails with conversational cohesion and subjectivity. In...
H.P. Edmundson
New methods in automatic extracting
Journal of the ACM (JACM)
(1969)
M. Gambhir et al.
Recent automatic text summarization techniques: A survey
(2017)
Gupta, S., & Gupta, S. (2018). Abstractive summarization: An overview of the state of the art. Expert Systems with...
J. He et al.
Prst: A pagerank-based summarization technique for summarizing bug reports with duplicates
International Journal of Software Engineering and Knowledge Engineering
(2017)
Huai, B., Li, W., Wu, Q., & Wang, M. (2018). Mining intentions to improve bug report summarization (pp....
Jafari, M., Wang, J., Qin, Y., Gheisari, M., Shahabi, A., & Tao, X. (2016). Automatic text summarization using fuzzy...
Jagadeesh, J., Varma, V., & Pingali, P. (2005). Sentence extraction based single document summarization. Workshop on...
Kukkar, A., & Mohana, R. (2019). Bug report summarization by using swarm intelligence approaches. Recent Patents on...

A. Kukkar et al.

An optimization technique for unsupervised automatic extractive bug report summarization

Kumarasamy Mani, S.K., Catherine, R., Sinha, V., & Dubey, A. (2012). Ausum: Approach for unsupervised bug report...

R. Kurmi et al.

Text summarization using enhanced mmr technique

Cited by (12)

A network-based feature extraction model for imbalanced text data
2022, Expert Systems with Applications
Citation Excerpt :
The features of text data refer to the language units (words) or phrases (Collobert et al., 2011), and even characters, which also perform good results in specific tasks (Zhang et al., 2015). There are numerous machine learning methods to achieve feature extraction, and their common advantage is that they are well adaptive to different tasks (Hassan et al., 2015; Junejo et al., 2016; Prihatini et al., 2018; Zhao, & Mao, 2018; Gupta, & Gupta, 2021; Yan et al., 2020). In recent years, the success of distributed representation of words (Mikolov et al., 2013; Le, & Mikolov, 2014; Pennington et al., 2014; Devlin et al., 2019) has inspired researchers to learn the features of text through training on the neural network (Kim, 2014; Hu et al., 2014; Yin et al., 2016; Foland, & Martin, 2017, Liang et al, 2017; Young et al., 2018).
The explosive growth of text data has attracted many researchers to explore the efficient method to extract valuable hidden information. Many technologies, especially deep learning methods, have achieved great success in text analysis. However, the most powerful methods always require a considerable quantity of data for training, which may suffer from imbalanced data in some cases. In this paper, we propose a network-based Convolution Neural Network (NCNN) to mitigate the effect of imbalanced data. The proposed model first generates new synthetic samples for the imbalanced data based on the random walking of the network. Then an extra layer called Polar Layer is introduced to connect the output from the network model of the text to the classical CNN. Two electing strategies (n-NCNN and x-NCNN) are proposed to improve the performance of NCNN further. In the experimental section, the proposed model is applied to Reuters 21578 and WebKb. By comparing with six approaches, we prove the effectiveness of the proposed NCNN model on the imbalanced text data.
A knowledge-graph based text summarization scheme for mobile edge computing
2024, Journal of Cloud Computing
GIRT-Model: Automated Generation of Issue Report Templates
2024, arXiv
NRPredictor: an ensemble learning and feature selection based approach for predicting the non-reproducible bugs
2023, International Journal of System Assurance Engineering and Management
A Combination of Classification and Summarization Techniques for Bug Report Summarization
2023, CEUR Workshop Proceedings
Improvement of maintenance-based Product-Service System offering through field data: a case study
2023, Production and Manufacturing Research

View all citing articles on Scopus

¹: ORCID ID: https://orcid.org/0000-0002-1476-084X

View full text

An approach to generate the bug report summaries using two-level feature extraction

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed approach

Dataset

Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Declaration of Competing Interest

Expert Systems with Applications

New methods in automatic extracting

Journal of the ACM (JACM)

Recent automatic text summarization techniques: A survey

Prst: A pagerank-based summarization technique for summarizing bug reports with duplicates

International Journal of Software Engineering and Knowledge Engineering

An optimization technique for unsupervised automatic extractive bug report summarization

Text summarization using enhanced mmr technique