Automatically classifying software changes via discriminative topic model: Supporting multi-category and cross-project

doi:10.1016/j.jss.2015.12.019

Journal of Systems and Software

Volume 113, March 2016, Pages 296-308

https://doi.org/10.1016/j.jss.2015.12.019 Get rights and content

Highlights

•
The discovered topics have a one-to-one correspondence with category labels.
•
The method performs both single-category and multi-category change classification.
•
The method overcomes the ambiguity coming from manually assigning weights.
•
The method is applicable to cross-project analysis without the need of re-learning.

Abstract

Accurate classification of software changes as corrective, adaptive and perfective can enhance software decision making activities. However, a major challenge which remains is how to automatically classify multi-category changes. This paper presents a discriminative Probability Latent Semantic Analysis (DPLSA) model with a novel initialization method which initializes the word distributions for different topics using labeled samples. This method creates a one-to-one correspondence between the discovered topics and the change categories. As a result, the discriminative semantic representation of the software change messages whose largest topic entry directly corresponds to the category label of the change message which is directly used to perform single-category and multi-category change classification. In the evaluation on five open source projects, the experimental results show that the proposed approach achieves a more accurate performance than the four baseline methods. Especially with the multi-category classification task which improves the recall rate. Moreover, the different projects share the same vocabulary and the estimated model so that DPLSA is well applicable to cross-project software change message analysis.

Introduction

To aid further software analysis, it is necessary to classify software change as corrective, adaptive, or perfective. The proportion of each category provides a valuable window into the software development practices. Project managers need to be well informed to enhance their decision making process. For example, if 90% of the changes in a project are corrective, then it may mean that now is the time to intensify the quality assurance work like code reviews and unit tests. It has been applied to many important software engineering activities, such as software maintenance (Mockus and Votta, 2000) and defect prediction (Kim et al., 2008). Various change cues have been used for classifying software changes, for example, change author (Hindle et al., 2009a), change file (Alali et al., 2008), change size (Hattori and Lanza, 2008) and change messages (Hassan, 2008). In particular, change messages are attractive for software change classification because it does not require retrieving and then analyzing the source code of the change. Moreover, retrieving only the message is significantly less expensive, and allows for efficient browsing and analysis of the changes and their constituent revisions. These characteristics are useful to anyone who needs to quickly categorize or filter out irrelevant revisions (Hindle et al., 2009a). However, to the best of our knowledge, there has been little work classifying a multi-category change; but, there have been researches that have found this to be a realistic activity (Fu et al., 2015, Mauczka et al., 2012). In this work, we aim to accurately understand the category distribution by classifying both single-category and multi-category changes.

Researchers have proposed a variety of approaches for retrieving keywords in change messages to classifying software changes (Hassan, 2008, Mauczka et al., 2012, Mockus and Votta, 2000). Despite the great success achieved, there are some unsolved issues remaining in this research, such as the ambiguity coming from subjective interpretations of the relationship between relevant words and categories of changes. A similar work has demonstrated success in automatic software change classification by using semi-supervised Latent Dirichlet Allocation (LDA) (Fu et al., 2015). We noticed that both Mauczka et al. (2012) and Fu et al. (2015) found that single-category changes are not necessarily realistic. A major challenge which remains is how to automatically classify multi-category changes. To address this challenge, we focus on automatically classifying software changes by developing a novel discriminative Probability Latent Semantic Analysis, referred to DPLSA. The main difference from Fu et al. (2015) is that the three topics in this work have a one-to-one correspondence to corrective, adaptive and perfective software change categories, such that the change message categorization comes down to finding the single maximum entry (single-category) or multi maximum entries (multi-category) in the topic-document distributions, and we provide a method of cross-project classification without the need of re-learning. In particular, we motivated our investigation with three research questions:

RQ1 What is a better way to evaluate the relationship between relevant words and the categories of software changes? A change message often is a short description written by developers and the VCS does not enforce how to write a change message. Consequently, change messages are non-structured free format text. There are many salient words relevant to categories in change messages, such as “fix”, “create” and “correct”. The relationship between the relevant words and the categories is the key issue in the classification step. Mauczka et al. (2012) assigned weights to the salient words which is a subjective interpretation. We wish to perform a cross-project training using labeled messages to automatically determine a probabilistic relationship.
RQ2 How well do the discovered topics correspond to software changes with multi-category? A change message indicates a particular maintenance task, such as fixing a defect or adding a new feature, despite the fact that there exists a few change messages, which indicate multiple purposes as Mauczka et al. (2012) and Fu et al. (2015) presented in their validation step. We wish to create a one-to-one correspondence between discovered topics and categories by using the discriminative topic model. After that, the discovered topics can be directly used to perform the classification task including single-category and multi-category.
RQ3 What is an accurate way to automatically obtain the distribution of software changes? A project manager would be interested in knowing the distribution of categories of software changes. We wish to quantify a more accurate distribution by classifying both single-category and multi-category changes.

We address our research questions by proposing a topic modeling method. It is inspired by the recent success of topic modeling in mining software repositories (Grant et al., 2012, Hindle et al., 2011, Hindle et al., 2009b, Pollock et al., 2013, Thomas, 2012). Topic models, such as Probability Latent Semantic Analysis (Hofmann, 2001), Latent Dirichlet Allocation (Blei et al., 2003), Correlated Topic Models (Lafferty and Blei, 2006) and their variants and extensions, have been applied to various software engineering research questions, such as software evolution and software defect prediction (Chen et al., 2012, Gethers and Poshyvanyk, 2010, Grant et al., 2012). Despite the great success achieved, there are some unsolved, important issues that still remain in this line of research. First, in the original topic models, some words which are fully connected to different topics are noisy and irrelevant for model construction. Disconnecting the irrelevant words is helpful for generating a sparse representation over different topics of a document (Chien and Chang, 2014). In fact, the sparsity of the topic-document distribution (i.e. with a small number of dominant entries and most zero or close to zero entries) is helpful for directly performing the classification task. Second, a critical issue in understanding the latent topics uncovered from software repositories is how many topics should be sought (Grant et al., 2013). There is not a one-to-one correspondence between topics and category labels in the traditional models. The topic-document distributions are only used to decide which topics are important for a particular document and cannot determine which category a particular document belongs to. This is also the limitation in solving the multi-category problem.

The process of the proposed DPLSA is divided into three phases as illustrated in Fig. 1. We select single category change messages as our training datasets and use the semantically salient words derived from the work of Mauczka et al. (2012) to form the vocabulary as illustrated in Fig. 1(a). Moreover, the training messages from the same category are employed to initialize the category-conditional probability of a specific word conditioned on the corresponding topic. Hence, semantically salient words are forced to connect to the topic partially with a dominated probability. Such that, it creates a one-to-one correspondence between topics and categories. Due to the special initialization approach, the sparsity is achieved for the corresponding words to the corresponding topics (Chien and Chang, 2014). Finally, the topic representation of a test sample is sparse and its maximum entry directly determines the category to which the test sample belongs because the topic is the same as the category. When multiple topic entries of a change message reach the same maximum, the change message is regarded as a multi-purpose one.

In our experiments, change messages of five open source projects are extracted by using the CVSAnalY (Robles et al., 2004) tool. The change message is normalized by WordNet (Miller, 1995) and Gate (Cunningham et al., 2002). The five different projects in the experiment shared the same vocabulary and the estimated model, and moreover the sparse probabilistic representation of software change messages were directly used to assign software changes into Swanson's maintenance categories (Swanson, 1976) by finding the maximum topic entry. The proposed approach is proved capable of classifying changes well through manual validation performed by professional developers. Especially, the multi-category change classification task that improves the recall rate. In summary, the contributions of this paper can be summarized as follows:

•
We explore the discovered word-topic distributions learned from labeled change messages and find they provide an ordered probabilistic relationship between relevant words and the categories of software changes. As a result, this overcomes the ambiguity coming from manually subjective weights.
•
We explore the discovered topic-document distributions and find a one-to-one correspondence between these discovered topics and change categories. The maximum topic entry directly determines the category to which a change belongs. If multiple topic entries reach the same maximum, this indicates a change is a multi-category one.
•
We evaluate our approach on five projects and compare the performance with four baselines. The results indicate that our performing multi-category classification improves the classification performance. As a result, this work provides a more accurate distribution of each category in a project. Besides, we provide a method of cross-project software change classification without the need for re-learning. The different projects share the same vocabulary and the estimated model.

The structure of this paper is as follows. In Section 2 we present the related work of our research, including previous software change classification methods, software change classification rules, topic modeling in mining software repositories (MSR), and PLSA. We describe our research preparation, models and techniques in Section 3. In Sections 4 and 5, we provide the experiment design, results and validation. Then at last in Sections 6 and 7, we discuss the potential threats to our findings and draw a conclusion.

Section snippets

Related work

In this section, we discuss related literature from several aspects: previous software change classification methods, software change classification rules, topic modeling in mining software repositories (MSR), and Probabilistic Latent Semantic Analysis.

Research methodology

Two steps are necessary to conduct a DPLSA model of a software change repository: (1) the change message extracting and preprocessing step and (2) the topic modeling step. These steps are detailed below and illustrated in Fig. 3.

Data sets

Five open source projects in Table 2 were chosen to build, test, and verify this method. The following criteria were used to select the projects:

Public accessible. The candidate open source projects were mature projects and had public accessible source control repositories.
Number of commits and developers. For this work, only projects with near 30,000 commits were considered. In order to validate if the topic modeling understands natural languages and the expression diversity of different

Validation

To evaluate our classification results, we did a survey with a small number of professional software developers which were accessible to us and we could easily interview them to explain their replies when necessary. Five developers were surveyed by our questionnaires, each with 80 messages which were selected randomly including both single-category and multi-category messages from each project. Too many surveyed messages for a participant is a big burden to finish an accurate survey (Hassan,

Limitation and threats to validity

Size of evaluation set. We determined the test set size as a trade-off value in an empirical way which is similar to the works in Table 8. Using a more systematic approach for choosing the size of the evaluation dataset is better. However, the size of surveyed messages needed cannot be controlled in a systematic approach. Too many surveyed messages are a big burden for participants to finish an accurate survey. Also, too few surveyed messages may not contain any multi-category messages. It may

Conclusions and future work

In this paper, we investigated an artifact of software development, namely, the change messages attached to every change committed to a version control system. It presented a discriminative topic model technique supporting multi-category classification and cross-project. The results of a set of controlled experiments carried out to validate whether it can evaluate the probability relationship between relevant words and categories and provide a more accurate distribution of software changes. In

Acknowledgments

The work described in this paper was partially supported by the National Natural Science Foundation of China (grant nos. 91118005, 61173131), Changjiang Scholars and Innovative Research Team in University (grant no. IRT1196), Chongqing Graduate Student Research Innovation Project (grant no. CYS14008), and the Fundamental Research Funds for the Central Universities (grant nos. CDJZR12098801 and CDJZR11095501).

Meng Yan was born in 1989 and received his B.S. in Chongqing University in 2011, and his M.S. degree in Software Engineering in 2013. He is currently a Ph.D. candidate of the School of Software Engineering, Chongqing University. His research interests include data mining of software engineering and topic modeling.

References (37)

FuY. et al.
Automated classification of software change messages by semi-supervised latent Dirichlet allocation
Inf. Software Technol.
(2015)
GrantS. et al.
Using heuristics to estimate an appropriate number of latent topics in source code analysis
Sci. Comput. Program.
(2013)
AhsanS.N. et al.
Automatic software bug triage system (bts) based on latent semantic indexing and support vector machine
AlaliA. et al.
What's a typical commit? A characterization of open source software repositories
AsuncionH.U. et al.
Software traceability with topic modeling
BleiD.M. et al.
Latent Dirichlet allocation
J. Mach. Learn. Res.
(2003)
ChenT.H. et al.
Explaining software defects using topic models
ChienJ.-T. et al.
Bayesian sparse topic model
J. Signal Process. Syst.
(2014)
Commission, I.O.F.S.I.E., 2001. Software engineering–Product quality–Part 1: Quality model. ISO/IEC 9126,...
CunninghamH. et al.
GATE: an architecture for development of robust HLT applications

DempsterA.P. et al.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc., Ser. B (Methodol.)

(1977)

GethersM. et al.

Using relational topic models to capture coupling among classes in object-oriented software systems

GethersM. et al.

CodeTopics: which topic am I coding now?

GrantS. et al.

Using topic models to support software maintenance

Hassan, A.E., 2008. Automated classification of change messages in open source projects. In: Proceedings of the 2008...

HattoriL.P. et al.

On the nature of commits

HindleA. et al.

Automated topic naming to support cross-project analysis of software maintenance activities

HindleA. et al.

Automatic classication of large changes into maintenance categories

Cited by (40)

A survey of software architectural change detection and categorization techniques
2022, Journal of Systems and Software
Citation Excerpt :
Future research in this direction can make a significant contribution to the architectural change review process. In this regard, we assume that the proposed model by Yan et al. (2016) is the most promising as it also handles tangled messages and is compatible with architectural change classification (as was studied by Mondal et al. (2019)). Overall, our survey will help the researchers in this field quickly identify the existing tools and techniques and find the directions that are yet to explore and make a comparison in this field based on various perspectives of software architecture and maintenance.
Software architecture is defined as the structural construction, design decisions implementation, evolution and knowledge sharing mechanisms of a system. Software architecture documentation help architects with decision making, guide developers during implementation, and preserve architectural decisions so that future caretakers are able to better understand an architect’s solution. Many modern-day software development teams are focusing more on architectural consistency of software design to better cope with the cost-time-efforts, continuous integration, software glitches, security backdoors, regulatory inspections, human values, and so on. Therefore, in order to better reflect the software design challenges, the development teams review the architectural design either on a regular basis or after completing certain milestones or releases. However, many studies have focused on architectural change detection and classification as the essential steps for reviewing design, discovering architectural tactics and knowledge, analyzing software stability, tracing and auditing software development history, recovering design decisions, generating design summary, and so on.
In this paper, we survey state-of-the-art architectural change detection and categorization techniques and identify future research directions. To the best of our knowledge, our survey is the first comprehensive report on this area. However, in this survey, we compare available techniques using various quality attributes relevant to software architecture for different implementation levels and types. Moreover, our analysis shows that there is a lack of lightweight techniques (in terms of human intervention, algorithmic complexity, and frequency of usage) feasible to process hundreds and thousands of change revisions of a project. We also realize that rigorous focuses are required for capturing the design decision associativity of the architectural change detection techniques for practical use in the design review process. However, our survey on architectural change classification shows that existing automatic change classification techniques are not promising enough to use for real-world scenarios and reliable post analysis of causes of architectural change is not possible without manual intervention. There is also a lack of empirical data to construct an architectural change taxonomy, and further exploration in this direction would add much value to architectural change management.
Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model
2021, Information and Software Technology
Analyzing software maintenance activities is very helpful in ensuring cost-effective evolution and development activities. The categorization of commits into maintenance tasks supports practitioners in making decisions about resource allocation and managing technical debt.
In this paper, we propose to use a pre-trained language neural model, namely BERT (Bidirectional Encoder Representations from Transformers) for the classification of commits into three categories of maintenance tasks — corrective, perfective and adaptive. The proposed commit classification approach will help the classifier better understand the context of each word in the commit message.
We built a balanced dataset of 1793 labeled commits that we collected from publicly available datasets. We used several popular code change distillers to extract fine-grained code changes that we have incorporated into our dataset as additional features to BERT’s word representation features. In our study, a deep neural network (DNN) classifier has been used as an additional layer to fine-tune the BERT model on the task of commit classification. Several models have been evaluated to come up with a deep analysis of the impact of code changes on the classification performance of each commit category.
Experimental results have shown that the DNN model trained on BERT’s word representations and Fixminer code changes (DNN@BERT+Fix_cc) provided the best performance and achieved 79.66% accuracy and a macro-average f1 score of 0.8. Comparison with the state-of-the-art model that combines keywords and code changes (RF@KW+CD_cc) has shown that our model achieved approximately 8% improvement in accuracy. Results have also shown that a DNN model using only BERT’s word representation features achieved an improvement of 5% in accuracy compared to the RF@KW+CD_cc model.
How we refactor and how we document it? On the use of supervised machine learning algorithms to classify refactoring documentation
2021, Expert Systems with Applications
Refactoring is the art of improving the structural design of a software system without altering its external behavior. Today, refactoring has become a well-established and disciplined software engineering practice that has attracted a significant amount of research presuming that refactoring is primarily motivated by the need to improve system structures. However, recent studies have shown that developers may incorporate refactoring strategies in other development-related activities that go beyond improving the design especially with the emerging challenges in contemporary software engineering. Unfortunately, these studies are limited to developer interviews and a reduced set of projects.
To cope with the above-mentioned limitations, we aim to better understand what motivates developers to apply a refactoring by mining and automatically classifying a large set of 111,884 commits containing refactoring activities, extracted from 800 open source Java projects. We trained a multi-class classifier to categorize these commits into three categories, namely, Internal Quality Attribute, External Quality Attribute, and Code Smell Resolution, along with the traditional Bug Fix and Functional categories. This classification challenges the original definition of refactoring, being exclusive to improving software design and fixing code smells. Furthermore, to better understand our classification results, we qualitatively analyzed commit messages to extract textual patterns that developers regularly use to describe their refactoring activities.
The results of our empirical investigation show that (1) fixing code smells is not the main driver for developers to refactoring their code bases. Refactoring is solicited for a wide variety of reasons, going beyond its traditional definition; (2) the distribution of refactoring operations differs between production and test files; (3) developers use a variety of patterns to purposefully target refactoring-related activities; (4) the textual patterns, extracted from commit messages, provide better coverage for how developers document their refactorings.
Large-scale intent analysis for identifying large-review-effort code changes
2021, Information and Software Technology
Citation Excerpt :
Their experiments showed that the proposed prediction models could achieve an average precision above 70%. Note that we do not use the machine learning based change classification models proposed by Hindle et al. [23] or Yan et al. [91], the reason is that we find the refined heuristics provide us relatively acceptable accuracies, i.e., above 80%. Software prediction techniques leverage various software metrics to build machine learning models to predict unknown defects in the source code [20,34,49,50,59,94].
Context: Code changes to software occur due to various reasons such as bug fixing, new feature addition, and code refactoring. Change intents have been studied for years to help developers understand the rationale behind code commits. However, in most existing studies, the intent of the change is rarely leveraged to provide more specific, context aware analysis.
Objective: In this paper, we present the first study to leverage change intent to characterize and identify Large-Review-Effort (LRE) changes—changes with large review effort.
Method: Specifically, we first propose a feedback-driven and heuristics-based approach to identify change intents of code changes. We then characterize the changes regarding review effort by using various features extracted from change metadata and the change intents. We further explore the feasibility of automatically classifying LRE changes. We conduct our study on four large-scale projects, one from Microsoft and three are open source projects, i.e., Qt, Android, and OpenStack.
Results: Our results show that, (i) code changes with some intents (i.e., Feature and Refactor) are more likely to be LRE changes, (ii) machine learning based prediction models are applicable for identifying LRE changes, and (iii) prediction models built for code changes with some intents achieve better performance than prediction models without considering the change intent, the improvement in AUC can be up to 19 percentage points and is 7.4 percentage points on average.
Conclusion: The change intent analysis and its application on LRE identification proposed in this study has already been used in Microsoft to provide the review effort and intent information of changes for reviewers to accelerate the review process. To show how to deploy our approaches in real-world practice, we report a case study of developing and deploying the intent analysis system in Microsoft. Moreover, we also evaluate the usefulness of our approaches by using a questionnaire survey. The feedback from developers demonstrate its practical value.
Toward the automatic classification of Self-Affirmed Refactoring
2021, Journal of Systems and Software
Citation Excerpt :
Further, a few studies (Cedrim et al., 2017; Chávez et al., 2017) propose the classification of refactoring instances as root-canal or floss refactoring through the use of manual inspection. Yan et al. (2016) used discriminative topic modeling techniques to automatically classifying software changes. Mockus and Votta (2000) designed an automatic classification algorithm to classify maintenance activities based on a textual description of changes.
The concept of Self-Affirmed Refactoring (SAR) was introduced to explore how developers document their refactoring activities in commit messages, i.e., developers explicit documentation of refactoring operations intentionally introduced during a code change. In our previous study, we have manually identified refactoring patterns and defined three main common quality improvement categories including internal quality attributes, external quality attributes, and code smells, by only considering refactoring-related commits. However, this approach heavily depends on the manual inspection of commit messages. In this paper, we propose a two-step approach to first identify whether a commit describes developer-related refactoring events, then to classify it according to the refactoring common quality improvement categories. Specifically, we combine the N-Gram TF–IDF feature selection with binary and multiclass classifiers to build a new model to automate the classification of refactorings based on their quality improvement categories. We challenge our model using a total of 2,867 commit messages extracted from well engineered open-source Java projects. Our findings show that (1) our model is able to accurately classify SAR commits, outperforming the pattern-based and random classifier approaches, and allowing the discovery of 40 more relevant SAR patterns, and (2) our model reaches an F-measure of up to 90% even with a relatively small training dataset.
Revisiting Probabilistic Latent Semantic Analysis: Extensions, Challenges and Insights
2024, Technologies

View all citing articles on Scopus

Ying Fu was born in 1991 and received her B.S. in Chongqing University in 2013. She is a M.S. candidate of the School of Software Engineering, Chongqing University now. Her research interests include data mining of software engineering and topic modeling.

Xiaohong Zhang received the Ph.D. degree in Computer Software and Theory from Chongqing University, PR China in 2006, where he also received the M.S. degree in Applied Mathematics. He is a professor in School of Software Engineering at Chongqing University. His current research interests include data mining of software engineering, topic modeling, image semantic analysis and video analysis.

Dan Yang received the Ph.D. degree from Chongqing University, PR China in 1997, where he also received the M.S. degree in 1985 and B.S. in 1982. He is a professor in School of Software Engineering at Chongqing University. His current research interests include data mining of software engineering, topic modeling, image semantic analysis and video analysis.

Ling Xu was born in 1975 and received her B.S. in Hefei University of Technology in 1998, and her M.S. degree in Software Engineering in 2004. She received her Ph.D. degree in Computer Science Technology from Chongqing University, PR China in 2009. Her research interests include data mining of software engineering, topic modeling, and image processing.

Jeffrey D. Kymer received a B.S. in Computer Science and a B.S. in Mathematics from Westfield State College in Westfield, Massachusetts, USA in 1986. He is a senior lecturer in School of Software Engineering at Chongqing University. He has worked on a range of software products from a top selling CD-Rom (the MPC Wizard) to one of the first multi-track audio boards. His current research interests include software engineering, computational linguistics, teaching, data mining, and alternate computer interfaces.

View full text

Automatically classifying software changes via discriminative topic model: Supporting multi-category and cross-project

Highlights

Abstract

Introduction

Section snippets

Related work

Research methodology

Data sets

Validation

Limitation and threats to validity

Conclusions and future work

Acknowledgments

Inf. Software Technol.

Sci. Comput. Program.

Automatic software bug triage system (bts) based on latent semantic indexing and support vector machine

What's a typical commit? A characterization of open source software repositories

Software traceability with topic modeling

Latent Dirichlet allocation

J. Mach. Learn. Res.

Explaining software defects using topic models

Bayesian sparse topic model

J. Signal Process. Syst.

GATE: an architecture for development of robust HLT applications