Elsevier

Neurocomputing

Volume 352, 4 August 2019, Pages 64-74
Neurocomputing

Software defect prediction via cost-sensitive Siamese parallel fully-connected neural networks

https://doi.org/10.1016/j.neucom.2019.03.076Get rights and content

Abstract

Software defect prediction (SDP) has caused widespread concern among software engineering researchers, which aims to erect a software defect predictor according to historical data. However, it is still difficult to develop an effective SDP model on high-dimensional and limited data. In this study, a novel SDP model for this problem is proposed, called Siamese parallel fully-connected networks (SPFCNN), which combines the advantages of Siamese networks and deep learning into a unified method. And training this model is administered by AdamW algorithm for finding the best weights. The minimum value of a singular formula is the target of training for SPFCNN model. Significantly, we extensively compared SPFCNN method with the state-of-the-art SDP approaches using six openly available datasets from the NASA repository. Six indexes are used to evaluate the performance of the proposed method. Experimental results showed that the SPFCNN method contributes to significantly higher performance compared with benchmarked SDP approaches, indicating that a cost-sensitive neural network could be developed successfully for SDP.

Introduction

Software defect prediction (SDP) is to discriminate as far defects as possible before a software is distributed, which plays a vital character in ensuring the quality of a software [1]. Moreover, it has become one of the most expensive stages as software scales and complexity increase. A survey showed that the United States Department of Defense spends $4 billion every year owing to software bugs [2], indicated that the importance of software defect detection and ensuring software quality. Fortunately, SDP has received the attention of many researchers and become one of the most active research fields in recent years [3]. And an accurate SDP model can help prevent defects by suggesting that developers to allocate limited time and human resources to defect modules. But, developing an accurate SDP model is often impeded by the high-dimensional and limited software project data (also known as small data) [1], [4].

In order to exploit an effective SDP model, Dam et al. [5] used a deep tree-based LSTM for software defect prediction, but the method is mainly based on the abstract syntax tree representation of source code instead of software metrics. Lu et al. [6] used a deep Belief Network to predict the software defects from software metrics, but the network has limited ability to extract features from high-dimensional metrics. Wang and Zhang [7] proposed a deep recurrent neural network to recognize the software defects while its computational complexity is high. Although these approaches are very ingenious and well thought, their limited effectiveness make them difficult to meet the growth needs of software business. So, it remains an extremely challenging task to develop an effective SDP on the high-dimensional and limited software defect data.

This paper proposes a novel SDP model called Siamese parallel fully-connected neural networks (SPFCNN), which combines the advantages of Siamese networks and deep learning into a unified approach. Siamese networks has been proven by Koch [8] and Vinyals et al. [9] to be effective for a few-shot learning where a little data is available. Deep learning has proven to be very good at learning features from high-dimensional data [10]. And the following results show that the proposed SPFCNN is effective on the software metrics with small data.

The misclassification cost of every class is considered as the same status in most conventional machine learning approaches. However, the misclassification of faulty modules is far more important than the misclassification of non-faulty modules in practice [4]. For example, the prediction of a faulty module as non-faulty module often leads to more expensive fixing activities, while the non-faulty module is predicted as a faulty module leads to more testing time, but that is more acceptable than the previous case. Cost-sensitive learning [11] has proven to be an efficacious technology for considering both of these cases and tries to get the minimum total misclassification costs, we will use it to amalgamate the different misclassification costs into the SDP procedure.

The main contributions of this study are summarized below:

(1) A novel SDP model called Siamese parallel fully-connected neural networks (SPFCNN) is proposed for addressing the high-dimensional and limited software defect data. This approach can predict software defects through learning unequivocal information about similarity or dissimilarity between sample pairs.

(2) A pair of parallel Siamese networks are used to extract the highest-level representation from the high dimensional attributes for SPFCNN training and testing. And the cost-sensitivity features are integrated into SPFCNN to achieve a balance between the classification performance of minority and majority classes.

(3) The proposed method is evaluated and compared with existing state-of-the-art SDP methods on 6 common software defect datasets from the NASA repository. The experimental results show that the performance of the SPFCNN model is better than those of benchmarked models.

The rest of this paper is arranged as follows: the related work is generalized in the next section. Section 3 explains the proposed methods. The empirical study and results are given in Section 4. Section 5 shows the discusses. Section 6 describes the conclusion.

Section snippets

Related work

Many machine learning approaches have been adopted to dispose of the SDP problem. Logistic Regression (LR) [12], Decision Tree Classifier (DT) [13], and Linear Discriminant Analysis (LDA) [14] are some of the algorithms used for this problem. They have different performances of SDP as their respective model characteristics. But, they are difficult to obtain an accurate result when there is not enough software defect data [1]. In addition to these conventional machine-learning algorithms, some

Methodology

In this study, the proposed Siamese parallel fully connected neural networks (SPFCNN) is built based on two Siamese networks. One is a shallow Siamese networks, which composes of twin fully-connected networks with two hidden layers. The other is a deep Siamese networks, which consists of two identical fully-connected networks including eight hidden layers. Merging the above-mentioned these shallow and deep Siamese networks to form the final Siamese parallel fully-connected neural networks. Its

Datasets

This study randomly extracted six most commonly used datasets from the public NASA repository as the baselined datasets for SDP. Each instance in the database is mainly composed of two parts: independent code properties and a number property illustrating how many defects are included in this category [41]. Static code attributes are primarily based on object-oriented metrics comprising number of total operands, total number of operators, etc. A more detailed explanation is exhibited in [41],

Discussion

As an empirical research, we need to consider the threats to the effectiveness of our experiments, mainly summarized as the following three aspects.

Internal validity is concerned with the errors in the experiments, such as the implementation of benchmarked methods, the choice of parameters for the model, and the quality of baselined datasets. Although we are very careful to check our experiments, some mistakes maybe still exist.

Concerning external validity, an obvious threat is the

Conclusion

To address the high-dimensional and limited data of SDP, the paper proposes the Siamese parallel fully-connected neural networks (SPFCNN). The cost-sensitivity features are integrated into SPFCNN by the designed AdamW with a singular formula (NECM). Experimental results prove that the proposed SPFCNN surpasses commonly deep learning methods, and the singular performance index (NECM) of our SPFCNN is better than the comparison methods.

Although the SPFCNN has achieved a more competitive

Conflict of interest

None.

Acknowledgments

This work has been partially supported by the National Natural Science Foundation of China (No. 61572087), in part by the Macao Special Project of the State Ministry of Science and Technology (No. 2015DFM10020), and in part by the Graduate Research and Innovation Foundation of Chongqing, China (No. CYS18006).

Linchang Zhao received the B.S. degree in the College of Computer Science, Northeast Petroleum University, Daqing, China, in 2013. In 2017, he got the master’s degree in the School of mathematics and statistics, Qiannan Normal College for Nationalities, Duyun, China. He is currently studying towards the Ph.D. degree with the College of Computer Science, Chongqing University. His research interests include pattern recognition, computer vision, machine learning, and deep learning.

References (45)

  • H. Wang et al.

    A. Napolitano, Software measurement data reduction using ensemble techniques

    Neurocomputing

    (2012)
  • WangH. et al.

    Software measurement data reduction using ensemble techniques

    Neurocomputing

    (2012)
  • B.W. Matthews

    Comparison of the predicted and observed secondary structure of T4 phage lysozyme

    BBA – Protein Struct.

    (1975)
  • V.B. Kampenes et al.

    A systematic review of effect size in software engineering experiments

    Inf. Softw. Technol.

    (2007)
  • Ö. Faruk Arar et al.

    Software defect prediction using cost-sensitive neural network

    Appl. Soft Comput.

    (2015)
  • H.K. Dam et al.

    A deep tree-based model for software defect prediction

    Softw. Engn.

    (2018)
  • LuG. et al.

    Deep belief network software defect prediction model

    Comput. Sci.

    (2017)
  • G. Koch

    Siamese neural networks for one-shot image recognition

    (2015)
  • O. Vinyals et al.

    Matching networks for one shot learning

    Mach. Learn.

    (2016)
  • L.C. Yann et al.

    Deep learning

    Nature

    (2015)
  • S.S. Rathore et al.

    A decision tree regression based approach for the number of software faults prediction

    ACM SIGSOFT Softw. Eng. Notes

    (2016)
  • A. Kalsoom et al.

    A dimensionality reduction-based efficient software fault prediction using fisher linear discriminant analysis (FLDA)

    J. Supercomput.

    (2018)
  • Cited by (30)

    • On the use of deep learning in software defect prediction

      2023, Journal of Systems and Software
      Citation Excerpt :

      Some researchers have tackled this challenge using different DL architectures which take this difference into account, while others have introduced normalization and transformation steps in data preprocessing as well as in feature extraction. Insufficient training data: Having limited training data, either from a quality or quantity point of view, makes it difficult to perform SDP in the first place (Zhao et al., 2018, 2019; Pandey and Tripathi, 2021). Potential solutions include using DL architectures capable of learning with limited data and adding more labeled data into the training dataset.

    • Software defect prediction via optimal trained convolutional neural network

      2022, Advances in Engineering Software
      Citation Excerpt :

      The conclusion of the research is shown in section 8. In 2019, Linchang et al. [1] developed a novel SDP system known as SPFCNN(Siamese Parallel Fully-Connected Networks) that combines the benefits of deep learning as well as Siamese networks into a unified technique. AdamW technique was used to train this model and identify the appropriate weights.

    • Analysis and modeling conditional mutual dependency of metrics in software defect prediction using latent variables

      2021, Neurocomputing
      Citation Excerpt :

      In fact, considering sheer volume of test cases needed to be accomplished in quality assurance, more error prone files would be of more priority in practice. The approach of machine learning has been extensively adopted by researchers to provide appropriate set of tools and techniques for the experts of this context so as to facilitate the process based on the information of former projects [3–6]. In the literature of software defect prediction, the status of a artifact is referred to as a class variable and the process of determining its value into either defective or non-defective is called defect classification.

    • Hybrid model with optimization tactics for software defect prediction

      2023, International Journal of Modeling, Simulation, and Scientific Computing
    View all citing articles on Scopus

    Linchang Zhao received the B.S. degree in the College of Computer Science, Northeast Petroleum University, Daqing, China, in 2013. In 2017, he got the master’s degree in the School of mathematics and statistics, Qiannan Normal College for Nationalities, Duyun, China. He is currently studying towards the Ph.D. degree with the College of Computer Science, Chongqing University. His research interests include pattern recognition, computer vision, machine learning, and deep learning.

    Zhaowei Shang received the Ph.D. degree in Electronics information from Xi’an Jiao University, and the postdoctoral station in computer science from Chongqing University, China, in 2005, and 2010, respectively. Currently, he is working as a professor in the Department of Computer Science at Chongqing University and a visiting research fellow at the Faculty of Science and Technology, University of Macau. His research interests include pattern recognition, image processing, and machine learning. He has published extensively in the IEEE Transactions on Image Processing, Pattern Recognition, Neurocomputing, etc. He is a member of the IEEE.

    Ling Zhao received the B.S. degree in the College of Materials Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2005. In 2008, he got the master’s degree in the school of management from Fudan University, Shanghai, China. Currently, He is working as a manager in the United Imaging (Guizhou) Healthcare Co., Ltd. His research interests include data mining, Bigdata analysing and machine learning. He is a member of the IEEE.

    Taiping Zhang received the B.S. and M.S. degrees in computational mathematics, and the Ph.D. degree in computer science from Chongqing University, Chongqing, China, in 1999, 2001, and 2010, respectively. He is currently an associate professor in the Department of Computer Science, Chongqing University. His research interests include pattern recognition, image processing, machine learning, and computational mathematics. He has published extensively in the IEEE Transactions on Image Processing, the IEEE Transactions on Systems, Man, and Cybernetics, Part B (TSMCB), the IEEE Transactions on Knowledge and Data Engineering, Pattern Recognition, Neurocomputing, etc. He is a member of the IEEE.

    Yuanyan Tang received the B.Sc. degree in electrical and computer engineering from Chongqing University, Chongqing, China, the M.Eng. degree in electrical engineering from the Beijing Institute of Posts and Telecommunications, Beijing, China, and the Ph.D. degree in computer science from Concordia University, Montreal, QC, Canada in 1966, 1981, and 1990, respectively. He is a Chair Professor of the Faculty of Science and Technology, University of Macau, Macau, China, and a Professor/Adjunct Professor/Honorary Professor with several institutes, including Chongqing University, Concordia University, and Hong Kong Baptist University, Hong Kong. His current research interests include wavelets, pattern recognition, and image processing. He has published over 400 academic papers and has authored/coauthored more than 25 monographs/books/bookchapters. He is the Founder and the Chair of the Macau Branch of International Associate of Pattern Recognition (IAPR). He is a fellow of the IAPR.

    View full text