Keywords

1 Introduction and Context

The development of a reliable software system, especially at a low cost, can be a significant challenge. The product also has to be market-ready in a reasonable time. Failure detection and defect proneness prediction become crucial tools for reliable software creation, helping with decision making and resource allocation. However, the analysis of software related data causes many problems and possible pitfalls due to intrinsic data difficulties. The aspects of data difficulties and motivation for this work are presented in details in Sect. 2.

To this point various metrics, such as code complexity, or number of revisions can help spot classes with high probability of bugs. Bug prediction, therefore, is a classification problem. Numerous classification methods have been employed to deal with this challenge, along with Artificial Neural Networks (ANN). While some researchers are reluctant to employ ANNs for their lack of transparency, however their prowess in modeling nonlinear functional relationships seem to make them well suited for the problem of defect prediction [1].

Software quality is a fundamental competitive factor for the success of contemporary software houses. The Horizon 2020 Q-Rapdis project aims to augment the strategic decision-making procedures of software development by supplying strategic indicators of quality requirements. Fault-proneness is one of such metrics, as it can significantly affect the overall cost of the software. The Q-Rapids strategic indicators stem directly form the metrics and factors calculated from the software development-related data through the use of various data mining and machine learning procedures.

The paper is structured as follows: in Sect. 2 we discuss the problems and difficulties in analyzing the realistic software related data. In Sect. 5.1 the used benchmark dataset is described in details. Section 3 is devoted to Artificial Neural Network and the algorithms used in this work, while Sect. 4 addresses the problem of data imbalance. Results and the comparison with other standard machine learning approaches are presented in Sect. 5, while conclusions are given thereafter.

2 Problems and Difficulties in Real Software Related Data

In this paper we focus on pre-processing and recognizing (detecting faults/bugs) the software related data. But why is software data considered difficult anyways? There are many reasons and answers to such question, e.g. the following aspects contribute:

  • Software related data from real SW companies/developers is considered sensitive commercial data. Commercial companies and SW houses are not eager to share SW related data, even if it is not directly the code. Most companies use software management and monitoring tools such as SonarQube, JIRA, GitLab, Jenkins and many others. Still the data retrieve from those tools provide information about processes, metrics, quality, testing aspects and much more, and all of those can reveal information about companies and teams as such. Moreover, such data might contain personal information (such as the names of programmers), therefore the privacy and GDPR aspects should also be taken into account [2].

  • From machine learning perspective, software related data is often a one-shot learning. If you train any system/classifier on a data from one project or from one company, it is still not representing other projects and companies, so the training and adjusting algorithms have to be repeated all the time. Indeed, especially now in the era of RSD (Rapid Software Development) and agile/lean methodologies, it is difficult to observe long-term patterns in the way of working. Projects, developers and approaches change often, causing the sudden changes in the data as well, meaning that the trained models might not be relevant anymore. Therefore, lifelong learning approach to machine learning is beneficial and required.

  • Software related data contains noise, and almost always the manual work on the data adaptation is needed. The good example is that, for instance each team (even within the same company) might use different naming conventions for GitLab labels. The processes, e.g. of tickets cycles can also vary. Software data also suffer from the problem of data imbalance. The classes containing faults/bugs are under-represented, which causes the typical difficult data (d2) problem for machine learning techniques [3].

Therefore, the abovementioned problems motivate our research towards improving machine learning techniques for improving the quality of rapid software development.

3 Proposed Method

Artificial Neural Networks (ANN) constitute a functional instrument for creating machine learning models. They are a widely used tool for data mining, as they equip the user with classification, regression, clustering and time series analysis abilities. The assertion of an ANN is that it tries to imitate the learning capabilities of a biological neural network, abstractly simplified [4].

Fig. 1.
figure 1

A general architecture of the neural network adapted for the bug prediction (the depth of the network varies depending on the experiment).

The impressive modeling capability of an ANN in fields relying on pattern recognition lies in direct proportion to its striking adaptability to data. It’s extensive approximation capability is notably beneficial in handling real-world data, when there is plenty of data, but the patterns buried in the data are yet to be uncovered. Not only can the network figure out the interconnections among the variables, but it can generalize to a sufficient extent so as to provide satisfactory achievements on novel data [5]. An Artificial Neural Network is essentially like fitting a line, plane, or hyper-plane though a dataset, defining the relationships that might exist among the features [6].

Fig. 2.
figure 2

The procedure pipeline.

A multilayer neural network is constructed with the use of multiple computational (hidden) layers. The data flows from the input layer to the following layer with adequate arithmetic along the way, and then is supplied to the following layer and so on until it arrives at the output layer. A model illustrating the general architecture of a multilayer neural network can be seen in Fig. 1. This mechanism is dubbed the feed-forward neural network [7]. The number of neurons and the number of layers depends on the complexity of the required model and on the availability of data [5]. Using hidden layers with the number of nodes below the number of inputs creates a loss in representation, which frequently betters the network’s performance. This might come as a result of eliminating the noise in data.

Designing a network with too many neurons can result in overfitting. Overfitting, or overtraining, means that the model fitted itself to extremally specific patterns of the training dataset, thus it will perform poorly on new, novel data, as it is not general enough [7].

The proposed method uses Principal Component Analysis for dimensionality reduction, singles out the bug instances in the dataset, clusters the ‘clean’ examples to the number of clusters that balances the number of bugs and re-merges the dataset to achieve a balanced dataset, which is then fed to the classifier. The pipeline is shown in Fig. 2.

4 Data Imbalance

A set is referred to as imbalanced when the classes are not represented in an equal manner [3]. What might initially seem like a negligible issue can cause machine learning algorithms to fail. An instance supplied in [8] explains a situation where a mammography dataset includes no more than 2% of abnormalities. In that case a classification of all the samples to the majority class would output an accuracy of 98%, strikingly missing the point of creating a machine learning algorithm to identify the minority class. Dataset imbalance exists in most of real-world research problems, so two resolutions to the challenge have been developed. Resampling - like subsampling the majority class and oversampling the minority class. Additionally, one could attach a specific cost function to the training samples [8]. Inspired by the emergence of the granular computing (GrC) paradigm, as it proved valuable in multiple scenarios [9] and after the successful application of the paradigm in [10] we are now investigating the feasibility of GrC for dataset balancing. An interesting approach using k-NN algorithm and rough sets can be observed in [11]. Our proposed method stems from the idea that one can include the characteristics of the dataset by clustering the majority class so the number of clusters matches the number of data samples in the minority class. While this method has it’s drawbacks - namely the problem of the overlapping clear and bug granules, it still improves the accuracy by 10% as compared with a simple subsampling approach.

Table 1. Results of all datasets combined, using 10-fold cross validation
Table 2. Results of 4 datasets combined, new project classification

5 Results and Comparison with Classic ML Methods

5.1 Bug Prediciton Dataset

The dataset [12] which is utilised consists of an aggregation of class-level software development metrics. As mentioned in the accompanying paper, the main aim of the dataset is in providing a benchmark as an experimental field to test-run novel approaches. The set supplies characteristics derived from source code metrics in conjunction with historical and process information. A number of bugs and their impact is also supplied. Since the data is provided at a class level, the defect prediction can also be performed at the class level. The data could, however, be combined into package or subsystem level by summing class metrics.

The dataset contains metrics of 5 projects, these are: Eclipse JDT Core, Eclipse PDE UI, Equinox Framework, Lucene and Mylyn. Every project comes with a range of derived metrics, of which change log data in the form of comma separated files was used, for the use in this paper. The features were suggested by [13].

5.2 Results

In order to prove that the analysed data is difficult and to demonstrate that the proposed method allowed us to achieve superior results, we have compared various classical machine learning methods. In order to do that we have used a ROC curve (Fig. 3) to report the effectiveness in terms of the number of false positives (number of false alarms) and true positives (number of correctly predicted bugs). In these experiments we have considered the following classifiers:

Table 3. Training on the ‘Eclipse’ project, classification on ‘pde’
  • Random Forest

  • AdaBoost

  • Ensemble of RepTrees

  • Gradient Boosted Trees

All of these methods have been wrapped with a metaclassifier that made the base classifier cost-sensitive. More precisely, the metaclassifier weights the training instances according to the total cost assigned to each class.

In this experiment, the Random Forest classifier is composed of 300 Random Trees that are combined together using the bagging technique. Each bag contains roughly 20% of data. During the training we control the depth of the trees. We set a hard limit to 10.

For the AdaBoost method we have used the classical approach. The ensemble is composed of one-level decision trees (decision stumps). We have noticed that increasing the ensemble size above 100 does not improve the quality.

The ensemble of RepTrees is build similarly to a Random Forest. However, instead of a Random Tree as a base classifier we have adapted a well-known RepTree decision tree (Reduced Error Pruning Tree). This machine learning technique uses a pruned decision tree. First, the method generates multiple regression trees in each iteration. Afterwards, it chooses the best one. It uses regression tree adapting variance and information gain (by measuring the entropy). The algorithm prunes the tree using a back fitting method.

Table 4. Best hyperparameter setups found for the ANN’s accuracy, a summary of multiple gridsearches.

The GBT stands for Gradient Boosted Trees classifier. The method uses an additive learning approach. In each iteration a single tree is trained and is added to the ensemble in order to fix errors (optimise the objective function) introduced in the previous iteration. The objective function measures the loss and the complexity of the trees comprising the ensemble.

Table 5. Comparison ML algorithms setups
Table 6. Comparison of methods on all datasets combined, using 10-fold cross validation
Fig. 3.
figure 3

ROC curve obtained for various classifiers: Random Forest, Adaboost, Bag of RepTrees, Gradient Boosted Trees (GBT).

As it is shown in Fig. 3 and Table 6, we have achieved the best results for the Random Forest classifier. Although, the recall for this method is higher from that presented for our method in Table 1, the precision and f1-score remain far inferior.

When researching the ANN method multiple scenarios were evaluated all throughout the duration of the experiments. The best accuracy results along with their respective hyperparameter setups are found in Table 4, followed by the specific parameters of the comparison algorithms in Table 5. As mentioned earlier, the dataset [12] provided the metrics of 5 different coding projects. This situation differs from the one evaluated in [14] and in [15], where in order to fulfil the requirement stated by the software house’s executives data from platforms like GITlab and SonarQube were proposed. One of the approaches evaluated how would the algorithm perform if it was trained on one project and tested on another. The detailed results can be seen in Table 3. A different scenario evaluated how the ANN trained on 4 of the projects would perform on a new project, as seen in Table 2. Finally a 10-fold cross validation of all the datasets combined resulted in the performance depicted in Table 1.

6 Conclusions

In this paper we tackle the problem of the analysis of difficult software-related data. In general, such data can be analyzed in order to improve the software quality, detect faults and bugs or improve programming patterns. However, quite often the results are tampered by the nature of the data. Hereby, we propose to use machine learning techniques in order to detect bugs while addressing the problem of data imbalance. The presented results (Table 7) are comparable to other approaches, and we currently work to use them in practice on real commercial data from industrial software products.

Table 7. An example of ANN optimisation results - 3 hidden layers