Skip to main content
Log in

Supervised machine learning approach to predict qualitative software product

  • Special Issue
  • Published:
Evolutionary Intelligence Aims and scope Submit manuscript

Abstract

Software development process (SDP) is a framework imposed on software product development and is a multi-stage process wherein a wide range of tasks and activities pan out in each stage. Each stage requires careful observations to improve productivity, quality, etc. to ease the process of development. During each stage, problems surface likes constraint of on-time completion, proper utilization of available resources and appropriate traceability of work progress, etc. and may lead to reiteration due to the defects spotted during testing and then, results into the negative walk-through due to unsatisfactory outcomes. Working on such defects can help to take a step towards the proper steering of activities and thus to improve the expected performance of the software product. Handpicking the proper notable features of SDP and then analyzing their nature towards the outcome can greatly help in getting a reliable software product by meeting the expected objectives. This paper proposed supervised Machine Learning (ML) models for the predictions of better SDP, particularly focusing on cost estimation, defect prediction, and reusability. The experimental studies were conducted on the primary data, and the evaluation reveals the model suitability in terms of efficiency and effectiveness for SDP prediction (accuracy of cost estimation: 65%, defect prediction: 93% and reusability: 82%).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Ankita AA (2015) Cost evaluation framework of effort estimation models. Int Res J Manag Sci Technol 6(7): 37–46. Retrieved from https://www.academia.edu/19641801/CostEvaluationFrameworkofEffortEstimationModels. Accesssed 29 Sept 2019

  2. Boehm B, Clark B, Horowitz E, Westland C, Madachy R, Selby R (1995) Cost models for future software life cycle processes: COCOMO 2.0. Ann Softw Eng 1(1):57–94

    Article  Google Scholar 

  3. Brownlee J (2016) What is confusion matrix in machine learning [Blog post]. Retrieved from https://machinelearningmastery.com/confusion-matrix-machine-learning/. Accessed 27 Sept 2019

  4. Das S, Dey A, Pal A, Roy N (2015) Applications of artificial intelligence in machine learning: review and prospect. Int J Comput Appl 115(9):31–41

    Google Scholar 

  5. Deng Z, Zhu X, Cheng D, Zong M, Zhang S (2016) Efficient kNN classification algorithm for big data. Neurocomputing 195:143–148

    Article  Google Scholar 

  6. Devi J, Seghal N (2017) A review of improving software quality using machine learning algorithms. Retrieved from https://www.semanticscholar.org/paper/A-Review-of-Improving-Software-Quality-using-Devi-Seghal/84b9b971e866acf011e8522e5537a96a1c65c689. Accessed 22 Aug 2019

  7. Gray D, Bowes D, Davey N, Sun Y, Christianson B (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. In: 15th annual conference on evaluation and assessment in software engineering (EASE 2011). IET, pp. 96–103

  8. Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering. IEEE Computer Society, pp. 78–88

  9. Huang X, Ho D, Ren J, Capretz LF (2007) Improving the COCOMO model using a neuro-fuzzy approach. Appl Soft Comput 7(1):29–40

    Article  Google Scholar 

  10. ISTQB (2012) “Why is testing necessary,” certified tester, foundation level syllabus, p. 11, [Online]. Available: https://www.istqb.org/downloads/send/2-foundation-level-documents/3-foundation-level-syllabus-2011.html4. Accessed 22 Sept 2019

  11. JJ (2016) MAE and RMSE—which metric is better? [Blog post]. Retrieved from https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d. Accessed 22 Sept 2019

  12. Kalopsia (2019) Software engineering|COCOMO Model. [Blog post]. Retrieved from https://www.geeksforgeeks.org/software-engineering-cocomo-model/. Accessed 22 Sept 2019

  13. Kumari S, Pushkar S (2013) Performance analysis of the software cost estimation methods: a review. Int J Adv Res Comput Sci Softw Eng 3(7)

  14. Leszak M, Perry DE, Stoll D (2000) A case study in root cause defect analysis. In: Proceedings of the 22nd international conference on Software engineering. ACM, pp. 428–437

  15. Long A (2018) Understanding data science classification metrics in scikit-learn in python [Blog post]. Retrieved from https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019. Accessed 28 Sept 2019

  16. Lounis H, Ait-Mehedine L (2004) Machine-learning techniques for software product quality assessment. In: Fourth international conference onquality software, 2004. QSIC 2004. Proceedings. IEEE, pp. 102–109

  17. Marill KA (2004) Advanced statistics: linear regression, part II: multiple linear regression. Acad Emerg Med 11(1):94–102

    Article  Google Scholar 

  18. Menzies T, Di Stefano JS (2004) The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada. Available at: https://promise.site.uottawa.ca/SERepository/datasets/reuse.arff. Accessed 1 Oct 2019

  19. Menzies T (2004) The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada. Retrieved from http://promise.site.uottawa.ca/SERepository/datasets/cm1.arff. Accessed 22 Sept 2019

  20. Menzies T (2006) The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada. Available at: https://promise.site.uottawa.ca/SERepository/datasets/cocomonasa_2.arff. Accessed 22 Sept 2019

  21. Musílek P, Pedrycz W, Succi G, Reformat M (2020) Software cost estimation with granular models.

  22. Nassar B (2016) Prediction of software faults based on requirements and design interrelationships (Doctoral dissertation, Master’s Thesis. Gothenburg, Sweden: Department of Computer Science and Engineering, Chalmers University of Technology University of Gothenburg)

  23. Ross DT, Goodenough JB, Irvine CA (1975) Software engineering: Process, principles, and goals. Computer 8(5):17–27

    Article  Google Scholar 

  24. Sarker IH, Faruque F, Hossen U, Rahman A (2015) A survey of software development process models in software engineering. Int J Softw Eng Appl 9(11):55–70

    Google Scholar 

  25. Shenvi AA (2009) Defect prevention with orthogonal defect classification. In: Proceedings of the 2nd India software engineering conference. ACM, pp. 83–88

  26. Shepperd M, Bowes D, Hall T (2014) Researcher bias: The use of machine learning in software defect prediction. IEEE Trans Software Eng 40(6):603–616

    Article  Google Scholar 

  27. Song Q, Jia Z, Shepperd M, Ying S, Liu J (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370

    Article  Google Scholar 

  28. Tomar AB, Thakare VM (2011) A systematic study of software quality models. Int J Eng Appl 2(4):61

    Google Scholar 

  29. Zhou Y, Xu B, Leung H, Chen L (2014) An in-depth study of the potentially confounded effect of class size in fault prediction. ACM Transactions on Software Engineering and Methodology (TOSEM) 23(1):1–51

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajat Kumar Behera.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

figure a

Appendix B

See appendix Tables 12, 13, 14.

Table 12 Data model for cost estimation prediction
Table 13 Data model for defect prediction
Table 14 Data model for reusability prediction

2.1 Glossary

In order to make more understandable for the researchers who have no or little or limited prior experience in ML evaluation metrics, relevant key terminologies are provided.

2.1.1 Mean absolute error (MAE)

MAE is the overall average calculation of error, obtained under the differences between the predicted and the actual observation [11] and presented in Eq. G.1, where, y is the predicted result, \(\widehat{\mathrm{y}}\) is the actual result, and n is the number of observations. It is to measure the closeness of the prediction of the eventual outcomes and the lower value signifies better prediction.

$$MAE = \frac{1}{n}\sum\limits_{j = 1}^{n} {\left| {y_{j} - \hat{y}_{j} } \right|}$$
(G.1)

2.1.2 Root mean square error (RMSE)

It is the square root of the overall average of the squared differences of data calculated between the predicted and actual observation [11] and is presented in Equation G.2, where, y is the predicted result, \(\widehat{\mathrm{y}}\) is the actual result, and n is the number of observations. It represents the sample standard deviation of the differences between predicted values and observed values, and the lower value signifies better prediction.

$${\text{RMSE}} = \sqrt {\frac{1}{n}\sum\nolimits_{j = 1}^{n} {\left( {{\text{y}}_{j} - {\hat{\text{y}}}_{j} } \right)}^{2} }$$
(G.2)

2.1.3 Confusion matrix

It is the technique for conveying the performance of the classification algorithm. It contains four aspects of performance measures which are being compared over the actual and the predicated observations. The four aspects are: True Positive, True Negative, False Positive and False Negative [3].

True Positive (TP): It is for the correctly predicted event values.

True Negative (TN): It is for the correctly predicted no-event values.

False Positive (FP): It is for the incorrectly predicted event values.

False Negative (FN): It is for the incorrectly predicted no-event values.

2.1.4 Accuracy

It is the fraction of samples predicted correctly [15] and is presented in Eq. G.3, where, TP is True Positive, TN is True Negative, FP is False Positive, and FN is False Negative (refer to Confusion Matrix).

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(G.3)

2.1.5 Precision

It is the fraction of predicted positive events that are actually positive [15] and is presented in Equation G.4, where, TP is True Positive and FP is False Positive (refer to Confusion Matrix). It is to answer the question on what proportion of positive identifications were actually correct and higher value represents better relevant prediction over irrelevant ones.

$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(G.4)

2.1.6 Recall

It is the fraction of predicted positive results that are predicted correctly [15] and is presented in Equation G.5, where, TP is True Positive and FN is False Negative (refer to Confusion Matrix). It is to answer the question on what proportion of actual positives were identified correctly and higher value represents better relevant prediction. It is also called as True Positive Rate (TPR).

$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(G.5)

High precision relates to a low false positive rate, and high recall relates to a low false negative rate. High recall and high precision signify the classifier is very good.

2.1.7 F1-score

It is the harmonic mean of the Precision and Recall with a higher score signifies a better model accuracy [15] and is presented in Equation G.6.

$${\text{F1}} - {\text{Score}} = \frac{{2{\text{*}}\left( {{\text{precision*recall}}} \right)}}{{{\text{precision}} + {\text{recall}}}}$$
(G.6)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sinha, H., Behera, R.K. Supervised machine learning approach to predict qualitative software product. Evol. Intel. 14, 741–758 (2021). https://doi.org/10.1007/s12065-020-00434-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12065-020-00434-4

Keywords

Navigation