skip to main content
10.1145/3463274.3463315acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

Human-level Ordinal Maintainability Prediction Based on Static Code Metrics

Published: 21 June 2021 Publication History

Abstract

One of the greatest challenges in software quality control is the efficient and effective measurement of maintainability. Thorough expert assessments are precise yet slow and expensive, whereas automated static analysis yields imprecise yet rapid feedback. Several machine learning approaches aim to integrate the advantages of both concepts.
However, most prior studies did not adhere to expert judgment and predicted the number of changed lines as a proxy for maintainability, or were biased towards a small group of experts. In contrast, the present study builds on a manually labeled and validated dataset. Prediction is done using static code metrics where we found simple structural metrics such as the size of a class and its methods to yield the highest predictive power towards maintainability. Using just a small set of these metrics, our models can distinguish easy from hard to maintain code with an F-score of 91.3% and AUC of 82.3%.
In addition, we perform a more detailed ordinal classification and compare the quality of the classification with the performance of experts. Here, we use the deviations between the individual expert’s ratings and the eventually determined consensus of all experts.
In sum, our models achieve the same level of performance as an average human expert. In fact, the obtained accuracy and mean squared error outperform human performance. We hence argue that our models provide an automated and trustworthy prediction of software maintainability.

References

[1]
Arthur Aron and Elaine Aron. 2002. Statistics for the behavioral and social sciences. Prentice Hall Press.
[2]
Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 58(2020), 82–115.
[3]
Hans Christian Benestad, Bente Anda, and Erik Arisholm. 2006. Assessing software product maintainability based on class-level structural measures. In International Conference on Product Focused Software Process Improvement. Springer, 94–111.
[4]
Andrew P Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30, 7 (1997), 1145–1159.
[5]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
[6]
G Ann Campbell. 2018. Cognitive complexity: An overview and evaluation. In Proceedings of the 2018 international conference on technical debt. 57–58.
[7]
G Ann Campbell and Patroklos P Papapetrou. 2013. SonarQube in action. Manning Publications Co.
[8]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357.
[9]
Shyam R Chidamber and Chris F Kemerer. 1994. A metrics suite for object oriented design. IEEE Transactions on software engineering 20, 6 (1994), 476–493.
[10]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
[11]
D. Coleman, D. Ash, B. Lowther, and P. Oman. 1994. Using metrics to evaluate software system maintainability. IEEE Computer 27, 8 (1994).
[12]
Florian Deissenboeck, Elmar Juergens, Benjamin Hummel, Stefan Wagner, Benedikt Mas y Parareda, and Markus Pizka. 2008. Tool support for continuous quality control. IEEE software 25, 5 (2008), 60–67.
[13]
Carsten F Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R García Marquéz, Bernd Gruber, Bruno Lafourcade, Pedro J Leitao, 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 1 (2013), 27–46.
[14]
Georgios Douzas, Fernando Bacao, and Felix Last. 2018. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences 465(2018), 1–20.
[15]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861–874.
[16]
Eibe Frank and Mark Hall. 2001. A simple approach to ordinal classification. In European Conference on Machine Learning. Springer, 145–156.
[17]
Jerome H Friedman. 2002. Stochastic gradient boosting. Computational statistics & data analysis 38, 4 (2002), 367–378.
[18]
Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine learning 63, 1 (2006), 3–42.
[19]
Jan Gorodkin. 2004. Comparing two K-category assignments by a K-category correlation coefficient. Computational biology and chemistry 28 (2004), 367–374.
[20]
Rachel Harrison, Steve J Counsell, and Reuben V Nithi. 1998. An evaluation of the MOOD set of object-oriented software metrics. IEEE Transactions on Software Engineering 24, 6 (1998), 491–496.
[21]
Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. 2009. Multi-class adaboost. Statistics and its Interface 2, 3 (2009), 349–360.
[22]
Péter Hegedűs, Tibor Bakota, László Illés, Gergely Ladányi, Rudolf Ferenc, and Tibor Gyimóthy. 2011. Source code metrics and maintainability: a case study. In International Conference on Advanced Software Engineering and Its Applications. Springer, 272–284.
[23]
Péter Hegedűs, Gergely Ladányi, István Siket, and Rudolf Ferenc. 2012. Towards building method level maintainability models based on expert evaluations. In Computer Applications for Software Engineering, Disaster Recovery, and Business Continuity. Springer, 146–154.
[24]
Lars Heinemann, Benjamin Hummel, and Daniela Steidl. 2014. Teamscale: Software quality control in real-time. In Companion Proceedings of the 36th International Conference on Software Engineering. 592–595.
[25]
David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013. Applied logistic regression. Vol. 398. John Wiley & Sons.
[26]
Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wagner. 2009. Do code clones matter?. In 2009 IEEE 31st International Conference on Software Engineering. IEEE, 485–495.
[27]
Arvinder Kaur and Kamaldeep Kaur. 2013. Statistical comparison of modelling methods for software maintainability prediction. International Journal of Software Engineering and Knowledge Engineering 23, 06(2013), 743–774.
[28]
Kenji Kira and Larry Rendell. 1992. A Practical Approach to Feature Selection. In Ninth International Workshop on Machine Learning. Morgan Kaufmann, 249–256.
[29]
György Kovács. 2019. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing(2019).
[30]
György Kovács. 2019. smote-variants: a Python Implementation of 85 Minority Oversampling Techniques. Neurocomputing 366(2019), 352–354.
[31]
Stefan Kramer, Gerhard Widmer, Bernhard Pfahringer, and Michael De Groeve. 2001. Prediction of ordinal classes using regression trees. Fundamenta Informaticae 47, 1-2 (2001), 1–13.
[32]
Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E 69, 6 (2004).
[33]
Lov Kumar, Debendra Kumar Naik, and Santanu Ku Rath. 2015. Validating the effectiveness of object-oriented metrics for predicting maintainability. Procedia Computer Science 57 (2015), 798–806.
[34]
Wei Li and Sallie Henry. 1993. Object-oriented metrics that predict maintainability. Journal of systems and software 23, 2 (1993), 111–122.
[35]
FrontEndART Software Ltd.2020. Sourcemeter. https://www.sourcemeter.com.
[36]
Ruchika Malhotra and Anuradha Chug. 2012. Software maintainability prediction using machine learning algorithms. Software engineering: an international Journal (SeiJ) 2, 2(2012).
[37]
Ruchika Malhotra and Kusum Lata. 2020. An empirical study on predictability of software maintainability using imbalanced data. Software Quality Journal 28, 4 (2020), 1581–1614.
[38]
Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering4 (1976), 308–320.
[39]
Mohammad Y Mhawish and Manjari Gupta. 2020. Predicting Code Smells and Analysis of Predictions: Using Machine Learning Techniques and Software Metrics. Journal of Computer Science and Technology 35, 6 (2020), 1428–1445.
[40]
Marvin Muñoz Barón, Marvin Wyrich, and Stefan Wagner. 2020. An Empirical Validation of Cognitive Complexity as a Measure of Source Code Understandability. In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM.
[41]
Mayra Nilson, Vard Antinyan, and Lucas Gren. 2019. Do internal software quality tools measure validated metrics?. In International Conference on Product-Focused Software Process Improvement. Springer, 637–648.
[42]
Paul Oman and Jack Hagemeister. 1994. Construction and testing of polynomials predicting software maintainability. Journal of Systems and Software 24, 3 (1994), 251–266.
[43]
Karl Pearson. 1895. VII. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London 58, 347-352 (1895), 240–242.
[44]
Fabiano Pecorelli, Dario Di Nucci, Coen De Roover, and Andrea De Lucia. 2019. On the role of data balancing for machine learning-based code smell detection. In Proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation. 19–24.
[45]
Fabiano Pecorelli, Fabio Palomba, Dario Di Nucci, and Andrea De Lucia. 2019. Comparing heuristic and machine learning approaches for metric-based code smell detection. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 93–104.
[46]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[47]
M. Pizka and T. Panas. 2009. Establishing economic effectiveness through software health-management. In 1st International Workshop on Software Health Management, Pasadena.
[48]
Nicolino J Pizzi, Arthur R Summers, and Witold Pedrycz. 2002. Software quality prediction using median-adjusted class labels. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02, Vol. 3. IEEE, 2405–2409.
[49]
John A Rice. 2006. Mathematical statistics and data analysis. Nelson Education.
[50]
Tony Rosqvist, Mika Koskela, and Hannu Harju. 2003. Software quality evaluation based on expert judgement. Software Quality Journal 11, 1 (2003), 39–55.
[51]
Brian C Ross. 2014. Mutual information between discrete and continuous data sets. PloS one 9, 2 (2014).
[52]
Chanchal Roy and James R Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541 (2007), 64–68.
[53]
Markus Schnappinger, Arnaud Fietzke, and Alexander Pretschner. 2020. Defining a Software Maintainability Dataset: Collecting, Aggregating and Analysing Expert Evaluations of Software Maintainability. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 278–289.
[54]
Markus Schnappinger, Arnaud Fietzke, and Alexander Pretschner. 2020. A Software Maintainability Dataset. https://doi.org/10.6084/m9.figshare.12801215
[55]
Markus Schnappinger, Mohd Hafeez Osman, Alexander Pretschner, and Arnaud Fietzke. 2019. Learning a classifier for prediction of maintainability based on static analysis tools. In Proceedings of the 27th International Conference on Program Comprehension. IEEE, 243–248.
[56]
Markus Schnappinger, Mohd Hafeez Osman, Alexander Pretschner, Markus Pizka, and Arnaud Fietzke. 2018. Software quality assessment in practice: a hypothesis-driven framework. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, 40.
[57]
Tushar Sharma, Pratibha Mishra, and Rohit Tiwari. 2016. Designite: A software design quality assessment tool. In Proceedings of the 1st International Workshop on Bringing Architectural Design Thinking into Developers’ Daily Activities. 1–4.
[58]
Dag IK Sjøberg, Bente Anda, and Audris Mockus. 2012. Questioning software maintenance metrics: a comparative case study. In Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE, 107–110.
[59]
Dag IK Sjøberg, Aiko Yamashita, Bente CD Anda, Audris Mockus, and Tore Dybå. 2012. Quantifying the effect of code smells on maintenance effort. IEEE Transactions on Software Engineering 39, 8 (2012), 1144–1156.
[60]
Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information processing & management 45, 4 (2009), 427–437.
[61]
Charles Spearman. 1961. ”General Intelligence” Objectively Determined and Measured.(1961).
[62]
Giancarlo Succi, Witold Pedrycz, Snezana Djokic, Paolo Zuliani, and Barbara Russo. 2005. An empirical exploration of the distributions of the chidamber and kemerer object-oriented metrics suite. Empirical Software Engineering 10, 1 (2005), 81–104.
[63]
Barbara G Tabachnick and Linda S Fidell. 2007. Using multivariate statistics. Vol. 5.
[64]
Chikako Van Koten and AR Gray. 2006. An application of Bayesian network for predicting object-oriented software maintainability. Information and Software Technology 48, 1 (2006), 59–67.
[65]
W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Transactions on Software Engineering 42, 8 (2016), 707–740.
[66]
Jürgen Wüst. 2018. SDMetrics: The software design metrics tool for UML. https://www.sdmetrics.com/.
[67]
Yuming Zhou and Hareton Leung. 2007. Predicting object-oriented software maintainability using multivariate adaptive regression splines. Journal of systems and software 80, 8 (2007), 1349–1361.

Cited By

View all
  • (2024)Ghost Echoes Revealed: Benchmarking Maintainability Metrics and Machine Learning Predictions Against Human Assessments2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00072(678-688)Online publication date: 6-Oct-2024
  • (2024)Actionable code smell identification with fusion learning of metrics and semanticsScience of Computer Programming10.1016/j.scico.2024.103110236:COnline publication date: 1-Sep-2024
  • (2023)Revisiting Inter-Class Maintainability Indicators2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00093(805-814)Online publication date: Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EASE '21: Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering
June 2021
417 pages
ISBN:9781450390538
DOI:10.1145/3463274
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Expert Judgment
  2. Machine Learning
  3. Maintainability Prediction
  4. Ordinal Classification
  5. Software Maintainability

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EASE 2021

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Ghost Echoes Revealed: Benchmarking Maintainability Metrics and Machine Learning Predictions Against Human Assessments2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00072(678-688)Online publication date: 6-Oct-2024
  • (2024)Actionable code smell identification with fusion learning of metrics and semanticsScience of Computer Programming10.1016/j.scico.2024.103110236:COnline publication date: 1-Sep-2024
  • (2023)Revisiting Inter-Class Maintainability Indicators2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00093(805-814)Online publication date: Mar-2023
  • (2022)A Preliminary Study on Using Text- and Image-Based Machine Learning to Predict Software MaintainabilitySoftware Quality: The Next Big Thing in Software Engineering and Quality10.1007/978-3-031-04115-0_4(41-60)Online publication date: 12-Apr-2022
  • (2021)Analyzing Static Analysis Metric Trends towards Early Identification of Non-Maintainable Software ComponentsSustainability10.3390/su13221284813:22(12848)Online publication date: 20-Nov-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media