skip to main content
10.1145/3568562.3568643acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

ml-Codesmell: A code smell prediction dataset for machine learning approaches

Published: 01 December 2022 Publication History

Abstract

In recent years, many studies on detecting code smells in source code have published datasets with limited characteristics, such as the ambiguity of code smell definitions leads to different interpretations for each code smell, the number of samples of the datasets is small, and the features of the datasets are heterogeneous. Therefore, comparing performance between detecting code smell models is challenging, and the datasets are often not reusable in other code smell detection studies. In this work, we propose the ml-Codesmell dataset created by analyzing source code and extracting massive source code metrics with many labelled code smells. The proposed dataset has been used to train and predict code smell using machine learning algorithms. Based on the high confidential F1-score in evaluation, the ml-Codesmell dataset demonstrates a strong correlation between features and labels. Regarding these advantages, the ml-Codesmell dataset is expected to be helpful for studies on detecting code smell using machine learning approaches in software development.

References

[1]
Paris Avgeriou and P Kruchten. 2016. Managing Technical Debt in Software Engineering. Dagstuhl Seminar 16162(2016). https://doi.org/10.4230/DagRep.6.4.110
[2]
Leo Breiman. 2001. Random Forests., 5-32 pages. https://doi.org/10.1023/A:1010933404324
[3]
O. Ciupke. 1999. Automatic detection of design problems in object-oriented reengineering. Proceedings of Technology of Object-Oriented Languages and Systems - TOOLS 30 (Cat. No.PR00278), 18–32. https://doi.org/10.1109/TOOLS.1999.787532
[4]
Pádraig Cunningham and Sarah Jane Delany. 2022. k-Nearest Neighbour Classifiers - A Tutorial. Comput. Surveys 54 (7 2022), 1–25. Issue 6. https://doi.org/10.1145/3459665
[5]
Karim Dhambri, Houari A Sahraoui, and Pierre Poulin. 2008. Visual Detection of Design Anomalies. 2008 12th European Conference on Software Maintenance and Reengineering (2008), 279–283.
[6]
Ke-Lin Du and M. N. S. Swamy. 2014. Fundamentals of Machine Learning., 15-65 pages. https://doi.org/10.1007/978-1-4471-5571-3_2
[7]
Eduardo Fernandes, Johnatan Oliveira, Gustavo Vale, Thanis Paiva, and Eduardo Figueiredo. 2016. A review-based comparative study of bad smell detection tools. Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering 01-03-June-2016, 1–12. https://doi.org/10.1145/2915970.2915984
[8]
Alberto Fernandez, Salvador Garcia, Francisco Herrera, and Nitesh V. Chawla. 2018. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research 61 (4 2018), 863–905. https://doi.org/10.1613/jair.1.11192
[9]
Francesca Arcelli Fontana, Pietro Braione, and Marco Zanoni. 2012. Automatic detection of bad smells in code: An experimental assessment.The Journal of Object Technology 11 (2012), 5:1. Issue 2. https://doi.org/10.5381/jot.2012.11.2.a5
[10]
Francesca Arcelli Fontana, Mika V. Mäntylä, Marco Zanoni, and Alessandro Marino. 2016. Comparing and experimenting machine learning techniques for code smell detection. Empirical Software Engineering 21 (6 2016), 1143–1191. Issue 3. https://doi.org/10.1007/s10664-015-9378-4
[11]
Glenn Fung and O L Mangasarian. 2001. Semi-Supervised Support Vector Machines for Unlabeled Data Classification., 14 pages.
[12]
Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. 2012. An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empirical Software Engineering 17 (8 2012), 243–275. https://doi.org/10.1007/s10664-011-9171-y
[13]
Foutse Khomh, Stephane Vaucher, Yann-Gaël Guéhéneuc, and Houari Sahraoui. 2009. A Bayesian Approach for the Detection of Code and Design Smells. Proceedings - International Conference on Quality Software (8 2009), 305–314. https://doi.org/10.1109/QSIC.2009.47
[14]
Foutse Khomh, Stephane Vaucher, Yann-Gaël Guéhéneuc, and Houari Sahraoui. 2011. BDTEX: A GQM-based Bayesian approach for the detection of antipatterns. Journal of Systems and Software 84 (8 2011), 559–572. https://doi.org/10.1016/j.jss.2010.11.921
[15]
Jochen Kreimer. 2005. Adaptive Detection of Design Flaws. Electronic Notes in Theoretical Computer Science 141 (8 2005), 117–136. https://doi.org/10.1016/j.entcs.2005.02.059
[16]
Guillaume Langelier, Houari Sahraoui, and Pierre Poulin. 2005. Visualization-based analysis of quality for large-scale software systems. ASE ’05. 20th IEEE/ACM International Conference on Automated Software Engineering, ASE 2005 (8 2005), 214–223. https://doi.org/10.1145/1101908.1101941
[17]
Michele Lanza and Radu Marinescu. 2006. Object-Oriented Metrics in Practice. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-39538-5
[18]
M.M. Lehman. 1980. Programs, life cycles, and laws of software evolution. Proc. IEEE 68(1980), 1060–1076. Issue 9. https://doi.org/10.1109/PROC.1980.11805
[19]
Lech Madeyski and Tomasz Lewowski. 2020. MLCQ. Proceedings of the Evaluation and Assessment in Software Engineering, 342–347. https://doi.org/10.1145/3383219.3383264
[20]
Aloustapha Issiaka Maiga, Nasir Ali, Neelesh Bhattacharya, A Sabane, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and Esma Aimeur. 2012. Support vector machines for anti-pattern detection. 2012 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012 - Proceedings (8 2012). https://doi.org/10.1145/2351676.2351723
[21]
Usman Mansoor, Marouane Kessentini, Bruce R. Maxim, and Kalyanmoy Deb. 2017. Multi-objective code-smells detection using good and bad design examples. Software Quality Journal 25 (6 2017), 529–552. Issue 2. https://doi.org/10.1007/s11219-016-9309-7
[22]
Cristina Marinescu, Radu Marinescu, Petru Florin Mihancea, Daniel Ratiu, and Richard Wettel. 2005. iPlasma: An Integrated Platform for Quality Assessment of Object-Oriented Design. ICSM (Industrial and Tool Volume), 77–80.
[23]
Fowler Martin, Beck Kent, Brant John, Opdyke William, Roberts Don, and Erich Gamma. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley.
[24]
Dakota Aaron McCarty, Hyun Woo Kim, and Hye Kyung Lee. 2020. Evaluation of Light Gradient Boosted Machine Learning Technique in Large Scale Land Use and Land Cover Classification. Environments 7 (10 2020), 84. Issue 10. https://doi.org/10.3390/environments7100084
[25]
Naouel Moha, Yann-Gaël Guéhéneuc, Anne-Françoise Le Meur, and Laurence Duchien. 2008. A Domain Analysis to Specify Design Defects and Generate Detection Algorithms., 276-291 pages. https://doi.org/10.1007/978-3-540-78743-3_20
[26]
Nuno Moniz, Paula Branco, and Luís Torgo. 2017. Resampling strategies for imbalanced time series forecasting. International Journal of Data Science and Analytics 3 (5 2017), 161–181. Issue 3. https://doi.org/10.1007/s41060-017-0044-3
[27]
Emerson Murphy-Hill and Andrew P. Black. 2010. An interactive ambient visualization for code smells. Proceedings of the 5th international symposium on Software visualization - SOFTVIS ’10, 5. https://doi.org/10.1145/1879211.1879216
[28]
Fabio Palomba. 2015. Textual Analysis for Code Smell Detection. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, 769–771. https://doi.org/10.1109/ICSE.2015.244
[29]
Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia. 2018. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering 23 (6 2018), 1188–1221. Issue 3. https://doi.org/10.1007/s10664-017-9535-z
[30]
Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. 2013. Detecting bad smells in source code using change history information. 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), 268–278. https://doi.org/10.1109/ASE.2013.6693086
[31]
D.L. Parnas. 1994. Software aging. Proceedings of 16th International Conference on Software Engineering, 279–287. https://doi.org/10.1109/ICSE.1994.296790
[32]
Fabiano Pecorelli, Fabio Palomba, Dario Di Nucci, and Andrea De Lucia. 2019. Comparing Heuristic and Machine Learning Approaches for Metric-Based Code Smell Detection. 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), 93–104. https://doi.org/10.1109/ICPC.2019.00023
[33]
Chao-Ying Joanne Peng, Kuk Lida Lee, and Gary M. Ingersoll. 2002. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research 96 (9 2002), 3–14. Issue 1. https://doi.org/10.1080/00220670209598786
[34]
Naveen Roperia. 2009. JSmell: A Bad Smell detection tool for Java systems. Maharishi Dayanand University.
[35]
Ian Shoenberger, Mohamed Wiem Mkaouer, and Marouane Kessentini. 2017. On the Use of Smelly Examples to Detect Code Smells in JavaScript., 20-34 pages. https://doi.org/10.1007/978-3-319-55792-2_2
[36]
Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the Relation of Test Smells to Software Code Quality. 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (8 2018), 1–12. https://doi.org/10.1109/ICSME.2018.00010
[37]
Sandro Sperandei. 2014. Understanding logistic regression analysis. Biochemia Medica 24(2014), 12–18. Issue 1. https://doi.org/10.11613/BM.2014.003
[38]
Guilherme Travassos, Forrest Shull, Michael Fredericks, and Victor R. Basili. 1999. Detecting defects in object-oriented designs. Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications - OOPSLA ’99 34, 47–56. https://doi.org/10.1145/320384.320389
[39]
Nikolaos Tsantalis, Theodore Chaikalis, and Alexander Chatzigeorgiou. 2008. JDeodorant: Identification and Removal of Type-Checking Bad Smells. (8 2008), 329–331. https://doi.org/10.1109/CSMR.2008.4493342
[40]
Gustavo Vale, Danyllo Albuquerque, Eduardo Figueiredo, and Alessandro Garcia. 2015. Defining metric thresholds for software product lines. Proceedings of the 19th International Conference on Software Product Line, 176–185. https://doi.org/10.1145/2791060.2791078
[41]
Aiko Yamashita and Leon Moonen. 2012. Do code smells reflect important maintainability aspects?IEEE International Conference on Software Maintenance, ICSM (8 2012), 306–315. https://doi.org/10.1109/ICSM.2012.6405287

Index Terms

  1. ml-Codesmell: A code smell prediction dataset for machine learning approaches

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology
    December 2022
    474 pages
    ISBN:9781450397254
    DOI:10.1145/3568562
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 December 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Code Smell Prediction
    2. Dataset
    3. Machine learning

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SoICT 2022

    Acceptance Rates

    Overall Acceptance Rate 147 of 318 submissions, 46%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 189
      Total Downloads
    • Downloads (Last 12 months)56
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media