research-article

ml-Codesmell: A code smell prediction dataset for machine learning approaches

Authors:

Binh Nguyen Thanh,

Minh Nguyen N. H.,

Hanh Le Thi My,

Binh Nguyen ThanhAuthors Info & Claims

SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology

Pages 368 - 374

https://doi.org/10.1145/3568562.3568643

Published: 01 December 2022 Publication History

Abstract

In recent years, many studies on detecting code smells in source code have published datasets with limited characteristics, such as the ambiguity of code smell definitions leads to different interpretations for each code smell, the number of samples of the datasets is small, and the features of the datasets are heterogeneous. Therefore, comparing performance between detecting code smell models is challenging, and the datasets are often not reusable in other code smell detection studies. In this work, we propose the ml-Codesmell dataset created by analyzing source code and extracting massive source code metrics with many labelled code smells. The proposed dataset has been used to train and predict code smell using machine learning algorithms. Based on the high confidential F1-score in evaluation, the ml-Codesmell dataset demonstrates a strong correlation between features and labels. Regarding these advantages, the ml-Codesmell dataset is expected to be helpful for studies on detecting code smell using machine learning approaches in software development.

References

[1]

Paris Avgeriou and P Kruchten. 2016. Managing Technical Debt in Software Engineering. Dagstuhl Seminar 16162(2016). https://doi.org/10.4230/DagRep.6.4.110

[2]

Leo Breiman. 2001. Random Forests., 5-32 pages. https://doi.org/10.1023/A:1010933404324

Digital Library

[3]

O. Ciupke. 1999. Automatic detection of design problems in object-oriented reengineering. Proceedings of Technology of Object-Oriented Languages and Systems - TOOLS 30 (Cat. No.PR00278), 18–32. https://doi.org/10.1109/TOOLS.1999.787532

[4]

Pádraig Cunningham and Sarah Jane Delany. 2022. k-Nearest Neighbour Classifiers - A Tutorial. Comput. Surveys 54 (7 2022), 1–25. Issue 6. https://doi.org/10.1145/3459665

Digital Library

[5]

Karim Dhambri, Houari A Sahraoui, and Pierre Poulin. 2008. Visual Detection of Design Anomalies. 2008 12th European Conference on Software Maintenance and Reengineering (2008), 279–283.

[6]

Ke-Lin Du and M. N. S. Swamy. 2014. Fundamentals of Machine Learning., 15-65 pages. https://doi.org/10.1007/978-1-4471-5571-3_2

[7]

Eduardo Fernandes, Johnatan Oliveira, Gustavo Vale, Thanis Paiva, and Eduardo Figueiredo. 2016. A review-based comparative study of bad smell detection tools. Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering 01-03-June-2016, 1–12. https://doi.org/10.1145/2915970.2915984

Digital Library

[8]

Alberto Fernandez, Salvador Garcia, Francisco Herrera, and Nitesh V. Chawla. 2018. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research 61 (4 2018), 863–905. https://doi.org/10.1613/jair.1.11192

[9]

Francesca Arcelli Fontana, Pietro Braione, and Marco Zanoni. 2012. Automatic detection of bad smells in code: An experimental assessment.The Journal of Object Technology 11 (2012), 5:1. Issue 2. https://doi.org/10.5381/jot.2012.11.2.a5

[10]

Francesca Arcelli Fontana, Mika V. Mäntylä, Marco Zanoni, and Alessandro Marino. 2016. Comparing and experimenting machine learning techniques for code smell detection. Empirical Software Engineering 21 (6 2016), 1143–1191. Issue 3. https://doi.org/10.1007/s10664-015-9378-4

[11]

Glenn Fung and O L Mangasarian. 2001. Semi-Supervised Support Vector Machines for Unlabeled Data Classification., 14 pages.

[12]

Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. 2012. An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empirical Software Engineering 17 (8 2012), 243–275. https://doi.org/10.1007/s10664-011-9171-y

Digital Library

[13]

Foutse Khomh, Stephane Vaucher, Yann-Gaël Guéhéneuc, and Houari Sahraoui. 2009. A Bayesian Approach for the Detection of Code and Design Smells. Proceedings - International Conference on Quality Software (8 2009), 305–314. https://doi.org/10.1109/QSIC.2009.47

Digital Library

[14]

Foutse Khomh, Stephane Vaucher, Yann-Gaël Guéhéneuc, and Houari Sahraoui. 2011. BDTEX: A GQM-based Bayesian approach for the detection of antipatterns. Journal of Systems and Software 84 (8 2011), 559–572. https://doi.org/10.1016/j.jss.2010.11.921

Digital Library

[15]

Jochen Kreimer. 2005. Adaptive Detection of Design Flaws. Electronic Notes in Theoretical Computer Science 141 (8 2005), 117–136. https://doi.org/10.1016/j.entcs.2005.02.059

Digital Library

[16]

Guillaume Langelier, Houari Sahraoui, and Pierre Poulin. 2005. Visualization-based analysis of quality for large-scale software systems. ASE ’05. 20th IEEE/ACM International Conference on Automated Software Engineering, ASE 2005 (8 2005), 214–223. https://doi.org/10.1145/1101908.1101941

Digital Library

[17]

Michele Lanza and Radu Marinescu. 2006. Object-Oriented Metrics in Practice. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-39538-5

Digital Library

[18]

M.M. Lehman. 1980. Programs, life cycles, and laws of software evolution. Proc. IEEE 68(1980), 1060–1076. Issue 9. https://doi.org/10.1109/PROC.1980.11805

[19]

Lech Madeyski and Tomasz Lewowski. 2020. MLCQ. Proceedings of the Evaluation and Assessment in Software Engineering, 342–347. https://doi.org/10.1145/3383219.3383264

Digital Library

[20]

Aloustapha Issiaka Maiga, Nasir Ali, Neelesh Bhattacharya, A Sabane, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and Esma Aimeur. 2012. Support vector machines for anti-pattern detection. 2012 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012 - Proceedings (8 2012). https://doi.org/10.1145/2351676.2351723

Digital Library

[21]

Usman Mansoor, Marouane Kessentini, Bruce R. Maxim, and Kalyanmoy Deb. 2017. Multi-objective code-smells detection using good and bad design examples. Software Quality Journal 25 (6 2017), 529–552. Issue 2. https://doi.org/10.1007/s11219-016-9309-7

Digital Library

[22]

Cristina Marinescu, Radu Marinescu, Petru Florin Mihancea, Daniel Ratiu, and Richard Wettel. 2005. iPlasma: An Integrated Platform for Quality Assessment of Object-Oriented Design. ICSM (Industrial and Tool Volume), 77–80.

[23]

Fowler Martin, Beck Kent, Brant John, Opdyke William, Roberts Don, and Erich Gamma. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley.

Digital Library

[24]

Dakota Aaron McCarty, Hyun Woo Kim, and Hye Kyung Lee. 2020. Evaluation of Light Gradient Boosted Machine Learning Technique in Large Scale Land Use and Land Cover Classification. Environments 7 (10 2020), 84. Issue 10. https://doi.org/10.3390/environments7100084

[25]

Naouel Moha, Yann-Gaël Guéhéneuc, Anne-Françoise Le Meur, and Laurence Duchien. 2008. A Domain Analysis to Specify Design Defects and Generate Detection Algorithms., 276-291 pages. https://doi.org/10.1007/978-3-540-78743-3_20

[26]

Nuno Moniz, Paula Branco, and Luís Torgo. 2017. Resampling strategies for imbalanced time series forecasting. International Journal of Data Science and Analytics 3 (5 2017), 161–181. Issue 3. https://doi.org/10.1007/s41060-017-0044-3

[27]

Emerson Murphy-Hill and Andrew P. Black. 2010. An interactive ambient visualization for code smells. Proceedings of the 5th international symposium on Software visualization - SOFTVIS ’10, 5. https://doi.org/10.1145/1879211.1879216

Digital Library

[28]

Fabio Palomba. 2015. Textual Analysis for Code Smell Detection. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, 769–771. https://doi.org/10.1109/ICSE.2015.244

[29]

Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia. 2018. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering 23 (6 2018), 1188–1221. Issue 3. https://doi.org/10.1007/s10664-017-9535-z

Digital Library

[30]

Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. 2013. Detecting bad smells in source code using change history information. 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), 268–278. https://doi.org/10.1109/ASE.2013.6693086

Digital Library

[31]

D.L. Parnas. 1994. Software aging. Proceedings of 16th International Conference on Software Engineering, 279–287. https://doi.org/10.1109/ICSE.1994.296790

[32]

Fabiano Pecorelli, Fabio Palomba, Dario Di Nucci, and Andrea De Lucia. 2019. Comparing Heuristic and Machine Learning Approaches for Metric-Based Code Smell Detection. 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), 93–104. https://doi.org/10.1109/ICPC.2019.00023

Digital Library

[33]

Chao-Ying Joanne Peng, Kuk Lida Lee, and Gary M. Ingersoll. 2002. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research 96 (9 2002), 3–14. Issue 1. https://doi.org/10.1080/00220670209598786

[34]

Naveen Roperia. 2009. JSmell: A Bad Smell detection tool for Java systems. Maharishi Dayanand University.

[35]

Ian Shoenberger, Mohamed Wiem Mkaouer, and Marouane Kessentini. 2017. On the Use of Smelly Examples to Detect Code Smells in JavaScript., 20-34 pages. https://doi.org/10.1007/978-3-319-55792-2_2

[36]

Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the Relation of Test Smells to Software Code Quality. 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (8 2018), 1–12. https://doi.org/10.1109/ICSME.2018.00010

[37]

Sandro Sperandei. 2014. Understanding logistic regression analysis. Biochemia Medica 24(2014), 12–18. Issue 1. https://doi.org/10.11613/BM.2014.003

[38]

Guilherme Travassos, Forrest Shull, Michael Fredericks, and Victor R. Basili. 1999. Detecting defects in object-oriented designs. Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications - OOPSLA ’99 34, 47–56. https://doi.org/10.1145/320384.320389

Digital Library

[39]

Nikolaos Tsantalis, Theodore Chaikalis, and Alexander Chatzigeorgiou. 2008. JDeodorant: Identification and Removal of Type-Checking Bad Smells. (8 2008), 329–331. https://doi.org/10.1109/CSMR.2008.4493342

Digital Library

[40]

Gustavo Vale, Danyllo Albuquerque, Eduardo Figueiredo, and Alessandro Garcia. 2015. Defining metric thresholds for software product lines. Proceedings of the 19th International Conference on Software Product Line, 176–185. https://doi.org/10.1145/2791060.2791078

Digital Library

[41]

Aiko Yamashita and Leon Moonen. 2012. Do code smells reflect important maintainability aspects?IEEE International Conference on Software Maintenance, ICSM (8 2012), 306–315. https://doi.org/10.1109/ICSM.2012.6405287

Digital Library

Index Terms

ml-Codesmell: A code smell prediction dataset for machine learning approaches
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

A Systematic Literature Review on the Code Smells Datasets and Validation Mechanisms
The accuracy reported for code smell-detecting tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as ...
On the role of data balancing for machine learning-based code smell detection
MaLTeSQuE 2019: Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation

Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set ...
Code smell detection based on supervised learning models: A survey
Abstract
Supervised learning-based code smell detection has become one of the dominant approaches to identify code smell. Existing works optimize the process of code smell detection from multiple aspects, such as high-quality dataset, feature selection, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology

December 2022

474 pages

ISBN:9781450397254

DOI:10.1145/3568562

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoICT 2022

SoICT 2022: The 11th International Symposium on Information and Communication Technology

December 1 - 3, 2022

Hanoi, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
189
Total Downloads

Downloads (Last 12 months)56
Downloads (Last 6 weeks)4

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten