Skip to main content

From Reinvention to Reuse: An Empirical Example Study on Technical Debt Dataset

  • Conference paper
  • First Online:
Product-Focused Software Process Improvement (PROFES 2024)

Abstract

Self-Admitted Technical Debt (SATD) is a subset of Technical Debt (TD), where the developer leaves a comment on the source, thus marking the place where debt has been taken. Previous research on SATD relies on either the creation of new datasets or the reuse of existing ones. One seminal SATD dataset containing over 4,000 SATD comments and their classification into five different TD categories was published by Maldonado et al. [14]. The drawback of the dataset is its lack of any other information, e.g. static analysis, seriously limiting its possible use cases. We remedy this situation by reforming the dataset. We combine the original comments with contextual information and static analysis from the source codes and recreate the dataset as an SQLite database. Our reformed dataset contains over 13,000 files, nearly 14,000 classes, almost 100,000 methods, and over 650,000 code violation instances. The reformed dataset allows varied and detailed analyses in the future, which we demonstrate by examining the relationship of SATD comments to code violations. The results show that on the method level, the most important predictors are the number of code violations in total as well as the number of violations labelled as Priority 3 or belonging to the Documentation Rule Set. On the file level, LOC is an important predictor alongside the number of violations from the Documentation Rule Set or having a Priority 2 classification. Overall, our example study demonstrates the potential of what reforming existing datasets can have.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pmd.github.io/latest/.

  2. 2.

    https://pmd.github.io/latest/.

  3. 3.

    https://docs.pmd-code.org/pmd-doc-7.0.0/pmd_rules_java.html.

  4. 4.

    https://www.rdocumentation.org/packages/textreuse/versions/0.1.5.

References

  1. AlOmar, E.A., et al.: SATDBailiff-mining and tracking self-admitted technical debt. Sci. Comput. Program. 213, 102693 (2022)

    Article  Google Scholar 

  2. Anand, R., Jeffrey David, U.: Mining of Massive Datasets. Cambridge university press (2011)

    Google Scholar 

  3. Avgeriou, P., Kruchten, P., Ozkaya, I., Seaman, C.: Managing technical debt in software engineering (DAGSTUHL seminar 16162). In: DAGSTUHL reports. vol. 6. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2016)

    Google Scholar 

  4. Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45123-4_1

    Chapter  Google Scholar 

  5. Huang, Q., Shihab, E., Xia, X., Lo, D., Li, S.: Identifying self-admitted technical debt in open source projects using text mining. Empir. Softw. Eng. 23, 418–451 (2018)

    Article  Google Scholar 

  6. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)

    Google Scholar 

  7. Li, Z., Avgeriou, P., Liang, P.: A systematic mapping study on technical debt and its management. J. Syst. Softw. 101, 193–220 (2015)

    Article  Google Scholar 

  8. da Maldonado, E.S., Shihab, E.: Detecting and quantifying different types of self-admitted technical debt. In: 2015 IEEE 7Th International Workshop on Managing Technical Debt (MTD), pp. 9–15. IEEE (2015)

    Google Scholar 

  9. Potdar, A., Shihab, E.: An exploratory study on self-admitted technical debt. In: 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 91–100. IEEE (2014)

    Google Scholar 

  10. Rantala, L.: From reuse to reform: a sample study on technical debt dataset, final dataset (2024). https://doi.org/10.6084/m9.figshare.22778606

  11. Rantala, L.: From reuse to reform: a sample study on technical debt dataset, replication package (2024). https://doi.org/10.6084/m9.figshare.21959882

  12. Ren, X., Xing, Z., Xia, X., Lo, D., Wang, X., Grundy, J.: Neural network-based detection of self-admitted technical debt: from performance to explainability. ACM Trans. Softw. Eng. Methodol. (TOSEM) 28(3), 1–45 (2019)

    Article  Google Scholar 

  13. Rice, W.R.: Analyzing tables of statistical tests. Evolution 43(1), 223–225 (1989)

    Article  Google Scholar 

  14. da Silva Maldonado, E., Shihab, E., Tsantalis, N.: Using natural language processing to automatically detect self-admitted technical debt. IEEE Trans. Software Eng. 43(11), 1044–1062 (2017)

    Article  Google Scholar 

  15. Singh, D., Sekar, V.R., Stolee, K.T., Johnson, B.: Evaluating how static analysis tools can reduce code review effort. In: 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 101–105. IEEE (2017)

    Google Scholar 

  16. Skryseth, D., Shivashankar, K., Pilán, I., Martini, A.: Technical debt classification in issue trackers using natural language processing based on transformers. In: 2023 ACM/IEEE International Conference on Technical Debt (TechDebt), pp. 92–101 (2023). https://doi.org/10.1109/TechDebt59074.2023.00017

  17. Sridharan, M., Mäntylä, M., Claes, M., Rantala, L.: SoCCMiner: a source code-comments and comment-context miner. In: Proceedings of the 19th International Conference on Mining Software Repositories, pp. 242–246 (2022)

    Google Scholar 

  18. Stol, K.J., Fitzgerald, B.: The ABC of software engineering research. ACM Trans. Softw. Eng. Methodol. (TOSEM) 27(3), 1–51 (2018)

    Article  Google Scholar 

  19. Sutoyo, E., Capiluppi, A.: SATDAUG–a balanced and augmented dataset for detecting self-admitted technical debt. In: Proceedings of the 21st International Conference on Mining Software Repositories (2024)

    Google Scholar 

  20. Trautsch, A., Herbold, S., Grabowski, J.: A longitudinal study of static analysis warning evolution and the effects of PMD on software quality in apache open source projects. Empir. Softw. Eng. 25(6), 5137–5192 (2020)

    Article  Google Scholar 

  21. Xiao, T., Zeng, Z., Wang, D., Hata, H., McIntosh, S., Matsumoto, K.: Quantifying and characterizing clones of self-admitted technical debt in build systems. Empir. Softw. Eng. 29(2), 1–31 (2024)

    Article  Google Scholar 

  22. Yan, M., Xia, X., Shihab, E., Lo, D., Yin, J., Yang, X.: Automating change-level self-admitted technical debt determination. IEEE Trans. Software Eng. 45(12), 1211–1229 (2018)

    Article  Google Scholar 

  23. Yu, J., Zhao, K., Liu, J., Liu, X., Xu, Z., Wang, X.: Exploiting gated graph neural network for detecting and explaining self-admitted technical debts. J. Syst. Softw. 187, 111219 (2022)

    Article  Google Scholar 

  24. Zhu, K., Yin, M., Zhu, D., Zhang, X., Gao, C., Jiang, J.: SCGRU: a general approach for identifying multiple classes of self-admitted technical debt with text generation oversampling. J. Syst. Softw. 195, 111514 (2023)

    Article  Google Scholar 

Download references

Acknowledgments

The authors have been supported by Academy of Finland (grant number 328058).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leevi Rantala .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rantala, L., Mäntylä, M.V., Sridharan, M. (2025). From Reinvention to Reuse: An Empirical Example Study on Technical Debt Dataset. In: Pfahl, D., Gonzalez Huerta, J., Klünder, J., Anwar, H. (eds) Product-Focused Software Process Improvement. PROFES 2024. Lecture Notes in Computer Science, vol 15452. Springer, Cham. https://doi.org/10.1007/978-3-031-78386-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78386-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78385-2

  • Online ISBN: 978-3-031-78386-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics