Abstract
Self-Admitted Technical Debt (SATD) is a subset of Technical Debt (TD), where the developer leaves a comment on the source, thus marking the place where debt has been taken. Previous research on SATD relies on either the creation of new datasets or the reuse of existing ones. One seminal SATD dataset containing over 4,000 SATD comments and their classification into five different TD categories was published by Maldonado et al. [14]. The drawback of the dataset is its lack of any other information, e.g. static analysis, seriously limiting its possible use cases. We remedy this situation by reforming the dataset. We combine the original comments with contextual information and static analysis from the source codes and recreate the dataset as an SQLite database. Our reformed dataset contains over 13,000 files, nearly 14,000 classes, almost 100,000 methods, and over 650,000 code violation instances. The reformed dataset allows varied and detailed analyses in the future, which we demonstrate by examining the relationship of SATD comments to code violations. The results show that on the method level, the most important predictors are the number of code violations in total as well as the number of violations labelled as Priority 3 or belonging to the Documentation Rule Set. On the file level, LOC is an important predictor alongside the number of violations from the Documentation Rule Set or having a Priority 2 classification. Overall, our example study demonstrates the potential of what reforming existing datasets can have.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
AlOmar, E.A., et al.: SATDBailiff-mining and tracking self-admitted technical debt. Sci. Comput. Program. 213, 102693 (2022)
Anand, R., Jeffrey David, U.: Mining of Massive Datasets. Cambridge university press (2011)
Avgeriou, P., Kruchten, P., Ozkaya, I., Seaman, C.: Managing technical debt in software engineering (DAGSTUHL seminar 16162). In: DAGSTUHL reports. vol. 6. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2016)
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45123-4_1
Huang, Q., Shihab, E., Xia, X., Lo, D., Li, S.: Identifying self-admitted technical debt in open source projects using text mining. Empir. Softw. Eng. 23, 418–451 (2018)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Li, Z., Avgeriou, P., Liang, P.: A systematic mapping study on technical debt and its management. J. Syst. Softw. 101, 193–220 (2015)
da Maldonado, E.S., Shihab, E.: Detecting and quantifying different types of self-admitted technical debt. In: 2015 IEEE 7Th International Workshop on Managing Technical Debt (MTD), pp. 9–15. IEEE (2015)
Potdar, A., Shihab, E.: An exploratory study on self-admitted technical debt. In: 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 91–100. IEEE (2014)
Rantala, L.: From reuse to reform: a sample study on technical debt dataset, final dataset (2024). https://doi.org/10.6084/m9.figshare.22778606
Rantala, L.: From reuse to reform: a sample study on technical debt dataset, replication package (2024). https://doi.org/10.6084/m9.figshare.21959882
Ren, X., Xing, Z., Xia, X., Lo, D., Wang, X., Grundy, J.: Neural network-based detection of self-admitted technical debt: from performance to explainability. ACM Trans. Softw. Eng. Methodol. (TOSEM) 28(3), 1–45 (2019)
Rice, W.R.: Analyzing tables of statistical tests. Evolution 43(1), 223–225 (1989)
da Silva Maldonado, E., Shihab, E., Tsantalis, N.: Using natural language processing to automatically detect self-admitted technical debt. IEEE Trans. Software Eng. 43(11), 1044–1062 (2017)
Singh, D., Sekar, V.R., Stolee, K.T., Johnson, B.: Evaluating how static analysis tools can reduce code review effort. In: 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 101–105. IEEE (2017)
Skryseth, D., Shivashankar, K., Pilán, I., Martini, A.: Technical debt classification in issue trackers using natural language processing based on transformers. In: 2023 ACM/IEEE International Conference on Technical Debt (TechDebt), pp. 92–101 (2023). https://doi.org/10.1109/TechDebt59074.2023.00017
Sridharan, M., Mäntylä, M., Claes, M., Rantala, L.: SoCCMiner: a source code-comments and comment-context miner. In: Proceedings of the 19th International Conference on Mining Software Repositories, pp. 242–246 (2022)
Stol, K.J., Fitzgerald, B.: The ABC of software engineering research. ACM Trans. Softw. Eng. Methodol. (TOSEM) 27(3), 1–51 (2018)
Sutoyo, E., Capiluppi, A.: SATDAUG–a balanced and augmented dataset for detecting self-admitted technical debt. In: Proceedings of the 21st International Conference on Mining Software Repositories (2024)
Trautsch, A., Herbold, S., Grabowski, J.: A longitudinal study of static analysis warning evolution and the effects of PMD on software quality in apache open source projects. Empir. Softw. Eng. 25(6), 5137–5192 (2020)
Xiao, T., Zeng, Z., Wang, D., Hata, H., McIntosh, S., Matsumoto, K.: Quantifying and characterizing clones of self-admitted technical debt in build systems. Empir. Softw. Eng. 29(2), 1–31 (2024)
Yan, M., Xia, X., Shihab, E., Lo, D., Yin, J., Yang, X.: Automating change-level self-admitted technical debt determination. IEEE Trans. Software Eng. 45(12), 1211–1229 (2018)
Yu, J., Zhao, K., Liu, J., Liu, X., Xu, Z., Wang, X.: Exploiting gated graph neural network for detecting and explaining self-admitted technical debts. J. Syst. Softw. 187, 111219 (2022)
Zhu, K., Yin, M., Zhu, D., Zhang, X., Gao, C., Jiang, J.: SCGRU: a general approach for identifying multiple classes of self-admitted technical debt with text generation oversampling. J. Syst. Softw. 195, 111514 (2023)
Acknowledgments
The authors have been supported by Academy of Finland (grant number 328058).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rantala, L., Mäntylä, M.V., Sridharan, M. (2025). From Reinvention to Reuse: An Empirical Example Study on Technical Debt Dataset. In: Pfahl, D., Gonzalez Huerta, J., Klünder, J., Anwar, H. (eds) Product-Focused Software Process Improvement. PROFES 2024. Lecture Notes in Computer Science, vol 15452. Springer, Cham. https://doi.org/10.1007/978-3-031-78386-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-78386-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78385-2
Online ISBN: 978-3-031-78386-9
eBook Packages: Computer ScienceComputer Science (R0)