Abstract
Architecture refactoring is a big challenge and requires thorough analysis and labor-intensive, error-prone activities to restructure functionalities from a legacy architecture to a new intended one. Indeed, source code should be adapted to match the new structure. In this context, automatically mapping source code to the intended architecture would significantly reduce manual work and prevent technical debt. To this end, in this paper, we aim to map methods to architectural modules solely defined by textual descriptions, i.e., formulated as a machine learning text classification problem. Methods are mapped into modules using different approaches. We apply the proposed approach to an open-source software system, results show that vectorizing text and code using large language models outperforms other modern methods. The different applied machine learning classifiers perform comparably well, where the best attain accuracy of around 40% and F1-score of around 30%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Data Availability Statement
Data and Software developed to support the findings of this study are available at: https://github.com/nijoad/source-code-mapping-to-architecture/tree/main.
Notes
- 1.
see GitHub: “PX4/PX4-Autopilot/tree/main/src/modules”.
References
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Abid, C., Alizadeh, V., Kessentini, M., do Nascimento Ferreira, T., Dig, D.: 30 years of software refactoring research: a systematic literature review. arXiv arXiv:2007.02194 (2020)
Somogyi, N., Kövesdán, G.: Software modernization using machine learning techniques. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000361–000365 (2021)
Alomar, E.A., Mkaouer, M.W., Newman, C.D., Ouni, A.: On preserving the behavior in software refactoring: a systematic mapping study. arXiv arXiv:2106.13900 (2021)
Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018)
Aniche, M.F., Maziero, E.G., Durelli, R.S., Durelli, V.H.S.: The effectiveness of supervised machine learning algorithms in predicting software refactoring. IEEE Trans. Software Eng. 48, 1432–1450 (2020)
Baqais, A.A.B., Alshayeb, M.R.: Automatic software refactoring: a systematic literature review. Software Qual. J. 28, 459–502 (2019)
Bittencourt, R.A., Santos, G.J.D., Guerrero, D.D.S., Murphy, G.C.: Improving automated mapping in reflexion models using information retrieval techniques. In: 2010 17th Working Conference on Reverse Engineering, pp. 163–172 (2010)
Christl, A., Koschke, R., Storey, M.A.: Automated clustering to support the reflexion method. Inf. Softw. Technol. 49(3), 255–274 (2007). 12th Working Conference on Reverse Engineering
Cruciani, F., Moore, S., Nugent, C.: Comparing general purpose pre-trained word and sentence embeddings for requirements classification. In: 6th Workshop on Natural Language Processing for Requirements Engineering: REFSQ Co-Located Events 2023, vol. 3378. CEUR-WS (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)
Diaz-Pace, J.A., Berrios, R.C., Tommasel, A., Vazquez, H.C.: A metrics-based approach for assessing architecture-implementation mappings. In: Anais do XXV Congresso Ibero-Americano em Engenharia de Software, pp. 16–30. SBC, Porto Alegre, RS, Brasil (2022)
Dogra, V., et al.: A complete process of text classification system using state-of-the-art NLP models. Comput. Intell. Neurosci. 2022, 1–19 (2022)
Florean, A., Jalal, L.: Mapping java source code to architectural concerns through machine learning. Master’s thesis, Karlstad University (2021)
Florean, A., Jalal, L., Sinkala, Z.T., Herold, S.: A comparison of machine learning-based text classifiers for mapping source code to architectural modules. In: European Conference on Software Architecture (2021)
Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J.: Unixcoder: unified cross-modal pre-training for code representation. arXiv:2203.03850 (2022)
Hu, L., Liu, Z., Zhao, Z., Hou, L., Nie, L., Li, J.: A survey of knowledge enhanced pre-trained language models. arXiv:2211.05994 (2023)
Karakati, C.B., Thirumaaran, S.: Software code refactoring based on deep neural network-based fitness function. Concurrency Comput. Pract. Experience 35(4), e7531 (2023)
Liang, M., Niu, T.: Research on text classification techniques based on improved TF-IDF algorithm and LSTM inputs. Procedia Comput. Sci. 208, 460–470 (2022). 7th International Conference on Intelligent, Interactive Systems and Applications
Link, D., Behnamghader, P., Moazeni, R., Boehm, B.: Recover and relax: concern-oriented software architecture recovery for systems development and maintenance. In: Proceedings of the International Conference on Software and System Processes, ICSSP 2019, pp. 64–73. IEEE Press (2019)
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning based text classification: a comprehensive review. arXiv:2004.03705 (2021)
Niu, C., Li, C., Luo, B., Ng, V.: Deep learning meets software engineering: a survey on pre-trained models of source code. arXiv:2205.11739 (2022)
Olsson, T., Ericsson, M., Wingkvist, A.: To automatically map source code entities to architectural modules with naive bayes. J. Syst. Softw. 183, 111095 (2022)
Pace, J.A.D., Villavicencio, C., Schiaffino, S.N., Nicoletti, M., Vázquez, H.C.: Producing just enough documentation: an optimization approach applied to the software architecture domain. J. Data Semant. 5(1), 37–53 (2016)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
PX4: Px4-autopilot/src/modules at main \(\cdot \) px4/px4-autopilot. https://github.com/PX4/PX4-Autopilot/tree/main/src/modules
Savelka, J., Ashley, K.D.: The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts. Front. Artif. Intell. 6, 1279794 (2023)
Shah, K., Patel, H., Sanghvi, D., Shah, M.: A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Hum. Res. 5(1), 1–16 (2020). https://doi.org/10.1007/s41133-020-00032-0
Sinkala, Z.T., Herold, S.: InMap: automated interactive code-to-architecture mapping. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, pp. 1439-1442. SAC 2021, Association for Computing Machinery, New York, NY, USA (2021)
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? arXiv:1905.05583 (2020)
Wang, X., Wang, Y.: Sentence-level resampling for named entity recognition. In: North American Chapter of the Association for Computational Linguistics (2022)
Wang, Z., Pang, Y., Lin, Y.: Large language models are zero-shot text classifiers. arXiv:2312.01044 (2023)
Xie, Y., Lin, J., Dong, H., Zhang, L., Wu, Z.: Survey of code search based on deep learning. ACM Trans. Softw. Eng. Methodol. 33(2), 1–42 (2023)
Yu, Y., et al.: Large language model as attributed training data generator: A tale of diversity and bias. arXiv:2306.15895 (2023)
Zhang, C., et al.: A survey of automatic source code summarization. Symmetry 14, 471 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Johansson, N., Caporuscio, M., Olsson, T. (2024). Mapping Source Code to Software Architecture by Leveraging Large Language Models. In: Ampatzoglou, A., et al. Software Architecture. ECSA 2024 Tracks and Workshops. ECSA 2024. Lecture Notes in Computer Science, vol 14937. Springer, Cham. https://doi.org/10.1007/978-3-031-71246-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-71246-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70945-6
Online ISBN: 978-3-031-71246-3
eBook Packages: Computer ScienceComputer Science (R0)