Mapping Source Code to Software Architecture by Leveraging Large Language Models

Johansson, Nils; Caporuscio, Mauro; Olsson, Tobias

doi:10.1007/978-3-031-71246-3_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14937))

Included in the following conference series:

European Conference on Software Architecture

391 Accesses

Abstract

Architecture refactoring is a big challenge and requires thorough analysis and labor-intensive, error-prone activities to restructure functionalities from a legacy architecture to a new intended one. Indeed, source code should be adapted to match the new structure. In this context, automatically mapping source code to the intended architecture would significantly reduce manual work and prevent technical debt. To this end, in this paper, we aim to map methods to architectural modules solely defined by textual descriptions, i.e., formulated as a machine learning text classification problem. Methods are mapped into modules using different approaches. We apply the proposed approach to an open-source software system, results show that vectorizing text and code using large language models outperforms other modern methods. The different applied machine learning classifiers perform comparably well, where the best attain accuracy of around 40% and F1-score of around 30%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Does BERT Understand Code? – An Exploratory Study on the Detection of Architectural Tactics in Code

A Preliminary Study on Using Text- and Image-Based Machine Learning to Predict Software Maintainability

Mapping Source Code to Modular Architectures Using Keywords

Data Availability Statement

Data and Software developed to support the findings of this study are available at: https://github.com/nijoad/source-code-mapping-to-architecture/tree/main.

Notes

1.
see GitHub: “PX4/PX4-Autopilot/tree/main/src/modules”.

References

https://px4.io/
https://huggingface.co/blog/mteb
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
https://huggingface.co/microsoft/unixcoder-base-nine
Abid, C., Alizadeh, V., Kessentini, M., do Nascimento Ferreira, T., Dig, D.: 30 years of software refactoring research: a systematic literature review. arXiv arXiv:2007.02194 (2020)
Somogyi, N., Kövesdán, G.: Software modernization using machine learning techniques. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000361–000365 (2021)
Google Scholar
Alomar, E.A., Mkaouer, M.W., Newman, C.D., Ouni, A.: On preserving the behavior in software refactoring: a systematic mapping study. arXiv arXiv:2106.13900 (2021)
Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018)
Article Google Scholar
Aniche, M.F., Maziero, E.G., Durelli, R.S., Durelli, V.H.S.: The effectiveness of supervised machine learning algorithms in predicting software refactoring. IEEE Trans. Software Eng. 48, 1432–1450 (2020)
Article Google Scholar
Baqais, A.A.B., Alshayeb, M.R.: Automatic software refactoring: a systematic literature review. Software Qual. J. 28, 459–502 (2019)
Article Google Scholar
Bittencourt, R.A., Santos, G.J.D., Guerrero, D.D.S., Murphy, G.C.: Improving automated mapping in reflexion models using information retrieval techniques. In: 2010 17th Working Conference on Reverse Engineering, pp. 163–172 (2010)
Google Scholar
Christl, A., Koschke, R., Storey, M.A.: Automated clustering to support the reflexion method. Inf. Softw. Technol. 49(3), 255–274 (2007). 12th Working Conference on Reverse Engineering
Google Scholar
Cruciani, F., Moore, S., Nugent, C.: Comparing general purpose pre-trained word and sentence embeddings for requirements classification. In: 6th Workshop on Natural Language Processing for Requirements Engineering: REFSQ Co-Located Events 2023, vol. 3378. CEUR-WS (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)
Diaz-Pace, J.A., Berrios, R.C., Tommasel, A., Vazquez, H.C.: A metrics-based approach for assessing architecture-implementation mappings. In: Anais do XXV Congresso Ibero-Americano em Engenharia de Software, pp. 16–30. SBC, Porto Alegre, RS, Brasil (2022)
Google Scholar
Dogra, V., et al.: A complete process of text classification system using state-of-the-art NLP models. Comput. Intell. Neurosci. 2022, 1–19 (2022)
Article Google Scholar
Florean, A., Jalal, L.: Mapping java source code to architectural concerns through machine learning. Master’s thesis, Karlstad University (2021)
Google Scholar
Florean, A., Jalal, L., Sinkala, Z.T., Herold, S.: A comparison of machine learning-based text classifiers for mapping source code to architectural modules. In: European Conference on Software Architecture (2021)
Google Scholar
Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J.: Unixcoder: unified cross-modal pre-training for code representation. arXiv:2203.03850 (2022)
Hu, L., Liu, Z., Zhao, Z., Hou, L., Nie, L., Li, J.: A survey of knowledge enhanced pre-trained language models. arXiv:2211.05994 (2023)
Karakati, C.B., Thirumaaran, S.: Software code refactoring based on deep neural network-based fitness function. Concurrency Comput. Pract. Experience 35(4), e7531 (2023)
Article Google Scholar
Liang, M., Niu, T.: Research on text classification techniques based on improved TF-IDF algorithm and LSTM inputs. Procedia Comput. Sci. 208, 460–470 (2022). 7th International Conference on Intelligent, Interactive Systems and Applications
Google Scholar
Link, D., Behnamghader, P., Moazeni, R., Boehm, B.: Recover and relax: concern-oriented software architecture recovery for systems development and maintenance. In: Proceedings of the International Conference on Software and System Processes, ICSSP 2019, pp. 64–73. IEEE Press (2019)
Google Scholar
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning based text classification: a comprehensive review. arXiv:2004.03705 (2021)
Niu, C., Li, C., Luo, B., Ng, V.: Deep learning meets software engineering: a survey on pre-trained models of source code. arXiv:2205.11739 (2022)
Olsson, T., Ericsson, M., Wingkvist, A.: To automatically map source code entities to architectural modules with naive bayes. J. Syst. Softw. 183, 111095 (2022)
Article Google Scholar
Pace, J.A.D., Villavicencio, C., Schiaffino, S.N., Nicoletti, M., Vázquez, H.C.: Producing just enough documentation: an optimization approach applied to the software architecture domain. J. Data Semant. 5(1), 37–53 (2016)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
PX4: Px4-autopilot/src/modules at main $\cdot $ px4/px4-autopilot. https://github.com/PX4/PX4-Autopilot/tree/main/src/modules
Savelka, J., Ashley, K.D.: The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts. Front. Artif. Intell. 6, 1279794 (2023)
Article Google Scholar
Shah, K., Patel, H., Sanghvi, D., Shah, M.: A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Hum. Res. 5(1), 1–16 (2020). https://doi.org/10.1007/s41133-020-00032-0
Article Google Scholar
Sinkala, Z.T., Herold, S.: InMap: automated interactive code-to-architecture mapping. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, pp. 1439-1442. SAC 2021, Association for Computing Machinery, New York, NY, USA (2021)
Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? arXiv:1905.05583 (2020)
Wang, X., Wang, Y.: Sentence-level resampling for named entity recognition. In: North American Chapter of the Association for Computational Linguistics (2022)
Google Scholar
Wang, Z., Pang, Y., Lin, Y.: Large language models are zero-shot text classifiers. arXiv:2312.01044 (2023)
Xie, Y., Lin, J., Dong, H., Zhang, L., Wu, Z.: Survey of code search based on deep learning. ACM Trans. Softw. Eng. Methodol. 33(2), 1–42 (2023)
Article Google Scholar
Yu, Y., et al.: Large language model as attributed training data generator: A tale of diversity and bias. arXiv:2306.15895 (2023)
Zhang, C., et al.: A survey of automatic source code summarization. Symmetry 14, 471 (2022)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Linneaus University, Växjö, Sweden
Nils Johansson, Mauro Caporuscio & Tobias Olsson
Volvo Construction Equipment, Braås, Sweden
Nils Johansson

Authors

Nils Johansson
View author publications
You can also search for this author in PubMed Google Scholar
Mauro Caporuscio
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Olsson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nils Johansson .

Editor information

Editors and Affiliations

University of Macedonia, Thessaloniki, Greece
Apostolos Ampatzoglou
Universidad Politécnica de Madrid, Madrid, Spain
Jennifer Pérez
Masaryk University, Brno, Czech Republic
Barbora Buhnova
University of Oulu, Oulu, Finland
Valentina Lenarduzzi
University of Huddersfield, Huddersfield, UK
Colin C. Venters
University of Vienna, Vienna, Austria
Uwe Zdun
Université de Toulouse, Toulouse, France
Khalil Drira
Gran Sasso Science Institute, L'Aquila, Italy
Luciana Rebelo
University of L’Aquila, L'Aquila, Italy
Daniele Di Pompeo
University of L’Aquila, L'Aquila, Italy
Michele Tucci
University of São Paulo, São Carlos, Brazil
Elisa Yumi Nakagawa
University of Castilla-La Mancha, Albacete, Spain
Elena Navarro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Johansson, N., Caporuscio, M., Olsson, T. (2024). Mapping Source Code to Software Architecture by Leveraging Large Language Models. In: Ampatzoglou, A., et al. Software Architecture. ECSA 2024 Tracks and Workshops. ECSA 2024. Lecture Notes in Computer Science, vol 14937. Springer, Cham. https://doi.org/10.1007/978-3-031-71246-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-71246-3_13
Published: 01 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70945-6
Online ISBN: 978-3-031-71246-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mapping Source Code to Software Architecture by Leveraging Large Language Models