Skip to main content

Mapping Source Code to Software Architecture by Leveraging Large Language Models

  • Conference paper
  • First Online:
Software Architecture. ECSA 2024 Tracks and Workshops (ECSA 2024)

Abstract

Architecture refactoring is a big challenge and requires thorough analysis and labor-intensive, error-prone activities to restructure functionalities from a legacy architecture to a new intended one. Indeed, source code should be adapted to match the new structure. In this context, automatically mapping source code to the intended architecture would significantly reduce manual work and prevent technical debt. To this end, in this paper, we aim to map methods to architectural modules solely defined by textual descriptions, i.e., formulated as a machine learning text classification problem. Methods are mapped into modules using different approaches. We apply the proposed approach to an open-source software system, results show that vectorizing text and code using large language models outperforms other modern methods. The different applied machine learning classifiers perform comparably well, where the best attain accuracy of around 40% and F1-score of around 30%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Data Availability Statement

Data and Software developed to support the findings of this study are available at: https://github.com/nijoad/source-code-mapping-to-architecture/tree/main.

Notes

  1. 1.

    see GitHub: “PX4/PX4-Autopilot/tree/main/src/modules”.

References

  1. https://px4.io/

  2. https://huggingface.co/blog/mteb

  3. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  4. https://huggingface.co/microsoft/unixcoder-base-nine

  5. Abid, C., Alizadeh, V., Kessentini, M., do Nascimento Ferreira, T., Dig, D.: 30 years of software refactoring research: a systematic literature review. arXiv arXiv:2007.02194 (2020)

  6. Somogyi, N., Kövesdán, G.: Software modernization using machine learning techniques. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000361–000365 (2021)

    Google Scholar 

  7. Alomar, E.A., Mkaouer, M.W., Newman, C.D., Ouni, A.: On preserving the behavior in software refactoring: a systematic mapping study. arXiv arXiv:2106.13900 (2021)

  8. Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018)

    Article  Google Scholar 

  9. Aniche, M.F., Maziero, E.G., Durelli, R.S., Durelli, V.H.S.: The effectiveness of supervised machine learning algorithms in predicting software refactoring. IEEE Trans. Software Eng. 48, 1432–1450 (2020)

    Article  Google Scholar 

  10. Baqais, A.A.B., Alshayeb, M.R.: Automatic software refactoring: a systematic literature review. Software Qual. J. 28, 459–502 (2019)

    Article  Google Scholar 

  11. Bittencourt, R.A., Santos, G.J.D., Guerrero, D.D.S., Murphy, G.C.: Improving automated mapping in reflexion models using information retrieval techniques. In: 2010 17th Working Conference on Reverse Engineering, pp. 163–172 (2010)

    Google Scholar 

  12. Christl, A., Koschke, R., Storey, M.A.: Automated clustering to support the reflexion method. Inf. Softw. Technol. 49(3), 255–274 (2007). 12th Working Conference on Reverse Engineering

    Google Scholar 

  13. Cruciani, F., Moore, S., Nugent, C.: Comparing general purpose pre-trained word and sentence embeddings for requirements classification. In: 6th Workshop on Natural Language Processing for Requirements Engineering: REFSQ Co-Located Events 2023, vol. 3378. CEUR-WS (2023)

    Google Scholar 

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)

  15. Diaz-Pace, J.A., Berrios, R.C., Tommasel, A., Vazquez, H.C.: A metrics-based approach for assessing architecture-implementation mappings. In: Anais do XXV Congresso Ibero-Americano em Engenharia de Software, pp. 16–30. SBC, Porto Alegre, RS, Brasil (2022)

    Google Scholar 

  16. Dogra, V., et al.: A complete process of text classification system using state-of-the-art NLP models. Comput. Intell. Neurosci. 2022, 1–19 (2022)

    Article  Google Scholar 

  17. Florean, A., Jalal, L.: Mapping java source code to architectural concerns through machine learning. Master’s thesis, Karlstad University (2021)

    Google Scholar 

  18. Florean, A., Jalal, L., Sinkala, Z.T., Herold, S.: A comparison of machine learning-based text classifiers for mapping source code to architectural modules. In: European Conference on Software Architecture (2021)

    Google Scholar 

  19. Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J.: Unixcoder: unified cross-modal pre-training for code representation. arXiv:2203.03850 (2022)

  20. Hu, L., Liu, Z., Zhao, Z., Hou, L., Nie, L., Li, J.: A survey of knowledge enhanced pre-trained language models. arXiv:2211.05994 (2023)

  21. Karakati, C.B., Thirumaaran, S.: Software code refactoring based on deep neural network-based fitness function. Concurrency Comput. Pract. Experience 35(4), e7531 (2023)

    Article  Google Scholar 

  22. Liang, M., Niu, T.: Research on text classification techniques based on improved TF-IDF algorithm and LSTM inputs. Procedia Comput. Sci. 208, 460–470 (2022). 7th International Conference on Intelligent, Interactive Systems and Applications

    Google Scholar 

  23. Link, D., Behnamghader, P., Moazeni, R., Boehm, B.: Recover and relax: concern-oriented software architecture recovery for systems development and maintenance. In: Proceedings of the International Conference on Software and System Processes, ICSSP 2019, pp. 64–73. IEEE Press (2019)

    Google Scholar 

  24. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning based text classification: a comprehensive review. arXiv:2004.03705 (2021)

  25. Niu, C., Li, C., Luo, B., Ng, V.: Deep learning meets software engineering: a survey on pre-trained models of source code. arXiv:2205.11739 (2022)

  26. Olsson, T., Ericsson, M., Wingkvist, A.: To automatically map source code entities to architectural modules with naive bayes. J. Syst. Softw. 183, 111095 (2022)

    Article  Google Scholar 

  27. Pace, J.A.D., Villavicencio, C., Schiaffino, S.N., Nicoletti, M., Vázquez, H.C.: Producing just enough documentation: an optimization approach applied to the software architecture domain. J. Data Semant. 5(1), 37–53 (2016)

    Article  Google Scholar 

  28. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  29. PX4: Px4-autopilot/src/modules at main \(\cdot \) px4/px4-autopilot. https://github.com/PX4/PX4-Autopilot/tree/main/src/modules

  30. Savelka, J., Ashley, K.D.: The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts. Front. Artif. Intell. 6, 1279794 (2023)

    Article  Google Scholar 

  31. Shah, K., Patel, H., Sanghvi, D., Shah, M.: A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Hum. Res. 5(1), 1–16 (2020). https://doi.org/10.1007/s41133-020-00032-0

    Article  Google Scholar 

  32. Sinkala, Z.T., Herold, S.: InMap: automated interactive code-to-architecture mapping. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, pp. 1439-1442. SAC 2021, Association for Computing Machinery, New York, NY, USA (2021)

    Google Scholar 

  33. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? arXiv:1905.05583 (2020)

  34. Wang, X., Wang, Y.: Sentence-level resampling for named entity recognition. In: North American Chapter of the Association for Computational Linguistics (2022)

    Google Scholar 

  35. Wang, Z., Pang, Y., Lin, Y.: Large language models are zero-shot text classifiers. arXiv:2312.01044 (2023)

  36. Xie, Y., Lin, J., Dong, H., Zhang, L., Wu, Z.: Survey of code search based on deep learning. ACM Trans. Softw. Eng. Methodol. 33(2), 1–42 (2023)

    Article  Google Scholar 

  37. Yu, Y., et al.: Large language model as attributed training data generator: A tale of diversity and bias. arXiv:2306.15895 (2023)

  38. Zhang, C., et al.: A survey of automatic source code summarization. Symmetry 14, 471 (2022)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nils Johansson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Johansson, N., Caporuscio, M., Olsson, T. (2024). Mapping Source Code to Software Architecture by Leveraging Large Language Models. In: Ampatzoglou, A., et al. Software Architecture. ECSA 2024 Tracks and Workshops. ECSA 2024. Lecture Notes in Computer Science, vol 14937. Springer, Cham. https://doi.org/10.1007/978-3-031-71246-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71246-3_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70945-6

  • Online ISBN: 978-3-031-71246-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics