skip to main content
10.1145/3652620.3688338acmconferencesArticle/Chapter ViewAbstractPublication PagesmodelsConference Proceedingsconference-collections
research-article

Building deduplicated model repositories to assess domain-specific languages evolution

Published: 31 October 2024 Publication History

Abstract

Software evolution and maintenance is a real challenge in modern software engineering. In the context of model-driven development, which heavily rely on interconnected (meta-)models, tools and generators, evolving both models and their associated meta-models is particularly complex. This issue is also prevalent in language engineering, where evolving a language's grammar or semantics must remain consistent with the pre-existing models. In this paper, we explore how techniques inspired by repository mining can help a model designer/language engineer to build a deduplicated dataset of existing models available in open source repositories. Deduplication is essential to ensure the evolution made on the meta-model/language can be efficiently assessed. We apply the method to the P4 language, an industrial domain-specific language (Intel, Linux foundation) used to model software defined networks.

References

[1]
[n. d.]. Langium. https://langium.org/. Accessed: 2024-07-15.
[2]
2023. Git - Git Objects. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects [Online; accessed 14. Dec. 2023].
[3]
2023. Pattern Matching: the Gestalt Approach. https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970?pgno=5 [Online; accessed 14. Dec. 2023].
[4]
Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Onward! 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. Association for Computing Machinery, New York, NY, USA, 143--153.
[5]
Benjamin Benni, Sébastien Mosser, Naouel Moha, and Michel Riveill. 2019. A delta-oriented approach to support the safe reuse of black-box code rewriters. J. Softw. Evol. Process. 31, 8 (2019).
[6]
Lorenzo Bettini. 2016. Implementing domain-specific languages with Xtext and Xtend. Packt Publishing Ltd.
[7]
Jürgen Cito, Gerald Schermann, John Erik Wittern, Philipp Leitner, Sali Zumberi, and Harald C. Gall. 2017. An Empirical Analysis of the Docker Container Ecosystem on GitHub. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 323--333.
[8]
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE '14, Vasteras, Sweden - September 15 - 19, 2014. 313--324.
[9]
Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings. European Language Resources Association (May 2020). https://oro.open.ac.uk/70519
[10]
Michael Townsen Hicks, James Humphries, and Joe Slater. 2024. ChatGPT is bullshit. Ethics and Information Technology 26, 2 (08 June 2024), 38.
[11]
Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: a map of code duplicates on GitHub. Proc. ACM Program. Lang. 1, OOPSLA (Oct. 2017), 1--28.
[12]
Philip Mayer and Alexander Bauer. 2015. An empirical analysis of the utilization of multiple programming languages in open source projects. In EASE '15: Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering. Association for Computing Machinery, New York, NY, USA, 1--10.
[13]
Muhammad Shumail Naveed. 2022. Correlation Between GitHub Stars and Code Vulnerabilities. JCBI 4, 01 (Dec. 2022), 141--151.
[14]
Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei, and Davide Di Ruscio. 2020. An automated approach to assess the similarity of GitHub repositories. Software Qual. J. 28, 2 (June 2020), 595--631.
[15]
Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient Computation of the Tree Edit Distance. ACM Transactions on Database Systems (TODS) (2015), 3:1--3:40.
[16]
Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit distance: Robust and memory-efficient. Information Systems 56 (2016), 157--173.
[17]
Md Omar Faruk Rokon, Pei Yan, Risul Islam, and Michalis Faloutsos. 2021. Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 355--365.
[18]
Sébastien Mosser Sathurshan Arulmohan, Marie-Jean Meurs. 2023. Extracting Domain Models from Textual Requirements in the Era of Large Language Models. In 5th Workshop on Artificial Intelligence and Model-driven Engineering (co-located with 26th International Conference on Model-Driven Engineering, Languages and Systems (MODELS)).
[19]
Diomidis Spinellis, Zoe Kotti, and Audris Mockus. 2020. A Dataset for GitHub Repository Deduplication. In MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories. Association for Computing Machinery, New York, NY, USA, 523--527.
[20]
Martin Woodward. 2022. Octoverse 2022: 10 years of tracking open source. GitHub Blog (Nov. 2022). https://github.blog/2022-11-17-octoverse-2022-10-years-of-tracking-open-source
[21]
Yanjie Zhao, Li Li, Haoyu Wang, Haipeng Cai, Tegawendé F. Bissyandé, Jacques Klein, and John Grundy. 2021. On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection. ACM Trans. Software Eng. Method. 30, 3 (May 2021), 1--38.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MODELS Companion '24: Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems
September 2024
1261 pages
ISBN:9798400706226
DOI:10.1145/3652620
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2024

Check for updates

Author Tags

  1. compiler
  2. DSL
  3. model
  4. mining
  5. evolution

Qualifiers

  • Research-article

Funding Sources

  • NSERC Discovery
  • NSERC Alliance

Conference

MODELS Companion '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 144 of 506 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 33
    Total Downloads
  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media