skip to main content
10.1145/3589334.3645520acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

ModelGo: A Practical Tool for Machine Learning License Analysis

Published: 13 May 2024 Publication History

Abstract

Productionizing machine learning projects is inherently complex, involving a multitude of interconnected components that are assembled like LEGO blocks and evolve throughout development lifecycle. These components encompass software, databases, and models, each subject to various licenses governing their reuse and redistribution. However, existing license analysis approaches for Open Source Software (OSS) are not well-suited for this context. For instance, some projects are licensed without explicitly granting sublicensing rights, or the granted rights can be revoked, potentially exposing their derivatives to legal risks. Indeed, the analysis of licenses in machine learning projects grows significantly more intricate as it involves interactions among diverse types of licenses and licensed materials. To the best of our knowledge, no prior research has delved into the exploration of license conflicts within this domain. In this paper, we introduce ModelGo, a practical tool for auditing potential legal risks in machine learning projects to enhance compliance and fairness. With ModelGo, we present license assessment reports based on five use cases with diverse model-reusing scenarios, rendered by real-world machine learning components. Finally, we summarize the reasons behind license conflicts and provide guidelines for minimizing them. Our code is publicly available at https://github.com/Xtra-Computing/ModelGo.

Supplemental Material

MP4 File
Presentation video
MP4 File
Supplemental video

References

[1]
Daniel A Almeida, Gail C Murphy, Greg Wilson, and Mike Hoye. 2017. Do software developers understand open source licenses?. In Proceedings of the 25th IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE, 1--11. https://doi.org/10.1109/ICPC.2017.7
[2]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on Fairness, Accountability, and Transparency (FAccT). 610--623. https://doi.org/10.1145/3442188.3445922
[3]
Misha Benjamin, Paul Gagnon, Negar Rostamzadeh, Chris Pal, Yoshua Bengio, and Alex Shee. 2019. Towards standardization of data licenses: The montreal data license. arXiv preprint arXiv:1903.12262 (2019).
[4]
Petya Buchkova, Joakim Hey Hinnerskov, Kasper Olsen, and Rolf-Helge Pfeiffer. 2022. DaSEA: a dataset for software ecosystem analysis. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR). 388--392. https://doi.org/10.1145/3524842.3528004
[5]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research (JAIR), Vol. 16, 1 (2002), 321--357. https://doi.org/10.1613/jair.953
[6]
Jialuo Chen, Jingyi Wang, Tinglan Peng, Youcheng Sun, Peng Cheng, Shouling Ji, Xingjun Ma, Bo Li, and Dawn Song. 2022. Copy, Right? A testing framework for copyright protection of deep learning models. In IEEE Symposium on Security and Privacy (SP). IEEE, 824--841. https://doi.org/10.1109/SP46214.2022.9833747
[7]
Creative Commons. 2023 a. Artificial intelligence and CC licenses. https://creativecommons.org/faq/#artificial-intelligence-and-cc-licenses Retrieved September 25, 2023 from
[8]
Creative Commons. 2023 b. Creative Commons Licenses List. https://creativecommons.org/licenses/ Retrieved September 25, 2023 from
[9]
Danish Contractor, Daniel McDuff, Julia Katherine Haines, Jenny Lee, Christopher Hines, Brent Hecht, Nicholas Vincent, and Hanlin Li. 2022. Behavioral use licensing for responsible AI. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT). 778--788. https://doi.org/10.1145/3531146.3533143
[10]
Xing Cui, Jingzheng Wu, Yanjun Wu, Xu Wang, Tianyue Luo, Sheng Qu, Xiang Ling, and Mutian Yang. 2023. An Empirical Study of License Conflict in Free and Open Source Software. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 495--505. https://doi.org/10.1109/ICSE-SEIP58684.2023.00050
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171--4186. https://doi.org/10.18653/v1/n19--1423
[12]
Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born again neural networks. In Proceedings of the 35th International Conference on Machine Learning (ICML). PMLR, 1607--1616.
[13]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027 (2020).
[14]
Daniel M German, Yuki Manabe, and Katsuro Inoue. 2010. A sentence-matching method for automatic license identification of source code files. In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE). 437--446. https://doi.org/10.1145/1858996.1859088
[15]
Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Mannat Singh, Ishan Misra, Levent Sagun, Armand Joulin, and Piotr Bojanowski. 2022. Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360 (2022).
[16]
Eli Greenbaum. 2016. The Non-Discrimination Principle in Open Source Licensing. Cardozo Law Review, Vol. 37, 4 (2016), 1297--1344.
[17]
Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. 2021. Pre-trained models: Past, present and future. AI Open, Vol. 2 (2021), 225--250. https://doi.org/10.1016/j.aiopen.2021.08.002
[18]
Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IJCNN). IEEE, 1322--1328. https://doi.org/10.1109/IJCNN.2008.4633969
[19]
Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 120--134. https://doi.org/10.1145/3503221.3508418
[20]
Samantha Fink Hedrick. 2019. I Think, Therefore I Create: Claiming Copyright in the Outputs of Algorithms. New York University Journal of Intellectual Property & Entertainment Law (JIPEL), Vol. 8, 2 (2019), 324--375.
[21]
Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A Lemley, and Percy Liang. 2023. Foundation models and fair use. arXiv preprint arXiv:2303.15715 (2023).
[22]
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia (MM). 4083--4091. https://doi.org/10.1145/3503161.3548112
[23]
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, Vol. 3, 1 (1991), 79--87. https://doi.org/10.1162/neco.1991.3.1.79
[24]
Michael C Jaeger, Oliver Fendt, Robert Gobeille, Maximilian Huber, Johannes Najjar, Kate Stewart, Steffen Weber, and Andreas Wurl. 2017. The FOSSology project: 10 years of license scanning. International Free and Open Source Software Law Review, Vol. 9 (2017), 9.
[25]
Wenxin Jiang, Nicholas Synovic, Matt Hyatt, Taylor R Schorlemmer, Rohan Sethi, Yung-Hsiang Lu, George K Thiruvathukal, and James C Davis. 2023. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). 2463--2475. https://doi.org/10.1109/ICSE48619.2023.00206
[26]
Georgia M Kapitsaki, Frederik Kramer, and Nikolaos D Tselikas. 2017. Automating the license compatibility process in open source software with SPDX. Journal of Systems and Software (JSS), Vol. 131 (2017), 386--401. https://doi.org/10.1016/j.jss.2016.06.064
[27]
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. 2023. The Stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research (TMLR) (2023).
[28]
Thanh Chi Lam, Nghia Hoang, Bryan Kian Hsiang Low, and Patrick Jaillet. 2021. Model Fusion for Personalized Learning. In Proceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 5948--5958.
[29]
Qinbin Li, Bingsheng He, and Dawn Song. 2021. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10713--10722.
[30]
Dreamlike Tech Ltd. 2023. Dreamlike Photoreal 2.0. https://huggingface.co/dreamlike-art/dreamlike-photoreal-2.0 Retrieved September 25, 2023 from
[31]
Thomas Margoni. 2018. Artificial Intelligence, Machine learning and EU copyright law: Who owns AI? Machine Learning and EU Copyright Law: Who Owns AI (2018). https://doi.org/10.2139/ssrn.3299523
[32]
Arunesh Mathur, Harshal Choudhary, Priyank Vashist, William Thies, and Santhi Thilagam. 2012. An empirical study of license violations in open source projects. In 2012 35th Annual IEEE Software Engineering Workshop (SEW). IEEE, 168--176. https://doi.org/10.1109/SEW.2012.24
[33]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS). 1273--1282.
[34]
Inc. Meta Platforms. 2023. Llama2 Community License. https://ai.meta.com/llama/license/ Retrieved September 25, 2023 from
[35]
Philippe Ombredanne. 2020. Free and open source software license compliance: tools for software composition analysis. Computer, Vol. 53, 10 (2020), 105--109. https://doi.org/10.1109/MC.2020.3011082
[36]
National Commission on New Technological Uses of Copyrighted Works (US). 1979. Final Report of the National Commission on New Technological Uses of Copyrighted Works, July 31, 1978. Library of Congress.
[37]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023).
[38]
Sen Peng, Yufei Chen, Jie Xu, Zizhuo Chen, Cong Wang, and Xiaohua Jia. 2022. Intellectual property protection of DNN models. World Wide Web (2022), 1--35. https://doi.org/10.1007/s11280-022-01113--3
[39]
Bruce Perens. 1999. The open source definition. Open sources: voices from the open source revolution, Vol. 1 (1999), 171--188.
[40]
Midjourney platform. 2023. Midjourney's Terms of Service. https://docs.midjourney.com/docs/terms-of-service Retrieved September 25, 2023 from
[41]
PromptHero. 2023. Openjourney v4. https://www.openjourney.art/ Retrieved September 25, 2023 from
[42]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[43]
Gopi Krishnan Rajbahadur, Erika Tuck, Li Zi, Dayi Lin, Boyuan Chen, Zhen Ming, Daniel M German, et al. 2021. Can I use this publicly available dataset to build commercial AI software?--A Case Study on Publicly Available Image Datasets. arXiv preprint arXiv:2111.02374 (2021).
[44]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684--10695. https://doi.org/10.1109/CVPR52688.2022.01042
[45]
Lawrence Rosen. 2005. Open Source Licensing: Software Freedom and Intellectual Property Law. Prentice Hall Professional Technical Reference, New Jersey.
[46]
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, Francc ois Yvon, Matthias Gallé, et al. 2022. BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
[47]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS), Vol. 35 (2022), 25278--25294.
[48]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 3645--3650. https://doi.org/10.18653/v1/p19--1355
[49]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[50]
Christopher Vendome, Mario Linares-Vásquez, Gabriele Bavota, Massimiliano Di Penta, Daniel German, and Denys Poshyvanyk. 2017. Machine learning-based detection of open source license exceptions. In Proceedings of IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 118--129. https://doi.org/10.1109/ICSE.2017.19
[51]
Naibo Wang, Wenjie Feng, Jianwei Yin, and See-Kiong Ng. 2023. EasySpider: A No-Code Visual System for Crawling the Web. In Companion Proceedings of the ACM Web Conference (WWW). 192--195. https://doi.org/10.1145/3543873.3587345
[52]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 38--45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
[53]
Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M German, and Katsuro Inoue. 2015. A method to detect license inconsistencies in large-scale open source projects. In Proceedings of IEEE/ACM 12th Working Conference on Mining Software Repositories (MSR). IEEE, 324--333. https://doi.org/10.1109/MSR.2015.37
[54]
Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M German, and Katsuro Inoue. 2017. Analysis of license inconsistency in large collections of open source projects. Empirical Software Engineering (ESE), Vol. 22 (2017), 1194--1222. https://doi.org/10.1007/s10664-016--9487--8
[55]
Shan You, Chang Xu, Fei Wang, and Changshui Zhang. 2021. Workshop on Model Mining. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 4177--4178. https://doi.org/10.1145/3447548.3469471
[56]
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2023. GLM-130B: An Open Bilingual Pre-trained Model. Proceedings of the 11th International Conference on Learning Representations (ICLR).
[57]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In Proceedings of the 6th International Conference on Learning Representations (ICLR). io

Index Terms

  1. ModelGo: A Practical Tool for Machine Learning License Analysis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '24: Proceedings of the ACM Web Conference 2024
    May 2024
    4826 pages
    ISBN:9798400701719
    DOI:10.1145/3589334
    This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2024

    Check for updates

    Author Tags

    1. ai licensing
    2. license analysis
    3. model mining

    Qualifiers

    • Research-article

    Funding Sources

    • AI Singapore

    Conference

    WWW '24
    Sponsor:
    WWW '24: The ACM Web Conference 2024
    May 13 - 17, 2024
    Singapore, Singapore

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 621
      Total Downloads
    • Downloads (Last 12 months)621
    • Downloads (Last 6 weeks)71
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media