research-article

Open access

ModelGo: A Practical Tool for Machine Learning License Analysis

Authors:

Bingsheng HeAuthors Info & Claims

WWW '24: Proceedings of the ACM Web Conference 2024

Pages 1158 - 1169

https://doi.org/10.1145/3589334.3645520

Published: 13 May 2024 Publication History

Abstract

Productionizing machine learning projects is inherently complex, involving a multitude of interconnected components that are assembled like LEGO blocks and evolve throughout development lifecycle. These components encompass software, databases, and models, each subject to various licenses governing their reuse and redistribution. However, existing license analysis approaches for Open Source Software (OSS) are not well-suited for this context. For instance, some projects are licensed without explicitly granting sublicensing rights, or the granted rights can be revoked, potentially exposing their derivatives to legal risks. Indeed, the analysis of licenses in machine learning projects grows significantly more intricate as it involves interactions among diverse types of licenses and licensed materials. To the best of our knowledge, no prior research has delved into the exploration of license conflicts within this domain. In this paper, we introduce ModelGo, a practical tool for auditing potential legal risks in machine learning projects to enhance compliance and fairness. With ModelGo, we present license assessment reports based on five use cases with diverse model-reusing scenarios, rendered by real-world machine learning components. Finally, we summarize the reasons behind license conflicts and provide guidelines for minimizing them. Our code is publicly available at https://github.com/Xtra-Computing/ModelGo.

Supplemental Material

MP4 File

Presentation video

Download
1082.34 MB

MP4 File

Supplemental video

Download
26.38 MB

References

[1]

Daniel A Almeida, Gail C Murphy, Greg Wilson, and Mike Hoye. 2017. Do software developers understand open source licenses?. In Proceedings of the 25th IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE, 1--11. https://doi.org/10.1109/ICPC.2017.7

Digital Library

[2]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on Fairness, Accountability, and Transparency (FAccT). 610--623. https://doi.org/10.1145/3442188.3445922

Digital Library

[3]

Misha Benjamin, Paul Gagnon, Negar Rostamzadeh, Chris Pal, Yoshua Bengio, and Alex Shee. 2019. Towards standardization of data licenses: The montreal data license. arXiv preprint arXiv:1903.12262 (2019).

[4]

Petya Buchkova, Joakim Hey Hinnerskov, Kasper Olsen, and Rolf-Helge Pfeiffer. 2022. DaSEA: a dataset for software ecosystem analysis. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR). 388--392. https://doi.org/10.1145/3524842.3528004

Digital Library

[5]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research (JAIR), Vol. 16, 1 (2002), 321--357. https://doi.org/10.1613/jair.953

[6]

Jialuo Chen, Jingyi Wang, Tinglan Peng, Youcheng Sun, Peng Cheng, Shouling Ji, Xingjun Ma, Bo Li, and Dawn Song. 2022. Copy, Right? A testing framework for copyright protection of deep learning models. In IEEE Symposium on Security and Privacy (SP). IEEE, 824--841. https://doi.org/10.1109/SP46214.2022.9833747

[7]

Creative Commons. 2023 a. Artificial intelligence and CC licenses. https://creativecommons.org/faq/#artificial-intelligence-and-cc-licenses Retrieved September 25, 2023 from

[8]

Creative Commons. 2023 b. Creative Commons Licenses List. https://creativecommons.org/licenses/ Retrieved September 25, 2023 from

[9]

Danish Contractor, Daniel McDuff, Julia Katherine Haines, Jenny Lee, Christopher Hines, Brent Hecht, Nicholas Vincent, and Hanlin Li. 2022. Behavioral use licensing for responsible AI. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT). 778--788. https://doi.org/10.1145/3531146.3533143

Digital Library

[10]

Xing Cui, Jingzheng Wu, Yanjun Wu, Xu Wang, Tianyue Luo, Sheng Qu, Xiang Ling, and Mutian Yang. 2023. An Empirical Study of License Conflict in Free and Open Source Software. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 495--505. https://doi.org/10.1109/ICSE-SEIP58684.2023.00050

Digital Library

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171--4186. https://doi.org/10.18653/v1/n19--1423

[12]

Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born again neural networks. In Proceedings of the 35th International Conference on Machine Learning (ICML). PMLR, 1607--1616.

[13]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027 (2020).

[14]

Daniel M German, Yuki Manabe, and Katsuro Inoue. 2010. A sentence-matching method for automatic license identification of source code files. In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE). 437--446. https://doi.org/10.1145/1858996.1859088

Digital Library

[15]

Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Mannat Singh, Ishan Misra, Levent Sagun, Armand Joulin, and Piotr Bojanowski. 2022. Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360 (2022).

[16]

Eli Greenbaum. 2016. The Non-Discrimination Principle in Open Source Licensing. Cardozo Law Review, Vol. 37, 4 (2016), 1297--1344.

[17]

Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. 2021. Pre-trained models: Past, present and future. AI Open, Vol. 2 (2021), 225--250. https://doi.org/10.1016/j.aiopen.2021.08.002

[18]

Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IJCNN). IEEE, 1322--1328. https://doi.org/10.1109/IJCNN.2008.4633969

[19]

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 120--134. https://doi.org/10.1145/3503221.3508418

Digital Library

[20]

Samantha Fink Hedrick. 2019. I Think, Therefore I Create: Claiming Copyright in the Outputs of Algorithms. New York University Journal of Intellectual Property & Entertainment Law (JIPEL), Vol. 8, 2 (2019), 324--375.

[21]

Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A Lemley, and Percy Liang. 2023. Foundation models and fair use. arXiv preprint arXiv:2303.15715 (2023).

[22]

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia (MM). 4083--4091. https://doi.org/10.1145/3503161.3548112

Digital Library

[23]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, Vol. 3, 1 (1991), 79--87. https://doi.org/10.1162/neco.1991.3.1.79

[24]

Michael C Jaeger, Oliver Fendt, Robert Gobeille, Maximilian Huber, Johannes Najjar, Kate Stewart, Steffen Weber, and Andreas Wurl. 2017. The FOSSology project: 10 years of license scanning. International Free and Open Source Software Law Review, Vol. 9 (2017), 9.

[25]

Wenxin Jiang, Nicholas Synovic, Matt Hyatt, Taylor R Schorlemmer, Rohan Sethi, Yung-Hsiang Lu, George K Thiruvathukal, and James C Davis. 2023. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). 2463--2475. https://doi.org/10.1109/ICSE48619.2023.00206

Digital Library

[26]

Georgia M Kapitsaki, Frederik Kramer, and Nikolaos D Tselikas. 2017. Automating the license compatibility process in open source software with SPDX. Journal of Systems and Software (JSS), Vol. 131 (2017), 386--401. https://doi.org/10.1016/j.jss.2016.06.064

Digital Library

[27]

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. 2023. The Stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research (TMLR) (2023).

[28]

Thanh Chi Lam, Nghia Hoang, Bryan Kian Hsiang Low, and Patrick Jaillet. 2021. Model Fusion for Personalized Learning. In Proceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 5948--5958.

[29]

Qinbin Li, Bingsheng He, and Dawn Song. 2021. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10713--10722.

[30]

Dreamlike Tech Ltd. 2023. Dreamlike Photoreal 2.0. https://huggingface.co/dreamlike-art/dreamlike-photoreal-2.0 Retrieved September 25, 2023 from

[31]

Thomas Margoni. 2018. Artificial Intelligence, Machine learning and EU copyright law: Who owns AI? Machine Learning and EU Copyright Law: Who Owns AI (2018). https://doi.org/10.2139/ssrn.3299523

[32]

Arunesh Mathur, Harshal Choudhary, Priyank Vashist, William Thies, and Santhi Thilagam. 2012. An empirical study of license violations in open source projects. In 2012 35th Annual IEEE Software Engineering Workshop (SEW). IEEE, 168--176. https://doi.org/10.1109/SEW.2012.24

Digital Library

[33]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS). 1273--1282.

[34]

Inc. Meta Platforms. 2023. Llama2 Community License. https://ai.meta.com/llama/license/ Retrieved September 25, 2023 from

[35]

Philippe Ombredanne. 2020. Free and open source software license compliance: tools for software composition analysis. Computer, Vol. 53, 10 (2020), 105--109. https://doi.org/10.1109/MC.2020.3011082

[36]

National Commission on New Technological Uses of Copyrighted Works (US). 1979. Final Report of the National Commission on New Technological Uses of Copyrighted Works, July 31, 1978. Library of Congress.

[37]

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023).

[38]

Sen Peng, Yufei Chen, Jie Xu, Zizhuo Chen, Cong Wang, and Xiaohua Jia. 2022. Intellectual property protection of DNN models. World Wide Web (2022), 1--35. https://doi.org/10.1007/s11280-022-01113--3

[39]

Bruce Perens. 1999. The open source definition. Open sources: voices from the open source revolution, Vol. 1 (1999), 171--188.

[40]

Midjourney platform. 2023. Midjourney's Terms of Service. https://docs.midjourney.com/docs/terms-of-service Retrieved September 25, 2023 from

[41]

PromptHero. 2023. Openjourney v4. https://www.openjourney.art/ Retrieved September 25, 2023 from

[42]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[43]

Gopi Krishnan Rajbahadur, Erika Tuck, Li Zi, Dayi Lin, Boyuan Chen, Zhen Ming, Daniel M German, et al. 2021. Can I use this publicly available dataset to build commercial AI software?--A Case Study on Publicly Available Image Datasets. arXiv preprint arXiv:2111.02374 (2021).

[44]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684--10695. https://doi.org/10.1109/CVPR52688.2022.01042

[45]

Lawrence Rosen. 2005. Open Source Licensing: Software Freedom and Intellectual Property Law. Prentice Hall Professional Technical Reference, New Jersey.

Digital Library

[46]

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, Francc ois Yvon, Matthias Gallé, et al. 2022. BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).

[47]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS), Vol. 35 (2022), 25278--25294.

[48]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 3645--3650. https://doi.org/10.18653/v1/p19--1355

[49]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).

[50]

Christopher Vendome, Mario Linares-Vásquez, Gabriele Bavota, Massimiliano Di Penta, Daniel German, and Denys Poshyvanyk. 2017. Machine learning-based detection of open source license exceptions. In Proceedings of IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 118--129. https://doi.org/10.1109/ICSE.2017.19

Digital Library

[51]

Naibo Wang, Wenjie Feng, Jianwei Yin, and See-Kiong Ng. 2023. EasySpider: A No-Code Visual System for Crawling the Web. In Companion Proceedings of the ACM Web Conference (WWW). 192--195. https://doi.org/10.1145/3543873.3587345

Digital Library

[52]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 38--45. https://doi.org/10.18653/v1/2020.emnlp-demos.6

[53]

Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M German, and Katsuro Inoue. 2015. A method to detect license inconsistencies in large-scale open source projects. In Proceedings of IEEE/ACM 12th Working Conference on Mining Software Repositories (MSR). IEEE, 324--333. https://doi.org/10.1109/MSR.2015.37

[54]

Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M German, and Katsuro Inoue. 2017. Analysis of license inconsistency in large collections of open source projects. Empirical Software Engineering (ESE), Vol. 22 (2017), 1194--1222. https://doi.org/10.1007/s10664-016--9487--8

Digital Library

[55]

Shan You, Chang Xu, Fei Wang, and Changshui Zhang. 2021. Workshop on Model Mining. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 4177--4178. https://doi.org/10.1145/3447548.3469471

Digital Library

[56]

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2023. GLM-130B: An Open Bilingual Pre-trained Model. Proceedings of the 11th International Conference on Learning Representations (ICLR).

[57]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In Proceedings of the 6th International Conference on Learning Representations (ICLR). io

Index Terms

ModelGo: A Practical Tool for Machine Learning License Analysis
1. Software and its engineering
  1. Software creation and management
    1. Collaboration in software development
      1. Open source model

Recommendations

Open Source License Inconsistencies on GitHub
Almost all software, open or closed, builds on open source software and therefore needs to comply with the license obligations of the open source code. Not knowing which licenses to comply with poses a legal danger to anyone using open source software. ...
An Empirical Study of License Conflict in Free and Open Source Software
ICSE-SEIP '23: Proceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice

Free and Open Source Software (FOSS) has become the fundamental infrastructure of mainstream software projects. FOSS is subject to various legal terms and restrictions, depending on the type of open source license in force. Hence it is important to ...
Free and Open Source Software (FOSS) and other Alternative License Models: A Comparative Analysis

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '24: Proceedings of the ACM Web Conference 2024

May 2024

4826 pages

ISBN:9798400701719

DOI:10.1145/3589334

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Proceedings Chair:
Roy Ka-Wei Lee
Singapore University of Technology and Design
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

AI Singapore

Conference

WWW '24

Sponsor:

SIGWEB

WWW '24: The ACM Web Conference 2024

May 13 - 17, 2024

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
621
Total Downloads

Downloads (Last 12 months)621
Downloads (Last 6 weeks)71

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten