skip to main content
10.1145/3510454.3516866acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

WhyGen: explaining ML-powered code generation by referring to training examples

Published: 19 October 2022 Publication History

Abstract

Deep learning has demonstrated great abilities in various code generation tasks. However, despite the great convenience for some developers, many are concerned that the code generators may recite or closely mimic copyrighted training data without user awareness, leading to legal and ethical concerns. To ease this problem, we introduce a tool, named WhyGen, to explain the generated code by referring to training examples. Specifically, we first introduce a data structure, named inference fingerprint, to represent the decision process of the model when generating a prediction. The fingerprints of all training examples are collected offline and saved to a database. When the model is used at runtime for code generation, the most relevant training examples can be retrieved by querying the fingerprint database. Our experiments have shown that WhyGen is able to precisely notify the users about possible recitations and highly similar imitations with a top-10 accuracy of 81.21%. The demo video can be found at https://youtu.be/EtoQP6850To.

References

[1]
2021. GitHub, Copilot and the Copyright Around AI. https://www.plagiarismtoday.com/2021/07/08/github-copilot-and-the-copyright-around-ai/.
[2]
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308--318.
[3]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. arXiv:2103.06333 [cs.CL]
[4]
Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. 2019. Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems 32 (2019), 15479--15488.
[5]
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2020. Extracting training data from large language models. arXiv preprint arXiv:2012.07805 (2020).
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
[7]
Giorgio Franceschelli and Mirco Musolesi. 2021. Copyright in Generative Deep Learning. arXiv:2105.09266 [cs.CY]
[8]
Huseyin A Inan, Osman Ramadan, Lukas Wutschitz, Daniel Jones, Victor Rühle, James Withers, and Robert Sim. 2021. Training data leakage analysis in language models. arXiv preprint arXiv:2101.05405 (2021).
[9]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
[10]
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International Conference on Machine Learning. PMLR, 1885--1894.
[11]
Bingyan Liu, Yuanchun Li, Yunxin Liu, Yao Guo, and Xiangqun Chen. 2020. Pmc: A privacy-preserving deep learning model customization framework for edge computing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 4 (2020), 1--25.
[12]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv preprint arXiv:2102.04664 (2021).
[13]
Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. 2020. Privacy risks of generalpurpose language models. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1314--1331.
[14]
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2021. An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions. arXiv:2108.09293 [cs.CR]
[15]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. [n. d.]. Language models are unsupervised multitask learners. ([n. d.]).
[16]
Veselin Raychev, Pavol Bielik, and Martin T. Vechev. 2016. Probabilistic model for code with decision trees. Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (2016).
[17]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv preprint arXiv:2109.00859 (2021).
[18]
Ziqi Zhang, Yuanchun Li, Yao Guo, Xiangqun Chen, and Yunxin Liu. 2020. Dynamic slicing for deep neural networks. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 838--850.

Cited By

View all
  • (2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 30-May-2024
  • (2023)Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?Empirical Software Engineering10.1007/s10664-023-10380-128:6Online publication date: 23-Sep-2023

Index Terms

  1. WhyGen: explaining ML-powered code generation by referring to training examples
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            ICSE '22: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings
            May 2022
            394 pages
            ISBN:9781450392235
            DOI:10.1145/3510454
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Sponsors

            In-Cooperation

            • IEEE CS

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 19 October 2022

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. code generation
            2. intellectual property
            3. machine learning
            4. recitation

            Qualifiers

            • Research-article

            Conference

            ICSE '22
            Sponsor:

            Acceptance Rates

            Overall Acceptance Rate 276 of 1,856 submissions, 15%

            Upcoming Conference

            ICSE 2025

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)28
            • Downloads (Last 6 weeks)3
            Reflects downloads up to 27 Feb 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 30-May-2024
            • (2023)Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?Empirical Software Engineering10.1007/s10664-023-10380-128:6Online publication date: 23-Sep-2023

            View Options

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media