research-article

WhyGen: explaining ML-powered code generation by referring to training examples

Authors:

Yuanchun LiAuthors Info & Claims

ICSE '22: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings

Pages 237 - 241

https://doi.org/10.1145/3510454.3516866

Published: 19 October 2022 Publication History

Abstract

Deep learning has demonstrated great abilities in various code generation tasks. However, despite the great convenience for some developers, many are concerned that the code generators may recite or closely mimic copyrighted training data without user awareness, leading to legal and ethical concerns. To ease this problem, we introduce a tool, named WhyGen, to explain the generated code by referring to training examples. Specifically, we first introduce a data structure, named inference fingerprint, to represent the decision process of the model when generating a prediction. The fingerprints of all training examples are collected offline and saved to a database. When the model is used at runtime for code generation, the most relevant training examples can be retrieved by querying the fingerprint database. Our experiments have shown that WhyGen is able to precisely notify the users about possible recitations and highly similar imitations with a top-10 accuracy of 81.21%. The demo video can be found at https://youtu.be/EtoQP6850To.

References

[1]

2021. GitHub, Copilot and the Copyright Around AI. https://www.plagiarismtoday.com/2021/07/08/github-copilot-and-the-copyright-around-ai/.

[2]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308--318.

Digital Library

[3]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. arXiv:2103.06333 [cs.CL]

[4]

Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. 2019. Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems 32 (2019), 15479--15488.

[5]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2020. Extracting training data from large language models. arXiv preprint arXiv:2012.07805 (2020).

[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[7]

Giorgio Franceschelli and Mirco Musolesi. 2021. Copyright in Generative Deep Learning. arXiv:2105.09266 [cs.CY]

[8]

Huseyin A Inan, Osman Ramadan, Lukas Wutschitz, Daniel Jones, Victor Rühle, James Withers, and Robert Sim. 2021. Training data leakage analysis in language models. arXiv preprint arXiv:2101.05405 (2021).

[9]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).

[10]

Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International Conference on Machine Learning. PMLR, 1885--1894.

[11]

Bingyan Liu, Yuanchun Li, Yunxin Liu, Yao Guo, and Xiangqun Chen. 2020. Pmc: A privacy-preserving deep learning model customization framework for edge computing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 4 (2020), 1--25.

Digital Library

[12]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv preprint arXiv:2102.04664 (2021).

[13]

Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. 2020. Privacy risks of generalpurpose language models. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1314--1331.

[14]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2021. An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions. arXiv:2108.09293 [cs.CR]

[15]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. [n. d.]. Language models are unsupervised multitask learners. ([n. d.]).

[16]

Veselin Raychev, Pavol Bielik, and Martin T. Vechev. 2016. Probabilistic model for code with decision trees. Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (2016).

Digital Library

[17]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv preprint arXiv:2109.00859 (2021).

[18]

Ziqi Zhang, Yuanchun Li, Yao Guo, Xiangqun Chen, and Yunxin Liu. 2020. Dynamic slicing for deep neural networks. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 838--850.

Digital Library

Cited By

Fan DLee RZhang X(2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654634
Asare ONagappan MAsokan N(2023)Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?Empirical Software Engineering10.1007/s10664-023-10380-128:6Online publication date: 23-Sep-2023
https://dl.acm.org/doi/10.1007/s10664-023-10380-1

Index Terms

WhyGen: explaining ML-powered code generation by referring to training examples

Index terms have been assigned to the content through auto-classification.

Recommendations

Code Generation by Example Using Symbolic Machine Learning
Abstract
Code generation is a key technique for model-driven engineering (MDE) approaches of software construction. Code generation enables the synthesis of applications in executable programming languages from high-level specifications in UML or in a ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
RRGcode: Deep hierarchical search-based code generation
Abstract
Retrieval-augmented code generation strengthens the generation model by using a retrieval model to select relevant code snippets from a code corpus. The synergy between retrieval and generation ensures that the generated code closely corresponds ...
Highlights
- A Hierarchical Code Retrieval Mechanism for Efficiently Extracting Relevant Code Snippets from Extensive Code Repositories.
- An Intelligent Re-ranking Technique for Reassessing Retrieved Code Snippets According to Query Relevance.
- ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '22: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings

May 2022

394 pages

ISBN:9781450392235

DOI:10.1145/3510454

General Chair:
Matthew B Dwyer
University of Virginia

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
119
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fan DLee RZhang X(2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654634
Asare ONagappan MAsokan N(2023)Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?Empirical Software Engineering10.1007/s10664-023-10380-128:6Online publication date: 23-Sep-2023
https://dl.acm.org/doi/10.1007/s10664-023-10380-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten