research-article

Open access

CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation

Authors:

Nenghai YuAuthors Info & Claims

ACM-TURC '24: Proceedings of the ACM Turing Award Celebration Conference - China 2024

Pages 120 - 125

https://doi.org/10.1145/3674399.3674447

Published: 30 July 2024 Publication History

All formats PDF

Abstract

As deep learning progresses, programming language generation models such as CodeLlama, GitHub Copilot, and ChatGPT have been widely applied to intelligent code development. However, this also reduces the cost of code plagiarism, posing challenges to copyright and academic integrity. In response to the specific needs for human-machine code detection, this paper introduces a comprehensive automated benchmark CodeWMBench for active detection of human-machine code through watermarking. With a meticulous evaluation of eight code watermarking methods, we demonstrated their performance in terms of harmlessness, robustness, and transparency. Specifically, for the first time, we introduced watermark removal techniques based on large language models and conducted the first assessment of these watermarking methods against code rewriting and retranslating attacks. In the discussion, we delved into the critical issues currently facing code watermarking, including why existing code watermarking methods struggle to resist removal by large language models and potential future methods that could withstand such removals.

References

[1]

[1] n.d.https://www.zerogpt.com

[2]

Geneviève Arboit. 2002. A Method for Watermarking Java Programs via Opaque Predicates. Electronic Commerce Research (2002).

[3]

P Vinod Bhattathiripad. 2012. Software piracy forensics: A proposal for incorporating dead codes and other programming blunders as important evidence in AFC test. In IEEE 36th Annual Computer Software and Applications Conference Workshops. 206–212.

Digital Library

[4]

Casey Casalnuovo, Earl T Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Morgan. 2020. A theory of dual channel constraints. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results. 25–28.

Digital Library

[5]

Jianping Chen, Kui Li, Wanzhi Wen, Weixu Chen, and Chenxue Yan. 2018. Software watermarking for java program based on method name encoding. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics. 865–874.

[6]

Christian Collberg, Andrew Huntwork, Edward Carter, and Gregg Townsend. 2005. Graph theoretic software watermarks: Implementation, analysis, and attacks. In International Workshop on Information Hiding. 192–207.

[7]

Christian Collberg and Clark Thomborson. 1999. Software watermarking: Models and dynamic embeddings. In Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 311–324.

Digital Library

[8]

Patrick Cousot and Radhia Cousot. 2004. An abstract interpretation-based framework for software watermarking. ACM Sigplan Notices 39, 1 (2004), 173–185.

Digital Library

[9]

Mila Dalla Preda and Michele Pasqua. 2017. Software watermarking: a semantics-based approach. Electronic Notes in Theoretical Computer Science 331 (2017), 71–85.

Digital Library

[10]

Ayan Dey, Sukriti Bhattacharya, and Nabendu Chaki. 2018. Software Watermarking: Progress and Challenges. INAE Letters 4 (2018).

[11]

James Hamilton and Sebastian Danicic. 2011. A survey of static software watermarking. In World Congress on Internet Security (WorldCIS-2011). 100–107.

[12]

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. ArXiv preprint abs/2301.10226 (2023).

[13]

Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. 2023. Who wrote this code? watermarking for code generation. ArXiv preprint abs/2305.15060 (2023).

[14]

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2024. Backdoor Learning: A Survey. IEEE Transactions on Neural Networks and Learning Systems 35, 1 (2024), 5–22.

[15]

Haoyu Ma, Chunfu Jia, Shijia Li, Wantong Zheng, and Dinghao Wu. 2019. Xmark: dynamic software watermarking using Collatz conjecture. IEEE Transactions on Information Forensics and Security 14, 11 (2019), 2859–2874.

[16]

Ginger Myles and Christian Collberg. 2006. Software watermarking via opaque predicates: Implementation, analysis, and attacks. Electronic Commerce Research 6 (2006), 155–171.

Digital Library

[17]

J. Palsberg, S. Krishnaswamy, Minseok Kwon, D. Ma, Qiuyun Shao, and Y. Zhang. 2000. Experience with software watermarking. In Proceedings 16th Annual Computer Security Applications Conference (ACSAC’00). 308–316.

[18]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. ArXiv preprint abs/2009.10297 (2020).

[19]

Zhensu Sun, Xiaoning Du, Fu Song, and Li Li. 2023. CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1561–1572.

Digital Library

[20]

Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. 2022. Coprotector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the ACM Web Conference. 652–660.

Digital Library

[21]

Edward Tian and Alexander Cui. 2023. GPTZero: Towards detection of AI-generated text using zero-shot and supervised methods.

[22]

Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 139–150.

[23]

Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F Wong, and Lidia S Chao. 2023. A survey on llm-gernerated text detection: Necessity, methods, and future directions. ArXiv preprint abs/2310.14724 (2023).

[24]

Chang Xu, Jun Wang, Yuqing Tang, Francisco Guzmán, Benjamin IP Rubinstein, and Trevor Cohn. 2021. A targeted attack on black-box neural machine translation with parallel data poisoning. In Proceedings of the web conference. 3638–3650.

Digital Library

[25]

Borui Yang, Wei Li, Liyao Xiang, and Bo Li. 2024. SrcMarker: Dual-Channel Source Code Watermarking via Scalable Code Transformations. In IEEE Symposium on Security and Privacy (SP). 97–97.

[26]

Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. 2023. Codebertscore: Evaluating code generation with pretrained models of code. arXiv preprint arXiv:2302.05527 (2023).

[27]

William Zhu, Clark Thomborson, and Fei-Yue Wang. 2005. A survey of software watermarking. In IEEE International Conference on Intelligence and Security Informatics. 454–458.

Digital Library

Cited By

Huang YLi HLu DZhang ZTie W(2025)Intelligence Evaluating Computational Power: A Multi-Factor MethodIEEE Access10.1109/ACCESS.2025.353897713(27398-27415)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3538977

Index Terms

CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation
1. Security and privacy
  1. Software and application security
    1. Software security engineering

Recommendations

A Benchmark for 3D Mesh Watermarking
SMI '10: Proceedings of the 2010 Shape Modeling International Conference

This paper presents a benchmarking system for the evaluation of robust mesh watermarking methods. The proposed benchmark has three different components: a ''standard'' mesh model collection, a software tool and two application-oriented evaluation ...
Dual-Watermarking by QR-code Applications in Image Processing
UIC-ATC '12: Proceedings of the 2012 9th International Conference on Ubiquitous Intelligence and Computing and 9th International Conference on Autonomic and Trusted Computing

Digital watermarking has recently emerged as a solution to the problem of providing guarantees about copyright protection of digital images. However, several problems related to the robustness of invisible watermarking techniques from malicious or non-...
A region-adaptive semi-fragile dual watermarking scheme

Since existing watermarking schemes usually have only a single function, a region-adaptive semi-fragile dual watermarking scheme is proposed, taking into account both watermark embedding capacity and security. The dual watermarks refer to the robust ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ACM-TURC '24: Proceedings of the ACM Turing Award Celebration Conference - China 2024

July 2024

261 pages

ISBN:9798400710117

DOI:10.1145/3674399

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the Natural Science Foundation of China

Conference

ACM-TURC '24

ACM-TURC '24: ACM Turing Award Celebration Conference 2024

July 5 - 7, 2024

Changsha, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
400
Total Downloads

Downloads (Last 12 months)400
Downloads (Last 6 weeks)72

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang YLi HLu DZhang ZTie W(2025)Intelligence Evaluating Computational Power: A Multi-Factor MethodIEEE Access10.1109/ACCESS.2025.353897713(27398-27415)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3538977

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten