skip to main content
10.1145/3643991.3644883acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

MalwareBench: Malware samples are not enough

Published: 02 July 2024 Publication History

Abstract

The prevalent use of third-party components in modern software development, rapid modernization, and digitization have significantly amplified the risk of software supply chain attacks. Popular large registries like npm and PyPI are highly targeted malware distribution channels for attackers due to heavy growth and dependence on third-party components. Industry and academia are working towards building tools to detect malware in the software supply chain. However, a lack of benchmark datasets containing both malicious and neutral packages hampers the evaluation of the performance of these malware detection tools. The goal of our study is to aid researchers and tool developers in evaluating and improving malware detection tools by contributing a benchmark dataset built by systematically collecting malicious and neutral packages from the npm and PyPI ecosystems. We present MalwareBench, a labeled dataset of 20,792 packages (of which 6,659 are malicious) from the npm and PyPI ecosystems. We constructed the benchmark dataset by incorporating pre-existing malware datasets with the Socket internal benchmark data and including popular and newly released npm and PyPI packages. The ground truth labels of these neutral packages were determined using the Socket AI Scanner and manual inspection.

References

[1]
2022. Socket, Inc. Retrieved December 2, 2023 from https://socket.dev/
[2]
Sharma A. 2022. Protestware on the rise: Why developers are sabotaging their own code. Retrieved December 2, 2023 from https://techcrunch.com/2022/07/27/protestware-code-sabotage/
[3]
Lxyeternal Blue. 2023. PyPI Malregistry. https://github.com/lxyeternal/pypi_malregistry
[4]
DataDog. 2022. GuardDog. https://github.com/datadog/guarddog
[5]
DataDog. 2023. malicious-software-packages-dataset. https://github.com/DataDog/malicious-software-packages-dataset
[6]
Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. 2020. Towards measuring supply chain attacks on package managers for interpreted languages. arXiv preprint arXiv:2002.01139 (2020).
[7]
ENISA. 2022. ENISA Threat Landscape 2022. Retrieved December 2, 2023 from https://www.enisa.europa.eu/publications/enisa-threat-landscape-2022
[8]
Zahan et al. 2023. MalwareBench.
[9]
Peter Firstbrook. 2022. 7 Top Trends in Cybersecurity for 2022. Retrieved December 2, 2023 from https://www.gartner.com/en/articles/7-top-trends-in-cybersecurity-for-2022
[10]
Fabian Froh, Matías Gobbi, and Johannes Kinder. 2023. Differential Static Analysis for Detecting Malicious Updates to Open Source Packages. In Proceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses. 41--49.
[11]
Sarah Gooding. 2023. How Socket Combats Insidious Typosquatting Supply Chain Attacks. https://socket.dev/blog/how-socket-combats-insidious-typosquatting-supply-chain-attacks
[12]
Wenbo Guo, Zhengzi Xu, Chengwei Liu, Cheng Huang, Yong Fang, and Yang Liu. 2023. An Empirical Study of Malicious Code In PyPI Ecosystem. arXiv preprint arXiv:2309.11021 (2023).
[13]
The White House. 2021. Executive Order on Improving the Nation's Cybersecurity. Retrieved December 2, 2023 from https://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/executive-order-on-improving-the-nations-cybersecurity/
[14]
Datadog Security Labs. 2023. Open-Source Dataset of Malicious Software Packages. Retrieved December 2, 2023 from https://github.com/datadog/malicious-software-packages-dataset
[15]
Piergiorgio Ladisa, Serena Elisa Ponta, Nicola Ronzoni, Matias Martinez, and Olivier Barais. 2023. On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI. arXiv preprint arXiv:2310.09571 (2023).
[16]
Mikola Lysenko. 2023. Introducing Socket AI - ChatGPT-Powered Threat Analysis. Retrieved December 2, 2023 from https://socket.dev/blog/introducing-socket-ai-chatgpt-powered-threat-analysis
[17]
Dempsey K. Pillitteri V. Y. Nieles, M. [n. d.]. NIST Special Publication 800-12, Revision 1. National Institute of Standards & Technology ([n. d.]).
[18]
Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber's knife collection: A review of open source software supply chain attacks. (2020), 23--43.
[19]
Socket. 2023. Dependency Confusion. https://socket.dev/glossary/dependency-confusion
[20]
SocRadar. 2023. SmoothOperator Supply Chain Attack Targeting 3CX VOIP Desktop Client. https://socradar.io/smoothoperator-supply-chain-attack-targeting-3cx-voip-desktop-client/ Last accessed December 2, 2023.
[21]
Bill Toulas. 2023. NPM ecosystem at risk from "Manifest Confusion" attacks. https://www.bleepingcomputer.com/news/security/npm-ecosystem-at-risk-from-manifest-confusion-attacks/
[22]
Duc-Ly Vu, Zachary Newman, and John Speed Meyers. 2023. Bad Snakes: Understanding and Improving Python Package Index Malware Scanning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 499--511.
[23]
Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid, Brendan Murphy, Chandra Maddila, and Laurie Williams. 2022. What are weak links in the npm supply chain?. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 331--340.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories
April 2024
788 pages
ISBN:9798400705878
DOI:10.1145/3643991
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

  1. software engineering security
  2. software supply chain
  3. software supply chain security
  4. npm and PyPI ecosystems
  5. malicious packages
  6. benchmark dataset

Qualifiers

  • Research-article

Funding Sources

Conference

MSR '24
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 62
    Total Downloads
  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)14
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media