research-article

A Close Look at a Daily Dataset of Malware Samples

Authors:

Xabier Ugarte-Pedrero,

Mariano Graziano,

Davide BalzarottiAuthors Info & Claims

ACM Transactions on Privacy and Security (TOPS), Volume 22, Issue 1

Article No.: 6, Pages 1 - 30

https://doi.org/10.1145/3291061

Published: 22 January 2019 Publication History

Abstract

The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security researchers and attracts the interest of a human expert.

To the best of our knowledge, nobody has systematically dissected these datasets to precisely understand what they really contain. The security community generally discards the problem because of the alleged prevalence of uninteresting samples.

In this article, we guide the reader through a step-by-step analysis of the hundreds of thousands Windows executables collected in one day from these feeds. Our goal is to show how a company can employ existing state-of-the-art techniques to automatically process these samples and then perform manual experiments to understand and document what is the real content of this gigantic dataset. We present the filtering steps, and we discuss in detail how samples can be grouped together according to their behavior to support manual verification. Finally, we use the results of this measurement experiment to provide a rough estimate of both the human and computer resources that are required to get to the bottom of the catch of the day.

References

[1]

Jose Morales. 2014. A New Approach to Prioritizing Malware Analysis. Retrieved from https://insights.sei.cmu.edu/sei_blog/2014/04/a-new-approach-to-prioritizing-malware-analysis.html.

[2]

Symantec. 2008. Symantec Global Internet Security Threat Report Trends for 2008. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiv_04-2009.en-us.pdf.

[3]

Symantec. 2015. Symantec’s 2015 internet security threat report. Retrieved from https://www.symantec.com/security_response/publications/threatreport.jsp.

[4]

Francisco Santos. 2016. Putting the spotlight on firmware malware. Retrieved from http://blog.virustotal.com/2016/01/putting-spotlight-on-firmware-malware_27.html.

[5]

Symantec. 2016. Symantec’s Internet Security Threat Report 2016. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.

[6]

Herman Slatman. 2017. Awesome Threat Intelligence. Retrieved from https://github.com/hslatman/awesome-threat-intelligence.

[7]

VirusTotal. 2017. VirusTotal File Statistics during the last 7 days. Retrieved from https://www.virustotal.com/en/statistics/.

[8]

Alberto Ortega. 2018. Pafish—Paranoid Fish. Retrieved from https://github.com/a0rtega/pafish.

[9]

C. Gates B. Li, K. Roundy, and Y. Vorobeychik. 2017. Large-scale identification of malicious singleton files. In Proceedings of the ACM Conference on Data and Application Security and Privacy (CODASPY’17).

Digital Library

[10]

Ulrich Bayer, Imam Habibi, Davide Balzarotti, Engin Kirda, and Christopher Kruegel. 2009. A view on current malware behaviors. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET 09).

Digital Library

[11]

Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2016. A look into 30 years of malware development from a software metrics perspective. In Proceedings of the 19th International Symposium on Research in Attacks, Intrusions and Defenses. Evry, France.

[12]

Julio Canto, Marc Dacier, Engin Kirda, and Corrado Leita. 2008. Large-scale malware collection: Lessons learned. In Proceedings of the 27th International Symposium on Reliable Distributed Systems (SRDS’08). Retrieved from http://www.eurecom.fr/publication/2648.

[13]

Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, and Davide Balzarotti. 2018. Understanding linux malware. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.

[14]

Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. J. Info. Secur. 5, 2 (2014), 56.

[15]

Mariano Graziano, Davide Canali, Leyla Bilge, Andrea Lanzi, and Davide Balzarotti. 2015. Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence. In Proceedings of the 24th USENIX Security Symposium (USENIXSecurity’15).

Digital Library

[16]

Xin Hu, Kang G. Shin, Sandeep Bhatkar, and Kent Griffin. 2013. MutantX-S: Scalable malware clustering based on static features. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’13). USENIX, San Jose, CA, 187--198.

Digital Library

[17]

Heqing Huang, Cong Zheng, Junyuan Zeng, Wu Zhou, Sencun Zhu, Peng Liu, Suresh Chari, and Ce Zhang. 2016. Android malware development on public malware scanning platforms: A large-scale data-driven study. In Proceedings of the IEEE International Conference on Big Data (BIG DATA’16). IEEE Computer Society, Washington, DC.

[18]

Grégoire Jacob, Paolo Milani Comparetti, Matthias Neugschwandtner, Christopher Kruegel, and Giovanni Vigna. 2012. A static, packer-agnostic filter to detect similar malware samples. In Proceedings of the Conference on Detection of Intrusions and Malware 8 Vulnerability Assessment (DIMVA’12) (Lecture Notes in Computer Science), Vol. 7591. Springer, 102--122.

Digital Library

[19]

Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11).

Digital Library

[20]

Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In Proceedings of the 22nd USENIX Security Symposium (USENIXSecurity’13). USENIX, Washington, D.C., 81--96.

Digital Library

[21]

Eric Jones, Travis Oliphant, Pearu Peterson et al. 2016. SciPy: Open source scientific tools for Python 2001--2012. Retrieved from http://www.scipy.org.

[22]

Sandeep Karanth, Srivatsan Laxman, Prasad Naldurg, Ramarathnam Venkatesan, John Lambert, and Jinwook Shin. 2011. ZDVUE: Prioritization of javascript attacks to discover new vulnerabilities. In Proceedings of the ACM Workshop on Artificial Intelligence and Security (AISec’11). ACM, 31--42.

Digital Library

[23]

Doowon Kim, Bum Jun Kwon, and Tudor Dumitraş. 2017. Certified malware: Measuring breaches of trust in the windows code-signing PKI. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17).

Digital Library

[24]

Kristián Kozák, Bum Jun Kwon, Doowon Kim, Christopher Gates, and Tudor Dumitraş. 2018. Issued for abuse: Measuring the underground trade in code signing certificate. arXiv preprint arXiv:1803.02931.

[25]

Bum Jun Kwon, Jayanta Mondal, Jiyong Jang, Leyla Bilge, and Tudor Dumitraş. 2015. The dropper effect: Insights into malware distribution with downloader graph analytics. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15).

Digital Library

[26]

Chaz Lever, Platon Kotzias, Davide Balzarotti, Juan Caballero, and Manos Antonakakis. 2017. A lustrum of malware network communication: Evolution and insights. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.

[27]

Martina Lindorfer, Alessandro Di Federico, Federico Maggi, Paolo Milani Comparetti, and Stefano Zanero. 2012. Lines of malicious code: Insights into the malicious software industry. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC’12).

Digital Library

[28]

Martina Lindorfer, Clemens Kolbitsch, and Paolo Milani Comparetti. 2011. Detecting environment-sensitive malware. In Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID’11).

Digital Library

[29]

Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van der Veen, and Christian Platzer. 2014. Andrubis—1,000,000 apps later: A view on current android malware behaviors. In Proceedings of the 3rd International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS’14).

Digital Library

[30]

Jonathan Oliver, Chun Cheng, and Yanggui Chen. 2013. TLSH--a locality sensitive hash. In Proceedings of the 4th Cybercrime and Trustworthy Computing Workshop (CTC’13). IEEE, 7--13.

Digital Library

[31]

Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. 2008. Learning and classification of malware behavior. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 108--125.

Digital Library

[32]

Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. 2011. Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19, 4 (2011), 639--668.

[33]

Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. AVclass: A tool for massive malware labeling. In Proceedings of the International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.

[34]

Xabier Ugarte-Pedrero, Davide Balzarotti, Igor Santos, and Pablo G. Bringas. 2015. {SoK} deep packer inspection: A longitudinal study of the complexity of run-time packers. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.

Digital Library

[35]

George D. Webster, Bojan Kolosnjaji, Christian von Pentz, Julian Kirsch, Zachary D. Hanif, Apostolis Zarras, and Claudia Eckert. 2017. Finding the needle: A study of the PE32 rich header and respective malware triage. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 119--138.

[36]

Georg Wicherski. 2009. peHash: A novel approach to fast malware clustering. In Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More (LEET’09). USENIX Association.

Digital Library

Cited By

Allegretta MSiracusano GGonzález RGramaglia MCaballero J(2025)Web of shadows: Investigating malware abuse of internet servicesComputers & Security10.1016/j.cose.2024.104182149(104182)Online publication date: Feb-2025
https://doi.org/10.1016/j.cose.2024.104182
Botacin M(2024)Fuzzing and Symbolic Execution for Multipath Malware Tracing: Bridging Theory and Practice via Survey and ExperimentsDigital Threats: Research and Practice10.1145/37001475:4(1-33)Online publication date: 11-Oct-2024
https://dl.acm.org/doi/10.1145/3700147
Stevens KErdemir MZhang HKim TPearce P(2024)BluePrint: Automatic Malware Signature Generation for Internet ScanningProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678923(197-214)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3678890.3678923
Show More Cited By

Index Terms

A Close Look at a Daily Dataset of Malware Samples
1. Security and privacy

Recommendations

Measuring the Effectiveness of Twitter’s URL Shortener (t.co) at Protecting Users from Phishing and Malware Attacks
ACSW '20: Proceedings of the Australasian Computer Science Week Multiconference

In this paper we investigate how effective Twitter’s URL shortening service (t.co) is at protecting users from phishing and malware attacks. We show that over 10,000 unique blacklisted phishing and malware URLs were posted to Twitter during a 2-month ...
Your botnet is my botnet: analysis of a botnet takeover
CCS '09: Proceedings of the 16th ACM conference on Computer and communications security

Botnets, networks of malware-infected machines that are controlled by an adversary, are the root cause of a large number of security problems on the Internet. A particularly sophisticated and insidious type of bot is Torpig, a malware program that is ...
Collecting autonomous spreading malware using high-interaction honeypots
ICICS'07: Proceedings of the 9th international conference on Information and communications security

Autonomous spreading malware in the form of worms or bots has become a severe threat in today's Internet. Collecting the sample as early as possible is a necessary precondition for the further treatment of the spreading malware, e.g., to develop ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Privacy and Security

ACM Transactions on Privacy and Security Volume 22, Issue 1

February 2019

226 pages

ISSN:2471-2566

EISSN:2471-2574

DOI:10.1145/3287762

Editor:
David Basin
ETH Zurich, Switzerland

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2019

Accepted: 01 October 2018

Revised: 01 October 2018

Received: 01 January 2018

Published in TOPS Volume 22, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Horizon 2020

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
917
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)8

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Allegretta MSiracusano GGonzález RGramaglia MCaballero J(2025)Web of shadows: Investigating malware abuse of internet servicesComputers & Security10.1016/j.cose.2024.104182149(104182)Online publication date: Feb-2025
https://doi.org/10.1016/j.cose.2024.104182
Botacin M(2024)Fuzzing and Symbolic Execution for Multipath Malware Tracing: Bridging Theory and Practice via Survey and ExperimentsDigital Threats: Research and Practice10.1145/37001475:4(1-33)Online publication date: 11-Oct-2024
https://dl.acm.org/doi/10.1145/3700147
Stevens KErdemir MZhang HKim TPearce P(2024)BluePrint: Automatic Malware Signature Generation for Internet ScanningProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678923(197-214)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3678890.3678923
Botacin M(2024)What do malware analysts want from academia? A survey on the state-of-the-practice to guide research developmentsProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678892(77-96)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3678890.3678892
Meschini MTizio GBalduzzi MMassacci F(2024)A Case-Control Study to Measure Behavioral Risks of Malware Encounters in OrganizationsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345696019(9419-9432)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2024.3456960
Zhong FHu QJiang YHuang JZhang CWu D(2024)Enhancing Malware Classification via Self-Similarity TechniquesIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.343337219(7232-7244)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2024.3433372
Han YSeo HYoon M(2024)Detecting Internet of Things Malware on Evidence GenerationIEEE Internet of Things Journal10.1109/JIOT.2024.343952811:22(36950-36964)Online publication date: 15-Nov-2024
https://doi.org/10.1109/JIOT.2024.3439528
Gupta SLu FBarlow ARaff EFerraro FMatuszek CNicholas CHolt J(2024)Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825735(2624-2634)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825735
Shakir Hameed Shah SJamil Nur Rehman Khan AMohd Sidek LAlturki NMuhammad Zain Z(2024)MalRed: An innovative approach for detecting malware using the red channel analysis of color imagesEgyptian Informatics Journal10.1016/j.eij.2024.10047826(100478)Online publication date: Jun-2024
https://doi.org/10.1016/j.eij.2024.100478
Cheng BLeal EZhang HMing JCalandrino JTroncoso C(2023)On the feasibility of malware unpacking via hardware-assisted loop profilingProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620656(7481-7498)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620656
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents