skip to main content
research-article

A Close Look at a Daily Dataset of Malware Samples

Published: 22 January 2019 Publication History

Abstract

The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security researchers and attracts the interest of a human expert.
To the best of our knowledge, nobody has systematically dissected these datasets to precisely understand what they really contain. The security community generally discards the problem because of the alleged prevalence of uninteresting samples.
In this article, we guide the reader through a step-by-step analysis of the hundreds of thousands Windows executables collected in one day from these feeds. Our goal is to show how a company can employ existing state-of-the-art techniques to automatically process these samples and then perform manual experiments to understand and document what is the real content of this gigantic dataset. We present the filtering steps, and we discuss in detail how samples can be grouped together according to their behavior to support manual verification. Finally, we use the results of this measurement experiment to provide a rough estimate of both the human and computer resources that are required to get to the bottom of the catch of the day.

References

[1]
Jose Morales. 2014. A New Approach to Prioritizing Malware Analysis. Retrieved from https://insights.sei.cmu.edu/sei_blog/2014/04/a-new-approach-to-prioritizing-malware-analysis.html.
[2]
Symantec. 2008. Symantec Global Internet Security Threat Report Trends for 2008. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiv_04-2009.en-us.pdf.
[3]
Symantec. 2015. Symantec’s 2015 internet security threat report. Retrieved from https://www.symantec.com/security_response/publications/threatreport.jsp.
[4]
Francisco Santos. 2016. Putting the spotlight on firmware malware. Retrieved from http://blog.virustotal.com/2016/01/putting-spotlight-on-firmware-malware_27.html.
[5]
Symantec. 2016. Symantec’s Internet Security Threat Report 2016. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.
[6]
Herman Slatman. 2017. Awesome Threat Intelligence. Retrieved from https://github.com/hslatman/awesome-threat-intelligence.
[7]
VirusTotal. 2017. VirusTotal File Statistics during the last 7 days. Retrieved from https://www.virustotal.com/en/statistics/.
[8]
Alberto Ortega. 2018. Pafish—Paranoid Fish. Retrieved from https://github.com/a0rtega/pafish.
[9]
C. Gates B. Li, K. Roundy, and Y. Vorobeychik. 2017. Large-scale identification of malicious singleton files. In Proceedings of the ACM Conference on Data and Application Security and Privacy (CODASPY’17).
[10]
Ulrich Bayer, Imam Habibi, Davide Balzarotti, Engin Kirda, and Christopher Kruegel. 2009. A view on current malware behaviors. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET 09).
[11]
Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2016. A look into 30 years of malware development from a software metrics perspective. In Proceedings of the 19th International Symposium on Research in Attacks, Intrusions and Defenses. Evry, France.
[12]
Julio Canto, Marc Dacier, Engin Kirda, and Corrado Leita. 2008. Large-scale malware collection: Lessons learned. In Proceedings of the 27th International Symposium on Reliable Distributed Systems (SRDS’08). Retrieved from http://www.eurecom.fr/publication/2648.
[13]
Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, and Davide Balzarotti. 2018. Understanding linux malware. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.
[14]
Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. J. Info. Secur. 5, 2 (2014), 56.
[15]
Mariano Graziano, Davide Canali, Leyla Bilge, Andrea Lanzi, and Davide Balzarotti. 2015. Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence. In Proceedings of the 24th USENIX Security Symposium (USENIXSecurity’15).
[16]
Xin Hu, Kang G. Shin, Sandeep Bhatkar, and Kent Griffin. 2013. MutantX-S: Scalable malware clustering based on static features. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’13). USENIX, San Jose, CA, 187--198.
[17]
Heqing Huang, Cong Zheng, Junyuan Zeng, Wu Zhou, Sencun Zhu, Peng Liu, Suresh Chari, and Ce Zhang. 2016. Android malware development on public malware scanning platforms: A large-scale data-driven study. In Proceedings of the IEEE International Conference on Big Data (BIG DATA’16). IEEE Computer Society, Washington, DC.
[18]
Grégoire Jacob, Paolo Milani Comparetti, Matthias Neugschwandtner, Christopher Kruegel, and Giovanni Vigna. 2012. A static, packer-agnostic filter to detect similar malware samples. In Proceedings of the Conference on Detection of Intrusions and Malware 8 Vulnerability Assessment (DIMVA’12) (Lecture Notes in Computer Science), Vol. 7591. Springer, 102--122.
[19]
Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11).
[20]
Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In Proceedings of the 22nd USENIX Security Symposium (USENIXSecurity’13). USENIX, Washington, D.C., 81--96.
[21]
Eric Jones, Travis Oliphant, Pearu Peterson et al. 2016. SciPy: Open source scientific tools for Python 2001--2012. Retrieved from http://www.scipy.org.
[22]
Sandeep Karanth, Srivatsan Laxman, Prasad Naldurg, Ramarathnam Venkatesan, John Lambert, and Jinwook Shin. 2011. ZDVUE: Prioritization of javascript attacks to discover new vulnerabilities. In Proceedings of the ACM Workshop on Artificial Intelligence and Security (AISec’11). ACM, 31--42.
[23]
Doowon Kim, Bum Jun Kwon, and Tudor Dumitraş. 2017. Certified malware: Measuring breaches of trust in the windows code-signing PKI. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17).
[24]
Kristián Kozák, Bum Jun Kwon, Doowon Kim, Christopher Gates, and Tudor Dumitraş. 2018. Issued for abuse: Measuring the underground trade in code signing certificate. arXiv preprint arXiv:1803.02931.
[25]
Bum Jun Kwon, Jayanta Mondal, Jiyong Jang, Leyla Bilge, and Tudor Dumitraş. 2015. The dropper effect: Insights into malware distribution with downloader graph analytics. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15).
[26]
Chaz Lever, Platon Kotzias, Davide Balzarotti, Juan Caballero, and Manos Antonakakis. 2017. A lustrum of malware network communication: Evolution and insights. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.
[27]
Martina Lindorfer, Alessandro Di Federico, Federico Maggi, Paolo Milani Comparetti, and Stefano Zanero. 2012. Lines of malicious code: Insights into the malicious software industry. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC’12).
[28]
Martina Lindorfer, Clemens Kolbitsch, and Paolo Milani Comparetti. 2011. Detecting environment-sensitive malware. In Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID’11).
[29]
Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van der Veen, and Christian Platzer. 2014. Andrubis—1,000,000 apps later: A view on current android malware behaviors. In Proceedings of the 3rd International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS’14).
[30]
Jonathan Oliver, Chun Cheng, and Yanggui Chen. 2013. TLSH--a locality sensitive hash. In Proceedings of the 4th Cybercrime and Trustworthy Computing Workshop (CTC’13). IEEE, 7--13.
[31]
Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. 2008. Learning and classification of malware behavior. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 108--125.
[32]
Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. 2011. Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19, 4 (2011), 639--668.
[33]
Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. AVclass: A tool for massive malware labeling. In Proceedings of the International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.
[34]
Xabier Ugarte-Pedrero, Davide Balzarotti, Igor Santos, and Pablo G. Bringas. 2015. {SoK} deep packer inspection: A longitudinal study of the complexity of run-time packers. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.
[35]
George D. Webster, Bojan Kolosnjaji, Christian von Pentz, Julian Kirsch, Zachary D. Hanif, Apostolis Zarras, and Claudia Eckert. 2017. Finding the needle: A study of the PE32 rich header and respective malware triage. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 119--138.
[36]
Georg Wicherski. 2009. peHash: A novel approach to fast malware clustering. In Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More (LEET’09). USENIX Association.

Cited By

View all
  • (2025)Web of shadows: Investigating malware abuse of internet servicesComputers & Security10.1016/j.cose.2024.104182149(104182)Online publication date: Feb-2025
  • (2024)Fuzzing and Symbolic Execution for Multipath Malware Tracing: Bridging Theory and Practice via Survey and ExperimentsDigital Threats: Research and Practice10.1145/37001475:4(1-33)Online publication date: 11-Oct-2024
  • (2024)BluePrint: Automatic Malware Signature Generation for Internet ScanningProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678923(197-214)Online publication date: 30-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Privacy and Security
ACM Transactions on Privacy and Security  Volume 22, Issue 1
February 2019
226 pages
ISSN:2471-2566
EISSN:2471-2574
DOI:10.1145/3287762
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2019
Accepted: 01 October 2018
Revised: 01 October 2018
Received: 01 January 2018
Published in TOPS Volume 22, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Malware
  2. classification
  3. measurement
  4. prioritization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)8
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Web of shadows: Investigating malware abuse of internet servicesComputers & Security10.1016/j.cose.2024.104182149(104182)Online publication date: Feb-2025
  • (2024)Fuzzing and Symbolic Execution for Multipath Malware Tracing: Bridging Theory and Practice via Survey and ExperimentsDigital Threats: Research and Practice10.1145/37001475:4(1-33)Online publication date: 11-Oct-2024
  • (2024)BluePrint: Automatic Malware Signature Generation for Internet ScanningProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678923(197-214)Online publication date: 30-Sep-2024
  • (2024)What do malware analysts want from academia? A survey on the state-of-the-practice to guide research developmentsProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678892(77-96)Online publication date: 30-Sep-2024
  • (2024)A Case-Control Study to Measure Behavioral Risks of Malware Encounters in OrganizationsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345696019(9419-9432)Online publication date: 1-Jan-2024
  • (2024)Enhancing Malware Classification via Self-Similarity TechniquesIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.343337219(7232-7244)Online publication date: 1-Jan-2024
  • (2024)Detecting Internet of Things Malware on Evidence GenerationIEEE Internet of Things Journal10.1109/JIOT.2024.343952811:22(36950-36964)Online publication date: 15-Nov-2024
  • (2024)Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825735(2624-2634)Online publication date: 15-Dec-2024
  • (2024)MalRed: An innovative approach for detecting malware using the red channel analysis of color imagesEgyptian Informatics Journal10.1016/j.eij.2024.10047826(100478)Online publication date: Jun-2024
  • (2023)On the feasibility of malware unpacking via hardware-assisted loop profilingProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620656(7481-7498)Online publication date: 9-Aug-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media