skip to main content
10.1145/2046707.2046742acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

BitShred: feature hashing malware for scalable triage and semantic analysis

Published: 17 October 2011 Publication History

Abstract

The sheer volume of new malware found each day is growing at an exponential pace. This growth has created a need for automatic malware triage techniques that determine what malware is similar, what malware is unique, and why. In this paper, we present BitShred, a system for large-scale malware similarity analysis and clustering, and for automatically uncovering semantic inter- and intra-family relationships within clusters. The key idea behind BitShred is using feature hashing to dramatically reduce the high-dimensional feature spaces that are common in malware analysis. Feature hashing also allows us to mine correlated features between malware families and samples using co-clustering techniques. Our evaluation shows that BitShred speeds up typical malware triage tasks by up to 2,365x and uses up to 82x less memory on a single CPU, all with comparable accuracy to previous approaches. We also develop a parallelized version of BitShred, and demonstrate scalability within the Hadoop framework.

References

[1]
Apache hadoop. http://hadoop.apache.org/.
[2]
Carnegie Mellon University Cloud Computer Cluster. http://www2.pdl.cmu.edu/~twiki/cgi-bin/view/OpenCloud/ClusterOverview.
[3]
Malware Analysis System. http://mwanalysis.org/.
[4]
Offensive Computing. http://www.offensivecomputing.net/.
[5]
SimMetrics. http://sourceforge.net/projects/simmetrics/.
[6]
VirusTotal. http://www.virustotal.com/.
[7]
zynamics bindiff. http://www.zynamics.com/bindiff.html.
[8]
Symantec internet security threat report. http://www.symantec.com/business/theme.jsp?themeid=threatreport, April 2010.
[9]
T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. In COMPSAC '04: Proceedings of the 28th Annual International Computer Software and Applications Conference - Workshops and Fast Abstracts, 2004.
[10]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):177--122, 2008.
[11]
J. Attenberg, K. Weinberger, A. Dasgupta, A. Smola, and M. Zinkevich. Collaborative email-spam filtering with the hashing-trick. In Proceedings of the Sixth Conference on Email and Anti-Spam, 2009.
[12]
M. Bailey, J. Oberheide, J. Andersen, F. J. Z. Morley~Mao, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the Symposium on Recent Advances in Intrusion Detection, September 2007.
[13]
U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based malware clustering. In Proceedings of the Network and Distributed System Security Symposium, 2009.
[14]
D. Bernstein. http://www.cse.yorku.ca/~oz/hash.html.
[15]
A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2005.
[16]
D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsous. Fully automatic cross associations. In Proceedings of ACM SIGKDD, August 2004.
[17]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating System Design and Implementation, 2004.
[18]
A. Dinaburg, P. Royal, M. Sharif, and W. Lee. Ether: malware analysis via hardware virtualization extensions. In ACM CCS, 2008.
[19]
D. Eppstein. Fast hierarchical clustering and other applications of dynamic closest pairs. In Proceedings of ACM Symposium on Discrete Algorithms (SODA), 1998.
[20]
F. Guo, P. Ferrie, and T.-C. Chiueh. A study of the packer problem and its solutions. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection, pages 98--115, 2008.
[21]
X. Hu, T. cker Chiueh, and K. G. Shin. Large-scale malware indexing using function call graphs. In Proceedings of the ACM Conference on Computer and Communications Security, 2009.
[22]
N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proceedings of Eighth International Workshop on the Web and Databases (WebDB 2005), June 2005.
[23]
M. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13--23, November 2005.
[24]
G. Karypis. CLUTO: a clustering toolkit, release 2.1.1. Technical report, University of Minnesota, 2003.
[25]
J. Z. Kolter and M. A. Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721--2744, Dec. 2006.
[26]
P. Li, L. Lu, D. Gao, and M. Reiter. On challenges in evaluating malware clustering. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection, 2010.
[27]
L. Martignoni, M. Christodorescu, and S. Jha. OmniUnpack: Fast, generic, and safe unpacking of malware. In In Proceedings of the Annual Computer Security Applications Conference, 2007.
[28]
A. Moser, C. Kruegel, and E. Kirda. Exploring multiple execution paths for malware analysis. In Proceedings of the IEEE Symposium on Security and Privacy, 2007.
[29]
A. Moser, C. Kruegel, and E. Kirda. Limits of static analysis for malware detection. In Proceedings of the Annual Computer Security Applications Conference, 2007.
[30]
S. Papadimitrou and J. Sun. Disco: Distributed co-clustering with map-reduce. In Proceedings of ICDM, 2008.
[31]
R. Perdisci, A. Lanzi, and W. Lee. Classification of packed executables for accurate computer virus detection. Pattern Recogn. Lett., 29(14):1941--1946, 2008.
[32]
R. Perdisci, W. Lee, and N. Feamster. Behavioral clustering of HTTP-based malware and signature generation using malicious network traces. In Proceedings of NSDI, 2010.
[33]
P. Royal, M. Halpin, D. Dagon, R. Edmonds, and W. Lee. PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proceedings of Computer Security Applications Conference, December 2006.
[34]
S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the ACM SIGMOD/PODS Conference, 2003.
[35]
M. Sharif, A. Lanzi, J. Giffin, and W. Lee. Automatic reverse engineering of malware emulators. In Proceedings of the IEEE Symposium on Security and Privacy, 2009.
[36]
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research, 2009.
[37]
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smole, A. Strehl, and V. Vishwanathan. Hash kernels. In Proceedings of the $12^th$ International Conference on Artificial Intelligence and Statisics (AISTATS), 2009.
[38]
A. Walenstein and A. Lakhotia. The software similarity problem in malware analysis. In Duplication, Redundancy, and Similarity in Software, 2007.
[39]
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large-scale multitask learning. In Proceedings of ICML, 2009.

Cited By

View all
  • (2024)Evaluation of Malware Classification Models for Heterogeneous DataSensors10.3390/s2401028824:1(288)Online publication date: 3-Jan-2024
  • (2024)Pitfalls in Machine Learning for Computer SecurityCommunications of the ACM10.1145/3643456Online publication date: 25-Oct-2024
  • (2024)Using Autoencoder as Feature Extractor for Malware Detection2024 International Conference on IT and Industrial Technologies (ICIT)10.1109/ICIT63607.2024.10860243(1-6)Online publication date: 10-Dec-2024
  • Show More Cited By

Index Terms

  1. BitShred: feature hashing malware for scalable triage and semantic analysis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CCS '11: Proceedings of the 18th ACM conference on Computer and communications security
      October 2011
      742 pages
      ISBN:9781450309486
      DOI:10.1145/2046707
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. co-clustering
      2. feature hashing
      3. hadoop
      4. malware triage

      Qualifiers

      • Research-article

      Conference

      CCS'11
      Sponsor:

      Acceptance Rates

      CCS '11 Paper Acceptance Rate 60 of 429 submissions, 14%;
      Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

      Upcoming Conference

      CCS '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)60
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 14 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Evaluation of Malware Classification Models for Heterogeneous DataSensors10.3390/s2401028824:1(288)Online publication date: 3-Jan-2024
      • (2024)Pitfalls in Machine Learning for Computer SecurityCommunications of the ACM10.1145/3643456Online publication date: 25-Oct-2024
      • (2024)Using Autoencoder as Feature Extractor for Malware Detection2024 International Conference on IT and Industrial Technologies (ICIT)10.1109/ICIT63607.2024.10860243(1-6)Online publication date: 10-Dec-2024
      • (2024)Identification of Unknown Malicious Flow Based on Adaptive Annotation and Deep Neural Networks2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808643(189-195)Online publication date: 25-Oct-2024
      • (2024)Exploring the potential of deep learning and machine learning techniques for randomness analysis to enhance security on IoTInternational Journal of Information Security10.1007/s10207-023-00783-y23:2(1117-1130)Online publication date: 1-Apr-2024
      • (2023)Generative intrusion detection and prevention on data streamProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620479(4319-4335)Online publication date: 9-Aug-2023
      • (2023)Deep Learning-Based Attack Detection and Classification in Android DevicesElectronics10.3390/electronics1215325312:15(3253)Online publication date: 28-Jul-2023
      • (2023)Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons LearnedIEEE Transactions on Software Engineering10.1109/TSE.2022.318768949:4(1661-1682)Online publication date: 1-Apr-2023
      • (2023)Enhancing Machine Learning in Information Security: Power-Law Distribution and Dragon King2023 International Conference on Computer Science and Automation Technology (CSAT)10.1109/CSAT61646.2023.00088(324-327)Online publication date: 6-Oct-2023
      • (2023)Design and implementation of a sandbox for facilitating and automating IoT malware analysis with techniques to elicit malicious behavior: case studies of functionalities for dissecting IoT malwareJournal of Computer Virology and Hacking Techniques10.1007/s11416-023-00478-x19:2(149-163)Online publication date: 2-May-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media