research-article

BitShred: feature hashing malware for scalable triage and semantic analysis

Authors:

Shobha VenkataramanAuthors Info & Claims

CCS '11: Proceedings of the 18th ACM conference on Computer and communications security

Pages 309 - 320

https://doi.org/10.1145/2046707.2046742

Published: 17 October 2011 Publication History

Abstract

The sheer volume of new malware found each day is growing at an exponential pace. This growth has created a need for automatic malware triage techniques that determine what malware is similar, what malware is unique, and why. In this paper, we present BitShred, a system for large-scale malware similarity analysis and clustering, and for automatically uncovering semantic inter- and intra-family relationships within clusters. The key idea behind BitShred is using feature hashing to dramatically reduce the high-dimensional feature spaces that are common in malware analysis. Feature hashing also allows us to mine correlated features between malware families and samples using co-clustering techniques. Our evaluation shows that BitShred speeds up typical malware triage tasks by up to 2,365x and uses up to 82x less memory on a single CPU, all with comparable accuracy to previous approaches. We also develop a parallelized version of BitShred, and demonstrate scalability within the Hadoop framework.

References

[1]

Apache hadoop. http://hadoop.apache.org/.

[2]

Carnegie Mellon University Cloud Computer Cluster. http://www2.pdl.cmu.edu/~twiki/cgi-bin/view/OpenCloud/ClusterOverview.

[3]

Malware Analysis System. http://mwanalysis.org/.

[4]

Offensive Computing. http://www.offensivecomputing.net/.

[5]

SimMetrics. http://sourceforge.net/projects/simmetrics/.

[6]

VirusTotal. http://www.virustotal.com/.

[7]

zynamics bindiff. http://www.zynamics.com/bindiff.html.

[8]

Symantec internet security threat report. http://www.symantec.com/business/theme.jsp?themeid=threatreport, April 2010.

[9]

T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. In COMPSAC '04: Proceedings of the 28th Annual International Computer Software and Applications Conference - Workshops and Fast Abstracts, 2004.

Digital Library

[10]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):177--122, 2008.

Digital Library

[11]

J. Attenberg, K. Weinberger, A. Dasgupta, A. Smola, and M. Zinkevich. Collaborative email-spam filtering with the hashing-trick. In Proceedings of the Sixth Conference on Email and Anti-Spam, 2009.

[12]

M. Bailey, J. Oberheide, J. Andersen, F. J. Z. Morley~Mao, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the Symposium on Recent Advances in Intrusion Detection, September 2007.

Digital Library

[13]

U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based malware clustering. In Proceedings of the Network and Distributed System Security Symposium, 2009.

[14]

D. Bernstein. http://www.cse.yorku.ca/~oz/hash.html.

[15]

A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2005.

[16]

D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsous. Fully automatic cross associations. In Proceedings of ACM SIGKDD, August 2004.

Digital Library

[17]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating System Design and Implementation, 2004.

Digital Library

[18]

A. Dinaburg, P. Royal, M. Sharif, and W. Lee. Ether: malware analysis via hardware virtualization extensions. In ACM CCS, 2008.

Digital Library

[19]

D. Eppstein. Fast hierarchical clustering and other applications of dynamic closest pairs. In Proceedings of ACM Symposium on Discrete Algorithms (SODA), 1998.

Digital Library

[20]

F. Guo, P. Ferrie, and T.-C. Chiueh. A study of the packer problem and its solutions. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection, pages 98--115, 2008.

Digital Library

[21]

X. Hu, T. cker Chiueh, and K. G. Shin. Large-scale malware indexing using function call graphs. In Proceedings of the ACM Conference on Computer and Communications Security, 2009.

Digital Library

[22]

N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proceedings of Eighth International Workshop on the Web and Databases (WebDB 2005), June 2005.

[23]

M. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13--23, November 2005.

[24]

G. Karypis. CLUTO: a clustering toolkit, release 2.1.1. Technical report, University of Minnesota, 2003.

[25]

J. Z. Kolter and M. A. Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721--2744, Dec. 2006.

Digital Library

[26]

P. Li, L. Lu, D. Gao, and M. Reiter. On challenges in evaluating malware clustering. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection, 2010.

Digital Library

[27]

L. Martignoni, M. Christodorescu, and S. Jha. OmniUnpack: Fast, generic, and safe unpacking of malware. In In Proceedings of the Annual Computer Security Applications Conference, 2007.

[28]

A. Moser, C. Kruegel, and E. Kirda. Exploring multiple execution paths for malware analysis. In Proceedings of the IEEE Symposium on Security and Privacy, 2007.

Digital Library

[29]

A. Moser, C. Kruegel, and E. Kirda. Limits of static analysis for malware detection. In Proceedings of the Annual Computer Security Applications Conference, 2007.

[30]

S. Papadimitrou and J. Sun. Disco: Distributed co-clustering with map-reduce. In Proceedings of ICDM, 2008.

Digital Library

[31]

R. Perdisci, A. Lanzi, and W. Lee. Classification of packed executables for accurate computer virus detection. Pattern Recogn. Lett., 29(14):1941--1946, 2008.

Digital Library

[32]

R. Perdisci, W. Lee, and N. Feamster. Behavioral clustering of HTTP-based malware and signature generation using malicious network traces. In Proceedings of NSDI, 2010.

Digital Library

[33]

P. Royal, M. Halpin, D. Dagon, R. Edmonds, and W. Lee. PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proceedings of Computer Security Applications Conference, December 2006.

Digital Library

[34]

S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the ACM SIGMOD/PODS Conference, 2003.

Digital Library

[35]

M. Sharif, A. Lanzi, J. Giffin, and W. Lee. Automatic reverse engineering of malware emulators. In Proceedings of the IEEE Symposium on Security and Privacy, 2009.

Digital Library

[36]

Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research, 2009.

Digital Library

[37]

Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smole, A. Strehl, and V. Vishwanathan. Hash kernels. In Proceedings of the $12^th$ International Conference on Artificial Intelligence and Statisics (AISTATS), 2009.

[38]

A. Walenstein and A. Lakhotia. The software similarity problem in malware analysis. In Duplication, Redundancy, and Similarity in Software, 2007.

[39]

K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large-scale multitask learning. In Proceedings of ICML, 2009.

Digital Library

Cited By

Bae H(2024)Evaluation of Malware Classification Models for Heterogeneous DataSensors10.3390/s2401028824:1(288)Online publication date: 3-Jan-2024
https://doi.org/10.3390/s24010288
Arp DQuiring EPendlebury FWarnecke APierazzi FWressnegger CCavallaro LRieck K(2024)Pitfalls in Machine Learning for Computer SecurityCommunications of the ACM10.1145/3643456Online publication date: 25-Oct-2024
https://doi.org/10.1145/3643456
Tayyab UKhan FHanif Durad MHussain M(2024)Using Autoencoder as Feature Extractor for Malware Detection2024 International Conference on IT and Industrial Technologies (ICIT)10.1109/ICIT63607.2024.10860243(1-6)Online publication date: 10-Dec-2024
https://doi.org/10.1109/ICIT63607.2024.10860243
Show More Cited By

Index Terms

BitShred: feature hashing malware for scalable triage and semantic analysis
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
  2. Systems security
    1. Operating systems security

Recommendations

AutoMal: automatic clustering and signature generation for malwares based on the network flow

The volume of malwares is growing at an exponential speed nowadays. This huge growth makes it extremely hard to analyse malware manually. Most existing signatures extracting methods are based on string signatures, and string matching is not accurate and ...
Towards a Framework to Detect Multi-stage Advanced Persistent Threats Attacks
SOSE '14: Proceedings of the 2014 IEEE 8th International Symposium on Service Oriented System Engineering

Detecting and defending against Multi-Stage Advanced Persistent Threats (APT) Attacks is a challenge for mechanisms that are static in its nature and are based on blacklisting and malware signature techniques. Blacklists and malware signatures are ...
TWCC

A co-clustering method TWCC was proposed, in which two types of weights are automatically computed.Its the first two-way subspace weighting partitional co-clustering method.It can simultaneously weight data from two ways for co-clustering.Experimental ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CCS '11: Proceedings of the 18th ACM conference on Computer and communications security

October 2011

742 pages

ISBN:9781450309486

DOI:10.1145/2046707

General Chair:
Yan Chen
Northwestern University, USA
,
Program Chairs:
George Danezis
Microsoft Research Cambridge, UK
,
Vitaly Shmatikov
University of Texas at Austin, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCS'11

Sponsor:

SIGSAC

CCS'11: the ACM Conference on Computer and Communications Security

October 17 - 21, 2011

Illinois, Chicago, USA

Acceptance Rates

CCS '11 Paper Acceptance Rate 60 of 429 submissions, 14%;

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

171
Total Citations
View Citations
1,541
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)11

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bae H(2024)Evaluation of Malware Classification Models for Heterogeneous DataSensors10.3390/s2401028824:1(288)Online publication date: 3-Jan-2024
https://doi.org/10.3390/s24010288
Arp DQuiring EPendlebury FWarnecke APierazzi FWressnegger CCavallaro LRieck K(2024)Pitfalls in Machine Learning for Computer SecurityCommunications of the ACM10.1145/3643456Online publication date: 25-Oct-2024
https://doi.org/10.1145/3643456
Tayyab UKhan FHanif Durad MHussain M(2024)Using Autoencoder as Feature Extractor for Malware Detection2024 International Conference on IT and Industrial Technologies (ICIT)10.1109/ICIT63607.2024.10860243(1-6)Online publication date: 10-Dec-2024
https://doi.org/10.1109/ICIT63607.2024.10860243
Ding ZWang YLin W(2024)Identification of Unknown Malicious Flow Based on Adaptive Annotation and Deep Neural Networks2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808643(189-195)Online publication date: 25-Oct-2024
https://doi.org/10.1109/BigDIA63733.2024.10808643
Ince K(2024)Exploring the potential of deep learning and machine learning techniques for randomness analysis to enhance security on IoTInternational Journal of Information Security10.1007/s10207-023-00783-y23:2(1117-1130)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s10207-023-00783-y
Seo HYoon MCalandrino JTroncoso C(2023)Generative intrusion detection and prevention on data streamProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620479(4319-4335)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620479
Gómez AMuñoz A(2023)Deep Learning-Based Attack Detection and Classification in Android DevicesElectronics10.3390/electronics1215325312:15(3253)Online publication date: 28-Jul-2023
https://doi.org/10.3390/electronics12153253
Kim DKim ECha SSon SKim Y(2023)Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons LearnedIEEE Transactions on Software Engineering10.1109/TSE.2022.318768949:4(1661-1682)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSE.2022.3187689
Wang YChen X(2023)Enhancing Machine Learning in Information Security: Power-Law Distribution and Dragon King2023 International Conference on Computer Science and Automation Technology (CSAT)10.1109/CSAT61646.2023.00088(324-327)Online publication date: 6-Oct-2023
https://doi.org/10.1109/CSAT61646.2023.00088
Yonamine STaenaka YKadobayashi YMiyamoto D(2023)Design and implementation of a sandbox for facilitating and automating IoT malware analysis with techniques to elicit malicious behavior: case studies of functionalities for dissecting IoT malwareJournal of Computer Virology and Hacking Techniques10.1007/s11416-023-00478-x19:2(149-163)Online publication date: 2-May-2023
https://doi.org/10.1007/s11416-023-00478-x
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten