research-article

Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering

Authors:

Steven H.H. Ding,

Benjamin C.M. Fung,

Philippe CharlandAuthors Info & Claims

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 461 - 470

https://doi.org/10.1145/2939672.2939719

Published: 13 August 2016 Publication History

Abstract

Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function's subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code.

References

[1]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1), 2008.

Digital Library

[2]

M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexes for similarity search. In Proc. of WWW'05, 2005.

Digital Library

[3]

M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM STOC'02, 2002.

Digital Library

[4]

P. Charland, B. C. M. Fung, and M. R. Farhadi. Clone search for malicious code correlation. In Proc. of the NATO RTO Symposium on Information Assurance and Cyber Defense (IST-111), 2012.

[5]

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of ACM SoCG'04, 2004.

Digital Library

[6]

Y. David and E. Yahav. Tracelet-based code search in executables. In Proc. of SIGPLAN'14, 2014.

Digital Library

[7]

M. R. Farhadi, B. C. M. Fung, P. Charland, and M. Debbabi. Binclone: Detecting code clones in malware. In Proc. of the 8th International Conference on Software Security and Reliability, 2014.

Digital Library

[8]

M. R. Farhadi, B. C. M. Fung, Y. B. Fung, P. Charland, S. Preda, and M. Debbabi. Scalable code clone search for malware analysis. Digital Investigation, 2015.

Digital Library

[9]

T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8), 2006.

Digital Library

[10]

J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In Proc. of SIGMOD'12, 2012.

Digital Library

[11]

J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. Dsh: Data sensitive hashing for high-dimensional k-nnsearch. In Proc. of SIGMOD'14. ACM, 2014.

Digital Library

[12]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of the VLDB, 1999.

Digital Library

[13]

W. Han, J. Lee, and J. Lee. Turbo(_iso): towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proc. of SIGMOD'13, 2013.

Digital Library

[14]

S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1), 2012.

[15]

E. Juergens et al. Why and how to control cloning in software artifacts. Technische Universitat München, 2011.

[16]

T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE TSE, 28(7), 2002.

Digital Library

[17]

W. M. Khoo. Decompilation as search. University of Cambridge, Computer Laboratory, Technical Report, 2013.

[18]

W. M. Khoo, A. Mycroft, and R. J. Anderson. Rendezvous: a search engine for binary code. In Proc. of MSR'13, 2013.

Digital Library

[19]

V. Komsiyski. Binary differencing for media files. 2013.

[20]

C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Proc. of RAID'06. Springer, 2006.

Digital Library

[21]

J. Lee, W. Han, R. Kasperovics, and J. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 2012.

Digital Library

[22]

Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB, 7(9), 2014.

Digital Library

[23]

C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval, volume 1. Cambridge University Press, 2008.

[24]

C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT press, 1999.

Digital Library

[25]

A. Mockus. Large-scale code reuse in open source software. In Proc. of FLOSS'07. IEEE, 2007.

Digital Library

[26]

A. Saebjornsen. Detecting Fine-Grained Similarity in Binaries. PhD thesis, UC Davis, 2014.

[27]

A. Sæbjørnsen, J. Willcock, T. Panas, D. J. Quinlan, and Z. Su. Detecting code clones in binary executables. In Proc. of the 18th International Symposium on Software Testing and Analysis, 2009.

Digital Library

[28]

R. Shamir and D. Tsur. Faster subtree isomorphism. In Proc. of IEEE ISTCS'97, 1997.

Digital Library

[29]

M. Sojer and J. Henkel. Code reuse in open source software development: Quantitative evidence, drivers, and impediments. JAIS, 11(12), 2010.

[30]

S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. JMLR, 7, 2006.

Digital Library

[31]

Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. The VLDB Endowment, 5(9), 2012.

Digital Library

[32]

Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In Proc. of SIGMOD'09, 2009.

Digital Library

[33]

Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM TODS, 35(3), 2010.

Digital Library

[34]

J. R. Ullmann. An algorithm for subgraph isomorphism. ACM JACM, 23(1), 1976.

Digital Library

[35]

J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv:1408.2927, 2014.

[36]

H. Welte. Current developments in GPL compliance, 2012.

Cited By

Jia YYu ZHong Z(2024)Semantic aware-based instruction embedding for binary code similarity detectionPLOS ONE10.1371/journal.pone.030529919:6(e0305299)Online publication date: 11-Jun-2024
https://doi.org/10.1371/journal.pone.0305299
Zhang FLi MWu HWu T(2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
https://doi.org/10.1186/s13677-024-00629-5
Lin RFu YYi WYang JCao JDong ZXie FLi H(2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3694782
Show More Cited By

Index Terms

Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering

Recommendations

Neighborhood Voting: A Novel Search Scheme for Hashing
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Hashing techniques for approximate nearest neighbor search (ANNS) encode data points into a set of short binary codes, while trying to preserve the neighborhood structure of the original data as much as possible. With the binary codes, the task of ANNS ...
k-Nearest neighbor searching in hybrid spaces
Abstract
Little work has been reported in the literature to support k-nearest neighbor (k-NN) searches/queries in hybrid data spaces (HDS). An HDS is composed of a combination of continuous and non-ordered discrete dimensions. This combination ...
Highlights
- Developed algorithm for searching multi-dimensional hybrid data spaces.
VBLSH: Volume-balancing locality-sensitive hashing algorithm for K-nearest neighbors search
Abstract
K-nearest neighbors search (K-NNS) is a fundamental problem in many areas of machine learning and data mining. In an attempt to solve NNS problems by locality-sensitive hashing (LSH)-based algorithms and avoid the F1-trap, in this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2016

2176 pages

ISBN:9781450342322

DOI:10.1145/2939672

General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon

Copyright © 2016 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was co-authored by an affiliate of the Canadian National Government. As such, the Crown in Right of Canada retains an equal interest in the copyright. Reprint requests should be forwarded to ACM, and reprints must include clear attribution to ACM and National Research Council Canada -NRC.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

KDD '16

Sponsor:

KDD '16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2016

California, San Francisco, USA

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
433
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jia YYu ZHong Z(2024)Semantic aware-based instruction embedding for binary code similarity detectionPLOS ONE10.1371/journal.pone.030529919:6(e0305299)Online publication date: 11-Jun-2024
https://doi.org/10.1371/journal.pone.0305299
Zhang FLi MWu HWu T(2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
https://doi.org/10.1186/s13677-024-00629-5
Lin RFu YYi WYang JCao JDong ZXie FLi H(2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3694782
Fan GChen SGao CXiao JZhang TFeng Z(2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1145/3641542
Jia AFan MXu XJin WWang HLiu TRoychoudhury APaiva AAbreu RStorey M(2024)Cross-Inlining Binary Function Similarity DetectionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639080(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639080
Jiang SFu CHe SLv JHan LHu H(2024)BinCola: Diversity-Sensitive Contrastive Learning for Binary Code Similarity DetectionIEEE Transactions on Software Engineering10.1109/TSE.2024.341107250:10(2485-2497)Online publication date: Oct-2024
https://doi.org/10.1109/TSE.2024.3411072
Alhashemi WFung BAbusitta AFachkha C(2024)CSGraph2Vec: Distributed Graph-Based Representation Learning for Assembly Functions2024 IEEE International Conference on Recent Advances in Systems Science and Engineering (RASSE)10.1109/RASSE64357.2024.10773827(1-8)Online publication date: 6-Nov-2024
https://doi.org/10.1109/RASSE64357.2024.10773827
Sha ZLan YZhang CWang HGao ZZhang BShu H(2024)OpTrans: enhancing binary code similarity detection with function inlining re-optimizationEmpirical Software Engineering10.1007/s10664-024-10605-x30:2Online publication date: 26-Dec-2024
https://doi.org/10.1007/s10664-024-10605-x
Yan YYu LWang TLi YPan Z(2024)Optir-SBERT: Cross-Architecture Binary Code Similarity Detection Based on Optimized LLVM IRDigital Forensics and Cyber Crime10.1007/978-3-031-56583-0_7(95-113)Online publication date: 3-Apr-2024
https://doi.org/10.1007/978-3-031-56583-0_7
Luo ZWang PXie WZhou XWang B(2023)IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block RelationsSensors10.3390/s2318778923:18(7789)Online publication date: 11-Sep-2023
https://doi.org/10.3390/s23187789
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents