skip to main content
10.1145/2939672.2939719acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering

Published: 13 August 2016 Publication History

Abstract

Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function's subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code.

References

[1]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1), 2008.
[2]
M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexes for similarity search. In Proc. of WWW'05, 2005.
[3]
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM STOC'02, 2002.
[4]
P. Charland, B. C. M. Fung, and M. R. Farhadi. Clone search for malicious code correlation. In Proc. of the NATO RTO Symposium on Information Assurance and Cyber Defense (IST-111), 2012.
[5]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of ACM SoCG'04, 2004.
[6]
Y. David and E. Yahav. Tracelet-based code search in executables. In Proc. of SIGPLAN'14, 2014.
[7]
M. R. Farhadi, B. C. M. Fung, P. Charland, and M. Debbabi. Binclone: Detecting code clones in malware. In Proc. of the 8th International Conference on Software Security and Reliability, 2014.
[8]
M. R. Farhadi, B. C. M. Fung, Y. B. Fung, P. Charland, S. Preda, and M. Debbabi. Scalable code clone search for malware analysis. Digital Investigation, 2015.
[9]
T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8), 2006.
[10]
J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In Proc. of SIGMOD'12, 2012.
[11]
J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. Dsh: Data sensitive hashing for high-dimensional k-nnsearch. In Proc. of SIGMOD'14. ACM, 2014.
[12]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of the VLDB, 1999.
[13]
W. Han, J. Lee, and J. Lee. Turbo(_iso): towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proc. of SIGMOD'13, 2013.
[14]
S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1), 2012.
[15]
E. Juergens et al. Why and how to control cloning in software artifacts. Technische Universitat München, 2011.
[16]
T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE TSE, 28(7), 2002.
[17]
W. M. Khoo. Decompilation as search. University of Cambridge, Computer Laboratory, Technical Report, 2013.
[18]
W. M. Khoo, A. Mycroft, and R. J. Anderson. Rendezvous: a search engine for binary code. In Proc. of MSR'13, 2013.
[19]
V. Komsiyski. Binary differencing for media files. 2013.
[20]
C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Proc. of RAID'06. Springer, 2006.
[21]
J. Lee, W. Han, R. Kasperovics, and J. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 2012.
[22]
Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB, 7(9), 2014.
[23]
C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval, volume 1. Cambridge University Press, 2008.
[24]
C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT press, 1999.
[25]
A. Mockus. Large-scale code reuse in open source software. In Proc. of FLOSS'07. IEEE, 2007.
[26]
A. Saebjornsen. Detecting Fine-Grained Similarity in Binaries. PhD thesis, UC Davis, 2014.
[27]
A. Sæbjørnsen, J. Willcock, T. Panas, D. J. Quinlan, and Z. Su. Detecting code clones in binary executables. In Proc. of the 18th International Symposium on Software Testing and Analysis, 2009.
[28]
R. Shamir and D. Tsur. Faster subtree isomorphism. In Proc. of IEEE ISTCS'97, 1997.
[29]
M. Sojer and J. Henkel. Code reuse in open source software development: Quantitative evidence, drivers, and impediments. JAIS, 11(12), 2010.
[30]
S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. JMLR, 7, 2006.
[31]
Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. The VLDB Endowment, 5(9), 2012.
[32]
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In Proc. of SIGMOD'09, 2009.
[33]
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM TODS, 35(3), 2010.
[34]
J. R. Ullmann. An algorithm for subgraph isomorphism. ACM JACM, 23(1), 1976.
[35]
J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv:1408.2927, 2014.
[36]
H. Welte. Current developments in GPL compliance, 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
Publication rights licensed to ACM. ACM acknowledges that this contribution was co-authored by an affiliate of the Canadian National Government. As such, the Crown in Right of Canada retains an equal interest in the copyright. Reprint requests should be forwarded to ACM, and reprints must include clear attribution to ACM and National Research Council Canada -NRC.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. assembly clone search
  2. information retrieval
  3. mining software repositorie

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '16
Sponsor:

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Semantic aware-based instruction embedding for binary code similarity detectionPLOS ONE10.1371/journal.pone.030529919:6(e0305299)Online publication date: 11-Jun-2024
  • (2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
  • (2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
  • (2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 18-Jan-2024
  • (2024)Cross-Inlining Binary Function Similarity DetectionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639080(1-13)Online publication date: 20-May-2024
  • (2024)BinCola: Diversity-Sensitive Contrastive Learning for Binary Code Similarity DetectionIEEE Transactions on Software Engineering10.1109/TSE.2024.341107250:10(2485-2497)Online publication date: Oct-2024
  • (2024)CSGraph2Vec: Distributed Graph-Based Representation Learning for Assembly Functions2024 IEEE International Conference on Recent Advances in Systems Science and Engineering (RASSE)10.1109/RASSE64357.2024.10773827(1-8)Online publication date: 6-Nov-2024
  • (2024)OpTrans: enhancing binary code similarity detection with function inlining re-optimizationEmpirical Software Engineering10.1007/s10664-024-10605-x30:2Online publication date: 26-Dec-2024
  • (2024)Optir-SBERT: Cross-Architecture Binary Code Similarity Detection Based on Optimized LLVM IRDigital Forensics and Cyber Crime10.1007/978-3-031-56583-0_7(95-113)Online publication date: 3-Apr-2024
  • (2023)IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block RelationsSensors10.3390/s2318778923:18(7789)Online publication date: 11-Sep-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media