ABSTRACT
Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function's subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code.
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1), 2008. Google ScholarDigital Library
- M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexes for similarity search. In Proc. of WWW'05, 2005. Google ScholarDigital Library
- M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM STOC'02, 2002. Google ScholarDigital Library
- P. Charland, B. C. M. Fung, and M. R. Farhadi. Clone search for malicious code correlation. In Proc. of the NATO RTO Symposium on Information Assurance and Cyber Defense (IST-111), 2012.Google Scholar
- M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of ACM SoCG'04, 2004. Google ScholarDigital Library
- Y. David and E. Yahav. Tracelet-based code search in executables. In Proc. of SIGPLAN'14, 2014. Google ScholarDigital Library
- M. R. Farhadi, B. C. M. Fung, P. Charland, and M. Debbabi. Binclone: Detecting code clones in malware. In Proc. of the 8th International Conference on Software Security and Reliability, 2014. Google ScholarDigital Library
- M. R. Farhadi, B. C. M. Fung, Y. B. Fung, P. Charland, S. Preda, and M. Debbabi. Scalable code clone search for malware analysis. Digital Investigation, 2015. Google ScholarDigital Library
- T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8), 2006. Google ScholarDigital Library
- J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In Proc. of SIGMOD'12, 2012. Google ScholarDigital Library
- J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. Dsh: Data sensitive hashing for high-dimensional k-nnsearch. In Proc. of SIGMOD'14. ACM, 2014. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of the VLDB, 1999. Google ScholarDigital Library
- W. Han, J. Lee, and J. Lee. Turbo(_iso): towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proc. of SIGMOD'13, 2013. Google ScholarDigital Library
- S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1), 2012.Google Scholar
- E. Juergens et al. Why and how to control cloning in software artifacts. Technische Universitat München, 2011.Google Scholar
- T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE TSE, 28(7), 2002. Google ScholarDigital Library
- W. M. Khoo. Decompilation as search. University of Cambridge, Computer Laboratory, Technical Report, 2013.Google Scholar
- W. M. Khoo, A. Mycroft, and R. J. Anderson. Rendezvous: a search engine for binary code. In Proc. of MSR'13, 2013. Google ScholarDigital Library
- V. Komsiyski. Binary differencing for media files. 2013.Google Scholar
- C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Proc. of RAID'06. Springer, 2006. Google ScholarDigital Library
- J. Lee, W. Han, R. Kasperovics, and J. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 2012. Google ScholarDigital Library
- Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB, 7(9), 2014. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval, volume 1. Cambridge University Press, 2008. Google ScholarCross Ref
- C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT press, 1999. Google ScholarDigital Library
- A. Mockus. Large-scale code reuse in open source software. In Proc. of FLOSS'07. IEEE, 2007. Google ScholarDigital Library
- A. Saebjornsen. Detecting Fine-Grained Similarity in Binaries. PhD thesis, UC Davis, 2014.Google Scholar
- A. Sæbjørnsen, J. Willcock, T. Panas, D. J. Quinlan, and Z. Su. Detecting code clones in binary executables. In Proc. of the 18th International Symposium on Software Testing and Analysis, 2009. Google ScholarDigital Library
- R. Shamir and D. Tsur. Faster subtree isomorphism. In Proc. of IEEE ISTCS'97, 1997. Google ScholarDigital Library
- M. Sojer and J. Henkel. Code reuse in open source software development: Quantitative evidence, drivers, and impediments. JAIS, 11(12), 2010.Google Scholar
- S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. JMLR, 7, 2006. Google ScholarDigital Library
- Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. The VLDB Endowment, 5(9), 2012. Google ScholarDigital Library
- Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In Proc. of SIGMOD'09, 2009. Google ScholarDigital Library
- Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM TODS, 35(3), 2010. Google ScholarDigital Library
- J. R. Ullmann. An algorithm for subgraph isomorphism. ACM JACM, 23(1), 1976. Google ScholarDigital Library
- J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv:1408.2927, 2014.Google Scholar
- H. Welte. Current developments in GPL compliance, 2012.Google Scholar
Index Terms
- Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering
Recommendations
Neighborhood Voting: A Novel Search Scheme for Hashing
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge ManagementHashing techniques for approximate nearest neighbor search (ANNS) encode data points into a set of short binary codes, while trying to preserve the neighborhood structure of the original data as much as possible. With the binary codes, the task of ANNS ...
k-Nearest neighbor searching in hybrid spaces
AbstractLittle work has been reported in the literature to support k-nearest neighbor (k-NN) searches/queries in hybrid data spaces (HDS). An HDS is composed of a combination of continuous and non-ordered discrete dimensions. This combination ...
Highlights- Developed algorithm for searching multi-dimensional hybrid data spaces.
VBLSH: Volume-balancing locality-sensitive hashing algorithm for K-nearest neighbors search
AbstractK-nearest neighbors search (K-NNS) is a fundamental problem in many areas of machine learning and data mining. In an attempt to solve NNS problems by locality-sensitive hashing (LSH)-based algorithms and avoid the F1-trap, in this ...
Comments