skip to main content
10.1145/2939672.2939719acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering

Published:13 August 2016Publication History

ABSTRACT

Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function's subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code.

References

  1. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexes for similarity search. In Proc. of WWW'05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM STOC'02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Charland, B. C. M. Fung, and M. R. Farhadi. Clone search for malicious code correlation. In Proc. of the NATO RTO Symposium on Information Assurance and Cyber Defense (IST-111), 2012.Google ScholarGoogle Scholar
  5. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of ACM SoCG'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. David and E. Yahav. Tracelet-based code search in executables. In Proc. of SIGPLAN'14, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. R. Farhadi, B. C. M. Fung, P. Charland, and M. Debbabi. Binclone: Detecting code clones in malware. In Proc. of the 8th International Conference on Software Security and Reliability, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. R. Farhadi, B. C. M. Fung, Y. B. Fung, P. Charland, S. Preda, and M. Debbabi. Scalable code clone search for malware analysis. Digital Investigation, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In Proc. of SIGMOD'12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. Dsh: Data sensitive hashing for high-dimensional k-nnsearch. In Proc. of SIGMOD'14. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of the VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Han, J. Lee, and J. Lee. Turbo(_iso): towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proc. of SIGMOD'13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1), 2012.Google ScholarGoogle Scholar
  15. E. Juergens et al. Why and how to control cloning in software artifacts. Technische Universitat München, 2011.Google ScholarGoogle Scholar
  16. T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE TSE, 28(7), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. M. Khoo. Decompilation as search. University of Cambridge, Computer Laboratory, Technical Report, 2013.Google ScholarGoogle Scholar
  18. W. M. Khoo, A. Mycroft, and R. J. Anderson. Rendezvous: a search engine for binary code. In Proc. of MSR'13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Komsiyski. Binary differencing for media files. 2013.Google ScholarGoogle Scholar
  20. C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Proc. of RAID'06. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Lee, W. Han, R. Kasperovics, and J. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB, 7(9), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval, volume 1. Cambridge University Press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  24. C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Mockus. Large-scale code reuse in open source software. In Proc. of FLOSS'07. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Saebjornsen. Detecting Fine-Grained Similarity in Binaries. PhD thesis, UC Davis, 2014.Google ScholarGoogle Scholar
  27. A. Sæbjørnsen, J. Willcock, T. Panas, D. J. Quinlan, and Z. Su. Detecting code clones in binary executables. In Proc. of the 18th International Symposium on Software Testing and Analysis, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Shamir and D. Tsur. Faster subtree isomorphism. In Proc. of IEEE ISTCS'97, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Sojer and J. Henkel. Code reuse in open source software development: Quantitative evidence, drivers, and impediments. JAIS, 11(12), 2010.Google ScholarGoogle Scholar
  30. S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. JMLR, 7, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. The VLDB Endowment, 5(9), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In Proc. of SIGMOD'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM TODS, 35(3), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. R. Ullmann. An algorithm for subgraph isomorphism. ACM JACM, 23(1), 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv:1408.2927, 2014.Google ScholarGoogle Scholar
  36. H. Welte. Current developments in GPL compliance, 2012.Google ScholarGoogle Scholar

Index Terms

  1. Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
              August 2016
              2176 pages
              ISBN:9781450342322
              DOI:10.1145/2939672

              Copyright © 2016 ACM

              Publication rights licensed to ACM. ACM acknowledges that this contribution was co-authored by an affiliate of the Canadian National Government. As such, the Crown in Right of Canada retains an equal interest in the copyright. Reprint requests should be forwarded to ACM, and reprints must include clear attribution to ACM and National Research Council Canada -NRC.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 13 August 2016

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

              Upcoming Conference

              KDD '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader