skip to main content
10.1145/2448556.2448630acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

Detecting source code similarity using code abstraction

Published:17 January 2013Publication History

ABSTRACT

Various approaches have been proposed to develop effective methods to measure program similarity. Even commercial tools and freeware tools are available for measuring program similarity based on source code comparison. These tools are quite useful to handle small to middle scale software products, but limited for large scale software products. In addition, these tools may report similarity measures with less credentials for the source code either obfuscated by malicious users or generated by automatic program template generation tools. To handle large scale software, more drastic measures should be provided. In this paper, we propose an automatic abstraction method to summarize source code. We eliminate a large portion of source code which is less relevant to similarity comparison. With this abstraction, our similarity comparison method can provide more robust measures for obfuscation and automatic code generation. We evaluate our abstraction method by running through source comparison tool --- MOSS, a web-based similarity detection tool. According to our experiment with multiple versions of Apache HTTP server, Putty SSH client, and Lighttpd server, our abstraction method reports quite reliable results with abstracted source code, which are only 23--35% of original source code. As the execution time for pattern match is linearly proportional to the length of the source code, our method can reduce the execution time as much as the percentage of source code reduction.

References

  1. Apache http server. {online} http://httpd.apache.org.Google ScholarGoogle Scholar
  2. Putty Telnet/SSH Client. {online} http://www.chiark.greenend.org.uk/sgtatham/putty/.Google ScholarGoogle Scholar
  3. Lighttpd server. {online} http://lighttpd.net/.Google ScholarGoogle Scholar
  4. The CETUS project. {online} http://cetus.ecn.purdue.edu/.Google ScholarGoogle Scholar
  5. A system for detecting software plagiarism - MOSS. {online} http://theory.stanford.edu/~aiken/moss/.Google ScholarGoogle Scholar
  6. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers, Principles, Techniques, and Tools. Pearson-Addison Wesley, 2nd edition edition, Jun. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Baker. On finding duplication and near-duplication in large software systems. In Proceedings of the Second Working Conference on Reverse Engineering, pages 86--95, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. H. Cormen, C. E. Leiserson, R. L. Riverst, and C. Stein. Introduction to algorithms. The MIT Press, 3rd edition edition, Jul. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Dave, H. Bae, S.-J. Min, S. Lee, R. Eigenmann, and S. Midkiff. Cetus: A source-to-source compiler infrastructure for multicores. IEEE Computer, 42: 36--42, Dec. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Ducasse, O. Nierstrasz, and M. Rieger. On the effectiveness of clone detection by string matching. International Journal on Software Maintenance and Evolution: Research and Practice, 18: 37--58, Jan. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Evans and C. Fraser. Clone detection via structural abstraction. In Proceedings of the 14th Conference on Reverse Engineering, Oct. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9: 319--349, Jul. 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In Proceedings of the 30th International Conference on Software Engineering, pages 321--330, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Gitchell and N. Tran. Sim: a utility for detecting similarity in computer programs. ACM SIGCSE Bulletin, 31: 266--270, Mar. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering, pages 96--105, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. A. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. Midkiff. Experiences in using Cetus for source-to-source transformations. In Proceedings of the 17th Workshop on Languages and Compilers for Parallel Computing, Sep. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28: 654--670, Jul. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Kontogiannis, R. DeMori, E. Merlo, M. Galler, and M. Bernstein. Pattern matching for clone and concept detection. Automated Software Engineering, 3: 77--108, Jun. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Lanubile and T. Mallardo. Finding function clones in web applications. In Proceedings of the 7th European Conference on Software Maintenance and Reengineering, Mar. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: Finding copypaste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32: 176--192, Mar. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pages 872--881, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. A. D. Lucca, M. D. Penta, and A. Fasolino. An approach to identify duplicated web pages. In Proceedings of the 26th International Computer Software and Applications Conference, Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Parr and K. Fisher. LL(*): the foundation of the ANTLR parser generator. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data, Jun. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Wahler, D. Seipel, J. W. von Gudenberg, and G. Fischer. Clone detection in source code by frequent itemset techniques. In Proceedings of the 4th IEEE International Workshop Source Code Analysis and Manipulation, Sep. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Walenstein, M. El-Ramly, J. R. Cordy, W. Evans, K. Mahdavi, M. Pizka, G. Ramalingam, J. W. von Gudenberg, and T. Kamiya. Similarity in programs. In R. Koschke, E. Merlo, and A. Walenstein, editors, Duplication, Redundancy, and Similarity in Software, number 06301 in Dagstuhl Seminar Proceedings, Apr. 2007.Google ScholarGoogle Scholar
  27. Pavel Berkhin. Survey of Clustering Data Mining Techniques. In Accrue Software, 2003.Google ScholarGoogle Scholar

Index Terms

  1. Detecting source code similarity using code abstraction

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
        January 2013
        772 pages
        ISBN:9781450319584
        DOI:10.1145/2448556

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 January 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate251of941submissions,27%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader