ABSTRACT
Various approaches have been proposed to develop effective methods to measure program similarity. Even commercial tools and freeware tools are available for measuring program similarity based on source code comparison. These tools are quite useful to handle small to middle scale software products, but limited for large scale software products. In addition, these tools may report similarity measures with less credentials for the source code either obfuscated by malicious users or generated by automatic program template generation tools. To handle large scale software, more drastic measures should be provided. In this paper, we propose an automatic abstraction method to summarize source code. We eliminate a large portion of source code which is less relevant to similarity comparison. With this abstraction, our similarity comparison method can provide more robust measures for obfuscation and automatic code generation. We evaluate our abstraction method by running through source comparison tool --- MOSS, a web-based similarity detection tool. According to our experiment with multiple versions of Apache HTTP server, Putty SSH client, and Lighttpd server, our abstraction method reports quite reliable results with abstracted source code, which are only 23--35% of original source code. As the execution time for pattern match is linearly proportional to the length of the source code, our method can reduce the execution time as much as the percentage of source code reduction.
- Apache http server. {online} http://httpd.apache.org.Google Scholar
- Putty Telnet/SSH Client. {online} http://www.chiark.greenend.org.uk/sgtatham/putty/.Google Scholar
- Lighttpd server. {online} http://lighttpd.net/.Google Scholar
- The CETUS project. {online} http://cetus.ecn.purdue.edu/.Google Scholar
- A system for detecting software plagiarism - MOSS. {online} http://theory.stanford.edu/~aiken/moss/.Google Scholar
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers, Principles, Techniques, and Tools. Pearson-Addison Wesley, 2nd edition edition, Jun. 2006. Google ScholarDigital Library
- B. Baker. On finding duplication and near-duplication in large software systems. In Proceedings of the Second Working Conference on Reverse Engineering, pages 86--95, 1995. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, R. L. Riverst, and C. Stein. Introduction to algorithms. The MIT Press, 3rd edition edition, Jul. 2009. Google ScholarDigital Library
- C. Dave, H. Bae, S.-J. Min, S. Lee, R. Eigenmann, and S. Midkiff. Cetus: A source-to-source compiler infrastructure for multicores. IEEE Computer, 42: 36--42, Dec. 2009. Google ScholarDigital Library
- S. Ducasse, O. Nierstrasz, and M. Rieger. On the effectiveness of clone detection by string matching. International Journal on Software Maintenance and Evolution: Research and Practice, 18: 37--58, Jan. 2006. Google ScholarDigital Library
- W. Evans and C. Fraser. Clone detection via structural abstraction. In Proceedings of the 14th Conference on Reverse Engineering, Oct. 2007. Google ScholarDigital Library
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9: 319--349, Jul. 1987. Google ScholarDigital Library
- M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In Proceedings of the 30th International Conference on Software Engineering, pages 321--330, 2008. Google ScholarDigital Library
- D. Gitchell and N. Tran. Sim: a utility for detecting similarity in computer programs. ACM SIGCSE Bulletin, 31: 266--270, Mar. 1999. Google ScholarDigital Library
- L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering, pages 96--105, 2007. Google ScholarDigital Library
- T. A. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. Midkiff. Experiences in using Cetus for source-to-source transformations. In Proceedings of the 17th Workshop on Languages and Compilers for Parallel Computing, Sep. 2004. Google ScholarDigital Library
- T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28: 654--670, Jul. 2002. Google ScholarDigital Library
- K. Kontogiannis, R. DeMori, E. Merlo, M. Galler, and M. Bernstein. Pattern matching for clone and concept detection. Automated Software Engineering, 3: 77--108, Jun. 1996. Google ScholarDigital Library
- F. Lanubile and T. Mallardo. Finding function clones in web applications. In Proceedings of the 7th European Conference on Software Maintenance and Reengineering, Mar. 2003. Google ScholarDigital Library
- Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: Finding copypaste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32: 176--192, Mar. 2006. Google ScholarDigital Library
- C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pages 872--881, 2006. Google ScholarDigital Library
- G. A. D. Lucca, M. D. Penta, and A. Fasolino. An approach to identify duplicated web pages. In Proceedings of the 26th International Computer Software and Applications Conference, Aug. 2002. Google ScholarDigital Library
- T. Parr and K. Fisher. LL(*): the foundation of the ANTLR parser generator. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2011. Google ScholarDigital Library
- S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data, Jun. 2003. Google ScholarDigital Library
- V. Wahler, D. Seipel, J. W. von Gudenberg, and G. Fischer. Clone detection in source code by frequent itemset techniques. In Proceedings of the 4th IEEE International Workshop Source Code Analysis and Manipulation, Sep. 2004. Google ScholarDigital Library
- A. Walenstein, M. El-Ramly, J. R. Cordy, W. Evans, K. Mahdavi, M. Pizka, G. Ramalingam, J. W. von Gudenberg, and T. Kamiya. Similarity in programs. In R. Koschke, E. Merlo, and A. Walenstein, editors, Duplication, Redundancy, and Similarity in Software, number 06301 in Dagstuhl Seminar Proceedings, Apr. 2007.Google Scholar
- Pavel Berkhin. Survey of Clustering Data Mining Techniques. In Accrue Software, 2003.Google Scholar
Index Terms
- Detecting source code similarity using code abstraction
Recommendations
Measuring Source Code Similarity Using Reference Vectors
ICICIC '06: Proceedings of the First International Conference on Innovative Computing, Information and Control - Volume 2This paper disscusses on a method of measuring similarities between program source codes. Unlike many of exsisting similarity measuring method we do not compare a pair of source codes directly but compare them indirectly with using reference source ...
A comparison of code similarity analysers
Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications ...
Flowchart-Based Cross-Language Source Code Similarity Detection
Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another ...
Comments