research-article

Detecting source code similarity using code abstraction

Authors:
Seongsoo Park

Sungkyunkwan University, Suwon, Korea

Sungkyunkwan University, Suwon, Korea
View Profile

,
Seungcheol Ko

Sungkyunkwan University, Suwon, Korea

Sungkyunkwan University, Suwon, Korea
View Profile

,
Jungsik Choi

Sungkyunkwan University, Suwon, Korea

Sungkyunkwan University, Suwon, Korea
View Profile

,
Hwansoo Han

Sungkyunkwan University, Suwon, Korea

Sungkyunkwan University, Suwon, Korea
View Profile

,
Seong-Je Cho

Dankook University, Yongin, Korea

Dankook University, Yongin, Korea
View Profile

,
Jongmoo Choi

Dankook University, Yongin, Korea

Dankook University, Yongin, Korea
View Profile

ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and CommunicationJanuary 2013Article No.: 74Pages 1–9https://doi.org/10.1145/2448556.2448630

Published:17 January 2013Publication History

ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication

Pages 1–9

ABSTRACT

Various approaches have been proposed to develop effective methods to measure program similarity. Even commercial tools and freeware tools are available for measuring program similarity based on source code comparison. These tools are quite useful to handle small to middle scale software products, but limited for large scale software products. In addition, these tools may report similarity measures with less credentials for the source code either obfuscated by malicious users or generated by automatic program template generation tools. To handle large scale software, more drastic measures should be provided. In this paper, we propose an automatic abstraction method to summarize source code. We eliminate a large portion of source code which is less relevant to similarity comparison. With this abstraction, our similarity comparison method can provide more robust measures for obfuscation and automatic code generation. We evaluate our abstraction method by running through source comparison tool --- MOSS, a web-based similarity detection tool. According to our experiment with multiple versions of Apache HTTP server, Putty SSH client, and Lighttpd server, our abstraction method reports quite reliable results with abstracted source code, which are only 23--35% of original source code. As the execution time for pattern match is linearly proportional to the length of the source code, our method can reduce the execution time as much as the percentage of source code reduction.

References

Apache http server. {online} http://httpd.apache.org.Google Scholar
Putty Telnet/SSH Client. {online} http://www.chiark.greenend.org.uk/sgtatham/putty/.Google Scholar
Lighttpd server. {online} http://lighttpd.net/.Google Scholar
The CETUS project. {online} http://cetus.ecn.purdue.edu/.Google Scholar
A system for detecting software plagiarism - MOSS. {online} http://theory.stanford.edu/~aiken/moss/.Google Scholar
A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers, Principles, Techniques, and Tools. Pearson-Addison Wesley, 2nd edition edition, Jun. 2006. Google ScholarDigital Library
B. Baker. On finding duplication and near-duplication in large software systems. In Proceedings of the Second Working Conference on Reverse Engineering, pages 86--95, 1995. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, R. L. Riverst, and C. Stein. Introduction to algorithms. The MIT Press, 3rd edition edition, Jul. 2009. Google ScholarDigital Library
C. Dave, H. Bae, S.-J. Min, S. Lee, R. Eigenmann, and S. Midkiff. Cetus: A source-to-source compiler infrastructure for multicores. IEEE Computer, 42: 36--42, Dec. 2009. Google ScholarDigital Library
S. Ducasse, O. Nierstrasz, and M. Rieger. On the effectiveness of clone detection by string matching. International Journal on Software Maintenance and Evolution: Research and Practice, 18: 37--58, Jan. 2006. Google ScholarDigital Library
W. Evans and C. Fraser. Clone detection via structural abstraction. In Proceedings of the 14th Conference on Reverse Engineering, Oct. 2007. Google ScholarDigital Library
J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9: 319--349, Jul. 1987. Google ScholarDigital Library
M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In Proceedings of the 30th International Conference on Software Engineering, pages 321--330, 2008. Google ScholarDigital Library
D. Gitchell and N. Tran. Sim: a utility for detecting similarity in computer programs. ACM SIGCSE Bulletin, 31: 266--270, Mar. 1999. Google ScholarDigital Library
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering, pages 96--105, 2007. Google ScholarDigital Library
T. A. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. Midkiff. Experiences in using Cetus for source-to-source transformations. In Proceedings of the 17th Workshop on Languages and Compilers for Parallel Computing, Sep. 2004. Google ScholarDigital Library
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28: 654--670, Jul. 2002. Google ScholarDigital Library
K. Kontogiannis, R. DeMori, E. Merlo, M. Galler, and M. Bernstein. Pattern matching for clone and concept detection. Automated Software Engineering, 3: 77--108, Jun. 1996. Google ScholarDigital Library
F. Lanubile and T. Mallardo. Finding function clones in web applications. In Proceedings of the 7th European Conference on Software Maintenance and Reengineering, Mar. 2003. Google ScholarDigital Library
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: Finding copypaste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32: 176--192, Mar. 2006. Google ScholarDigital Library
C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pages 872--881, 2006. Google ScholarDigital Library
G. A. D. Lucca, M. D. Penta, and A. Fasolino. An approach to identify duplicated web pages. In Proceedings of the 26th International Computer Software and Applications Conference, Aug. 2002. Google ScholarDigital Library
T. Parr and K. Fisher. LL(*): the foundation of the ANTLR parser generator. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2011. Google ScholarDigital Library
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data, Jun. 2003. Google ScholarDigital Library
V. Wahler, D. Seipel, J. W. von Gudenberg, and G. Fischer. Clone detection in source code by frequent itemset techniques. In Proceedings of the 4th IEEE International Workshop Source Code Analysis and Manipulation, Sep. 2004. Google ScholarDigital Library
A. Walenstein, M. El-Ramly, J. R. Cordy, W. Evans, K. Mahdavi, M. Pizka, G. Ramalingam, J. W. von Gudenberg, and T. Kamiya. Similarity in programs. In R. Koschke, E. Merlo, and A. Walenstein, editors, Duplication, Redundancy, and Similarity in Software, number 06301 in Dagstuhl Seminar Proceedings, Apr. 2007.Google Scholar
Pavel Berkhin. Survey of Clustering Data Mining Techniques. In Accrue Software, 2003.Google Scholar

Index Terms

Detecting source code similarity using code abstraction
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Preprocessors

Recommendations

Measuring Source Code Similarity Using Reference Vectors
ICICIC '06: Proceedings of the First International Conference on Innovative Computing, Information and Control - Volume 2

This paper disscusses on a method of measuring similarities between program source codes. Unlike many of exsisting similarity measuring method we do not compare a pair of source codes directly but compare them indirectly with using reference source ...
Read More
A comparison of code similarity analysers

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications ...
Read More
Flowchart-Based Cross-Language Source Code Similarity Detection
Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
January 2013
772 pages
ISBN:9781450319584
DOI:10.1145/2448556

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 January 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
large scale software
similarity
source code abstraction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate251of941submissions,27%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 492
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting source code similarity using code abstraction

ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication

ABSTRACT

References

Cited By

Index Terms

Recommendations

Measuring Source Code Similarity Using Reference Vectors

A comparison of code similarity analysers

Flowchart-Based Cross-Language Source Code Similarity Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detecting source code similarity using code abstraction

ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication

ABSTRACT

References

Cited By

Index Terms

Recommendations

Measuring Source Code Similarity Using Reference Vectors

A comparison of code similarity analysers

Flowchart-Based Cross-Language Source Code Similarity Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media