Measuring similarity between Karel programs using character and word n-grams

Sidorov, G.; Ibarra Romero, M.; Markov, I.; Guzman-Cabrera, R.; Chanona-Hernández, L.; Velásquez, F.

doi:10.1134/S0361768817010066

Measuring similarity between Karel programs using character and word n-grams

Published: 23 February 2017

Volume 43, pages 47–50, (2017)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

G. Sidorov¹,
M. Ibarra Romero¹,
I. Markov¹,
R. Guzman-Cabrera²,
L. Chanona-Hernández³ &
…
F. Velásquez⁴

84 Accesses
7 Citations
Explore all metrics

Abstract

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The Use of Artificial Intelligence in Writing Scientific Review Articles

Article Open access 16 January 2024

Melissa A. Kacena, Lilian I. Plotkin & Jill C. Fehrenbacher

Near-term advances in quantum natural language processing

Article 11 April 2024

Dominic Widdows, Aaranya Alexander, … Arunava Majumder

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Tomasz Kociumaka, Jakub Radoszewski & Tatiana Starikovskaya

References

R. E. Pattis, J. Reoberts, and M. Stehlik, Karel the Robot: Gentle Introduction to the Art of Programming, 2nd Ed. (John Wiley Sons, 1994).
Google Scholar
M. H. Halstead, Elements of Software Science (North Holland, New York, 1977).
MATH Google Scholar
T. J. McCabe, “A complexity measure”, IEEE Trans. Software Eng. 2(4), 308–320 (1976).
Article MathSciNet MATH Google Scholar
M. J. Wise, “YAP: Improved detection of similarities in computer program and other texts”, in Proceedings of SIGCSE’96 Technical Symposium (Philadelphia, USA, 1996), pp. 130–134.
Google Scholar
N. Tran and D. Gitchell, “Sim: A utility for detecting similarity in computer programs”, SIGCSE Bull. 31(1), 266–270 (1999).
Article Google Scholar
G. Cosma, “An approach to source-code plagiarism detection and investigation using latent semantic analysis”, PhD Dissertation (Department of Computer Science, University of Warwick, 2008).
Google Scholar
S. K. Hsu and S. J. Lin, “A block-structures model for source code retrieval,” in Proceedings of Intelligent Information and Database Systems, Third International Conference, ACIIDS 2011, 2011, pp. 161–171.
Google Scholar
S. Saul, D. S. Wilkerson, and A. Aiken, “Winnowing: Local algorithms for document fingerprinting”, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, New York, NY, USA, 2003), pp. 76–85.
Google Scholar
J. P. Posadas-Durán, I. Markov, H. Gómez-Adorno, G. Sidorov, I. Batyrshin, A. Gelbukh, and O. Pichardo-Lagunas, “Syntactic N-grams as features for the author profiling task”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.
Google Scholar
H. Gómez-Adorno, G. Sidorov, D. Pinto, and I. Markov, “A graph based authorship identification approach”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.
Google Scholar
G. Sidorov, H. Gómez-Adorno, I. Markov, D. Pinto, and N. Loya, “Computing text similarity using tree edit distance”, in Proceedings of the Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC), 2015 Annual Conference of the North American (Redmond, WA, USA, 2015), pp. 1–4.
Google Scholar
G. Sidorov, “Should syntactic N-grams contain names of syntactic relations?”, Int. J. Computational Linguistics Appl. 5(1), 139–158 (2014).
Google Scholar
Information Retrieval (Cambridge University Press, New York, NY, 2008).
S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer, “Indexing by latent semantic analysis”, J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990).
Article Google Scholar
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update”, SIGKDD Explorations 11(1) (2009).
Google Scholar
G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto, “Soft similarity and soft cosine measure: Similarity of features in vector space model”, Computación y Sistemas 18(3), 491–504 (2014).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Politécnico Nacional (IPN), Center for Computing Research (CIC), Mexico City, Mexico
G. Sidorov, M. Ibarra Romero & I. Markov
Engineering Division, University of Guanajuato, Campus Irapuato-Salamanca, Guanajuato, Mexico
R. Guzman-Cabrera
Instituto Politécnico Nacional, School of Mechanical and Electrical Engineering (ESIME), Mexico City, Mexico
L. Chanona-Hernández
Polytechnic University of Queretaro, Queretaro, Mexico
F. Velásquez

Authors

G. Sidorov
View author publications
You can also search for this author in PubMed Google Scholar
M. Ibarra Romero
View author publications
You can also search for this author in PubMed Google Scholar
I. Markov
View author publications
You can also search for this author in PubMed Google Scholar
R. Guzman-Cabrera
View author publications
You can also search for this author in PubMed Google Scholar
L. Chanona-Hernández
View author publications
You can also search for this author in PubMed Google Scholar
F. Velásquez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to G. Sidorov, I. Markov, R. Guzman-Cabrera, L. Chanona-Hernández or F. Velásquez.

Additional information

Original Russian Text © G. Sidorov, M. Ibarra Romero, I. Markov, R. Guzman-Cabrera, L. Chanona-Hernández, F. Velásquez, 2017, published in Programmirovanie, 2017, Vol. 43, No. 1.

The article is published in the original.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sidorov, G., Ibarra Romero, M., Markov, I. et al. Measuring similarity between Karel programs using character and word n-grams. Program Comput Soft 43, 47–50 (2017). https://doi.org/10.1134/S0361768817010066

Download citation

Received: 05 August 2016
Published: 23 February 2017
Issue Date: January 2017
DOI: https://doi.org/10.1134/S0361768817010066

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Measuring similarity between Karel programs using character and word n-grams

Abstract

Access this article

Similar content being viewed by others

The Use of Artificial Intelligence in Writing Scientific Review Articles

Near-term advances in quantum natural language processing

Longest Common Substring with Approximately k Mismatches

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Measuring similarity between Karel programs using character and word n-grams

Abstract

Access this article

Similar content being viewed by others

The Use of Artificial Intelligence in Writing Scientific Review Articles

Near-term advances in quantum natural language processing

Longest Common Substring with Approximately k Mismatches

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation