Skip to main content
Log in

Measuring similarity between Karel programs using character and word n-grams

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. E. Pattis, J. Reoberts, and M. Stehlik, Karel the Robot: Gentle Introduction to the Art of Programming, 2nd Ed. (John Wiley Sons, 1994).

    Google Scholar 

  2. M. H. Halstead, Elements of Software Science (North Holland, New York, 1977).

    MATH  Google Scholar 

  3. T. J. McCabe, “A complexity measure”, IEEE Trans. Software Eng. 2(4), 308–320 (1976).

    Article  MathSciNet  MATH  Google Scholar 

  4. M. J. Wise, “YAP: Improved detection of similarities in computer program and other texts”, in Proceedings of SIGCSE’96 Technical Symposium (Philadelphia, USA, 1996), pp. 130–134.

    Google Scholar 

  5. N. Tran and D. Gitchell, “Sim: A utility for detecting similarity in computer programs”, SIGCSE Bull. 31(1), 266–270 (1999).

    Article  Google Scholar 

  6. G. Cosma, “An approach to source-code plagiarism detection and investigation using latent semantic analysis”, PhD Dissertation (Department of Computer Science, University of Warwick, 2008).

    Google Scholar 

  7. S. K. Hsu and S. J. Lin, “A block-structures model for source code retrieval,” in Proceedings of Intelligent Information and Database Systems, Third International Conference, ACIIDS 2011, 2011, pp. 161–171.

    Google Scholar 

  8. S. Saul, D. S. Wilkerson, and A. Aiken, “Winnowing: Local algorithms for document fingerprinting”, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, New York, NY, USA, 2003), pp. 76–85.

    Google Scholar 

  9. J. P. Posadas-Durán, I. Markov, H. Gómez-Adorno, G. Sidorov, I. Batyrshin, A. Gelbukh, and O. Pichardo-Lagunas, “Syntactic N-grams as features for the author profiling task”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.

    Google Scholar 

  10. H. Gómez-Adorno, G. Sidorov, D. Pinto, and I. Markov, “A graph based authorship identification approach”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.

    Google Scholar 

  11. G. Sidorov, H. Gómez-Adorno, I. Markov, D. Pinto, and N. Loya, “Computing text similarity using tree edit distance”, in Proceedings of the Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC), 2015 Annual Conference of the North American (Redmond, WA, USA, 2015), pp. 1–4.

    Google Scholar 

  12. G. Sidorov, “Should syntactic N-grams contain names of syntactic relations?”, Int. J. Computational Linguistics Appl. 5(1), 139–158 (2014).

    Google Scholar 

  13. Information Retrieval (Cambridge University Press, New York, NY, 2008).

  14. S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer, “Indexing by latent semantic analysis”, J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990).

    Article  Google Scholar 

  15. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update”, SIGKDD Explorations 11(1) (2009).

    Google Scholar 

  16. G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto, “Soft similarity and soft cosine measure: Similarity of features in vector space model”, Computación y Sistemas 18(3), 491–504 (2014).

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to G. Sidorov, I. Markov, R. Guzman-Cabrera, L. Chanona-Hernández or F. Velásquez.

Additional information

Original Russian Text © G. Sidorov, M. Ibarra Romero, I. Markov, R. Guzman-Cabrera, L. Chanona-Hernández, F. Velásquez, 2017, published in Programmirovanie, 2017, Vol. 43, No. 1.

The article is published in the original.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sidorov, G., Ibarra Romero, M., Markov, I. et al. Measuring similarity between Karel programs using character and word n-grams. Program Comput Soft 43, 47–50 (2017). https://doi.org/10.1134/S0361768817010066

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768817010066

Keywords

Navigation