Skip to main content

Supporting the Cybercrime Investigation Process: Effective Discrimination of Source Code Authors Based on Byte-Level Information

  • Conference paper

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 3))

Abstract

Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack, authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural language text authorship attribution. We propose a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets. Experiments were performed on two different data sets, one with programs written in C++ and the second with programs written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach can be applied to any programming language with no additional cost. The presented accuracy rates are much better than the best reported results for the same data sets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Ding, H., Samadzadeh, M.: Extraction of Java program fingerprints for software authorship identification. The Journal of Systems and Software 72(1), 49–57 (2004)

    Article  Google Scholar 

  • Elliot, W., Valenza, R.: Was the Earl of Oxford The True Shakespeare? Notes and Queries 38, 501–506 (1991)

    Google Scholar 

  • Gray, A., Sallis, P., MacDonell, S.: Identified (integrated dictionary-based extraction of non-language-dependent token information for forensic identification, examination, and discrimination): A dictionary-based system for extracting source code metrics for software forensics. In: Proceedings of SE:E&P’98 (Software Engineering: Education and Practice Conference), pp. 252–259. IEEE Computer Society Press, Los Alamitos (1998)

    Google Scholar 

  • Gray, A., Sallis, P., MacDonell, S.: Software forensics: Extending authorship analysis techniques to computer programs. In: Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL’97), pp. 1–8 (1997)

    Google Scholar 

  • Frantzeskou, G., Gritzalis, S., Mac Donell, S.: Source Code Authorship Analysis for supporting the cybercrime investigation process. In: Proc. 1st International Conference on e-business and Telecommunications Networks (ICETE04), vol. 2, pp. 85–92 (2004)

    Google Scholar 

  • Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram based author profiles for authorship attribution. In: Proc. Pacific Association for Computational Linguistics (2003)

    Google Scholar 

  • Keselj, V.: Perl package Text:N-grams (2003), http://www.cs.dal.ca/~vlado/srcperl/N-grams , http://www.cs.dal.ca/~vlado/srcperl/N-grams

  • Kilgour, R.I., Gray, A.R., Sallis, P.J., MacDonell, S.G.: A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis. In: the Fourth International Conference on Neural Information Processing – The Annual Conference of the Asian Pacific Neural Network Assembly (ICONIP’97). Dunedin. New Zealand (1997)

    Google Scholar 

  • Krsul, I., Spafford, E.H.: Authorship analysis: Identifying the author of a program. In: Proc. 8th National Information Systems Security Conference, pp. 514–524, National Institute of Standards and Technology (1995)

    Google Scholar 

  • Krsul, I., Spafford, E.H.: 1996, Authorship analysis: Identifying the author of a program, Technical Report TR-96-052 (1996)

    Google Scholar 

  • Longstaff, T.A., Schultz, E.E.: Beyond Preliminary Analysis of the WANK and OILZ Worms: A Case Study of Malicious Code. Computers and Security 12, 61–77 (1993)

    Article  Google Scholar 

  • MacDonell, S.G., Gray, A.R.: Software forensics applied to the task of discriminating between program authors. Journal of Systems Research and Information Systems 10, 113–127 (2001)

    Google Scholar 

  • Oman, P., Cook, C.: Programming style authorship analysis. In: Seventeenth Annual ACM Science Conference Proceedings, pp. 320–326. ACM Press, New York (1989)

    Google Scholar 

  • Peng, F., Shuurmans, D., Wang, S.: Augmenting naive bayes classifiers with statistical language models. Information Retrieval Journal 7(1), 317–345 (2004)

    Article  Google Scholar 

  • Sallis, P., Aakjaer, A., MacDonell, S.: Software Forensics: Old Methods for a New Science. In: Proceedings of SE:E&P’96 (Software Engineering: Education and Practice), Dunedin, New Zealand, pp. 367–371. IEEE Computer Society Press, Los Alamitos (1996)

    Google Scholar 

  • Spafford, E.H.: The Internet Worm Program: An Analysis. Computer Communications Review 19(1), 17–49 (1989)

    Article  Google Scholar 

  • Spafford, E.H., Weeber, S.A.: Software forensics: tracking code to its authors. Computers and Security 12, 585–595 (1993)

    Article  Google Scholar 

  • Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorisation in terms of genre and author. Computational Linguistics 26(4), 471–495 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Joaquim Filipe Helder Coelhas Monica Saramago

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Frantzeskou, G., Stamatatos, E., Gritzalis, S. (2007). Supporting the Cybercrime Investigation Process: Effective Discrimination of Source Code Authors Based on Byte-Level Information. In: Filipe, J., Coelhas, H., Saramago, M. (eds) E-business and Telecommunication Networks. ICETE 2005. Communications in Computer and Information Science, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75993-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75993-5_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75992-8

  • Online ISBN: 978-3-540-75993-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics