Skip to main content

Forensic Authorship Attribution Using Compression Distances to Prototypes

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 5718))

Abstract

In several situations authors prefer to hide their identity. In forensic applications, one can think of extortion and threats in emails and forum messages. These types of messages can easily be adjusted, such that meta data referring to names and addresses is at least unreliable. In this paper, we propose a method to identify authors of short informal messages solely based on the text content. The method uses compression distances between texts as features. Using these features a supervised classifier is learned on a training set of known authors. For the experiments, we prepared a dataset from Dutch newsgroup texts. We compared several state-of-the-art methods to our proposed method for the identification of messages from up to 50 authors. Our method clearly outperformed the other methods. In 65% of the cases the author could be correctly identified, while in 88% of the cases the true author was in the top 5 of the produced ranked list.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Elovici, Y., Kandel, A., Last, M., Shapira, B., Zaafrany, O.: Using data mining techniques for detecting terror-related activities on the web. Journal of Information Warfare 3(1), 17–29 (2004)

    Google Scholar 

  2. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)

    Article  Google Scholar 

  3. Holmes, D., Forsyth, R.: The federalist revisited: New directions in authorship attribution. Literary and LInguistic Computing 10(2), 111–127 (1995)

    Article  Google Scholar 

  4. Mosteller, F., Wallace, D.: Inference and disputed authorship: the Federalist. Addison-Wesley, Reading (1964)

    MATH  Google Scholar 

  5. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20(5), 67–75 (2005)

    Article  Google Scholar 

  6. Argamon, S., Sari, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the 9th ACM SIGKDD, pp. 475–480 (2003)

    Google Scholar 

  7. Corney, M., Anderson, A., Mohay, G., Vel, O.D.: Identifying the authors of suspect e-mail. Technical report, Queensland University of technology (2001)

    Google Scholar 

  8. Vel, O.D., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM SIGMOD Record 30(4), 55–64 (2001)

    Article  Google Scholar 

  9. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework of authorship identification for online messages: Writing style features and classification techniques. Journal American Society for Information Science and Technology 57(3), 378–393 (2006)

    Article  Google Scholar 

  10. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)

    Article  Google Scholar 

  11. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)

    Article  Google Scholar 

  12. Khmelev, D.V., Teahan, W.: A repetition based measure for verification of text collections and for text categorization. In: 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, August 2003, pp. 104–110 (2003)

    Google Scholar 

  13. Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215 (2004)

    Google Scholar 

  14. Bolle, R., Connell, J., Pankanti, S., Ratha, N., Senior, A.: Guide to Biometrics. Springer, New York (2004)

    Book  Google Scholar 

  15. Rudman, J.: The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities 31(4), 351–365 (1998)

    Article  MathSciNet  Google Scholar 

  16. Abbasi, A., Chen, H.: Visualizing authorship for identification. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 60–71. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. Li, J., Zeng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49(4), 76–82 (2006)

    Article  Google Scholar 

  18. Tweedie, F., Baayen, R.: How variable may a constant be? measure of lexical richness in perspectiv. Computers and the Humanities 32(5), 323–352 (1998)

    Article  Google Scholar 

  19. Corney, M.: Analysing e-mail text authorship for forensic purposes. Master’s thesis, Queensland University of technology (2003)

    Google Scholar 

  20. Binongo, J.: Who wrote the 15th book of oz? an application of multivariate analysis to authorship attribution. Chance 16(2), 9–17 (2003)

    Article  MathSciNet  Google Scholar 

  21. Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 611–617 (2004)

    Google Scholar 

  22. Kaster, A., Siersdorfer, S., Weikum, G.: Combining text and linguistic document representations for authorship attribution. In: In SIGIR Workshop: Stylistic Analysis of Text for Information Access, pp. 27–35 (2005)

    Google Scholar 

  23. Uzuner, O., Katz, B.: A comparative study of language models for book and author recognition. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 969–980. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  24. Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  25. Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)

    Google Scholar 

  26. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, pp. 255–264 (2003)

    Google Scholar 

  27. Kjell, B.: Authorship attribution of text samples using neural networks and bayesian classifiers. IEEE International Conference on Systems, Man and Cybernetics 2, 1660–1664 (1994)

    Article  Google Scholar 

  28. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  29. Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and its Applications. Springer, New York (1997)

    Book  MATH  Google Scholar 

  30. Ball, P.: Algorithm makes tongue tree. Nature (January 2002)

    Google Scholar 

  31. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearny, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)

    Article  Google Scholar 

  32. Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50(7), 1545–1551 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  33. Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  34. Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission 37(2), 172–184 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  35. Telles, G., Minghim, R., Paulovich, F.: Normalized compression distance for visual analysis of document collections. Computers and Graphics 31(3), 327–337 (2007)

    Article  Google Scholar 

  36. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  37. Pȩkalska, E., Skurichina, M., Duin, R.: Combining fisher linear discriminants for dissimilarity representations. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 117–126. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  38. Duin, R., Pȩkalska, E., Ridder, D.D.: Relational discriminant analysis. Pattern Recognition Letters 20(11–13), 1175–1181 (1999)

    Article  Google Scholar 

  39. Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons, Inc, New York (2001)

    MATH  Google Scholar 

  40. Craig, H.: Authorial attribution and computational stylistics: If you can tell authors apart, have you learned anything about them? Literary and Linguistic Computing 14(1), 103–113 (1999)

    Article  MathSciNet  Google Scholar 

  41. Roshal, E.: RAR Compression Tool by RAR Labs, Inc (1993-2004), http://www.rarlab.com

  42. Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Advances in Information Retrieval, pp. 300–314 (2005)

    Google Scholar 

  43. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transaction on Information Theory 22, 75–81 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  44. Kaspar, F., Schuster, H.: Easily calculable measure for the complexity of spatiotemporal patterns. Physical Review A 36(2), 842–848 (1987)

    Article  MathSciNet  Google Scholar 

  45. Kim, S.W., Oommen, B.J.: On using prototype reduction schemes to optimize dissimilarity-based classification. Pattern Recognition 40(11), 2946–2957 (2007)

    Article  MATH  Google Scholar 

  46. Pekalska, E., Duin, R., Paclik, P.: Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39, 189–208 (2006)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lambers, M., Veenman, C.J. (2009). Forensic Authorship Attribution Using Compression Distances to Prototypes. In: Geradts, Z.J.M.H., Franke, K.Y., Veenman, C.J. (eds) Computational Forensics. IWCF 2009. Lecture Notes in Computer Science, vol 5718. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03521-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03521-0_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03520-3

  • Online ISBN: 978-3-642-03521-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics