Forensic Authorship Attribution Using Compression Distances to Prototypes

Lambers, Maarten; Veenman, Cor J.

doi:10.1007/978-3-642-03521-0_2

Forensic Authorship Attribution Using Compression Distances to Prototypes

Maarten Lambers²⁰ &
Cor J. Veenman^19,20

Conference paper

1080 Accesses
10 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 5718))

Abstract

In several situations authors prefer to hide their identity. In forensic applications, one can think of extortion and threats in emails and forum messages. These types of messages can easily be adjusted, such that meta data referring to names and addresses is at least unreliable. In this paper, we propose a method to identify authors of short informal messages solely based on the text content. The method uses compression distances between texts as features. Using these features a supervised classifier is learned on a training set of known authors. For the experiments, we prepared a dataset from Dutch newsgroup texts. We compared several state-of-the-art methods to our proposed method for the identification of messages from up to 50 authors. Our method clearly outperformed the other methods. In 65% of the cases the author could be correctly identified, while in 88% of the cases the true author was in the top 5 of the produced ranked list.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Elovici, Y., Kandel, A., Last, M., Shapira, B., Zaafrany, O.: Using data mining techniques for detecting terror-related activities on the web. Journal of Information Warfare 3(1), 17–29 (2004)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)
Article Google Scholar
Holmes, D., Forsyth, R.: The federalist revisited: New directions in authorship attribution. Literary and LInguistic Computing 10(2), 111–127 (1995)
Article Google Scholar
Mosteller, F., Wallace, D.: Inference and disputed authorship: the Federalist. Addison-Wesley, Reading (1964)
MATH Google Scholar
Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20(5), 67–75 (2005)
Article Google Scholar
Argamon, S., Sari, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the 9th ACM SIGKDD, pp. 475–480 (2003)
Google Scholar
Corney, M., Anderson, A., Mohay, G., Vel, O.D.: Identifying the authors of suspect e-mail. Technical report, Queensland University of technology (2001)
Google Scholar
Vel, O.D., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM SIGMOD Record 30(4), 55–64 (2001)
Article Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework of authorship identification for online messages: Writing style features and classification techniques. Journal American Society for Information Science and Technology 57(3), 378–393 (2006)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Article Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)
Article Google Scholar
Khmelev, D.V., Teahan, W.: A repetition based measure for verification of text collections and for text categorization. In: 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, August 2003, pp. 104–110 (2003)
Google Scholar
Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215 (2004)
Google Scholar
Bolle, R., Connell, J., Pankanti, S., Ratha, N., Senior, A.: Guide to Biometrics. Springer, New York (2004)
Book Google Scholar
Rudman, J.: The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities 31(4), 351–365 (1998)
Article MathSciNet Google Scholar
Abbasi, A., Chen, H.: Visualizing authorship for identification. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 60–71. Springer, Heidelberg (2006)
Chapter Google Scholar
Li, J., Zeng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49(4), 76–82 (2006)
Article Google Scholar
Tweedie, F., Baayen, R.: How variable may a constant be? measure of lexical richness in perspectiv. Computers and the Humanities 32(5), 323–352 (1998)
Article Google Scholar
Corney, M.: Analysing e-mail text authorship for forensic purposes. Master’s thesis, Queensland University of technology (2003)
Google Scholar
Binongo, J.: Who wrote the 15th book of oz? an application of multivariate analysis to authorship attribution. Chance 16(2), 9–17 (2003)
Article MathSciNet Google Scholar
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 611–617 (2004)
Google Scholar
Kaster, A., Siersdorfer, S., Weikum, G.: Combining text and linguistic document representations for authorship attribution. In: In SIGIR Workshop: Stylistic Analysis of Text for Information Access, pp. 27–35 (2005)
Google Scholar
Uzuner, O., Katz, B.: A comparative study of language models for book and author recognition. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 969–980. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Chapter Google Scholar
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)
Google Scholar
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, pp. 255–264 (2003)
Google Scholar
Kjell, B.: Authorship attribution of text samples using neural networks and bayesian classifiers. IEEE International Conference on Systems, Man and Cybernetics 2, 1660–1664 (1994)
Article Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Article MathSciNet MATH Google Scholar
Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and its Applications. Springer, New York (1997)
Book MATH Google Scholar
Ball, P.: Algorithm makes tongue tree. Nature (January 2002)
Google Scholar
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearny, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Article Google Scholar
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50(7), 1545–1551 (2004)
Article MathSciNet MATH Google Scholar
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Article MathSciNet MATH Google Scholar
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission 37(2), 172–184 (2001)
Article MathSciNet MATH Google Scholar
Telles, G., Minghim, R., Paulovich, F.: Normalized compression distance for visual analysis of document collections. Computers and Graphics 31(3), 327–337 (2007)
Article Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Pȩkalska, E., Skurichina, M., Duin, R.: Combining fisher linear discriminants for dissimilarity representations. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 117–126. Springer, Heidelberg (2000)
Chapter Google Scholar
Duin, R., Pȩkalska, E., Ridder, D.D.: Relational discriminant analysis. Pattern Recognition Letters 20(11–13), 1175–1181 (1999)
Article Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons, Inc, New York (2001)
MATH Google Scholar
Craig, H.: Authorial attribution and computational stylistics: If you can tell authors apart, have you learned anything about them? Literary and Linguistic Computing 14(1), 103–113 (1999)
Article MathSciNet Google Scholar
Roshal, E.: RAR Compression Tool by RAR Labs, Inc (1993-2004), http://www.rarlab.com
Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Advances in Information Retrieval, pp. 300–314 (2005)
Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transaction on Information Theory 22, 75–81 (1976)
Article MathSciNet MATH Google Scholar
Kaspar, F., Schuster, H.: Easily calculable measure for the complexity of spatiotemporal patterns. Physical Review A 36(2), 842–848 (1987)
Article MathSciNet Google Scholar
Kim, S.W., Oommen, B.J.: On using prototype reduction schemes to optimize dissimilarity-based classification. Pattern Recognition 40(11), 2946–2957 (2007)
Article MATH Google Scholar
Pekalska, E., Duin, R., Paclik, P.: Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39, 189–208 (2006)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Systems Lab, University of Amsterdam, Amsterdam, The Netherlands
Cor J. Veenman
Digital Technology & Biometrics Department, Netherlands Forensic Institute, The Hague, The Netherlands
Maarten Lambers & Cor J. Veenman

Authors

Maarten Lambers
View author publications
You can also search for this author in PubMed Google Scholar
Cor J. Veenman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Netherlands Forensic Institute, The Hague, The Netherlands
Zeno J. M. H. Geradts
Norwegian Information Security Laboratory, Gjøvik University College, Norway
Katrin Y. Franke
Digital Technology & Biometrics Department, Netherlands Forensic Institute, The Hague, The Netherlands
Cor J. Veenman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lambers, M., Veenman, C.J. (2009). Forensic Authorship Attribution Using Compression Distances to Prototypes. In: Geradts, Z.J.M.H., Franke, K.Y., Veenman, C.J. (eds) Computational Forensics. IWCF 2009. Lecture Notes in Computer Science, vol 5718. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03521-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-03521-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03520-3
Online ISBN: 978-3-642-03521-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics