Abstract
In several situations authors prefer to hide their identity. In forensic applications, one can think of extortion and threats in emails and forum messages. These types of messages can easily be adjusted, such that meta data referring to names and addresses is at least unreliable. In this paper, we propose a method to identify authors of short informal messages solely based on the text content. The method uses compression distances between texts as features. Using these features a supervised classifier is learned on a training set of known authors. For the experiments, we prepared a dataset from Dutch newsgroup texts. We compared several state-of-the-art methods to our proposed method for the identification of messages from up to 50 authors. Our method clearly outperformed the other methods. In 65% of the cases the author could be correctly identified, while in 88% of the cases the true author was in the top 5 of the produced ranked list.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Elovici, Y., Kandel, A., Last, M., Shapira, B., Zaafrany, O.: Using data mining techniques for detecting terror-related activities on the web. Journal of Information Warfare 3(1), 17–29 (2004)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)
Holmes, D., Forsyth, R.: The federalist revisited: New directions in authorship attribution. Literary and LInguistic Computing 10(2), 111–127 (1995)
Mosteller, F., Wallace, D.: Inference and disputed authorship: the Federalist. Addison-Wesley, Reading (1964)
Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20(5), 67–75 (2005)
Argamon, S., Sari, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the 9th ACM SIGKDD, pp. 475–480 (2003)
Corney, M., Anderson, A., Mohay, G., Vel, O.D.: Identifying the authors of suspect e-mail. Technical report, Queensland University of technology (2001)
Vel, O.D., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM SIGMOD Record 30(4), 55–64 (2001)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework of authorship identification for online messages: Writing style features and classification techniques. Journal American Society for Information Science and Technology 57(3), 378–393 (2006)
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)
Khmelev, D.V., Teahan, W.: A repetition based measure for verification of text collections and for text categorization. In: 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, August 2003, pp. 104–110 (2003)
Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215 (2004)
Bolle, R., Connell, J., Pankanti, S., Ratha, N., Senior, A.: Guide to Biometrics. Springer, New York (2004)
Rudman, J.: The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities 31(4), 351–365 (1998)
Abbasi, A., Chen, H.: Visualizing authorship for identification. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 60–71. Springer, Heidelberg (2006)
Li, J., Zeng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49(4), 76–82 (2006)
Tweedie, F., Baayen, R.: How variable may a constant be? measure of lexical richness in perspectiv. Computers and the Humanities 32(5), 323–352 (1998)
Corney, M.: Analysing e-mail text authorship for forensic purposes. Master’s thesis, Queensland University of technology (2003)
Binongo, J.: Who wrote the 15th book of oz? an application of multivariate analysis to authorship attribution. Chance 16(2), 9–17 (2003)
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 611–617 (2004)
Kaster, A., Siersdorfer, S., Weikum, G.: Combining text and linguistic document representations for authorship attribution. In: In SIGIR Workshop: Stylistic Analysis of Text for Information Access, pp. 27–35 (2005)
Uzuner, O., Katz, B.: A comparative study of language models for book and author recognition. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 969–980. Springer, Heidelberg (2005)
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, pp. 255–264 (2003)
Kjell, B.: Authorship attribution of text samples using neural networks and bayesian classifiers. IEEE International Conference on Systems, Man and Cybernetics 2, 1660–1664 (1994)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and its Applications. Springer, New York (1997)
Ball, P.: Algorithm makes tongue tree. Nature (January 2002)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearny, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50(7), 1545–1551 (2004)
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission 37(2), 172–184 (2001)
Telles, G., Minghim, R., Paulovich, F.: Normalized compression distance for visual analysis of document collections. Computers and Graphics 31(3), 327–337 (2007)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Pȩkalska, E., Skurichina, M., Duin, R.: Combining fisher linear discriminants for dissimilarity representations. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 117–126. Springer, Heidelberg (2000)
Duin, R., Pȩkalska, E., Ridder, D.D.: Relational discriminant analysis. Pattern Recognition Letters 20(11–13), 1175–1181 (1999)
Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons, Inc, New York (2001)
Craig, H.: Authorial attribution and computational stylistics: If you can tell authors apart, have you learned anything about them? Literary and Linguistic Computing 14(1), 103–113 (1999)
Roshal, E.: RAR Compression Tool by RAR Labs, Inc (1993-2004), http://www.rarlab.com
Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Advances in Information Retrieval, pp. 300–314 (2005)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transaction on Information Theory 22, 75–81 (1976)
Kaspar, F., Schuster, H.: Easily calculable measure for the complexity of spatiotemporal patterns. Physical Review A 36(2), 842–848 (1987)
Kim, S.W., Oommen, B.J.: On using prototype reduction schemes to optimize dissimilarity-based classification. Pattern Recognition 40(11), 2946–2957 (2007)
Pekalska, E., Duin, R., Paclik, P.: Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39, 189–208 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lambers, M., Veenman, C.J. (2009). Forensic Authorship Attribution Using Compression Distances to Prototypes. In: Geradts, Z.J.M.H., Franke, K.Y., Veenman, C.J. (eds) Computational Forensics. IWCF 2009. Lecture Notes in Computer Science, vol 5718. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03521-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-03521-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03520-3
Online ISBN: 978-3-642-03521-0
eBook Packages: Computer ScienceComputer Science (R0)