N-Gram Feature Selection for Authorship Identification

Houvardas, John; Stamatatos, Efstathios

doi:10.1007/11861461_10

N-Gram Feature Selection for Authorship Identification

John Houvardas²⁰ &
Efstathios Stamatatos²⁰

Conference paper

1556 Accesses
86 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4183))

Abstract

Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mosteller, F., Wallace, D.: Inference in an Authorship Problem. Journal of the American Statistical Association 58(302), 275–230 (1963)
Google Scholar
Labbé, C., Labbé, D.: Inter-textual distance and authorship attribution: Corneille and Molière. Journal of Quantitative Linguistics 8, 213–231 (2001)
Article Google Scholar
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)
Article Google Scholar
Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)
Article Google Scholar
van Halteren, H.: Linguistic Profiling for Author Recognition and Verification. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 199–206 (2004)
Google Scholar
Chaski, C.: Empirical Evaluations of Language-based Author Identification Techniques. Forensic Linguistics 8(1), 1–65 (2001)
Article Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)
Article Google Scholar
Peng, F., Shuurmans, F., Keselj, V., Wang, S.: Language Independent Authorship Attribution Using Character Level Language Models. In: Proc. of the 10th European Association for Computational Linguistics (2003)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Holmes, D.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)
Article Google Scholar
Kjell, B., Addison Woods, W., Frieder, O.: Discrimination of authorship using visualization. Information Processing and Management 30(1) (1994)
Google Scholar
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Association for Computational Linguistics (2003)
Google Scholar
Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proc. of the Joint ALLC/ACH2004 Conf., pp. 175–176 (2004)
Google Scholar
Ferreira da Silva, J., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999)
Chapter Google Scholar
Silva, J., Lopes, G.: A local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In: Proc. of the 6th Meeting on the Mathematics of Language, pp. 369–381 (1999)
Google Scholar
Church, K., Hanks, K.: Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16(1), 22–29 (1990)
Google Scholar
Gale, W., Church, K.: Concordance for parallel texts. In: Proc. of the 7th Annual Conference for the new OED and Text Research, Oxford, pp. 40–62 (1991)
Google Scholar
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Khmelev, D., Teahan, W.: A Repetition Based Measure for Verification of Text Collections and for Text Categorization. In: Proc. of the 26th ACM SIGIR, pp. 104–110 (2003)
Google Scholar
Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: Proc. of CSNA (2005)
Google Scholar
Yang, Y., Pedersen J.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conf. on Machine Learning (1997)
Google Scholar
Marton, Y., Wu, N., Hellerstein, L.: On Compression-Based Text Classification. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 300–314. Springer, Heidelberg (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Information and Communication Systems Eng., University of the Aegean, 83200, Karlovassi, Greece
John Houvardas & Efstathios Stamatatos

Authors

John Houvardas
View author publications
You can also search for this author in PubMed Google Scholar
Efstathios Stamatatos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INRIA Rhône-Alpes & LIG, 655 Avenue de l’Europe, 38330, Montbonnot Saint-Martin, France
Jérôme Euzenat
Knowledge Media Institute, The Open University, Walton Hall, MK6 7AA, Milton Keynes, United Kingdom
John Domingue

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Houvardas, J., Stamatatos, E. (2006). N-Gram Feature Selection for Authorship Identification. In: Euzenat, J., Domingue, J. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2006. Lecture Notes in Computer Science(), vol 4183. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11861461_10

Download citation

DOI: https://doi.org/10.1007/11861461_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40930-4
Online ISBN: 978-3-540-40931-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics