Skip to main content

N-Gram Feature Selection for Authorship Identification

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4183))

Abstract

Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mosteller, F., Wallace, D.: Inference in an Authorship Problem. Journal of the American Statistical Association 58(302), 275–230 (1963)

    Google Scholar 

  2. Labbé, C., Labbé, D.: Inter-textual distance and authorship attribution: Corneille and Molière. Journal of Quantitative Linguistics 8, 213–231 (2001)

    Article  Google Scholar 

  3. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)

    Article  Google Scholar 

  4. Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)

    Article  Google Scholar 

  5. van Halteren, H.: Linguistic Profiling for Author Recognition and Verification. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 199–206 (2004)

    Google Scholar 

  6. Chaski, C.: Empirical Evaluations of Language-based Author Identification Techniques. Forensic Linguistics 8(1), 1–65 (2001)

    Article  Google Scholar 

  7. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)

    Article  Google Scholar 

  8. Peng, F., Shuurmans, F., Keselj, V., Wang, S.: Language Independent Authorship Attribution Using Character Level Language Models. In: Proc. of the 10th European Association for Computational Linguistics (2003)

    Google Scholar 

  9. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  10. Holmes, D.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)

    Article  Google Scholar 

  11. Kjell, B., Addison Woods, W., Frieder, O.: Discrimination of authorship using visualization. Information Processing and Management 30(1) (1994)

    Google Scholar 

  12. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Association for Computational Linguistics (2003)

    Google Scholar 

  13. Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proc. of the Joint ALLC/ACH2004 Conf., pp. 175–176 (2004)

    Google Scholar 

  14. Ferreira da Silva, J., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  15. Silva, J., Lopes, G.: A local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In: Proc. of the 6th Meeting on the Mathematics of Language, pp. 369–381 (1999)

    Google Scholar 

  16. Church, K., Hanks, K.: Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16(1), 22–29 (1990)

    Google Scholar 

  17. Gale, W., Church, K.: Concordance for parallel texts. In: Proc. of the 7th Annual Conference for the new OED and Text Research, Oxford, pp. 40–62 (1991)

    Google Scholar 

  18. Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  19. Khmelev, D., Teahan, W.: A Repetition Based Measure for Verification of Text Collections and for Text Categorization. In: Proc. of the 26th ACM SIGIR, pp. 104–110 (2003)

    Google Scholar 

  20. Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: Proc. of CSNA (2005)

    Google Scholar 

  21. Yang, Y., Pedersen J.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conf. on Machine Learning (1997)

    Google Scholar 

  22. Marton, Y., Wu, N., Hellerstein, L.: On Compression-Based Text Classification. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 300–314. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Houvardas, J., Stamatatos, E. (2006). N-Gram Feature Selection for Authorship Identification. In: Euzenat, J., Domingue, J. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2006. Lecture Notes in Computer Science(), vol 4183. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11861461_10

Download citation

  • DOI: https://doi.org/10.1007/11861461_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40930-4

  • Online ISBN: 978-3-540-40931-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics