Skip to main content

Relevance of Named Entities in Authorship Attribution

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2016)

Abstract

Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. van Dalen-Oskam, K.: Names in novels: an experiment in computational stylistics. Literary Linguist. Comput. 28(2), 359–370 (2013)

    Article  Google Scholar 

  2. Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016 (2016)

    Google Scholar 

  3. Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, vol. 1391. CEUR (2015)

    Google Scholar 

  4. Goodman, R., Hahn, M., Marella, M., Ojar, C., Westcott, S.: The use of stylometry for email author identification: a feasibility study. Pace Pacing Clin. Electrophysiol., 1–7 (2007)

    Google Scholar 

  5. Koppel, M., Akiva, N., Dagan, I.: A corpus-independent feature set for style-based text categorization. Science, 1263–1276 (2002)

    Google Scholar 

  6. Leech, G.N.: Style in Fiction (1982)

    Google Scholar 

  7. Lucic, A., Blake, C.L.: A syntactic characterization of authorship style surrounding proper names. Digit. Scholarsh. Humanit. 30(1), 53–70 (2015)

    Article  Google Scholar 

  8. Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)

    Article  Google Scholar 

  9. Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016)

    Google Scholar 

  10. Mikros, G.K., Argiri, E.K.: Investigating topic influence in authorship attribution, vol. 276 (2007)

    Google Scholar 

  11. Nadeau, D.: A survey of named entity recognition and classification. Linguisticae Investigationes 30, 3–26 (2007)

    Article  Google Scholar 

  12. Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)

    Article  Google Scholar 

  13. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)

    Article  Google Scholar 

  14. Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Automatic detection of similarity of programs in karel programming language based on natural language processing techniques. Computación y Systemas 20(2), 279–288 (2016)

    Google Scholar 

  15. Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Measuring similarity between karel programs using character and word n-grams. Program. Comput. Softw. 43 (accepted, 2017)

    Google Scholar 

  16. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60, 538–556 (2009)

    Article  Google Scholar 

  17. Tanguy, L., Sajous, F., Calderone, B., Hathout, N.: Authorship attribution: using rich linguistic features when training data is scarce. Working Notes Papers of the CLEF 2012 Evaluation Labs, pp. 1–12 (2012)

    Google Scholar 

  18. Tanguy, L., Urieli, A., Calderone, B., Hathout, N., Sajous, F.: A multitude of linguistically-rich features for authorship attribution. In: CEUR Workshop Proceedings, vol. 1177 (2011)

    Google Scholar 

  19. Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: EMNLP 2015, pp. 2539–2544 (2015)

    Google Scholar 

  20. Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448 (2016)

    Google Scholar 

Download references

Acknowledgements

We would like to thank the support of Mexican government (CONACYT project 240844, SNI, SIP IPN projects 20161947, 20161958).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Germán Ríos-Toledo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ríos-Toledo, G., Sidorov, G., Castro-Sánchez, N.A., Nava-Zea, A., Chanona-Hernández, L. (2017). Relevance of Named Entities in Authorship Attribution. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62434-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62433-4

  • Online ISBN: 978-3-319-62434-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics