Relevance of Named Entities in Authorship Attribution

Ríos-Toledo, Germán; Sidorov, Grigori; Castro-Sánchez, Noé Alejandro; Nava-Zea, Alondra; Chanona-Hernández, Liliana

doi:10.1007/978-3-319-62434-1_1

Germán Ríos-Toledo¹⁵,
Grigori Sidorov¹⁶,
Noé Alejandro Castro-Sánchez¹⁵,
Alondra Nava-Zea¹⁵ &
…
Liliana Chanona-Hernández¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10061))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1479 Accesses

Abstract

Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Empirical Framework to Identify Authorship from Bengali Literary Works

An Interpretable Authorship Attribution Algorithm Based on Distance-Related Characterizations of Tokens

Authorship Attribution for Assamese Language Documents: Initial Results

References

van Dalen-Oskam, K.: Names in novels: an experiment in computational stylistics. Literary Linguist. Comput. 28(2), 359–370 (2013)
Article Google Scholar
Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016 (2016)
Google Scholar
Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, vol. 1391. CEUR (2015)
Google Scholar
Goodman, R., Hahn, M., Marella, M., Ojar, C., Westcott, S.: The use of stylometry for email author identification: a feasibility study. Pace Pacing Clin. Electrophysiol., 1–7 (2007)
Google Scholar
Koppel, M., Akiva, N., Dagan, I.: A corpus-independent feature set for style-based text categorization. Science, 1263–1276 (2002)
Google Scholar
Leech, G.N.: Style in Fiction (1982)
Google Scholar
Lucic, A., Blake, C.L.: A syntactic characterization of authorship style surrounding proper names. Digit. Scholarsh. Humanit. 30(1), 53–70 (2015)
Article Google Scholar
Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)
Article Google Scholar
Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016)
Google Scholar
Mikros, G.K., Argiri, E.K.: Investigating topic influence in authorship attribution, vol. 276 (2007)
Google Scholar
Nadeau, D.: A survey of named entity recognition and classification. Linguisticae Investigationes 30, 3–26 (2007)
Article Google Scholar
Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)
Article Google Scholar
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)
Article Google Scholar
Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Automatic detection of similarity of programs in karel programming language based on natural language processing techniques. Computación y Systemas 20(2), 279–288 (2016)
Google Scholar
Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Measuring similarity between karel programs using character and word n-grams. Program. Comput. Softw. 43 (accepted, 2017)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60, 538–556 (2009)
Article Google Scholar
Tanguy, L., Sajous, F., Calderone, B., Hathout, N.: Authorship attribution: using rich linguistic features when training data is scarce. Working Notes Papers of the CLEF 2012 Evaluation Labs, pp. 1–12 (2012)
Google Scholar
Tanguy, L., Urieli, A., Calderone, B., Hathout, N., Sajous, F.: A multitude of linguistically-rich features for authorship attribution. In: CEUR Workshop Proceedings, vol. 1177 (2011)
Google Scholar
Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: EMNLP 2015, pp. 2539–2544 (2015)
Google Scholar
Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448 (2016)
Google Scholar

Download references

Acknowledgements

We would like to thank the support of Mexican government (CONACYT project 240844, SNI, SIP IPN projects 20161947, 20161958).

Author information

Authors and Affiliations

Centro Nacional de Investigación y Desarrollo Tecnológico, Cuernavaca, Morelos, Mexico
Germán Ríos-Toledo, Noé Alejandro Castro-Sánchez & Alondra Nava-Zea
Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico City, Mexico
Grigori Sidorov
Instituto Politécnico Nacional, ESIMEZ, Mexico City, Mexico
Liliana Chanona-Hernández

Authors

Germán Ríos-Toledo
View author publications
You can also search for this author in PubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author in PubMed Google Scholar
Noé Alejandro Castro-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Alondra Nava-Zea
View author publications
You can also search for this author in PubMed Google Scholar
Liliana Chanona-Hernández
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Germán Ríos-Toledo .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov
Universidad Autónoma Metropolitana, Mexico City, Mexico
Oscar Herrera-Alcántara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ríos-Toledo, G., Sidorov, G., Castro-Sánchez, N.A., Nava-Zea, A., Chanona-Hernández, L. (2017). Relevance of Named Entities in Authorship Attribution. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-62434-1_1
Published: 03 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62433-4
Online ISBN: 978-3-319-62434-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics