Rhythmic and Psycholinguistic Features for Authorship Tasks in the Spanish Parliament: Evaluation and Analysis

Corbara, Silvia; Chulvi, Berta; Rosso, Paolo; Moreo, Alejandro

doi:10.1007/978-3-031-13643-6_6

Silvia Corbara¹⁷,
Berta Chulvi^18,19,
Paolo Rosso¹⁸ &
…
Alejandro Moreo²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13390))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1065 Accesses
3 Altmetric

Abstract

Among the many tasks of the authorship field, Authorship Identification aims at uncovering the author of a document, while Author Profiling focuses on the analysis of personal characteristics of the author(s), such as gender, age, etc. Methods devised for such tasks typically focus on the style of the writing, and are expected not to make inferences grounded on the topics that certain authors tend to write about. In this paper, we present a series of experiments evaluating the use of topic-agnostic feature sets for Authorship Identification and Author Profiling tasks in Spanish political language. In particular, we propose to employ features based on rhythmic and psycholinguistic patterns, obtained via different approaches of text masking that we use to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by a BETO transformer, when the latter is trained on the original text, i.e., potentially learning from topical information. Moreover, we further investigate the results for the different authors, showing that variations in performance are partially explainable in terms of the authors’ political affiliation and communication style.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/silvia-cor/Topic-agnostic_ParlaMintES.
2.
https://pan.webis.de/.
3.
Precisely, the PAN2021 event presented a particular case of AV where the dataset contained pairs of documents, and the aim was to infer whether the two documents shared the same author; we call this task Same-Authorship Verification (SAV).
4.
https://www.clarin.si/repository/xmlui/handle/11356/1431.
5.
Regionalist parties aim for more political power for regional entities.
6.
Note that we use the decade of birth as representation of age group. We assign the closest decade label to each author’s birth; for example, an author born in 1984 is assigned the label ‘1980’, while an author born in 1987 is assigned the label ‘1990’.
7.
https://www.nltk.org/.
8.
https://github.com/linhd-postdata/rantanplan.
9.
We employ the Spanish version of the dictionary, which is based on LIWC2007.
10.
We use the following categories: (i) Yo, Nosotro, TuUtd, ElElla, VosUtds, Ellos, Pasado, Present, Futuro, Subjuntiv, Negacio, Cuantif, Numeros, verbYO, verbTU, verbNOS, verbVos, verbosEL, verbELLOS, formal, informal; (ii) MecCog, Insight, Causa, Discrep, Asentir, Tentat, Certeza, Inhib, Incl, Excl, Percept, Ver, Oir, Sentir, NoFluen, Relleno, Ingerir, Relativ, Movim; (iii) Maldec, Afect, EmoPos, EmoNeg, Ansiedad, Enfado, Triste, Placer. We avoid employing categories that would repeat information already captured by the POS tags, or topic-related categories (e.g., Dinero, Familia).
11.
Formally, LIWC can be seen as a map \(m : w \rightarrow C\), where w is a word token and \(C\subset \mathcal {C}\) is a subset of the psycholinguistic categories \(\mathcal {C}\). Given a macro-category \(M\subset \mathcal {C}\), we replace each word w in a document by the categories \(m(w)\cap M\). If \(|m(w)\cap M|>1\), then a new token is created which consists of a concatenation of the category names (following a consistent ordering). If \(m(w)\cap M=\emptyset \), then w is replaced with the ‘w’ symbol. (Note that some entries in LIWC have the suffix truncated and replaced with an asterisk ‘*’, e.g., president*; the asterisk is treated as a wildcard in the mapping function, and in case more than one match is possible, the match with the longest common prefix is returned).
12.
We also carried out preliminary experiments with Random Forest (RF) and Logistic Regression (LR). SVM showed a remarkably better performance than RF, while no significant differences were noticed between SVM and LR.
13.
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.
14.
Indeed, LIWC_GRAM, LIWC_COG and LIWC_FEELS create the highest number of features in our experiments, ranging from 3000 to more than 20000.
15.
The selection is always carried out in the training set. During the 5-fold cross-validation optimization phase, feature selection is carried out in the corresponding \(80\%\) of the training set used as training.
16.
https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased. This model obtained better results than the ‘uncased’ version in preliminary experiments.

References

Bevendorff, J., et al.: Overview of PAN 2021: authorship verification, profiling hate speech spreaders on Twitter, and style change detection. In: Candan, K., et al. (eds.) CLEF 2021. LNCS, vol. 12880, pp. 419–431. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85251-1_26
Chapter Google Scholar
Bischoff, S., et al.: The importance of suppressing domain style in authorship analysis. arXiv:2005.14714 (2020)
Blas-Arroyo, J.L.: ‘Perdóneme que se lo diga, pero vuelve usted a faltar a la verdad, señor Gonzalez’: form and function of politic verbal behaviour in face-to-face Spanish political debates. Discour. Soc. 14(4), 395–423 (2003)
Article Google Scholar
Boyd, R.L.: Mental profile mapping: a psychological single-candidate authorship attribution method. PLoS One 13(7), e0200588 (2018)
Article Google Scholar
Bull, P., Wells, P.: Adversarial discourse in Prime Minister’s questions. J. Lang. Soc. Psychol. 31(1), 30–48 (2012)
Article Google Scholar
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: PML4DC at ICLR 2020 (2020)
Google Scholar
Chulvi, B., Rosso, P., Molpeceres, M.A., Sánchez-Junquera, J., Rodrigo, M.: Us and them: immigrant’s stereotypes and language style on political parliamentary speeches (under revision) (2022)
Google Scholar
Corbara, S., Chulvi, B., Rosso, P., Moreo, A.: Investigating topic-agnostic features for authorship tasks in Spanish political speeches. In: Rosso, P., Basile, V., Martínez, R., Mètais, E., Meziane, F. (eds.) NLDB 2022. LNCS, vol. 13286, pp. 394–402. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08473-7_36
Chapter Google Scholar
Corbara, S., Moreo, A., Sebastiani, F.: Syllabic quantity patterns as rhythmic features for Latin authorship attribution. arXiv:2110.14203 (2021)
Fenton-Smith, B.: Discourse structure and political performance in adversarial parliamentary wuestioning. J. Lang. Polit. 7(1), 97–118 (2008)
Article Google Scholar
Fernández-Cabana, M., Rúas-Araújo, J., Alves-Pérez, M.T.: Psicología, lenguaje y comunicación: análisis con la herramienta LIWC de los discursos y tweets de los candidatos a las elecciones gallegas. Anuario Psicol. 44(2), 169–184 (2014)
Google Scholar
García-Díaz, J.A., Colomo-Palacios, R., Valencia-García, R.: Psychographic traits identification based on political ideology: an author analysis study on Spanish politicians’ tweets posted in 2020. Futur. Gener. Comput. Syst. 130, 59–74 (2022)
Article Google Scholar
Gaston, J., et al.: Authorship attribution vs. adversarial authorship from a LIWC and sentiment analysis perspective. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 920–927. IEEE (2018)
Google Scholar
van der Goot, R., Ljubešić, N., Matroos, I., Nissim, M., Plank, B.: Bleaching text: abstract features for cross-lingual gender prediction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Volume 2: Short Papers, pp. 383–389 (2018)
Google Scholar
Halvani, O., Graner, L., Regev, R.: TAVeer: an interpretable topic-agnostic authorship verification method. In: Proceedings of the 15th International Conference on Availability, Reliability and Security (ARES 2020), pp. 1–10 (2020)
Google Scholar
Harris, S.: Being politically impolite: extending politeness theory to adversarial political discourse. Discour. Soc. 12(4), 451–472 (2001)
Article Google Scholar
Jordan, K.N., Sterling, J., Pennebaker, J.W., Boyd, R.L.: Examining long-term trends in politics and culture through language of political leaders and cultural institutions. Proc. Natl. Acad. Sci. 116(9), 3476–3481 (2019)
Article Google Scholar
Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) CLEF (Working Notes). CEUR Workshop Proceedings, vol. 2125. CEUR-WS.org (2018)
Google Scholar
Neidorf, L., Krieger, M.S., Yakubek, M., Chaudhuri, P., Dexter, J.P.: Large-scale quantitative profiling of the old English verse tradition. Nat. Hum. Behav. 3(6), 560–567 (2019)
Article Google Scholar
Nisbett, R.E., Peng, K., Choi, I., Norenzayan, A.: Culture and systems of thought: holistic versus analytic cognition. Psychol. Rev. 108(2), 291 (2001)
Article Google Scholar
Pennebaker, J.W., Boyd, R.L., Jordan, K., Blackburn, K.: The development and psychometric properties of LIWC2015. Technical report (2015)
Google Scholar
Pennebaker, J.W., Chung, C.K., Frazee, J., Lavergne, G.M., Beaver, D.I.: When small words foretell academic success: the case of college admissions essays. PLoS One 9(12), e115844 (2014)
Article Google Scholar
Plecháč, P.: Relative contributions of Shakespeare and Fletcher in Henry VIII: an analysis based on most frequent words and most frequent rhythmic patterns. Digit. Scholarsh. Humanit. 36(2), 430–438 (2021)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Stamatatos, E.: Masking topic-related information to enhance authorship attribution. J. Am. Soc. Inf. Sci. 69(3), 461–473 (2018)
Google Scholar
Tukey, J.W.: Comparing individual means in the analysis of variance. Biometrics, pp. 99–114 (1949)
Google Scholar
Weerasinghe, J., Singh, R., Greenstadt, R.: Feature vector difference based authorship verification for open world settings. In: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum. CEUR-WS.org (2021)
Google Scholar

Download references

Acknowledgment

The research work by Silvia Corbara was carried out during her visit at the Universitat Politècnica de València and was supported by the AI4Media project, funded by the EU Commission (Grant 951911, H2020 Programme ICT-48-2020).

The research work by Paolo Rosso was partially funded by the Generalitat Valenciana under DeepPattern (PROMETEO/2019/121).

Author information

Authors and Affiliations

Scuola Normale Superiore, Pisa, Italy
Silvia Corbara
Universitat Politècnica de València, Valencia, Spain
Berta Chulvi & Paolo Rosso
Universitat de València, Valencia, Spain
Berta Chulvi
Istituto di Scienza e Tecnologie dell’Informazione, CNR, Pisa, Italy
Alejandro Moreo

Authors

Silvia Corbara
View author publications
You can also search for this author in PubMed Google Scholar
Berta Chulvi
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Moreo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Silvia Corbara .

Editor information

Editors and Affiliations

University of Bologna, Forlì, Italy
Alberto Barrón-Cedeño
University of Padua, Padova, Italy
Giovanni Da San Martino
University of Bologna, Bologna, Italy
Mirko Degli Esposti
Instituto di Scienza e Tecnologie dell' Informazione “Alessandro Faedo”, Pisa, Italy
Fabrizio Sebastiani
University of Glasgow, Glasgow, UK
Craig Macdonald
University Milano-Bicocca, Milan, Italy
Gabriella Pasi
TU Wien, Vienna, Austria
Allan Hanbury
Leipzig University, Leipzig, Germany
Martin Potthast
University of Padua, Padova, Italy
Guglielmo Faggioli
University of Padua, Padova, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Corbara, S., Chulvi, B., Rosso, P., Moreo, A. (2022). Rhythmic and Psycholinguistic Features for Authorship Tasks in the Spanish Parliament: Evaluation and Analysis. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-13643-6_6
Published: 25 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13642-9
Online ISBN: 978-3-031-13643-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Rhythmic and Psycholinguistic Features for Authorship Tasks in the Spanish Parliament: Evaluation and Analysis