Richer Document Embeddings for Author Profiling tasks based on a heuristic search

https://doi.org/10.1016/j.ipm.2020.102227Get rights and content

Highlights

  • Users of Social Media can be profiled through their posts.

  • Word Embeddings offer semantic meaning in a n-vectorial space.

  • A Document can be represented as a weighted-average of their Word Embeddings.

  • A new proposed statistic called Relevance Topic Value, is useful as a weighting-scheme of terms.

  • Genetic Programming is useful to evolve competitive weighting-schemes of terms.

Abstract

In this study we propose a novel method to generate Document Embeddings (DEs) by means of evolving mathematical equations that integrate classical term frequency statistics. To accomplish this, we employed a Genetic Programming (GP) strategy to build competitive formulae to weight custom Word Embeddings (WEs), produced by cutting edge feature extraction techniques (e.g., word2vec, fastText, BERT), and then we create DEs by their weighted averaging. We exhaustively evaluated the proposed method over 9 datasets that are composed of several multilingual social media sources, with the aim to predict personal attributes of authors (e.g., gender, age, personality traits) in 17 tasks. In each dataset we contrast the results obtained by our method against state-of-the-art competitors, placing our approach at the top-quartile in all cases. Furthermore, we introduce a new numerical statistic feature called Relevance Topic Value (rtv), which could be used to enhance the forecasting of characteristics of authors, by numerically describing the topic of a document and the personal use of words by users. Interestingly, based on a frequency analysis of terminals used by GP, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting scheme.

Introduction

If you would want to know some personal characteristics of us, the authors of this manuscript, such as our age, gender or even personality traits, by just looking at this and some other texts that we had produced, you would be doing Author Profiling (AP). As a Natural Language Processing (NLP) task, AP has had its opportunity to prove its value in a variety of applications that spans from sensitive scenarios such as digital security, spotting Internet predatory activities, detecting fraud and cyber-terrorism or even plagiarism (Fatima, Hasan, Anwar, & Nawab, 2017), to more common settings like improving customer service, chatbots or diagnosis of neurological disorders, among others (Rangel & Rosso, 2016).

To identify personal features of authors we must examine how they use words. A way to computationally manipulate these words is by representing them as numeric vectors. In this sense, Word Embeddings (WEs) can be viewed as high density vectors that code the meaning of words, in such a way that similar concepts tend to cluster within a n-dimensional space, hence the distance between WEs is a measure of similarity (Kocher, Savoy, 2017, Mikolov, Sutskever, Chen, Corrado, Dean, 2013). Consider that WEs have delivered state-of-the-art results in tasks such as text classification, language translation or speech recognition. This idea could be extrapolated to represent bigger chunks of text (commonly known as Paragraph Vectors (PVs) or Document Embeddings (DEs)), and can be used to extract useful information, such as “intention” from a whole document, an approach that has been exploited for AP and Sentiment Analysis (SA) tasks (Le & Mikolov, 2014). PVs or DEs can be produced by several techniques, being the centroids method one of the most popular and successful (Kusner, Sun, Kolkin, & Weinberger, 2015). The idea behind centroids method is that a sentence, paragraph or even a whole document can be viewed as the aggregation of its words, therefore a DE could be generated by averaging the WEs of words present in the document.

Since it is based on the average concept, the Centroid method could present some shortcomings to capture subtle differences in authors writing styles, then making it not appropriate for AP endeavors. To overcome this issue, in this work we present a novel strategy to generate DEs. Our proposal relies on the hypothesis that it is possible to find novel and optimized weights for each word within a document, then producing an improved aggregate strategy instead of just averaging terms. To test our hypothesis we employ Genetic Programming (GP), which is a very sound approach to learn intrinsic structure within data via mathematical equations (Bruns, Dunkel, & Offel, 2019). To the best of our knowledge, GP has not been employed for the purpose of evolving weighting schemes to aggregate WEs into DEs in the AP area, so a new application of GP is envisioned. Our proposed pipeline is as follows. GP employs statistical features for each word within a document (e.g., term frequency (tf), term frequency-inverse document frequency (tf-idf), Information Gain (IG)) to evolve equations to calculate the weights (importance) of terms. Then, using feature extraction techniques (e.g., word2vec, fastText, BERT), WEs are produced for terms in the datasets. Next, WEs from users posts are aggregated into DEs using a weighted average (importance established by GP). Finally, using a Machine Learning approach we use the DEs to predict the gender, age, language variety and personality of authors. In addition to this, we introduce a novel numeric statistic feature (rtv), which is based on a frequency analysis over the use of words and themes by persons. Moreover, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting-scheme factor.

We exhaustively evaluated our proposal using a total of 9 datasets (in 17 tasks) that were originally devised for the scientific event called Uncovering Plagiarism, Authorship and Social Software Misuse (PAN) (Rangel, Rosso, Montes-y-Gómez, Potthast, & Stein, 2018), within the period 2013–2018. Each year, the Conference and Labs of the Evaluation Forum (CLEF) organizes the PAN conference including a shared task, where several teams compete to predict author’s features such as gender, age or personality traits (e.g., openness, extroversion, etc.) originated in a variety of multilingual social media sources. Within this context, we can contrast our proposal against all the teams that submitted an entry for each year’s contest. For completeness, we also include two averaging baselines: a) a weighted average using only tf-idf values, and b) using a simple mean of the WEs (centroids). The results of each comparison show that our proposed approach offers very competitive performance to solve AP related tasks, ranking in the top-quartile in every year’s competition. This result also suggests the flexibility and robustness of our approach, since through all the years different AP tasks and datasets have been used.

The rest of the document is organized as follows: In Section 2 we state our research objective. In Section 3 we discuss the related work for this study. In Section 4 we introduce our proposal. Section 5 presents the experimental setup and discusses the outcome of the comparisons made against competition results and baselines. Finally in chapter 6 we present conclusions and future work.

Section snippets

Research objective

The main objective of our work is to devise a novel approach to produce DEs in order to predict more accurately characteristics of authors such as gender, age and personality traits. Commonly in literature, this has been done by aggregating WEs, either by a simple mean or a weighted average. Our approach consists of building task-specific weighting-schemes, where they are learned from each particular dataset. In order to guide our work, we stated a research question and the specific objectives

Natural language processing and social media

Social media analytics has become fundamental for modern-life activities such as product placement, massive sentiment analysis, political marketing and even real-time information retrieval. Take as an example the work of Sánchez & Bellogn, 2019, a method that delivers competitive results against cutting edge recommendation techniques, to address the preferences of users over the Internet. Furthermore, nowadays social networks such as Twitter, act as live breaking news outlets. In

Methodology

Our aim in this study is to propose a strategy, to compose DEs and evaluate their suitability for different AP tasks. For this purpose we used datasets that were obtained from the PAN shared tasks 2013 through 2018 organized at CLEF (we will elaborate more on this later). The DEs devised are to be processed by either a classifier or a regressor, producing a profile of the individuals by gender, age, and personality traits. Fig. 1 depicts our full proposal.

Our methodology can be summarized in

Results

We organize this section based on two main sets of experiments. In the first experiment we evaluate the performance of the GPE-WS approach with respect to two well known baseline strategies and our proposed statistic value: a) a tf-idf weighted-average, b) a WE regular average and c) a rtv weighted-average. The aim of this experiment is to contrast the performance of our proposal against other common methodologies to generate DEs. For the second experiment, we analyze how competitive is our

Conclusions and future work

We can assume that WEs trained over specific-domain datasets, are useful as building blocks to construct DEs. In this sense, the methodology presented in this work although straightforward, is arguably a sound strategy to produce text representations (DEs) for AP tasks. Our approach outperformed two baselines, often depicted in literature as solid references in NLP problems. Also, it proved to be a strong competitor against the top-quartile participants in six international competitions on AP

Acknowledgments

This work was supported by CONACYT project FC-2410.

References (41)

  • M.A. Álvarez-Carmona et al.

    Inaoe’s participation at pan’15: Author profiling task.

  • S. Arora et al.

    A simple but tough-to-beat baseline for sentence embeddings

    5th international conference on learning representations, ICLR 2017, toulon, france, april 24–26, 2017, conference track proceedings

    (2017)
  • A. Basile et al.

    N-Gram: New groningen author-profiling model

    CoRR

    (2017)
  • P. Bojanowski et al.

    Enriching word vectors with subword information

    Transactions of the Association for Computational Linguistics

    (2017)
  • M.Á.Á. Carmona et al.

    Evaluating topic-based representations for author profiling in social media

    Iberamia

    (2016)
  • S. Daneshvar et al.

    Gender identification in twitter using n-grams and lsa: Notebook for pan at clef 2018

    Working notes of CLEF 2018 - conference and labs of the evaluation forum, avignon, france, september 10–14, 2018

    (2018)
  • J. Devlin et al.

    BERT: pre-training of deep bidirectional transformers for language understanding

    Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, minneapolis, mn, usa, june 2–7, 2019, volume 1 (long and short papers)

    (2019)
  • L.R. Goldberg

    The development of markers for the big-five factor structure.

    Psychological Assessment

    (1992)
  • A. Khatua et al.

    A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks

    Information Processing & Management

    (2019)
  • Q. Le et al.

    Distributed representations of sentences and documents

    Proceedings of the 31st international conference on international conference on machine learning - volume 32

    (2014)
  • Cited by (20)

    • Survey on profiling age and gender of text authors

      2022, Expert Systems with Applications
      Citation Excerpt :

      These results (especially the age result), which have been obtained by classical ML methods (MNB and LR), are much better than the results of the two systems that used DL methods (Zhang and Abdul-Mageed, 2019; Suman et al., 2019). López-Santillán et al. (2020) applied a genetic programming technique that generates a new document embeddings method, which is a weighted average of various word embedding methods, e.g., BERT, fastText, and word2vec. They evaluated their method on nine datasets by predicting various personal authors' features, e.g., age, gender, and personality traits.

    • Learning interpretable word embeddings via bidirectional alignment of dimensions with semantic concepts

      2022, Information Processing and Management
      Citation Excerpt :

      The scope of NLP-based studies can range from event detection (Qian et al., 2019; Tuke et al., 2020) to document retrieval (Bagheri et al., 2018). Computational studies on social media also frequently utilize NLP tools in various topics such as author profiling (López-Santillan et al., 2020), content processing (Moudjari et al., 2021; Roy et al., 2021) and hate speech detection (Pamungkas et al., 2021; Pronoza et al., 2021). What is common among these studies is that they all heavily depend on textual data.

    • Zipfian regularities in “non-point” word representations

      2021, Information Processing and Management
      Citation Excerpt :

      Continuous dense word embeddings opened a new era in NLP and computational linguistics (Bojanowski et al., 2017; Mikolov, Chen et al., 2013; Mikolov, Sutskever et al., 2013; Pennington et al., 2014). Besides their success in high level NLP tasks and applications, word embeddings are also frequently used in information processing field such as information retrieval from documents, (Bagheri et al., 2018), and social media user profiling, (Lopez-Santillan et al., 2020). Classical word embeddings treat all words as vectors representing points within a high-dimensional semantic vector space.

    • Deep context modeling for multi-turn response selection in dialogue systems

      2021, Information Processing and Management
      Citation Excerpt :

      In contrast, most previous studies are RNN-based and each utterance and response are encoded separately; (2) We introduce the NUP pre-training scheme to model the dialogue context continuity and adapt BERT to the dialogue domain. Recently, deep contextualized pre-trained language models (PTMs) have achieved great success in learning universal language representations, pushing up the state-of-the-art performance in a series of NLP tasks (Ángel González, Hurtado, & Pla, 2020; Fan, Fan, Smith, & Garner, 2020; Fu, Ouyang, Chen, & Luo, 2020; López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, & Prieto-Ordaz, 2020; Mekel & Frasincar, 2020). Prominent examples include ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), ALBERT (Lan et al., 2019), ELECTRA (Clark, Luong, Le, & Manning, 2019), etc.

    • Context-sensitive gender inference of named entities in text

      2021, Information Processing and Management
      Citation Excerpt :

      The goal of gender tagging is to identify the gender of named entities present in a text. The gender tag of named entities can be utilised for various applications such as analysing the representation of genders in different media (Tang, Ross, Saxena, & Chen, 2011), targeted advertising (Jansen, Moore, & Carman, 2013), studying gender differences (Foong, Vincent, Hecht, & Gerber, 2018), in user profiling (Chen et al., 2018; Fatima, Hasan, Anwar, & Nawab, 2017; López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, & Prieto-Ordaz, 2020), and harassment detection (Karami, White, Ford, Swan, & Spinel, 2020). The major impetus in gender identification has been to identify the gender of authors from texts (Fink, Kopecky, & Morawski, 2012; Fourkioti, Symeonidis, & Arampatzis, 2019; Li, Wang, Zhou, & Shi, 2015; Otterbacher, 2010).

    View all citing articles on Scopus
    View full text