Richer Document Embeddings for Author Profiling tasks based on a heuristic search

doi:10.1016/j.ipm.2020.102227

Information Processing & Management

Volume 57, Issue 4, July 2020, 102227

https://doi.org/10.1016/j.ipm.2020.102227 Get rights and content

Highlights

•
Users of Social Media can be profiled through their posts.
•
Word Embeddings offer semantic meaning in a n-vectorial space.
•
A Document can be represented as a weighted-average of their Word Embeddings.
•
A new proposed statistic called Relevance Topic Value, is useful as a weighting-scheme of terms.
•
Genetic Programming is useful to evolve competitive weighting-schemes of terms.

Abstract

In this study we propose a novel method to generate Document Embeddings (DEs) by means of evolving mathematical equations that integrate classical term frequency statistics. To accomplish this, we employed a Genetic Programming (GP) strategy to build competitive formulae to weight custom Word Embeddings (WEs), produced by cutting edge feature extraction techniques (e.g., word2vec, fastText, BERT), and then we create DEs by their weighted averaging. We exhaustively evaluated the proposed method over 9 datasets that are composed of several multilingual social media sources, with the aim to predict personal attributes of authors (e.g., gender, age, personality traits) in 17 tasks. In each dataset we contrast the results obtained by our method against state-of-the-art competitors, placing our approach at the top-quartile in all cases. Furthermore, we introduce a new numerical statistic feature called Relevance Topic Value (rtv), which could be used to enhance the forecasting of characteristics of authors, by numerically describing the topic of a document and the personal use of words by users. Interestingly, based on a frequency analysis of terminals used by GP, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting scheme.

Introduction

If you would want to know some personal characteristics of us, the authors of this manuscript, such as our age, gender or even personality traits, by just looking at this and some other texts that we had produced, you would be doing Author Profiling (AP). As a Natural Language Processing (NLP) task, AP has had its opportunity to prove its value in a variety of applications that spans from sensitive scenarios such as digital security, spotting Internet predatory activities, detecting fraud and cyber-terrorism or even plagiarism (Fatima, Hasan, Anwar, & Nawab, 2017), to more common settings like improving customer service, chatbots or diagnosis of neurological disorders, among others (Rangel & Rosso, 2016).

To identify personal features of authors we must examine how they use words. A way to computationally manipulate these words is by representing them as numeric vectors. In this sense, Word Embeddings (WEs) can be viewed as high density vectors that code the meaning of words, in such a way that similar concepts tend to cluster within a n-dimensional space, hence the distance between WEs is a measure of similarity (Kocher, Savoy, 2017, Mikolov, Sutskever, Chen, Corrado, Dean, 2013). Consider that WEs have delivered state-of-the-art results in tasks such as text classification, language translation or speech recognition. This idea could be extrapolated to represent bigger chunks of text (commonly known as Paragraph Vectors (PVs) or Document Embeddings (DEs)), and can be used to extract useful information, such as “intention” from a whole document, an approach that has been exploited for AP and Sentiment Analysis (SA) tasks (Le & Mikolov, 2014). PVs or DEs can be produced by several techniques, being the centroids method one of the most popular and successful (Kusner, Sun, Kolkin, & Weinberger, 2015). The idea behind centroids method is that a sentence, paragraph or even a whole document can be viewed as the aggregation of its words, therefore a DE could be generated by averaging the WEs of words present in the document.

Since it is based on the average concept, the Centroid method could present some shortcomings to capture subtle differences in authors writing styles, then making it not appropriate for AP endeavors. To overcome this issue, in this work we present a novel strategy to generate DEs. Our proposal relies on the hypothesis that it is possible to find novel and optimized weights for each word within a document, then producing an improved aggregate strategy instead of just averaging terms. To test our hypothesis we employ Genetic Programming (GP), which is a very sound approach to learn intrinsic structure within data via mathematical equations (Bruns, Dunkel, & Offel, 2019). To the best of our knowledge, GP has not been employed for the purpose of evolving weighting schemes to aggregate WEs into DEs in the AP area, so a new application of GP is envisioned. Our proposed pipeline is as follows. GP employs statistical features for each word within a document (e.g., term frequency (tf), term frequency-inverse document frequency (tf-idf), Information Gain (IG)) to evolve equations to calculate the weights (importance) of terms. Then, using feature extraction techniques (e.g., word2vec, fastText, BERT), WEs are produced for terms in the datasets. Next, WEs from users posts are aggregated into DEs using a weighted average (importance established by GP). Finally, using a Machine Learning approach we use the DEs to predict the gender, age, language variety and personality of authors. In addition to this, we introduce a novel numeric statistic feature (rtv), which is based on a frequency analysis over the use of words and themes by persons. Moreover, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting-scheme factor.

We exhaustively evaluated our proposal using a total of 9 datasets (in 17 tasks) that were originally devised for the scientific event called Uncovering Plagiarism, Authorship and Social Software Misuse (PAN) (Rangel, Rosso, Montes-y-Gómez, Potthast, & Stein, 2018), within the period 2013–2018. Each year, the Conference and Labs of the Evaluation Forum (CLEF) organizes the PAN conference including a shared task, where several teams compete to predict author’s features such as gender, age or personality traits (e.g., openness, extroversion, etc.) originated in a variety of multilingual social media sources. Within this context, we can contrast our proposal against all the teams that submitted an entry for each year’s contest. For completeness, we also include two averaging baselines: a) a weighted average using only tf-idf values, and b) using a simple mean of the WEs (centroids). The results of each comparison show that our proposed approach offers very competitive performance to solve AP related tasks, ranking in the top-quartile in every year’s competition. This result also suggests the flexibility and robustness of our approach, since through all the years different AP tasks and datasets have been used.

The rest of the document is organized as follows: In Section 2 we state our research objective. In Section 3 we discuss the related work for this study. In Section 4 we introduce our proposal. Section 5 presents the experimental setup and discusses the outcome of the comparisons made against competition results and baselines. Finally in chapter 6 we present conclusions and future work.

Section snippets

Research objective

The main objective of our work is to devise a novel approach to produce DEs in order to predict more accurately characteristics of authors such as gender, age and personality traits. Commonly in literature, this has been done by aggregating WEs, either by a simple mean or a weighted average. Our approach consists of building task-specific weighting-schemes, where they are learned from each particular dataset. In order to guide our work, we stated a research question and the specific objectives

Natural language processing and social media

Social media analytics has become fundamental for modern-life activities such as product placement, massive sentiment analysis, political marketing and even real-time information retrieval. Take as an example the work of Sánchez & Bellogn, 2019, a method that delivers competitive results against cutting edge recommendation techniques, to address the preferences of users over the Internet. Furthermore, nowadays social networks such as Twitter, act as live breaking news outlets. In

Methodology

Our aim in this study is to propose a strategy, to compose DEs and evaluate their suitability for different AP tasks. For this purpose we used datasets that were obtained from the PAN shared tasks 2013 through 2018 organized at CLEF (we will elaborate more on this later). The DEs devised are to be processed by either a classifier or a regressor, producing a profile of the individuals by gender, age, and personality traits. Fig. 1 depicts our full proposal.

Our methodology can be summarized in

Results

We organize this section based on two main sets of experiments. In the first experiment we evaluate the performance of the GPE-WS approach with respect to two well known baseline strategies and our proposed statistic value: a) a tf-idf weighted-average, b) a WE regular average and c) a rtv weighted-average. The aim of this experiment is to contrast the performance of our proposal against other common methodologies to generate DEs. For the second experiment, we analyze how competitive is our

Conclusions and future work

We can assume that WEs trained over specific-domain datasets, are useful as building blocks to construct DEs. In this sense, the methodology presented in this work although straightforward, is arguably a sound strategy to produce text representations (DEs) for AP tasks. Our approach outperformed two baselines, often depicted in literature as solid references in NLP problems. Also, it proved to be a strong competitor against the top-quartile participants in six international competitions on AP

Acknowledgments

This work was supported by CONACYT project FC-2410.

References (41)

R. Bruns et al.
Learning of complex event processing rules with genetic programming
Expert Systems with Applications
(2019)
C. De Boom et al.
Representation learning for very short texts using weighted word embedding aggregation
Pattern Recognition Letter
(2016)
H.J. Escalante et al.
Term-weighting learning via genetic programming for text classification
Knowledge-Based System
(2015)
M. Fatima et al.
Multilingual author profiling on facebook
Information Processing & Management
(2017)
E. Fersini et al.
Expressive signals in social media languages to improve polarity detection
Information Processing & Management
(2016)
M. Kocher et al.
Distance measures in author profiling
Information Processing & Management
(2017)
M.J. Kusner et al.
From word embeddings to document distances
Proceedings of the 32nd international conference on international conference on machine learning - volume 37
(2015)
H. Palangi et al.
Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval
IEEE/ACM Transaction Audio, Speech and Language Processing
(2016)
F. Rangel et al.
Overview of the 2nd author profiling task at pan 2014
Clef 2014 labs and workshops, notebook papers. ceur workshop proceedings, ceur-ws.org (sep 2014)
(2014)
M.B.O. Vollenbroek et al.
Gronup: Groningen user profiling
Working notes of CLEF 2016 - conference and labs of the evaluation forum, évora, portugal, 5–8 september, 2016
(2016)

M.A. Álvarez-Carmona et al.

Inaoe’s participation at pan’15: Author profiling task.

S. Arora et al.

A simple but tough-to-beat baseline for sentence embeddings

5th international conference on learning representations, ICLR 2017, toulon, france, april 24–26, 2017, conference track proceedings

(2017)

A. Basile et al.

N-Gram: New groningen author-profiling model

CoRR

(2017)

P. Bojanowski et al.

Enriching word vectors with subword information

Transactions of the Association for Computational Linguistics

(2017)

M.Á.Á. Carmona et al.

Evaluating topic-based representations for author profiling in social media

Iberamia

(2016)

S. Daneshvar et al.

Gender identification in twitter using n-grams and lsa: Notebook for pan at clef 2018

Working notes of CLEF 2018 - conference and labs of the evaluation forum, avignon, france, september 10–14, 2018

(2018)

J. Devlin et al.

BERT: pre-training of deep bidirectional transformers for language understanding

Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, minneapolis, mn, usa, june 2–7, 2019, volume 1 (long and short papers)

(2019)

L.R. Goldberg

The development of markers for the big-five factor structure.

Psychological Assessment

(1992)

A. Khatua et al.

A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks

Information Processing & Management

(2019)

Q. Le et al.

Distributed representations of sentences and documents

Proceedings of the 31st international conference on international conference on machine learning - volume 32

(2014)

Cited by (20)

Tran-Switch: A transfer learning approach for sentence level cross-genre author profiling on code-switched English–RomanUrdu Text
2023, Information Processing and Management
Cross-genre author profiling aims to build generalized models for predicting profile traits of authors that can be helpful across different text genres for computer forensics, marketing, and other applications. The cross-genre author profiling task becomes challenging when dealing with low-resourced languages due to the lack of availability of standard corpora and methods. The task becomes even more challenging when the data is code-switched, which is informal and unstructured. In previous studies, the problem of cross-genre author profiling has been mainly explored for mono-lingual texts in highly resourced languages (English, Spanish, etc.). However, it has not been thoroughly explored for the code-switched text which is widely used for communication over social media. To fulfill this gap, we propose a transfer learning-based solution for the cross-genre author profiling task on code-switched (English–RomanUrdu) text using three widely known genres, Facebook comments/posts, Tweets, and SMS messages. In this article, firstly, we experimented with the traditional machine learning, deep learning and pre-trained transfer learning models (MBERT, XLMRoBERTa, ULMFiT, and XLNET) for the same-genre and cross-genre gender identification task. We then propose a novel Trans-Switch approach that focuses on the code-switching nature of the text and trains on specialized language models. In addition, we developed three RomanUrdu to English translated corpora to study the impact of translation on author profiling tasks. The results show that the proposed Trans-Switch model outperforms the baseline deep learning and pre-trained transfer learning models for cross-genre author profiling task on code-switched text. Further, the experimentation also shows that the translation of RomanUrdu text does not improve results.
Survey on profiling age and gender of text authors
2022, Expert Systems with Applications
Citation Excerpt :
These results (especially the age result), which have been obtained by classical ML methods (MNB and LR), are much better than the results of the two systems that used DL methods (Zhang and Abdul-Mageed, 2019; Suman et al., 2019). López-Santillán et al. (2020) applied a genetic programming technique that generates a new document embeddings method, which is a weighted average of various word embedding methods, e.g., BERT, fastText, and word2vec. They evaluated their method on nine datasets by predicting various personal authors' features, e.g., age, gender, and personality traits.
Author profiling from text documents has become a popular task in latest years, in natural language applications. Author profiling is important for various domains such as advertising, marketing, forensics, and security. This survey focuses on profiling age and gender, the two features, which are probably the most researched profile attributes. In this paper, we present an overview of representative studies and datasets of the field (including those organized by PAN) with several significant leaps. Due to the increasing use of deep learning (DL) methods in recent years, we have also reviewed several DL systems that profile authors’ age and gender. Most age and gender datasets contain blog posts or Twitter messages written in English, Spanish or Arabic. There are also several relevant datasets written in Dutch, Italian, Portuguese, Turkish, and Russian. There is no consistency and no uniformity in the datasets concerning to the number and types of their documents, the division into training, dev, and test sets, the types of the applied preprocessing methods, and the quality measures used to evaluate the classification results. A prominent interesting finding is that the best age accuracy results are not as high as we might have expected taking into account relatively simple types of classification especially by gender (only 2 categories) when a large number of teams have competed over the years. Another interesting finding that repeats itself in various classification tasks is that classical ML methods are still better than DL methods for age and gender classification tasks. Most classical systems used word unigrams and bigrams and character 3–4-5-grams. Several systems also used various types of stylistic features. While many earlier systems did not apply preprocessing methods, most recent systems applied several preprocessing methods, e.g., lowercase conversion and replacement of various strings (e.g., URLs, LF characters, and User Mentions). We also suggest several potential future issues in age and gender profiling research.
Learning interpretable word embeddings via bidirectional alignment of dimensions with semantic concepts
2022, Information Processing and Management
Citation Excerpt :
The scope of NLP-based studies can range from event detection (Qian et al., 2019; Tuke et al., 2020) to document retrieval (Bagheri et al., 2018). Computational studies on social media also frequently utilize NLP tools in various topics such as author profiling (López-Santillan et al., 2020), content processing (Moudjari et al., 2021; Roy et al., 2021) and hate speech detection (Pamungkas et al., 2021; Pronoza et al., 2021). What is common among these studies is that they all heavily depend on textual data.
We propose bidirectional imparting or BiImp, a generalized method for aligning embedding dimensions with concepts during the embedding learning phase. While preserving the semantic structure of the embedding space, BiImp makes dimensions interpretable, which has a critical role in deciphering the black-box behavior of word embeddings. BiImp separately utilizes both directions of a vector space dimension: each direction can be assigned to a different concept. This increases the number of concepts that can be represented in the embedding space. Our experimental results demonstrate the interpretability of BiImp embeddings without making compromises on the semantic task performance. We also use BiImp to reduce gender bias in word embeddings by encoding gender-opposite concepts (e.g., male–female) in a single embedding dimension. These results highlight the potential of BiImp in reducing biases and stereotypes present in word embeddings. Furthermore, task or domain-specific interpretable word embeddings can be obtained by adjusting the corresponding word groups in embedding dimensions according to task or domain. As a result, BiImp offers wide liberty in studying word embeddings without any further effort.
Zipfian regularities in “non-point” word representations
2021, Information Processing and Management
Citation Excerpt :
Continuous dense word embeddings opened a new era in NLP and computational linguistics (Bojanowski et al., 2017; Mikolov, Chen et al., 2013; Mikolov, Sutskever et al., 2013; Pennington et al., 2014). Besides their success in high level NLP tasks and applications, word embeddings are also frequently used in information processing field such as information retrieval from documents, (Bagheri et al., 2018), and social media user profiling, (Lopez-Santillan et al., 2020). Classical word embeddings treat all words as vectors representing points within a high-dimensional semantic vector space.
Being one of the most common empirical regularities, the Zipf’s law for word frequencies is a power law relation between word frequencies and frequency ranks of words. We quantitatively study semantic uncertainty of words through non-point distribution-based word embeddings and reveal the Zipfian regularities. Uncertainty of a word can increase due to polysemy, the word having “broad” meaning (such as the relation between broader emotion and narrower exasperation) or a combination of both. Variances of Gaussian embeddings are utilized to quantify the extent a word can be used in different senses or contexts. By using the variance information embedded in the non-point Gaussian embeddings, we quantitatively show that semantic breadth of words also exhibits Zipfian patterns, when polysemy is controlled. This outcome is complementary to Zipf’s law of meaning distribution and the related meaning-frequency law by indicating the existence of Zipfian patterns: more frequent words tend to be generic while less frequent ones tend to be specific. Results for two languages, English and Turkish that belong to different language families, are also provided. Such regularities provide valuable information to extract and understand relationships between semantic properties of words and word frequencies. In various applications, performance improvements can be obtained by employing these regularities. We also propose a method that leverages the Zipfian regularity to improve the performance of baseline textual entailment detection algorithms. To the best of our knowledge, our approach is the first quantitative study that uses Gaussian embeddings to examine the relationships between word frequencies and semantic breadth.
Deep context modeling for multi-turn response selection in dialogue systems
2021, Information Processing and Management
Citation Excerpt :
In contrast, most previous studies are RNN-based and each utterance and response are encoded separately; (2) We introduce the NUP pre-training scheme to model the dialogue context continuity and adapt BERT to the dialogue domain. Recently, deep contextualized pre-trained language models (PTMs) have achieved great success in learning universal language representations, pushing up the state-of-the-art performance in a series of NLP tasks (Ángel González, Hurtado, & Pla, 2020; Fan, Fan, Smith, & Garner, 2020; Fu, Ouyang, Chen, & Luo, 2020; López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, & Prieto-Ordaz, 2020; Mekel & Frasincar, 2020). Prominent examples include ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), ALBERT (Lan et al., 2019), ELECTRA (Clark, Luong, Le, & Manning, 2019), etc.
Multi-turn response selection is a major task in building intelligent dialogue systems. Most existing works focus on modeling the semantic relationship between the utterances and the candidate response with neural networks like RNNs and various attention mechanisms. In this paper, we study how to leverage the advantage of pre-trained language models (PTMs) to multi-turn response selection in retrieval-based chatbots. We propose a deep context modeling architecture (DCM) for multi-turn response selection by utilizing BERT as the context encoder. DCM is formulated as a four-module architecture, namely contextual encoder, utterance-to-response interaction, features aggregation, and response selection. Moreover, in DCM, we introduce the next utterance prediction as a pre-training scheme based on BERT, aiming to adapt general BERT to accommodate the inherent context continuity underlying the multi-turn dialogue. Taking BERT as the backbone encoder, we then investigate a variety of strategies to perform response selection with comprehensive comparisons. Empirical results on three public datasets from two different languages show that our proposed model outperforms existing promising models significantly, pushing recall to 86.8% (+5.2% improvement over BERT) on Ubuntu Dialogue corpus, recall to 68.5% (+6.4% improvement over BERT) on E-Commerce Dialogue corpus, MAP and MRR to 61.6% and 64.9% respectively (+2.3% and 1.8% improvement over BERT) on Douban Conversation corpus, achieving new state-of-the-art performance for multi-turn response selection.
Context-sensitive gender inference of named entities in text
2021, Information Processing and Management
Citation Excerpt :
The goal of gender tagging is to identify the gender of named entities present in a text. The gender tag of named entities can be utilised for various applications such as analysing the representation of genders in different media (Tang, Ross, Saxena, & Chen, 2011), targeted advertising (Jansen, Moore, & Carman, 2013), studying gender differences (Foong, Vincent, Hecht, & Gerber, 2018), in user profiling (Chen et al., 2018; Fatima, Hasan, Anwar, & Nawab, 2017; López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, & Prieto-Ordaz, 2020), and harassment detection (Karami, White, Ford, Swan, & Spinel, 2020). The major impetus in gender identification has been to identify the gender of authors from texts (Fink, Kopecky, & Morawski, 2012; Fourkioti, Symeonidis, & Arampatzis, 2019; Li, Wang, Zhou, & Shi, 2015; Otterbacher, 2010).
The gender information of named entities is an important prerequisite for many text analysis tasks such as gender bias detection and targeted advertising. Despite its valuable use cases, gender tagging of named entities has traditionally been database-reliant. The lack of open-source benchmarks is a major impediment to exploring the effectiveness of machine learning-based methods for this task. Towards this goal, the article serves two main purposes. Firstly, we create four open-source datasets from well-known NER corpora and make them publicly available. Secondly, we propose a novel supervised learning approach based on the transformer network to identify the gender of named entities. We evaluate the proposed approach on four gender identification datasets. The proposed method outperforms two commercial database-reliant approaches and five deep sequence models, including BERT.

View all citing articles on Scopus

View full text

Richer Document Embeddings for Author Profiling tasks based on a heuristic search

Highlights

Abstract

Introduction

Section snippets

Research objective

Natural language processing and social media

Methodology

Results

Conclusions and future work

Acknowledgments

Expert Systems with Applications

Pattern Recognition Letter

Knowledge-Based System

Information Processing & Management

Information Processing & Management

Information Processing & Management

IEEE/ACM Transaction Audio, Speech and Language Processing

Inaoe’s participation at pan’15: Author profiling task.

A simple but tough-to-beat baseline for sentence embeddings

5th international conference on learning representations, ICLR 2017, toulon, france, april 24–26, 2017, conference track proceedings

N-Gram: New groningen author-profiling model

CoRR

Enriching word vectors with subword information

Transactions of the Association for Computational Linguistics

Evaluating topic-based representations for author profiling in social media

Iberamia

Gender identification in twitter using n-grams and lsa: Notebook for pan at clef 2018

Working notes of CLEF 2018 - conference and labs of the evaluation forum, avignon, france, september 10–14, 2018

BERT: pre-training of deep bidirectional transformers for language understanding

Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, minneapolis, mn, usa, june 2–7, 2019, volume 1 (long and short papers)

The development of markers for the big-five factor structure.

Psychological Assessment

A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks

Information Processing & Management

Distributed representations of sentences and documents

Proceedings of the 31st international conference on international conference on machine learning - volume 32