Richer Document Embeddings for Author Profiling tasks based on a heuristic search
Introduction
If you would want to know some personal characteristics of us, the authors of this manuscript, such as our age, gender or even personality traits, by just looking at this and some other texts that we had produced, you would be doing Author Profiling (AP). As a Natural Language Processing (NLP) task, AP has had its opportunity to prove its value in a variety of applications that spans from sensitive scenarios such as digital security, spotting Internet predatory activities, detecting fraud and cyber-terrorism or even plagiarism (Fatima, Hasan, Anwar, & Nawab, 2017), to more common settings like improving customer service, chatbots or diagnosis of neurological disorders, among others (Rangel & Rosso, 2016).
To identify personal features of authors we must examine how they use words. A way to computationally manipulate these words is by representing them as numeric vectors. In this sense, Word Embeddings (WEs) can be viewed as high density vectors that code the meaning of words, in such a way that similar concepts tend to cluster within a n-dimensional space, hence the distance between WEs is a measure of similarity (Kocher, Savoy, 2017, Mikolov, Sutskever, Chen, Corrado, Dean, 2013). Consider that WEs have delivered state-of-the-art results in tasks such as text classification, language translation or speech recognition. This idea could be extrapolated to represent bigger chunks of text (commonly known as Paragraph Vectors (PVs) or Document Embeddings (DEs)), and can be used to extract useful information, such as “intention” from a whole document, an approach that has been exploited for AP and Sentiment Analysis (SA) tasks (Le & Mikolov, 2014). PVs or DEs can be produced by several techniques, being the centroids method one of the most popular and successful (Kusner, Sun, Kolkin, & Weinberger, 2015). The idea behind centroids method is that a sentence, paragraph or even a whole document can be viewed as the aggregation of its words, therefore a DE could be generated by averaging the WEs of words present in the document.
Since it is based on the average concept, the Centroid method could present some shortcomings to capture subtle differences in authors writing styles, then making it not appropriate for AP endeavors. To overcome this issue, in this work we present a novel strategy to generate DEs. Our proposal relies on the hypothesis that it is possible to find novel and optimized weights for each word within a document, then producing an improved aggregate strategy instead of just averaging terms. To test our hypothesis we employ Genetic Programming (GP), which is a very sound approach to learn intrinsic structure within data via mathematical equations (Bruns, Dunkel, & Offel, 2019). To the best of our knowledge, GP has not been employed for the purpose of evolving weighting schemes to aggregate WEs into DEs in the AP area, so a new application of GP is envisioned. Our proposed pipeline is as follows. GP employs statistical features for each word within a document (e.g., term frequency (tf), term frequency-inverse document frequency (tf-idf), Information Gain (IG)) to evolve equations to calculate the weights (importance) of terms. Then, using feature extraction techniques (e.g., word2vec, fastText, BERT), WEs are produced for terms in the datasets. Next, WEs from users posts are aggregated into DEs using a weighted average (importance established by GP). Finally, using a Machine Learning approach we use the DEs to predict the gender, age, language variety and personality of authors. In addition to this, we introduce a novel numeric statistic feature (rtv), which is based on a frequency analysis over the use of words and themes by persons. Moreover, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting-scheme factor.
We exhaustively evaluated our proposal using a total of 9 datasets (in 17 tasks) that were originally devised for the scientific event called Uncovering Plagiarism, Authorship and Social Software Misuse (PAN) (Rangel, Rosso, Montes-y-Gómez, Potthast, & Stein, 2018), within the period 2013–2018. Each year, the Conference and Labs of the Evaluation Forum (CLEF) organizes the PAN conference including a shared task, where several teams compete to predict author’s features such as gender, age or personality traits (e.g., openness, extroversion, etc.) originated in a variety of multilingual social media sources. Within this context, we can contrast our proposal against all the teams that submitted an entry for each year’s contest. For completeness, we also include two averaging baselines: a) a weighted average using only tf-idf values, and b) using a simple mean of the WEs (centroids). The results of each comparison show that our proposed approach offers very competitive performance to solve AP related tasks, ranking in the top-quartile in every year’s competition. This result also suggests the flexibility and robustness of our approach, since through all the years different AP tasks and datasets have been used.
The rest of the document is organized as follows: In Section 2 we state our research objective. In Section 3 we discuss the related work for this study. In Section 4 we introduce our proposal. Section 5 presents the experimental setup and discusses the outcome of the comparisons made against competition results and baselines. Finally in chapter 6 we present conclusions and future work.
Section snippets
Research objective
The main objective of our work is to devise a novel approach to produce DEs in order to predict more accurately characteristics of authors such as gender, age and personality traits. Commonly in literature, this has been done by aggregating WEs, either by a simple mean or a weighted average. Our approach consists of building task-specific weighting-schemes, where they are learned from each particular dataset. In order to guide our work, we stated a research question and the specific objectives
Natural language processing and social media
Social media analytics has become fundamental for modern-life activities such as product placement, massive sentiment analysis, political marketing and even real-time information retrieval. Take as an example the work of Sánchez & Bellogn, 2019, a method that delivers competitive results against cutting edge recommendation techniques, to address the preferences of users over the Internet. Furthermore, nowadays social networks such as Twitter, act as live breaking news outlets. In
Methodology
Our aim in this study is to propose a strategy, to compose DEs and evaluate their suitability for different AP tasks. For this purpose we used datasets that were obtained from the PAN shared tasks 2013 through 2018 organized at CLEF (we will elaborate more on this later). The DEs devised are to be processed by either a classifier or a regressor, producing a profile of the individuals by gender, age, and personality traits. Fig. 1 depicts our full proposal.
Our methodology can be summarized in
Results
We organize this section based on two main sets of experiments. In the first experiment we evaluate the performance of the GPE-WS approach with respect to two well known baseline strategies and our proposed statistic value: a) a tf-idf weighted-average, b) a WE regular average and c) a rtv weighted-average. The aim of this experiment is to contrast the performance of our proposal against other common methodologies to generate DEs. For the second experiment, we analyze how competitive is our
Conclusions and future work
We can assume that WEs trained over specific-domain datasets, are useful as building blocks to construct DEs. In this sense, the methodology presented in this work although straightforward, is arguably a sound strategy to produce text representations (DEs) for AP tasks. Our approach outperformed two baselines, often depicted in literature as solid references in NLP problems. Also, it proved to be a strong competitor against the top-quartile participants in six international competitions on AP
Acknowledgments
This work was supported by CONACYT project FC-2410.
References (41)
- et al.
Learning of complex event processing rules with genetic programming
Expert Systems with Applications
(2019) - et al.
Representation learning for very short texts using weighted word embedding aggregation
Pattern Recognition Letter
(2016) - et al.
Term-weighting learning via genetic programming for text classification
Knowledge-Based System
(2015) - et al.
Multilingual author profiling on facebook
Information Processing & Management
(2017) - et al.
Expressive signals in social media languages to improve polarity detection
Information Processing & Management
(2016) - et al.
Distance measures in author profiling
Information Processing & Management
(2017) - et al.
From word embeddings to document distances
Proceedings of the 32nd international conference on international conference on machine learning - volume 37
(2015) - et al.
Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval
IEEE/ACM Transaction Audio, Speech and Language Processing
(2016) - et al.
Overview of the 2nd author profiling task at pan 2014
Clef 2014 labs and workshops, notebook papers. ceur workshop proceedings, ceur-ws.org (sep 2014)
(2014) - et al.
Gronup: Groningen user profiling
Working notes of CLEF 2016 - conference and labs of the evaluation forum, évora, portugal, 5–8 september, 2016
(2016)
Inaoe’s participation at pan’15: Author profiling task.
A simple but tough-to-beat baseline for sentence embeddings
5th international conference on learning representations, ICLR 2017, toulon, france, april 24–26, 2017, conference track proceedings
N-Gram: New groningen author-profiling model
CoRR
Enriching word vectors with subword information
Transactions of the Association for Computational Linguistics
Evaluating topic-based representations for author profiling in social media
Iberamia
Gender identification in twitter using n-grams and lsa: Notebook for pan at clef 2018
Working notes of CLEF 2018 - conference and labs of the evaluation forum, avignon, france, september 10–14, 2018
BERT: pre-training of deep bidirectional transformers for language understanding
Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, minneapolis, mn, usa, june 2–7, 2019, volume 1 (long and short papers)
The development of markers for the big-five factor structure.
Psychological Assessment
A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks
Information Processing & Management
Distributed representations of sentences and documents
Proceedings of the 31st international conference on international conference on machine learning - volume 32
Cited by (20)
Tran-Switch: A transfer learning approach for sentence level cross-genre author profiling on code-switched English–RomanUrdu Text
2023, Information Processing and ManagementSurvey on profiling age and gender of text authors
2022, Expert Systems with ApplicationsCitation Excerpt :These results (especially the age result), which have been obtained by classical ML methods (MNB and LR), are much better than the results of the two systems that used DL methods (Zhang and Abdul-Mageed, 2019; Suman et al., 2019). López-Santillán et al. (2020) applied a genetic programming technique that generates a new document embeddings method, which is a weighted average of various word embedding methods, e.g., BERT, fastText, and word2vec. They evaluated their method on nine datasets by predicting various personal authors' features, e.g., age, gender, and personality traits.
Learning interpretable word embeddings via bidirectional alignment of dimensions with semantic concepts
2022, Information Processing and ManagementCitation Excerpt :The scope of NLP-based studies can range from event detection (Qian et al., 2019; Tuke et al., 2020) to document retrieval (Bagheri et al., 2018). Computational studies on social media also frequently utilize NLP tools in various topics such as author profiling (López-Santillan et al., 2020), content processing (Moudjari et al., 2021; Roy et al., 2021) and hate speech detection (Pamungkas et al., 2021; Pronoza et al., 2021). What is common among these studies is that they all heavily depend on textual data.
Zipfian regularities in “non-point” word representations
2021, Information Processing and ManagementCitation Excerpt :Continuous dense word embeddings opened a new era in NLP and computational linguistics (Bojanowski et al., 2017; Mikolov, Chen et al., 2013; Mikolov, Sutskever et al., 2013; Pennington et al., 2014). Besides their success in high level NLP tasks and applications, word embeddings are also frequently used in information processing field such as information retrieval from documents, (Bagheri et al., 2018), and social media user profiling, (Lopez-Santillan et al., 2020). Classical word embeddings treat all words as vectors representing points within a high-dimensional semantic vector space.
Deep context modeling for multi-turn response selection in dialogue systems
2021, Information Processing and ManagementCitation Excerpt :In contrast, most previous studies are RNN-based and each utterance and response are encoded separately; (2) We introduce the NUP pre-training scheme to model the dialogue context continuity and adapt BERT to the dialogue domain. Recently, deep contextualized pre-trained language models (PTMs) have achieved great success in learning universal language representations, pushing up the state-of-the-art performance in a series of NLP tasks (Ángel González, Hurtado, & Pla, 2020; Fan, Fan, Smith, & Garner, 2020; Fu, Ouyang, Chen, & Luo, 2020; López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, & Prieto-Ordaz, 2020; Mekel & Frasincar, 2020). Prominent examples include ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), ALBERT (Lan et al., 2019), ELECTRA (Clark, Luong, Le, & Manning, 2019), etc.
Context-sensitive gender inference of named entities in text
2021, Information Processing and ManagementCitation Excerpt :The goal of gender tagging is to identify the gender of named entities present in a text. The gender tag of named entities can be utilised for various applications such as analysing the representation of genders in different media (Tang, Ross, Saxena, & Chen, 2011), targeted advertising (Jansen, Moore, & Carman, 2013), studying gender differences (Foong, Vincent, Hecht, & Gerber, 2018), in user profiling (Chen et al., 2018; Fatima, Hasan, Anwar, & Nawab, 2017; López-Santillán, Montes-Y-Gómez, González-Gurrola, Ramírez-Alonso, & Prieto-Ordaz, 2020), and harassment detection (Karami, White, Ford, Swan, & Spinel, 2020). The major impetus in gender identification has been to identify the gender of authors from texts (Fink, Kopecky, & Morawski, 2012; Fourkioti, Symeonidis, & Arampatzis, 2019; Li, Wang, Zhou, & Shi, 2015; Otterbacher, 2010).