Introduction

Author profiling helps in identifying demographics aspects such as gender, age, native language, or psychographic ones such as the personality type of an author. It is of growing importance in applications in forensics, security, and marketing. For instance, from forensic and security perspectives, it is important to infer the profile of the author of an harassing text message or a threat. Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products.

Demographic Profiling

Pioneer investigations in author profiling from computational linguistics [5] and social psychology [43] focused on formal and well-written texts in English [25]. With the rise of social media, researchers such as [28, 57] moved their interest to blogs and fora. Since 2013 we have been organising at PANFootnote 1 several author profiling tasks at CLEF,Footnote 2 as well at FIRE, where we have addressed different problems (age, gender, language variety identification, personality recognition), in several languages (Arabic, Dutch, English, Italian, Portuguese, Russian), and genres (blogs, reviews, social media, Twitter, Facebook, source code in Python and Java). These tasks allowed us to create a common framework of evaluation where other researchers can investigate further.

Regarding age and gender identification, the best performing team in the three first editions of the author profiling shared task at PAN@CLEF used a second-order representation which relates documents to author profiles and subprofiles (e.g., males talking about video games) [4]. The authors of [61] used the text to be identified as a query for a search engine, showing the competitiveness of information retrieval-based features to identify age and gender. In [35], the authors used MapReduce to approach the task with 3 million n-gram-based features, improving the accuracy and reducing the processing time. The EmoGraph graph-based approach [46] captures how users convey verbal emotions in the morphosyntactic structure of the discourse. They modelled the sequence of grammatical categories as a graph, and they enrich it with topics, semantics of verbs, polarity and emotions. They proved the competitiveness of the approach as well as its robustness against genres and languages [45].

Although it may be considered a more basic problem, identifying the language variety of an author is an important aspect to take into account when, for instance, an author of an harassing text message or a threat needs to be profiled. To discriminate among similar languages (e.g., Malaysian vs. Indonesian) or varieties of the same language (e.g., English from UK vs. US, Spanish from Peru vs. Colombia) not only implies to deal with very similar texts at lexical, syntactical and semantic levels, but also at pragmatics level due to the cultural idiosyncrasies of the authors. In the last years, several researchers have addressed this task for different languages such as English [34], Chinese [26], Spanish [21, 36, 51], or Portuguese [66], among others. In this regard, the authors in [66] created a corpus for Portuguese by collecting 1000 articles from the Folha de S. PauloFootnote 3 and Dirio de NotciasFootnote 4 newsletters, respectively, for Brazilian and Portugal varieties. They reported accuracies of 99.6%, 91.2%, and 99.8% with word unigrams, word bigrams and character 4-grams. Also in Portuguese, the authors in [13] combined character 6-grams with word unigrams and bigrams to obtain an accuracy of 92.71% in Twitter texts. In case of Spanish, the authors in [36] combined language models with n-grams and reported accuracies of 60–70% identifying among Argentinian, Chilean, Colombian, Mexican, and Spanish in Twitter. Similarly, the authors in [51] created the HispaBlogsFootnote 5 corpus which covers Spanish varieties from Argentina, Chile, Mexico, Peru, and Spain. They proposed a low-dimensionality representation to represent the texts and reported accuracies of 71.1%. In another investigation with HispaBlogs, the authors in [21] compared the previous representation with Skip-grams and Sentence Vectors, obtaining 72.2% and 70.8% of accuracy, respectively. In case of Chinese, the authors in [62] combined general features such as character and word n-grams with PMI-based and word alignment-based features to approach the task of identifying among varieties of the Mandarin Chinese for the Greater China Region: Mainland China, Hong Kong, Taiwan, Macao, Malaysia, and Singapore. They reported accuracies up to 90.91%.

Psychographic Profiling

Psychographics is the study of personality, values, attitudes and lifestyles. For instance, psychographic segmentation involves dividing a market into segments based upon different personality traits, values, attitudes, interests, and lifestyles of consumers.

Regarding personality recognition pioneers, investigations were carried out by Argamon et al. [6], who focused on the identification of extroversion and emotional stability. They used support vector machines with a combination of word categories and relative frequency of function words to recognize these traits from self-reports. Similarly, Oberlander and Nowson [41] focused on personality identification of bloggers. Mairesse et al. [37] analysed the impact of different set of psycholinguistic features obtained with LIWCFootnote 6 and MRC,Footnote 7 showing the highest performance on openness to experience trait.

More recently, researchers have focused on personality recognition from social media. In [14, 22, 44], the authors analysed different sets of linguistic features as well as friend count or daily activity. In [29], the authors reported a comprehensive analysis on features such as the size of the friendship network, the number of uploaded photos or the events attended by the user. They analysed more than 180,000 Facebook users and found correlations among these features and the different traits, especially in case of extroversion. With the same Facebook dataset and similar set of features, Bachrach et al. [8] reported high results predicting extroversion automatically.

In [58], the authors analysed 75,000 Facebook messages of volunteers who filled a personality test and found interesting correlations among word usage and personality traits. According to them, extroverts use more social words and introverts more words related to solitary activities. Emotionally stable people use words related to sports, vacation, beach, church or team whereas neurotics use more words and sentences referring to depression. In [40] the author introduces a new vectorial semantics approach to personality assessment, which involves the construction of vectors representing personality dimensions and disorders, and the automatic measurements of the similarity between these vectors and texts written by human subjects.

Recently, at GermEval 2020, a task was organised on the Prediction of Intellectual Ability and Personality Traits from Text.Footnote 8

PAN Lab Tracks at FIRE

In the name of cross-fertilization across evaluation forums,Footnote 9 in 2011, we started to be involved in the organization of tracks at FIRE, most of them as PAN tracks at FIRE. Initially, we addressed the problem of text reuse (2011) and similarity search (2012, 2013), both from a cross-language perspective [10, 23, 24]. In the former two tracks, datasets with texts in English, Gujarati, and Hindi were provided. The problem of text reuse was addressed also on source code texts, both from mono- and cross-(programming) language perspectives [19, 20]. The problem of plagiarism detection was addressed in 2015 and 2016 in Arabic and Persian with the aim of attract to FIRE also the research communities working with these languages [7, 11].

In 2015 and 2016, we were also partially involved in the organization of the track on Mixed Script Information Retrieval (MSIR) [9, 59] that will be described in another chapter of this special issue on FIRE 10 years’ anniversary.

More recently, we were instead involved in the organization of several author profiling tracks, addressing problems such as personality recognition from source code texts (2016) [50], and gender identification from a cross-genre perspective in Russian (2017) [33], and native language identification (Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu) from texts in English (2017) [33]. In 2019, in the framework of a track on author profiling and deception detection in Arabic, we organized a task on the identification of age, gender, and language variety from tweets.Footnote 10

In this chapter, we will present three author profiling shared tasks we have organized at FIRE, describing the resources that we created and made available to the research community, illustrating the obtained results and highlighting the main achievements.

The rest of the paper is structured as follows. In the next section, we introduce the track on author profiling in Arabic tweets that we organised in 2019, followed by which the Rusprofiling track on cross-genre gender identification in Russian, organised in 2017 is presented. The subsequent section is dedicated to the PR-SOCO track on personality recognition in source code that was organised in 2016. In the final section, we draw some conclusions and discuss some future directions for author profiling, also in Indian languages.

Age, Gender, and Language Variety Identification in Arabic

With respect to Arabic, the investigation in age and gender identification is more scarce. The authors in [18] collect 8028 emails from 1030 native speakers of Egyptian Arabic. They propose 518 features and test several machine learning algorithms, and report accuracies between 72.10 and 81.15%, respectively, for gender and age identification. The authors in [3] approached the gender identification in well-known Arabic newsletter articles written in Modern Standard Arabic. With a combination of bag-of-words, sentiments and emotions, they report an accuracy of 86.4%. Subsequently, the authors in [2] extend their work by experimenting with different machine learning algorithms, data subsets and feature selection methods, reporting accuracies up to 94%. The authors in [1] manually annotate tweets from Jordanian dialects with gender information. They show how the name of the author of the tweet can significantly improve the performance. They also experiment with other stylistic features such as the number of words per tweet or the average word length, achieving a best result of 99.50%.

The increasing interest in Arabic varieties identification is supported by the eighteen and six teams participating, respectively, in the Arabic subtask of the third [38] DSL track, the Arabic Dialect Identification (ADI) shared task [67], as well as the twenty teams participating in the Arabic subtask of the Author Profling shared task [52] at PAN 2017. However, as the authors in [55] highlighted, there is still a lack of resources and investigations in that language. Some of the few works are the following ones. The authors in [65] used a smoothed word unigram model and reported, respectively, 87.2%, 83.3% and 87.9% of accuracies for Levantine, Gulf and Egyptian varieties. The authors in [56] achieved 98% of accuracy discriminating among Egyptian, Iraqi, Gulf, Maghreb, Levantine, and Sudan with n-grams. The authors in [17] combined content and style-based features to obtain 85.5% of accuracy discriminating between Egyptian and Modern Standard Arabic. Recently, the identification of language varieties and demographics in Arabic has been addressed in [54].

Task Description

Author profiling distinguishes between classes of authors studying how language is shared by people. This helps in identifying profiling aspects such as age, gender, and language variety, among others. The focus of this task is to identify the age, gender, and language variety of Arabic Twitter users.

Dataset

This corpus was developed at the Carnegie Mellon University Qatar (CMUQ) [63] with the aim at providing with a fine-grained annotated corpus in Arabic. It contains 15 dialectical varieties corresponding to 22 countries of the Arab League. For each variety, a total of 102 authors (78 for training, 24 for test) were annotated with age and gender,Footnote 11 maintaining balance for both variables. The following groups were considered for the age annotation: under 25, between 25 and 34, and above 35. For each author, more than 2000 tweets were retrieved from her/his time line. The included varieties are Algeria, Egypt, Iraq, Kuwait, Lebanon Syria, Libya, Morocco, Oman, Palestine Jordan, Qatar, Saudi Arabia, Sudan, Tunisia, United Arab Emirates and Yemen. More information about this corpus is available in [64].

Evaluation Measures

Since the data are completely balanced, the performance is evaluated by accuracy, following what has been done in the author profiling tasks at PAN@CLEF. For each subtask (age, gender, language variety), we calculate individual accuracies. Systems rank by the joint accuracy (when age, gender and language variety are properly identified together).

Result Analysis

Thirteen teams have participated in the APDA shared task, submitting a total of 28 runs. Participants have used different kinds of features: from classical approaches based on n-grams and Support vector machines, to novel representations such as BERT. The best overall result (45.56% joint accuracy) has been achieved by DBMS-KU [39] with combinations of word n-grams, character n-grams, and function words to train support vector machines. The best result for gender identification (81.94%) has been obtained by MagdalenaYVino, who did not send information about their system. In case of age identification, the best result has been achieved by Yutong [60] (62.50%) with a logistic regression classifier trained with a combination of word unigrams with character 2–5-grams. Finally, in regard to language variety identification, the best result (97.78%) has been achieved also by DBMS-KU. More details can be found in [53] (Table 1).

Table 1 Overall ranking in terms of accuracy

Cross-Genre Gender Identification in Russian

Not many have been the research works that addressed author profiling in Russian. A corpus for authorship profiling in Russian was created by [30]. Gender identification in Russian texts has been addressed in [31] and in social media in [32].

Task Description

In the RusProfiling track, we have addressed the problem of predicting author’s gender in Russian from a cross-genre perspective: given a training set on Twitter, the systems have been evaluated on five different genres: essays, Facebook, Twitter, reviews and texts where the authors imitated the other gender, where the users change their idiostyle.

Following, we describe the datasets that were made available to the participants and the research community. They were created using both manual and automated techniques.Footnote 12

Datasets

Twitter Dataset

This dataset was composed by 500 users per gender. It was split into training (300 users per gender) and testing datasets (200 users per gender). The number of tweets from one user varied from 1 to 200 (depending on how active the users were at the time the data were collected—September 2016). All tweets from one user were merged together and considered as one text. As the analysis suggests, the tweets contain a lot of non-original information (hashtags, hidden citations (e.g., newsfeeds that are copied, etc.), hyperlinks, etc.), which makes it extremely challenging for them to be analyzed.

Facebook Dataset

This dataset was composed of 228 users (i.e., 114 authors per gender) of different age groups (20+, 30+, 40+) from different Russian cities who were randomly chosen (to get minimum mutual friendships). We used the same principals for gender labeling as were used for Twitter. All posts from one user were merged into one text with average length of 1000 words. As well as for collecting data from Twitter, Facebook pages of famous people involved in the administration or government or accounts of heads of major companies were not employed for the study.

Essay Dataset

This dataset is composed of 185 authors per gender, one or two texts per author (in case of two texts they were merged together and considered as one text). The texts were taken randomly from manually collected RusPersonality corpus [30]. RusPersonality is the first Russian-language corpus of written texts labeled with data on their authors. A unique aspect of the corpus is the breadth of the metadata (gender, age, personality, neuropsychological testing data, education level, etc). Topics of the texts were letter to a friend, picture description, letter to an employee trying to convince her to hire the respondent. The average text length in this dataset was 150 words.

Review Dataset

This dataset was composed of 388 authors per gender, one text per author. The texts were collected from Trustpilot,Footnote 13 the author’s gender was identified based on the profile information. The average text length was 80 words.

Gender-Imitated Dataset

In this dataset, 47 authors per gender were considered, three texts from each author that were merged together and considered as one text. The texts were randomly selected from the Gender Imitation corpus which is the first Russian corpus for studies of stylistic deception. Each respondent (n = 142) was instructed to write three texts on the same topic (from a list). The first text is supposed to be written in a way usual for whoever writes it (without any deception), the second one should be written as if by someone of the opposite gender (“imitation”); the third one should be as if one by another individual of the same gender so that her personal writing style will not be recognized (what is referred to as “obfuscation”). Most of the texts are 80–150 words long. All of the respondents are students of Russian universities. Besides the texts, the corpus includes metadata with the authors’ characteristics: gender, age, native language, handedness, and psychological gender (femininity/masculinity). Therefore, the corpus provides the opportunity for investigating problems arising in imitating properties of the written speech in different aspects as well as gender (biological and psychological) imitation in texts.

In Table 2, a summary on the number of authors per dataset is shown.

Table 2 Distribution of authors per dataset (half per gender)

Performance Measures

Accuracy was used, as it was done in the PAN author profiling tasks at CLEF. We have calculated the accuracy per dataset as the number of authors correctly identified divided by the total number of authors in this dataset. The global ranking has been obtained by calculating the average accuracy among all the datasets weighted by the number of documents in each dataset:

$$\begin{aligned} {\text {global}}\_{\text {acc}}=\frac{\sum _{{\text {ds}}}{\text {accuracy(ds)}}\times {\text {size(ds))}}}{\sum _{{\text {ds}}} {\text {size(ds)}}}. \end{aligned}$$
(1)

Result Analysis

Five teams submitted 22 runs (a total of 93 runs on the five different datasets). Participants have used different kinds of approaches, from traditional ones based on hand-crafted features and machine learning techniques such as Support Vector Machines, to the nowadays fashionable deep learning techniques. The best results have not been achieved in Twitter but in Facebook. The reason may be that, although Facebook maintains the spontaneity of Twitter, their posts are usually longer and grammatically richer, with fewer syntactic errors and misspellings. On the other hand, almost the worst results have been obtained on reviews.

In case of the gender-imitated texts, most systems failed, with 11 runs equal or below the majority baseline, and 6 runs with less than 5% of improvement. Only two systems obtained results with more than 10% of improvement over the baseline. In this more difficult scenario, the deep learning approaches showed their superiority over traditional approaches.

The overall ranking on the five datasets is shown in Table 3 and has been calculated following Formula 1. Most participants obtained a weighted accuracy between 47 and 57%, with a median of 54.42%. That means that most of the participants obtained results close to the majority class (50%) and the support vector machine-based baseline with the bag-of-words text representation (53.13%).

It is worth to mention that none of the systems outperformed the LDR baseline (71.21%) that obtained a 6.65% better performance with respect to the best system. This method [51] represents documents on the basis of the probability distribution of occurrence of their words in the different classes. LDR takes advantage of the whole vocabulary. It relies on the weight that represents the probability of a term to belong to one of the different categories (e.g. female vs. male). The distribution of weights for a given document should be closer to the weights of its corresponding category.

The second best results were obtained by the CIC team, with accuracies ranging from 58.62 to 64.56%, showing the robustness of their approach that employed Support Vector Machines with combinations of n-grams and linguistic rules.

A more detailed description of each of the submitted approaches could be found in the overview paper of the task [33].

Table 3 Overall ranking by averaging the accuracies on the different datasets, weighting by the size of the dataset

Personality Recognition in Source Code

Personality recognition and programming style have been started to be investigated in [12], where the authors explored the relationship between cognitive style, personality and computer programming style. Later, the authors in [27] also related personality to programming style and performance.

Task Description

Personality influences most, if not all, of the human activities, such as the way people write [15, 49], interact with others, and the way people make decisions, for instance in the case of developers the criteria they consider when selecting a software project they want to participate [42], or the way they write and structure their source code. Personality recognition may have several practical applications, for example to set up high performance teams. In software development, not only technical skills are required but also soft skills such as communication or teamwork. The possibility of using a tool to predict personality from source code, to know whether a candidate may fit in a team, may be very valuable for the recruitment process. Also in education, to know students’ personality from their source codes may help to improve the learning process by customising the educational offer.

Personality is defined often along five traits using the Big Five Theory [16], which is the most widely accepted in psychology. The five OCEAN traits are openness to experience, conscientiousness, extroversion, agreeableness, and neuroticism/emotional stability. In this task, we have addressed the problem of predicting an author’s personality from her source code. Concretely, given a source code collection of a programmer, the aim is to identify her personality OCEAN traits. In the training phase, participants have been provided with source codes in Java of computer science students together with their personality traits. At test, participants will be given source codes of a few programmers and they will have to predict their personality traits. The number of source codes per programmer was rather small reflecting a real scenario such as the one of a job interview: the interviewer could be interested in knowing the interviewee degree of conscientiousness by evaluating just a couple of programming problems.

Dataset

The dataset is composed of Java programs written by computer science students from a data structures class at the Universidad Nacional de Colombia. Students were asked to upload source code, responding to some functional requirements of different programming tasks, to an automated assessment tool. For each task, students could upload more than one attempted solution. The number of attempts per problem was not limited/discouraged in any way. There are very similar submissions among different attempts and also some of them contain compilation time or runtime errors.

Furthermore, in most of the cases students uploaded the right Java source code file, but some of them erroneously uploaded the compiler output, debug information or even the source code in other programming language (e.g., Python). A priori this seems to be noise for the dataset and the and a sensible alternative could have been to remove these entries. However, we decided to keep them due to the following reasons: first, teams could remove them easily if they decide to do so; second, it is possible that this kind of mistakes is related to some personality traits, so this information can be used as a feature as well. Finally, although we encouraged the students to write their own code, some of them could reuse some pieces of code from other exercises or even looked for code excerpt on books or Internet.

In addition, each student answered a Big Five personality test that allowed us to calculate a numerical score for each one of the OCEAN personality traits. Overall, the dataset consists of 2492 source code programs with an average of 5594 lines of code written by 70 students along with the scores of the five personality traits for each student, which are provided as floating point numbers in the continuous range [20, 80]. The source code of each student were organized on a single text file with all her source codes together with a line separator among them. The dataset was split in training and test subsets, the first one containing the data for 49 students and the second one the data of the remaining 21. Participants only have access to the personality trait scores of the 49 students in the training dataset.

Performance Measures

For evaluating participants’ approaches, we have used two complementary measures: root mean square error (RMSE) and Pearson product–moment correlation (PC). The motivation to use both measures is to try to understand whether a committed error is due to random chance.

We have calculated RMSE for each trait with the following equation:

$$\begin{aligned} {\text {RMSE}}_{t} = \sqrt{\frac{1}{n}\sum _{1}^{n}\left( y_{i} - \hat{y}_{i} \right) ^{2}}, \end{aligned}$$
(2)

where \({\text {RMSE}}_{t}\) is the root mean square error for trait t; \(y_{i}\) and \(\hat{y}_{i}\) are the ground truth and predicted value, respectively, for author i.

Also for each trait, PC is calculated by the following equation:

$$\begin{aligned} r = \frac{\sum _{i=1}^{n}\left( x_{i}-\bar{x} \right) \left( y_{i}-\bar{y} \right) }{\sum _{i=1}^{n}\left( x_{i}-\bar{x} \right) ^{2} \sum _{i=1}^{n}\left( y_{i}-\bar{y} \right) ^{2}}, \end{aligned}$$
(3)

where each \(x_{i}\) and \(y_{i}\) are, respectively, the ground truth and the predicted value for each author i; \(\bar{x}\) and \(\bar{y}\) the average values.

Result Analysis

Eleven were teams that participated in PR-SOCO submitting a total of 48 runs. Participants have used different kinds of features: from standard ones such as word or character n-grams to specific ones obtained by parsing the code, analysing its structure, style or comments, with the aim of investigating the way the code is commented, variable naming or indentation because they may also provide valuable information.

Results are presented in Table 4. In general, the several approaches worked quite similar in terms of Pearson correlation for all the OCEAN traits. However, there seem to be higher differences with respect to RMSE. Depending on the trait, standard features obtained competitive results compared with specific ones in terms of RMSE. The best results were achieved for openness (6.95), as previously was reported by Mairesse et al. [37], as well as this was one of the traits with the lower RMSE at PAN 2015 [49] for most languages.

The best result for both RMSE and Pearson correlation was obtained by uaemex in their 1st run. This run was generated using symbolic regression with three types of features: indentation, identifiers and comments. The authors optimised this run by eliminating the source codes of five developers according to the following criteria: the person who had high values in all the personality traits, the person who had a lower values in all the personality traits, the person who had an average values in all the personality traits, the person who had more source codes and the person who had few source codes. They also obtained high results with their 3rd run, where they trained a back propagation neural network with the whole set of training codes.

A more detailed description of each of the submitted approaches could be found in the overview paper of the task [50].

Table 4 Participants’ results in RMSE and Pearson product moment correlation

Conclusions and Future Work

The organisation of shared tasks allows the creation of a common framework of evaluation to measure the state of the art and foster the research on new challenging problems. To this end, in the last 10 years of the FIRE initiative, we have organised more than ten tracks on different aspects of author profiling, in several genres and languages.

With respect to machine learning algorithms, according to the results, also due to the size of the datasets, traditional approaches obtained a better performance than those using deep learning techniques, although approaches using long short-term memory (LSTM) or based on bidirectional encoder representations from transformers (BERT) obtained in the most recent track we organised very promising results [53].

The future of author profiling goes through addressing problems such as profiling of bots and fake news spreaders not only in English [48] but also in Indian languages.