Author Profiling Tracks at FIRE

Rosso, Paolo; Rangel, Francisco

doi:10.1007/s42979-020-0073-1

Author Profiling Tracks at FIRE

Survey Article
Published: 26 February 2020

Volume 1, article number 72, (2020)
Cite this article

Download PDF

SN Computer Science Aims and scope Submit manuscript

Author Profiling Tracks at FIRE

Download PDF

951 Accesses
1 Citation
Explore all metrics

A Publisher Correction to this article was published on 28 September 2023

This article has been updated

Abstract

Benchmarking activities are vital for fostering research and addressing new challenging problems. During the last 10 years of the FIRE initiative, we have been involved in the organization of more than ten tracks, with the aim of the creation of new resources in several languages that were made available to the research community. This allowed to compare the new several approaches on the same datasets. In this chapter, we will focus on the description of three author profiling tracks, on their data creation as well as the result analysis.

How to Write and Publish a Research Paper for a Peer-Reviewed Journal

Article Open access 30 April 2020

Clara Busse & Ella August

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Article 21 May 2019

Gerry L. Koons, Katja Schenke-Layland & Antonios G. Mikos

How to design bibliometric research: an overview and a framework proposal

Article Open access 06 March 2024

Oğuzhan Öztürk, Rıdvan Kocaman & Dominik K. Kanbach

Introduction

Author profiling helps in identifying demographics aspects such as gender, age, native language, or psychographic ones such as the personality type of an author. It is of growing importance in applications in forensics, security, and marketing. For instance, from forensic and security perspectives, it is important to infer the profile of the author of an harassing text message or a threat. Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products.

Demographic Profiling

Pioneer investigations in author profiling from computational linguistics [5] and social psychology [43] focused on formal and well-written texts in English [25]. With the rise of social media, researchers such as [28, 57] moved their interest to blogs and fora. Since 2013 we have been organising at PAN^{Footnote 1} several author profiling tasks at CLEF,^{Footnote 2} as well at FIRE, where we have addressed different problems (age, gender, language variety identification, personality recognition), in several languages (Arabic, Dutch, English, Italian, Portuguese, Russian), and genres (blogs, reviews, social media, Twitter, Facebook, source code in Python and Java). These tasks allowed us to create a common framework of evaluation where other researchers can investigate further.

Regarding age and gender identification, the best performing team in the three first editions of the author profiling shared task at PAN@CLEF used a second-order representation which relates documents to author profiles and subprofiles (e.g., males talking about video games) [4]. The authors of [61] used the text to be identified as a query for a search engine, showing the competitiveness of information retrieval-based features to identify age and gender. In [35], the authors used MapReduce to approach the task with 3 million n-gram-based features, improving the accuracy and reducing the processing time. The EmoGraph graph-based approach [46] captures how users convey verbal emotions in the morphosyntactic structure of the discourse. They modelled the sequence of grammatical categories as a graph, and they enrich it with topics, semantics of verbs, polarity and emotions. They proved the competitiveness of the approach as well as its robustness against genres and languages [45].

Although it may be considered a more basic problem, identifying the language variety of an author is an important aspect to take into account when, for instance, an author of an harassing text message or a threat needs to be profiled. To discriminate among similar languages (e.g., Malaysian vs. Indonesian) or varieties of the same language (e.g., English from UK vs. US, Spanish from Peru vs. Colombia) not only implies to deal with very similar texts at lexical, syntactical and semantic levels, but also at pragmatics level due to the cultural idiosyncrasies of the authors. In the last years, several researchers have addressed this task for different languages such as English [34], Chinese [26], Spanish [21, 36, 51], or Portuguese [66], among others. In this regard, the authors in [66] created a corpus for Portuguese by collecting 1000 articles from the Folha de S. Paulo^{Footnote 3} and Dirio de Notcias^{Footnote 4} newsletters, respectively, for Brazilian and Portugal varieties. They reported accuracies of 99.6%, 91.2%, and 99.8% with word unigrams, word bigrams and character 4-grams. Also in Portuguese, the authors in [13] combined character 6-grams with word unigrams and bigrams to obtain an accuracy of 92.71% in Twitter texts. In case of Spanish, the authors in [36] combined language models with n-grams and reported accuracies of 60–70% identifying among Argentinian, Chilean, Colombian, Mexican, and Spanish in Twitter. Similarly, the authors in [51] created the HispaBlogs^{Footnote 5} corpus which covers Spanish varieties from Argentina, Chile, Mexico, Peru, and Spain. They proposed a low-dimensionality representation to represent the texts and reported accuracies of 71.1%. In another investigation with HispaBlogs, the authors in [21] compared the previous representation with Skip-grams and Sentence Vectors, obtaining 72.2% and 70.8% of accuracy, respectively. In case of Chinese, the authors in [62] combined general features such as character and word n-grams with PMI-based and word alignment-based features to approach the task of identifying among varieties of the Mandarin Chinese for the Greater China Region: Mainland China, Hong Kong, Taiwan, Macao, Malaysia, and Singapore. They reported accuracies up to 90.91%.

Psychographic Profiling

Psychographics is the study of personality, values, attitudes and lifestyles. For instance, psychographic segmentation involves dividing a market into segments based upon different personality traits, values, attitudes, interests, and lifestyles of consumers.

Regarding personality recognition pioneers, investigations were carried out by Argamon et al. [6], who focused on the identification of extroversion and emotional stability. They used support vector machines with a combination of word categories and relative frequency of function words to recognize these traits from self-reports. Similarly, Oberlander and Nowson [41] focused on personality identification of bloggers. Mairesse et al. [37] analysed the impact of different set of psycholinguistic features obtained with LIWC^{Footnote 6} and MRC,^{Footnote 7} showing the highest performance on openness to experience trait.

More recently, researchers have focused on personality recognition from social media. In [14, 22, 44], the authors analysed different sets of linguistic features as well as friend count or daily activity. In [29], the authors reported a comprehensive analysis on features such as the size of the friendship network, the number of uploaded photos or the events attended by the user. They analysed more than 180,000 Facebook users and found correlations among these features and the different traits, especially in case of extroversion. With the same Facebook dataset and similar set of features, Bachrach et al. [8] reported high results predicting extroversion automatically.

In [58], the authors analysed 75,000 Facebook messages of volunteers who filled a personality test and found interesting correlations among word usage and personality traits. According to them, extroverts use more social words and introverts more words related to solitary activities. Emotionally stable people use words related to sports, vacation, beach, church or team whereas neurotics use more words and sentences referring to depression. In [40] the author introduces a new vectorial semantics approach to personality assessment, which involves the construction of vectors representing personality dimensions and disorders, and the automatic measurements of the similarity between these vectors and texts written by human subjects.

Recently, at GermEval 2020, a task was organised on the Prediction of Intellectual Ability and Personality Traits from Text.^{Footnote 8}

PAN Lab Tracks at FIRE

In the name of cross-fertilization across evaluation forums,^{Footnote 9} in 2011, we started to be involved in the organization of tracks at FIRE, most of them as PAN tracks at FIRE. Initially, we addressed the problem of text reuse (2011) and similarity search (2012, 2013), both from a cross-language perspective [10, 23, 24]. In the former two tracks, datasets with texts in English, Gujarati, and Hindi were provided. The problem of text reuse was addressed also on source code texts, both from mono- and cross-(programming) language perspectives [19, 20]. The problem of plagiarism detection was addressed in 2015 and 2016 in Arabic and Persian with the aim of attract to FIRE also the research communities working with these languages [7, 11].

In 2015 and 2016, we were also partially involved in the organization of the track on Mixed Script Information Retrieval (MSIR) [9, 59] that will be described in another chapter of this special issue on FIRE 10 years’ anniversary.

More recently, we were instead involved in the organization of several author profiling tracks, addressing problems such as personality recognition from source code texts (2016) [50], and gender identification from a cross-genre perspective in Russian (2017) [33], and native language identification (Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu) from texts in English (2017) [33]. In 2019, in the framework of a track on author profiling and deception detection in Arabic, we organized a task on the identification of age, gender, and language variety from tweets.^{Footnote 10}

In this chapter, we will present three author profiling shared tasks we have organized at FIRE, describing the resources that we created and made available to the research community, illustrating the obtained results and highlighting the main achievements.

The rest of the paper is structured as follows. In the next section, we introduce the track on author profiling in Arabic tweets that we organised in 2019, followed by which the Rusprofiling track on cross-genre gender identification in Russian, organised in 2017 is presented. The subsequent section is dedicated to the PR-SOCO track on personality recognition in source code that was organised in 2016. In the final section, we draw some conclusions and discuss some future directions for author profiling, also in Indian languages.

Age, Gender, and Language Variety Identification in Arabic

With respect to Arabic, the investigation in age and gender identification is more scarce. The authors in [18] collect 8028 emails from 1030 native speakers of Egyptian Arabic. They propose 518 features and test several machine learning algorithms, and report accuracies between 72.10 and 81.15%, respectively, for gender and age identification. The authors in [3] approached the gender identification in well-known Arabic newsletter articles written in Modern Standard Arabic. With a combination of bag-of-words, sentiments and emotions, they report an accuracy of 86.4%. Subsequently, the authors in [2] extend their work by experimenting with different machine learning algorithms, data subsets and feature selection methods, reporting accuracies up to 94%. The authors in [1] manually annotate tweets from Jordanian dialects with gender information. They show how the name of the author of the tweet can significantly improve the performance. They also experiment with other stylistic features such as the number of words per tweet or the average word length, achieving a best result of 99.50%.

The increasing interest in Arabic varieties identification is supported by the eighteen and six teams participating, respectively, in the Arabic subtask of the third [38] DSL track, the Arabic Dialect Identification (ADI) shared task [67], as well as the twenty teams participating in the Arabic subtask of the Author Profling shared task [52] at PAN 2017. However, as the authors in [55] highlighted, there is still a lack of resources and investigations in that language. Some of the few works are the following ones. The authors in [65] used a smoothed word unigram model and reported, respectively, 87.2%, 83.3% and 87.9% of accuracies for Levantine, Gulf and Egyptian varieties. The authors in [56] achieved 98% of accuracy discriminating among Egyptian, Iraqi, Gulf, Maghreb, Levantine, and Sudan with n-grams. The authors in [17] combined content and style-based features to obtain 85.5% of accuracy discriminating between Egyptian and Modern Standard Arabic. Recently, the identification of language varieties and demographics in Arabic has been addressed in [54].

Task Description

Author profiling distinguishes between classes of authors studying how language is shared by people. This helps in identifying profiling aspects such as age, gender, and language variety, among others. The focus of this task is to identify the age, gender, and language variety of Arabic Twitter users.

Dataset

This corpus was developed at the Carnegie Mellon University Qatar (CMUQ) [63] with the aim at providing with a fine-grained annotated corpus in Arabic. It contains 15 dialectical varieties corresponding to 22 countries of the Arab League. For each variety, a total of 102 authors (78 for training, 24 for test) were annotated with age and gender,^{Footnote 11} maintaining balance for both variables. The following groups were considered for the age annotation: under 25, between 25 and 34, and above 35. For each author, more than 2000 tweets were retrieved from her/his time line. The included varieties are Algeria, Egypt, Iraq, Kuwait, Lebanon Syria, Libya, Morocco, Oman, Palestine Jordan, Qatar, Saudi Arabia, Sudan, Tunisia, United Arab Emirates and Yemen. More information about this corpus is available in [64].

Evaluation Measures

Since the data are completely balanced, the performance is evaluated by accuracy, following what has been done in the author profiling tasks at PAN@CLEF. For each subtask (age, gender, language variety), we calculate individual accuracies. Systems rank by the joint accuracy (when age, gender and language variety are properly identified together).

Result Analysis

Thirteen teams have participated in the APDA shared task, submitting a total of 28 runs. Participants have used different kinds of features: from classical approaches based on n-grams and Support vector machines, to novel representations such as BERT. The best overall result (45.56% joint accuracy) has been achieved by DBMS-KU [39] with combinations of word n-grams, character n-grams, and function words to train support vector machines. The best result for gender identification (81.94%) has been obtained by MagdalenaYVino, who did not send information about their system. In case of age identification, the best result has been achieved by Yutong [60] (62.50%) with a logistic regression classifier trained with a combination of word unigrams with character 2–5-grams. Finally, in regard to language variety identification, the best result (97.78%) has been achieved also by DBMS-KU. More details can be found in [53] (Table 1).

Table 1 Overall ranking in terms of accuracy

Full size table

Cross-Genre Gender Identification in Russian

Not many have been the research works that addressed author profiling in Russian. A corpus for authorship profiling in Russian was created by [30]. Gender identification in Russian texts has been addressed in [31] and in social media in [32].

Task Description

In the RusProfiling track, we have addressed the problem of predicting author’s gender in Russian from a cross-genre perspective: given a training set on Twitter, the systems have been evaluated on five different genres: essays, Facebook, Twitter, reviews and texts where the authors imitated the other gender, where the users change their idiostyle.

Following, we describe the datasets that were made available to the participants and the research community. They were created using both manual and automated techniques.^{Footnote 12}

Datasets

Twitter Dataset

This dataset was composed by 500 users per gender. It was split into training (300 users per gender) and testing datasets (200 users per gender). The number of tweets from one user varied from 1 to 200 (depending on how active the users were at the time the data were collected—September 2016). All tweets from one user were merged together and considered as one text. As the analysis suggests, the tweets contain a lot of non-original information (hashtags, hidden citations (e.g., newsfeeds that are copied, etc.), hyperlinks, etc.), which makes it extremely challenging for them to be analyzed.

Facebook Dataset

This dataset was composed of 228 users (i.e., 114 authors per gender) of different age groups (20+, 30+, 40+) from different Russian cities who were randomly chosen (to get minimum mutual friendships). We used the same principals for gender labeling as were used for Twitter. All posts from one user were merged into one text with average length of 1000 words. As well as for collecting data from Twitter, Facebook pages of famous people involved in the administration or government or accounts of heads of major companies were not employed for the study.

Essay Dataset

This dataset is composed of 185 authors per gender, one or two texts per author (in case of two texts they were merged together and considered as one text). The texts were taken randomly from manually collected RusPersonality corpus [30]. RusPersonality is the first Russian-language corpus of written texts labeled with data on their authors. A unique aspect of the corpus is the breadth of the metadata (gender, age, personality, neuropsychological testing data, education level, etc). Topics of the texts were letter to a friend, picture description, letter to an employee trying to convince her to hire the respondent. The average text length in this dataset was 150 words.

Review Dataset

This dataset was composed of 388 authors per gender, one text per author. The texts were collected from Trustpilot,^{Footnote 13} the author’s gender was identified based on the profile information. The average text length was 80 words.

Gender-Imitated Dataset

In this dataset, 47 authors per gender were considered, three texts from each author that were merged together and considered as one text. The texts were randomly selected from the Gender Imitation corpus which is the first Russian corpus for studies of stylistic deception. Each respondent (n = 142) was instructed to write three texts on the same topic (from a list). The first text is supposed to be written in a way usual for whoever writes it (without any deception), the second one should be written as if by someone of the opposite gender (“imitation”); the third one should be as if one by another individual of the same gender so that her personal writing style will not be recognized (what is referred to as “obfuscation”). Most of the texts are 80–150 words long. All of the respondents are students of Russian universities. Besides the texts, the corpus includes metadata with the authors’ characteristics: gender, age, native language, handedness, and psychological gender (femininity/masculinity). Therefore, the corpus provides the opportunity for investigating problems arising in imitating properties of the written speech in different aspects as well as gender (biological and psychological) imitation in texts.

In Table 2, a summary on the number of authors per dataset is shown.

Table 2 Distribution of authors per dataset (half per gender)

Full size table

Performance Measures

Accuracy was used, as it was done in the PAN author profiling tasks at CLEF. We have calculated the accuracy per dataset as the number of authors correctly identified divided by the total number of authors in this dataset. The global ranking has been obtained by calculating the average accuracy among all the datasets weighted by the number of documents in each dataset:

$$\begin{aligned} {\text {global}}\_{\text {acc}}=\frac{\sum _{{\text {ds}}}{\text {accuracy(ds)}}\times {\text {size(ds))}}}{\sum _{{\text {ds}}} {\text {size(ds)}}}. \end{aligned}$$

(1)

Result Analysis

Five teams submitted 22 runs (a total of 93 runs on the five different datasets). Participants have used different kinds of approaches, from traditional ones based on hand-crafted features and machine learning techniques such as Support Vector Machines, to the nowadays fashionable deep learning techniques. The best results have not been achieved in Twitter but in Facebook. The reason may be that, although Facebook maintains the spontaneity of Twitter, their posts are usually longer and grammatically richer, with fewer syntactic errors and misspellings. On the other hand, almost the worst results have been obtained on reviews.

In case of the gender-imitated texts, most systems failed, with 11 runs equal or below the majority baseline, and 6 runs with less than 5% of improvement. Only two systems obtained results with more than 10% of improvement over the baseline. In this more difficult scenario, the deep learning approaches showed their superiority over traditional approaches.

The overall ranking on the five datasets is shown in Table 3 and has been calculated following Formula 1. Most participants obtained a weighted accuracy between 47 and 57%, with a median of 54.42%. That means that most of the participants obtained results close to the majority class (50%) and the support vector machine-based baseline with the bag-of-words text representation (53.13%).

It is worth to mention that none of the systems outperformed the LDR baseline (71.21%) that obtained a 6.65% better performance with respect to the best system. This method [51] represents documents on the basis of the probability distribution of occurrence of their words in the different classes. LDR takes advantage of the whole vocabulary. It relies on the weight that represents the probability of a term to belong to one of the different categories (e.g. female vs. male). The distribution of weights for a given document should be closer to the weights of its corresponding category.

The second best results were obtained by the CIC team, with accuracies ranging from 58.62 to 64.56%, showing the robustness of their approach that employed Support Vector Machines with combinations of n-grams and linguistic rules.

A more detailed description of each of the submitted approaches could be found in the overview paper of the task [33].

Table 3 Overall ranking by averaging the accuracies on the different datasets, weighting by the size of the dataset

Full size table

Personality Recognition in Source Code

Personality recognition and programming style have been started to be investigated in [12], where the authors explored the relationship between cognitive style, personality and computer programming style. Later, the authors in [27] also related personality to programming style and performance.

Task Description

Personality influences most, if not all, of the human activities, such as the way people write [15, 49], interact with others, and the way people make decisions, for instance in the case of developers the criteria they consider when selecting a software project they want to participate [42], or the way they write and structure their source code. Personality recognition may have several practical applications, for example to set up high performance teams. In software development, not only technical skills are required but also soft skills such as communication or teamwork. The possibility of using a tool to predict personality from source code, to know whether a candidate may fit in a team, may be very valuable for the recruitment process. Also in education, to know students’ personality from their source codes may help to improve the learning process by customising the educational offer.

Personality is defined often along five traits using the Big Five Theory [16], which is the most widely accepted in psychology. The five OCEAN traits are openness to experience, conscientiousness, extroversion, agreeableness, and neuroticism/emotional stability. In this task, we have addressed the problem of predicting an author’s personality from her source code. Concretely, given a source code collection of a programmer, the aim is to identify her personality OCEAN traits. In the training phase, participants have been provided with source codes in Java of computer science students together with their personality traits. At test, participants will be given source codes of a few programmers and they will have to predict their personality traits. The number of source codes per programmer was rather small reflecting a real scenario such as the one of a job interview: the interviewer could be interested in knowing the interviewee degree of conscientiousness by evaluating just a couple of programming problems.

Dataset

The dataset is composed of Java programs written by computer science students from a data structures class at the Universidad Nacional de Colombia. Students were asked to upload source code, responding to some functional requirements of different programming tasks, to an automated assessment tool. For each task, students could upload more than one attempted solution. The number of attempts per problem was not limited/discouraged in any way. There are very similar submissions among different attempts and also some of them contain compilation time or runtime errors.

Furthermore, in most of the cases students uploaded the right Java source code file, but some of them erroneously uploaded the compiler output, debug information or even the source code in other programming language (e.g., Python). A priori this seems to be noise for the dataset and the and a sensible alternative could have been to remove these entries. However, we decided to keep them due to the following reasons: first, teams could remove them easily if they decide to do so; second, it is possible that this kind of mistakes is related to some personality traits, so this information can be used as a feature as well. Finally, although we encouraged the students to write their own code, some of them could reuse some pieces of code from other exercises or even looked for code excerpt on books or Internet.

In addition, each student answered a Big Five personality test that allowed us to calculate a numerical score for each one of the OCEAN personality traits. Overall, the dataset consists of 2492 source code programs with an average of 5594 lines of code written by 70 students along with the scores of the five personality traits for each student, which are provided as floating point numbers in the continuous range [20, 80]. The source code of each student were organized on a single text file with all her source codes together with a line separator among them. The dataset was split in training and test subsets, the first one containing the data for 49 students and the second one the data of the remaining 21. Participants only have access to the personality trait scores of the 49 students in the training dataset.

Performance Measures

For evaluating participants’ approaches, we have used two complementary measures: root mean square error (RMSE) and Pearson product–moment correlation (PC). The motivation to use both measures is to try to understand whether a committed error is due to random chance.

We have calculated RMSE for each trait with the following equation:

$$\begin{aligned} {\text {RMSE}}_{t} = \sqrt{\frac{1}{n}\sum _{1}^{n}\left( y_{i} - \hat{y}_{i} \right) ^{2}}, \end{aligned}$$

(2)

where ${\text {RMSE}}_{t}$ is the root mean square error for trait t; $y_{i}$ and $\hat{y}_{i}$ are the ground truth and predicted value, respectively, for author i.

Also for each trait, PC is calculated by the following equation:

$$\begin{aligned} r = \frac{\sum _{i=1}^{n}\left( x_{i}-\bar{x} \right) \left( y_{i}-\bar{y} \right) }{\sum _{i=1}^{n}\left( x_{i}-\bar{x} \right) ^{2} \sum _{i=1}^{n}\left( y_{i}-\bar{y} \right) ^{2}}, \end{aligned}$$

(3)

where each $x_{i}$ and $y_{i}$ are, respectively, the ground truth and the predicted value for each author i; $\bar{x}$ and $\bar{y}$ the average values.

Result Analysis

Eleven were teams that participated in PR-SOCO submitting a total of 48 runs. Participants have used different kinds of features: from standard ones such as word or character n-grams to specific ones obtained by parsing the code, analysing its structure, style or comments, with the aim of investigating the way the code is commented, variable naming or indentation because they may also provide valuable information.

Results are presented in Table 4. In general, the several approaches worked quite similar in terms of Pearson correlation for all the OCEAN traits. However, there seem to be higher differences with respect to RMSE. Depending on the trait, standard features obtained competitive results compared with specific ones in terms of RMSE. The best results were achieved for openness (6.95), as previously was reported by Mairesse et al. [37], as well as this was one of the traits with the lower RMSE at PAN 2015 [49] for most languages.

The best result for both RMSE and Pearson correlation was obtained by uaemex in their 1st run. This run was generated using symbolic regression with three types of features: indentation, identifiers and comments. The authors optimised this run by eliminating the source codes of five developers according to the following criteria: the person who had high values in all the personality traits, the person who had a lower values in all the personality traits, the person who had an average values in all the personality traits, the person who had more source codes and the person who had few source codes. They also obtained high results with their 3rd run, where they trained a back propagation neural network with the whole set of training codes.

A more detailed description of each of the submitted approaches could be found in the overview paper of the task [50].

Table 4 Participants’ results in RMSE and Pearson product moment correlation

Full size table

Conclusions and Future Work

The organisation of shared tasks allows the creation of a common framework of evaluation to measure the state of the art and foster the research on new challenging problems. To this end, in the last 10 years of the FIRE initiative, we have organised more than ten tracks on different aspects of author profiling, in several genres and languages.

With respect to machine learning algorithms, according to the results, also due to the size of the datasets, traditional approaches obtained a better performance than those using deep learning techniques, although approaches using long short-term memory (LSTM) or based on bidirectional encoder representations from transformers (BERT) obtained in the most recent track we organised very promising results [53].

The future of author profiling goes through addressing problems such as profiling of bots and fake news spreaders not only in English [48] but also in Indian languages.

Change history

28 September 2023
A Correction to this paper has been published: https://doi.org/10.1007/s42979-023-02168-3

Notes

Lab on on digital text forensics and stylometry https://pan.webis.de.
http://www.clef-initiative.eu.
http://www.folha.uol.com.br.
http://www.dn.pt.
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs.
http://www.liwc.net/.
http://www.psych.rl.ac.uk/.
https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2020-psychopred.html.
As suggested by Carol Peters at CLEF 2010 in Padua.
https://www.autoritas.net/APDA/.
The annotation was carried out manually by human annotators. The detailed methodology is described in the overview paper.
The annotation of gender was carried out manually by human annotators. The detailed methodology is described in the overview paper.
https://ru.trustpilot.com/.

References

Al Sukhni E, Alequr Q. Investigating the use of machine learning algorithms in detecting gender of the Arabic tweet author. Int J Adv Comput Sci Appl. 2016;1(7):319–28.
Google Scholar
Alsmearat K, Al-Ayyoub M, Al-Shalabi R. An extensive study of the bag-of-words approach for gender identification of Arabic articles. In: 2014 IEEE/ACS 11th international conference on computer systems and applications (AICCSA). 2014. pp 601–608. IEEE.
Alsmearat K, Shehab M, Al-Ayyoub M, Al-Shalabi R, Kanaan G. Emotion analysis of Arabic articles and its impact on identifying the authors gender. In: 12th international conference on computer systems and applications (AICCSA), 2015 IEEE/ACS; 2015.
Álvarez-Carmona MA, López-Monroy AP, Montes-Y-Gómez M, Villaseñor-Pineda L, Jair-Escalante H. Inaoe’s participation at pan’15: author profiling task—notebook for pan at clef 2015; 2015.
Argamon S, Koppel M, Fine J, Shimoni AR. Gender, genre, and writing style in formal written texts. TEXT. 2003;23:321–46.
Google Scholar
Argamon S, Dhawle S, Koppel M, Pennebaker JW. Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the interface and the classification society of North America; 2005.
Asghari H, Mohtaj S, Fatemi O, Faili H, Rosso P, Potthast M. Algorithms and corpora for Persian plagiarism detection: overview of pan at fire 2016. In: Notebook Papers of FIRE 2016, FIRE-2016, Kolkata, India, December 7–10, CEUR Workshop Proceedings. CEUR-WS.org, vol 1737; 2016. pp 135–144.
Bachrach Y, Kosinski M, Graepel T, Kohli P, Stillwell D. Personality and patterns of Facebook usage. In: Proceedings of the ACM web science conference. ACM New York, NY, USA; 2012. pp 36–44.
Banerjee S, Chakma K, Naskar DA Sudip, Rosso P, Bandyopadhyay S, Choudhury M. Overview of the mixed script information retrieval (MSIR) at fire-2016. In: Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, December 7–10, CEUR workshop proceedings. CEUR-WS.org, vol 1737; 2016. pp 94–99.
Barrón-Cedeño A, Rosso P, Lalitha-Devi S, Clough P, Stevenson M. Pan@fire: Overview of the cross-language !ndian text re-use detection competition. In: 2nd and 3th international workshops FIRE 2010 and 2011, multilingual information access in south Asian Languages, Springer, LNCS(7536); 2013. pp 59–70.
Bensalem I, Boukhalfa I, Rosso P, Abouenour L, Darwish K, Chikhi S. Overview of the araplagdet pan@ fire2015 shared task on Arabic plagiarism detection. In: Notebook papers of FIRE 2015, FIRE-2015, Gandhinagar, India, December 4–6, CEUR Workshop Proceedings. CEUR-WS.org, vol 1587; 2015. pp 111–122.
Bishop-Clark C. Cognitive style, personality, and computer programming. Computers in human behavior, vol. 11–2. New York: Elsevier; 1995. p. 241–60.
Google Scholar
Castro D, Souza E, de Oliveira AL. Discriminating between brazilian and european portuguese national varieties on twitter texts. In: 5th Brazilian conference on intelligent systems (BRACIS); 2016. pp 265–270.
Celli F, Polonio L. Relationships between personality and interactions in Facebook. Social networking: recent trends, emerging issues and future outlook. New York: Nova Science Publishers Inc; 2013. p. 41–54.
Google Scholar
Celli F, Lepri B, Biel JI, Gatica-Perez D, Riccardi G, Pianesi F. The workshop on computational personality recognition 2014. In: Proceedings of the ACM international conference on multimedia, ACM; 2014. pp 1245–1246.
Costa PT, McCrae RR. The revised neo personality inventory (neo-pi-r). The SAGE handbook of personality theory and assessment, vol. 2. Thousand Oaks: Sage Publications Inc.; 2008. p. 179–98.
Google Scholar
Elfardy H, Diab MT. Sentence level dialect identification in Arabic. In: Association for computational linguistics (ACL); 2013. pp 456–461.
Estival D, Gaustad T, Hutchinson B, Bao-Pham S, Radford W. Author profiling for English and Arabic emails; 2008.
Flores E, Rosso P, Moreno L, Villatoro-Tello E. Pan@fire: Overview of SOCO track on the detection of source code re-use. In: Notebook papers of FIRE, FIRE-2014. India: Bangalore; 2014.
Flores E, Rosso P, Moreno L, Villatoro-Tello E. Pan@ fire 2015: Overview of cl-soco track on the detection of cross-language source code re-use. In: Proceedings of the seventh forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India; 2015. pp 4–6.
Franco-Salvador M, Rangel F, Rosso P, Taule M, Marti M. Language variety identification using distributed representations of words and documents. Experimental IR meets multilinguality, multimodality, and interaction. Berlin: Springer; 2015. p. 28–40.
Chapter Google Scholar
Golbeck J, Robles C, Turner K. Predicting personality with social media. In: CHI’11 extended abstracts on human factors in computing systems, ACM; 2011. pp 253–262.
Gupta P, Clough P, Rosso P, Stevenson M. Pan@fire: Overview of the cross-language Indian news story search (CLINSS) track. In: Notebook papers of FIRE 2012, FIRE-2012, Kolkata, India, December 17–19; 2012.
Gupta P, Clough P, Rosso P, Stevenson M, Banchs R. Pan@fire: Overview of the cross-language Indian news story search (CLINSS) track. In: Notebook Papers of FIRE 2013, FIRE-2013, Delhi, India, December 4–6; 2013.
Holmes J, Meyerhoff M. The handbook of language and gender. Blackwell handbooks in linguistics. New York: Wiley; 2003.
Book Google Scholar
Huang C, Lee L. Contrastive approach towards text source classification based on top-bag-of-word similarity. In: In PACLIC; 2008. pp 404–410.
Karimi Z, Baraani-Dastjerdi A, Ghasem-Aghaee N, Wagner S. Links between the personalities, styles and performance in computer programming. J Syst Softw. 2016;111:228–41.
Article Google Scholar
Koppel M, Argamon S, Shimoni AR. Automatically categorizing written texts by author gender. Lit Linguist Comput. 2002;17:4.
Article Google Scholar
Kosinski M, Bachrach Y, Kohli P, Stillwell D, Graepel T. Manifestations of user personality in website choice and behaviour on online social networks. New York: Springer; 2013. p. 1–24.
Google Scholar
Litvinova T, Litvinlova O, Zagorovskaya O, Seredin P, Sboev A, Romanchenko O. “ruspersonality”: a Russian corpus for authorship profiling and deception detection. In: Intelligence, social media and web (ISMW FRUCT), 2016 international FRUCT conference on, IEEE; 2016. pp 1–7.
Litvinova T, Seredin P, Litvinova O, Zagorovskaya O, Sboev A, Gudovskih D, Moloshnikov I, Rybka R. Gender prediction for authors of Russian texts using regression and classification techniques. In: CDUD@ CLA; 2016. pp 44–53.
Litvinova T, Gudovskikh D, Sboev A, Seredin P, Litvinova O, Pisarevskaya D, Rosso P. Author gender prediction in Russian social media texts. In: Conference on analysis of images, social networks, and texts, AIST-2017, IEEE; 2017. pp 1101–1106.
Litvinova T, Rangel F, Rosso P, Seredin P, Litvinova O. Overview of the rusprofiling pan at fire track on cross-genre gender identification in Russian. In: Notebook papers of FIRE 2017, FIRE-2017, Bangalore, India, December 8–11, CEUR Workshop Proceedings. CEUR-WS.org, vol 2036; 2017. pp 1–7.
Lui M, Cook P. Classifying English documents by national dialect. In: Proceedings of the Australasian Language Technology Association Workshop; 2013. pp 5–15.
Maharjan S, Shrestha P, Solorio T, Hasan R. A straightforward author profiling approach in mapreduce. In: Advances in artificial intelligence. Iberamia; 2014. pp 95–107.
Maier W, Gomez-Rodriguez C. Language variety identification in Spanish tweets. In: LT4CloseLang 2014; 2014.
Mairesse F, Walker MA, Mehl MR, Moore RK. Using linguistic cues for the automatic recognition of personality in conversation and text. J Artif Intell Res. 2007;30–1:457–500.
Article MATH Google Scholar
Malmasi S, Zampieri M, Ljubešić N, Nakov P, Ali A, Tiedemann J. Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the third workshop on NLP for similar languages, varieties and dialects (VarDial3); 2016. pp 1–14.
Maulana Siagian AHA, Aritsugi M. Dbms-ku approach for author profiling and deception detection in Arabic. In: Metha P, Rosso P, Majumder P, Mitra M (Eds) Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS.org, Kolkata, India, December 12–15; 2019.
Neuman Y, Cohen Y. A vectorial semantics approach to personality assessment. Sci Rep. 2014;4:4761.
Article Google Scholar
Oberlander J, Nowson S. Whose thumb is it anyway?: classifying author personality from weblog text. In: Proceedings of the COLING/ACL on main conference poster sessions, Association for Computational Linguistics; 2006. pp 627–634.
Paruma-Pabón OH, González FA, Aponte J, Camargo JE, Restrepo-Calle F. Finding relationships between socio-technical aspects and personality traits by mining developer e-mails. In: Proceedings of the 9th international workshop on cooperative and human aspects of software engineering, ACM; 2016. pp 8–14.
Pennebaker JW, Mehl MR, Niederhoffer KG. Psychological aspects of natural language use: our words, our selves. Annu Rev Psychol. 2003;54(1):547–77.
Article Google Scholar
Quercia D, Lambiotte R, Stillwell D, Kosinski M, Crowcroft J. The personality of popular Facebook users. In: Proceedings of the ACM 2012 conference on computer supported cooperative Work, ACM; 2012. pp 955–964.
Rangel F, Rosso P. On the multilingual and genre robustness of emographs for author profiling in social media. In: 6th international conference of CLEF on experimental IR meets multilinguality, multimodality, and interaction, Springer-Verlag, LNCS(9283); 2015. pp 274–280.
Rangel F, Rosso P. On the impact of emotions on author profiling. Inf Process Manag. 2016;52(1):73–92.
Article Google Scholar
Rangel F, Rosso P. On the implications of the general data protection regulation on the organisation of evaluation tasks. Lang Law. 2019;5:95–117.
Google Scholar
Rangel F, Rosso P. Overview of the 7th author profiling task at pan 2019: Bots and gender profiling. In: Cappellato L, Ferro N, MÃller H, Losada D (Eds) CLEF 2019 labs and workshops, notebook papers. CEUR Workshop Proceedings. CEUR-WS.org; 2019.
Rangel F, Rosso P, Potthast M, Stein B, Daelemans W. Overview of the 3rd author profiling task at pan 2015. In: Cappellato L, Ferro N, Jones G, San Juan E (Eds) CLEF 2015 labs and workshops, notebook papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1391; 2015.
Rangel F, González F, Restrepo-Calle F, Montes M, Rosso P. Pan at fire: Overview of the PR-SOCO track on personality recognition in source code. In: Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, December 7–10, CEUR workshop proceedings. CEUR-WS.org, vol 1737; 2016. pp 1–5.
Rangel F, Rosso P, Franco-Salvador M. A low dimensionality representation for language variety identification. In: 17th international conference on intelligent text processing and computational linguistics, CICLing. Springer; 2016. LNCS. arXiv:1705.10754
Rangel F, Rosso P, Potthast M, Stein B. Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In: Working notes papers of the CLEF 2017 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings; 2017.
Rangel F, Rosso P, Charfi A, Zaghouani W, Ghanem B, Sánchez-Junquera J. Overview of the track on author profiling and deception detection in Arabic. In: Metha P, Rosso P, Majumder P, Mitra M (Eds) Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS.org, Kolkata, India, December 12–15; 2019.
Rangel F, Paolo R, Zaghouani W, Charfi A. Fine-grained analysis of language varieties and demographics. Nat Lang Eng; 2020. (In Press).
Rosso P, Rangel F, Hernández-Farías I, Cagnina L, Zaghouani W, Charfi A. A survey on author profiling, deception, and irony detection for the Arabic language. Lang Ling Compass. 2018;12:4.
Article Google Scholar
Sadat F, Kazemi F, Farzindar A. Automatic identification of Arabic language varieties and dialects in social media. In: Proceedings of SocialNLP; 2014. p 22.
Schler J, Koppel M, Argamon S, Pennebaker JW. Effects of age and gender on blogging. In: AAAI spring symposium: computational approaches to analyzing weblogs, AAAI; 2006. pp 199–205.
Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, Shah A, Kosinski M, Stillwell D, Seligman ME, et al. Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One. 2013;8–9:773–91.
Google Scholar
Sequiera R, Choudhury M, Gupta P, Rosso P, Kumar S, Banerjee S, Kumar-Naskar S, Bandyopadhyay S, Chittaranjan G, Das A, Chakma K. Overview of fire-2015 shared task on mixed script information retrieval. In: Notebook papers of FIRE 2015, FIRE-2015, Gandhinagar, India, December 4–6, CEUR workshop proceedings. CEUR-WS.org, vol 1587; 2015. pp 19–25.
Sun Y, Ning H, Chen K, Kong L, Yang Y, Wang J, Qi H. Author profiling in arabic tweets:an approach based on multi-classification with word and character features. In: Metha P, Rosso P, Majumder P, Mitra M (eds) Working notes of the forum for information retrieval evaluation (FIRE 2019). CEUR workshop proceedings. CEUR-WS.org, Kolkata, India, December 12–15; 2019.
Weren E, Kauer A, Mizusaki L, Moreira V, de Oliveira P, Wives L. Examining multiple features for author profiling. J Inf Data Manag. 2014;20:266–79.
Google Scholar
Xu F, Wang M, Li M. Sentence-level dialects identification in the greater china region. Int J Nat Lang Comput. 2016;5:6.
Google Scholar
Zaghouani W, Charfi A. Arapâ tweet: a large multiâ dialect twitter corpus for gender, age and language variety identification. In: Proceedings of the 11th international conference on language resources and evaluation (LREC), Miyazaki, Japan; 2018.
Zaghouani W, Charfi A. Guidelines and annotation framework for Arabic author profiling. In: Proceedings of the 3rd workshop on open-source Arabic corpora and processing tools, 11th international conference on language resources and evaluation (LREC), Miyazaki, Japan; 2018.
Zaidan OF, Callison-Burch C. Arabic dialect identification. Comput Ling. 2014;40(1):171–202.
Article Google Scholar
Zampieri M, Gebre B. Automatic identification of language varieties: the case of Portuguese. In: The 11th conference on natural language processing (KONVENS). Osterreichischen Gesellschaft fur Artificial Intelligende (OGAI); 2012. pp 233–237.
Zampieri M, Malmasi S, Ljubešić N, Nakov P, Ali A, Tiedemann J, Scherrer Y, Aepli N. Findings of the vardial evaluation campaign 2017. In: Proceedings of the fourth workshop on NLP for similar languages, varieties and dialects; 2017. pp 1–15.

Download references

Author information

Authors and Affiliations

Universitat Politècnica de València, Valencia, Spain
Paolo Rosso
Symanto Research, Nuremberg, Germany
Francisco Rangel

Authors

Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Rangel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco Rangel.

Ethics declarations

Ethical standards

The publication of datasets containing personality profiles and also gender may potentially lead to ethical issues. The creation of the datasets was done in compliance with ethical standards and with the EU General Data Protection Regulation. A more in-depth discussion on legal and ethical issues can be found in [47].

Funding

The work on the author profiling data in Arabic was made possible by NPRP Grant #9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Forum for Information Retrieval Evaluation” guest edited by Mandar Mitra and Prasenjit Majumder.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rosso, P., Rangel, F. Author Profiling Tracks at FIRE. SN COMPUT. SCI. 1, 72 (2020). https://doi.org/10.1007/s42979-020-0073-1

Download citation

Received: 29 September 2019
Accepted: 03 February 2020
Published: 26 February 2020
DOI: https://doi.org/10.1007/s42979-020-0073-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Author Profiling Tracks at FIRE

Abstract

Similar content being viewed by others

How to Write and Publish a Research Paper for a Peer-Reviewed Journal

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

How to design bibliometric research: an overview and a framework proposal

Introduction

Demographic Profiling

Psychographic Profiling

PAN Lab Tracks at FIRE

Age, Gender, and Language Variety Identification in Arabic

Task Description

Dataset

Evaluation Measures

Result Analysis

Cross-Genre Gender Identification in Russian

Task Description

Datasets

Twitter Dataset

Facebook Dataset

Essay Dataset

Review Dataset

Gender-Imitated Dataset

Performance Measures

Result Analysis

Personality Recognition in Source Code

Task Description

Dataset

Performance Measures

Result Analysis

Conclusions and Future Work

Change history

28 September 2023

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical standards

Funding

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation