Novel semantic and statistic features-based author profiling approach

Ouni, Sarra; Fkih, Fethi; Omri, Mohamed Nazih

doi:10.1007/s12652-022-04198-w

Novel semantic and statistic features-based author profiling approach

Original Research
Published: 02 July 2022

Volume 14, pages 12807–12823, (2023)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

159 Accesses
4 Citations
Explore all metrics

Abstract

The Author Profiling (AP) task aims to predict certain demographic (e.g., age, gender) about authors from their documents. AP on social media networks is gaining increased research attention over the past decade. This challenge is of increasing importance in several applications related to security, marketing, psychology, etc. This article describes our solution for solving the author profiling problem as part of an annual series of digital text forensics computing events (PAN 2019). AP’s goal at PAN 2019 is to be able to distinguish between bots and humans on Twitter, to then identify the gender of human users. To achieve these goals, we have proposed two new models: (i) a first model that will be applied only to an English dataset using semantic and stylistic features. This model is topic-based for semantic feature extraction from tweets. These extracted stylistic and semantic features will be integrated into the convolutional neural network (CNN) model and (ii) the second is a classification model which will be applied to a Spanish corpus. It uses various statistical characteristics in order to feed a classifier based on random forests. The experimental study, which we conducted on various standard databases, shows the effectiveness of our proposed models in terms of accuracy, precision, recall, F1-score and G-mean. In addition, the analysis of the results of the comparative study between our models and other existing models shows the limits of these latest and confirms the performance of the solutions we have proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on sentiment analysis and emotion detection from text

Article 28 August 2021

Political mud slandering and power dynamics during Indian assembly elections

Article 27 August 2023

Fake news detection in social media based on sentiment analysis using classifier techniques

Article 11 March 2023

Availability of data and materials

The data that support the findings of this study are available from PAN (email: pan@webis.de) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of PAN (email: prosso@dsic.upv.es).

References

Alarifi A, Alsaleh M, Al-Salman A (2016) Twitter turing test: Identifying social machines. Inf Sci 372:332–346
Article Google Scholar
Álvarez-Carmona M. A, López-Monroy A. P, Montes-y Gómez M, Villasenor-Pineda L, Meza I (2016) Evaluating topic-based representations for author profiling in social media. In: Ibero-American Conference on Artificial Intelligence, pp 151–162. Springer
Basil M, Gaikwad S, Salim AS (2019) Deep learning approach based dominant age group based classification for social network. In: International conference on applied computing to support industry: innovation and technology, pp 148–156. Springer
Bessi A, Ferrara E (2016) Social bots distort the 2016 us presidential election online discussion. First Monday 21(11-7)
Cai C, Li L, Zengi D (2017) Behavior enhanced deep bot detection in social media. In: 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), pp 128–130. IEEE
Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824
Article Google Scholar
Company JS, Wanner L (2015) Multiple language gender identification for blog posts. In: CogSci
Daneshvar S, Inkpen D (2018a) Gender identification in twitter using n-grams and lsa. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018)
Daneshvar S, Inkpen D (2018b) Gender identification in twitter using n-grams and LSA: notebook for PAN at CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018, volume 2125 of CEUR Workshop Proceedings CEUR-WS.org
Davis CA, Varol O, Ferrara E, Flammini A, Menczer F (2016) Botornot: A system to evaluate social bots. In: Proceedings of the 25th international conference companion on world wide web, pp 273–274
Dessi D, Helaoui R, Kumar V, Reforgiato Recupero D, Riboni D (2020) Tf-idf vs word embeddings for morbidity identification in clinical notes: An initial study. In: 1st Workshop on Smart Personal Health Interfaces, SmartPhil 2020, volume 2596, pages 1–12. CEUR-WS
Dickerson JP, Kagan V, Subrahmanian VS (2014) Using sentiment to detect bots on twitter: Are humans more opinionated than bots? In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp 620–627. IEEE
Eiselt MPBSA, Rosso AB-CP (2009) Overview of the 1st international competition on plagiarism detection. In 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, pp 1
Fatima M, Hasan K, Anwar S, Nawab RMA (2017) Multilingual author profiling on facebook. Inf Process Manag 53(4):886–904
Article Google Scholar
Fernquist J, Kaati L, Schroeder R (2018) Political bots and the swedish general election. In: 2018 ieee international conference on intelligence and security informatics (isi), pp 124–129. IEEE
Flekova L, Preoţiuc-Pietro D, Ungar L (2016) Exploring stylistic variation with age and income on twitter. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 313–319
Garibay AP, Camacho-González AT, Fierro-Villaneda RA, Hernandez-Farias I, Buscaldi D, Ruíz IVM (2015) A random forest approach for authorship profiling. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org
Giachanou A, Zhang G, Rosso P (2020) Multimodal fake news detection with textual, visual and semantic information. In: International Conference on Text, Speech, and Dialogue, pages 30–38. Springer
Goubin R, Lefeuvre D, Alhamzeh A, Mitrovic J, Egyed-Zsigmond E, Fossi LG (2019) Bots and gender profiling using a multi-layer architecture. In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9–12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org
Gressel G, Hrudya P, Surendran K, Thara S, Aravind A, Prabaharan P (2014) Ensemble learning approach for author profiling. Notebook for PAN at CLEF, pp 401–412
Hall A, Terveen L, Halfaker A (2018) Bot detection in wikidata using behavioral and other informal cues. In: Proceedings of the ACM on Human–Computer Interaction, 2(CSCW):1–18
Isbister T, Kaati L, Cohen K (2017) Gender classification with data independent features in multiple languages. In: 2017 European Intelligence and Security Informatics Conference (EISIC), pp 54–60. IEEE
Jimenez-Villar V, Sánchez-Junquera J, Montes-y Gómez M, Pineda LV, Ponzetto SP (2019) Bots and gender profiling using masking techniques. In: CLEF (Working Notes)
Juola P (2015) Industrial uses for authorship analysis. Mathematics and Computers in Sciences and Industry, pp 21–25
Kestemont M, Manjavacas E, Markov I, Bevendorff J, Wiegmann M, Stamatatos E, Stein B, Potthast M (2021) Overview of the cross-domain authorship verification task at pan 2021. In: CLEF (Working Notes)
Kudugunta S, Ferrara E (2018) Deep neural networks for bot detection. Inf Sci 467:312–322
Article Google Scholar
Kumar V, Mishra BK, Mazzara M, Thanh DN, Verma A (2020) Prediction of malignant and benign breast cancer: A data mining approach in healthcare applications. In: Advances in data science and management, pp 435–442. Springer
Kumar V, Recupero DR, Riboni D, Helaoui R (2021) Ensembling classical machine learning and deep learning approaches for morbidity identification from clinical notes. IEEE Access 9:7107–7126
Article Google Scholar
Kyebambe MN, Cheng G, Huang Y, He C, Zhang Z (2017) Forecasting emerging technologies: a supervised learning approach through patent analysis. Technol Forecast Soc Chang 125:236–244
Article Google Scholar
López-Monroy AP, González FA, Solorio T (2020) Early author profiling on twitter using profile features with multi-resolution. Expert Syst Appl 140:112909
Article Google Scholar
Mabrouk O, Hlaoua L, Omri MN (2018a) Fuzzy twin svm based-profile categorization approach. In: 2018 14th International conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD), pp 547–553
Mabrouk O, Hlaoua L, Omri MN (2018b) Profile categorization system based on features reduction. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM 2018, Fort Lauderdale, Florida, USA, January 3–5, 2018
Mabrouk O, Hlaoua L, Omri MN (2021) Exploiting ontology information in fuzzy svm social media profile classification. Appl Intell 51:3757–3774
Article Google Scholar
Mac Kim S, Xu Q, Qu L, Wan S, Paris C. (2017). Demographic inference on twitter using recursive neural networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 471–477
Madichetty S, Sridevi M (2019) Disaster damage assessment from the tweets using the combination of statistical features and informative words. Soc Netw Anal Min 9(1):1–11
Article Google Scholar
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 889–892
Najib F, Cheema WA, Nawab RMA (2015) Author’s traits prediction on twitter data using content based approach. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015, volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org
Nieuwenhuis M, Wilkens J (2018) Twitter text and image gender classification with a logistic regression n-gram model: Notebook for PAN at CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018, volume 2125 of CEUR Workshop Proceedings. CEUR-WS.org
Oentaryo R. J, Murdopo A, Prasetyo PK, Lim E-P (2016) On profiling bots in social media. In: International Conference on Social Informatics, pp 92–109. Springer
Ortega-Mendoza RM, López-Monroy AP, Franco-Arcega A, Montes-y Gómez M (2018) Emphasizing personal information for author profiling: New approaches for term selection and weighting. Knowl-Based Syst 145:169–181
Article Google Scholar
Pardo F. M. R, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at PAN 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org
Pardo FMR, Rosso P (2019) Overview of the 7th author profiling task at PAN 2019: Bots and gender profiling in twitter. In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9–12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org
Patra BG, Das KG, Das D (2018) Multimodal author profiling for twitter: Notebook for PAN at CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018, volume 2125 of CEUR Workshop Proceedings. CEUR-WS.org
Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001
Rangel F, Rosso P (2016) On the impact of emotions on author profiling. Inf Process Manag 52(1):73–92
Article Google Scholar
Rangel F, Rosso P, Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at pan 2013. In: CLEF conference on multilingual and multimodal information access evaluation, pp 352–365. CELCT
Rangel F, Rosso P, Montes-y Gómez M, Potthast M, Stein B (2018) Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working Notes Papers of the CLEF
Rangel F, Rosso P, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W, et al (2014) Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings, volume 1180, pp 898–927. CEUR Workshop Proceedings
Rosenthal S, McKeown K (2011) Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp 763–772
Safara F, Mohammed AS, Potrus MY, Ali S, Tho QT, Souri A, Janenia F, Hosseinzadeh M (2020) An author gender detection method using whale optimization algorithm and artificial neural network. IEEE Access 8:48428–48437
Article Google Scholar
Şenel LK, Utlu I, Yücesoy V, Koc A, Cukur T (2018) Semantic structure and interpretability of word embeddings. IEEE/ACM Trans Audio Speech Lang Process 26(10):1769–1779
Article Google Scholar
Sreenivasulu M, Sridevi M (2020) Comparative study of statistical features to detect the target event during disaster. Big Data Min Anal 3(2):121–130
Article Google Scholar
Subrahmanian V, Azaria A, Durst S, Kagan V, Galstyan A, Lerman K, Zhu L, Ferrara E, Flammini A, Menczer F (2016) The darpa twitter bot challenge. Computer 49(6):38–46
Article Google Scholar
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Article MATH Google Scholar
Takahashi T, Tahara T, Nagatani K, Miura Y, Taniguchi T, Ohkuma T (2018) Text and image synergy with feature cross technique for gender identification: Notebook for PAN at CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018, volume 2125 of CEUR Workshop Proceedings. CEUR-WS.org
Tellez ES, Miranda-Jiménez S, Graff M, Moctezuma D (2017) Gender and language-variety identification with microtc. In: CLEF (Working Notes)
Valencia AIV, Adorno HG, Rhodes CS, Pineda GF (2019) Bots and gender identification based on stylometry of tweet minimal structure and n-grams model. In: Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, volume 2380
Varol O, Ferrara E, Davis CA, Menczer F, Flammini A (2017) Online human-bot interactions: detection, estimation, and characterization. arXiv:1703.03107
Wanner L et al. (2016) A semi-supervised approach for gender identification. In: Calzolari N, Choukri K, Declerck T, Goggi S, Grobelnik M, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S. LREC 2016, Tenth International Conference on Language Resources and Evaluation; 2016 23–28 May; Portorož, Slovenia.[Place unknown]: LREC, 2017. p. 1282-7. LREC
Wei F, Nguyen UT (2019) Twitter bot detection using bidirectional long short-term memory neural networks and word embeddings. In: 2019 First IEEE International conference on trust, privacy and security in intelligent systems and applications (TPS-ISA), pages 101–109. IEEE
Yang K-C, Varol O, Davis CA, Ferrara E, Flammini A, Menczer F (2019) Arming the public with artificial intelligence to counter social bots. Hum Behav Emerg Technol 1(1):48–61
Article Google Scholar

Download references

Author information

Authors and Affiliations

MARS Research Lab LR 17ES05, University of Sousse, Tunis, Tunisia
Sarra Ouni, Fethi Fkih & Mohamed Nazih Omri
Department of Computer Science, College of Computer, Qassim University, Buraydah, Saudi Arabia
Fethi Fkih

Authors

Sarra Ouni
View author publications
You can also search for this author in PubMed Google Scholar
Fethi Fkih
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nazih Omri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarra Ouni.

Ethics declarations

Conflicts of interest

The authors declare they have no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ouni, S., Fkih, F. & Omri, M.N. Novel semantic and statistic features-based author profiling approach. J Ambient Intell Human Comput 14, 12807–12823 (2023). https://doi.org/10.1007/s12652-022-04198-w

Download citation

Received: 14 February 2021
Accepted: 15 June 2022
Published: 02 July 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s12652-022-04198-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel semantic and statistic features-based author profiling approach

Abstract

Access this article

Similar content being viewed by others

A review on sentiment analysis and emotion detection from text

Political mud slandering and power dynamics during Indian assembly elections

Fake news detection in social media based on sentiment analysis using classifier techniques

Availability of data and materials

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Novel semantic and statistic features-based author profiling approach

Abstract

Access this article

Similar content being viewed by others

A review on sentiment analysis and emotion detection from text

Political mud slandering and power dynamics during Indian assembly elections

Fake news detection in social media based on sentiment analysis using classifier techniques

Availability of data and materials

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation