Skip to main content

Gender Classification Models and Feature Impact for Social Media Author Profiling

  • Conference paper
  • First Online:
Book cover Evaluation of Novel Approaches to Software Engineering (ENASE 2021)

Abstract

Automatic profiling models infer demographic characteristics of social network users from their generated content or interactions. Due to its use in business (targeted advertising, market studies...), automatic user profiling from social networks has become a popular task. Users’ demographic data is also crucial information for more socially concerning tasks, such as automatic early detection of mental disorders. For this type of users’ analysis task, it has been demonstrated that the way users employ language is an essential indicator that contributes to the effectiveness of the models. For this reason, we also believe that considering the usage of the language from both psycho-linguistic and semantic characteristics it is useful for detecting variables such as gender, age, and user’s origin. A proper selection of features will be critical for the performance of retrieval, classification, and decision-making software systems, a proper selection of features will be critical. In this work, we shall discuss gender classification as a part of the automated profiling task. We present an experimental analysis of the performance of existing gender classification models for automated profiling based on external corpus and baselines. We also investigate the role of linguistic characteristics in the model’s classification accuracy and their impact on each gender. Following that analysis, we have developed a feature set for gender classification models in social networks that outperforms existing benchmarks in terms of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pan.webis.de/.

  2. 2.

    https://pan.webis.de/shared-tasks.html.

  3. 3.

    https://pan.webis.de/data.html.

  4. 4.

    https://pan.webis.de/data.html.

  5. 5.

    https://pan.webis.de/clef20/pan20-web/celebrity-profiling.html.

  6. 6.

    https://scikit-learn.org/stable/.

  7. 7.

    https://www.nltk.org/.

  8. 8.

    https://spacy.io/.

  9. 9.

    https://pyphen.org/.

  10. 10.

    https://github.com/carpedm20/emoji/.

  11. 11.

    https://pan.webis.de/data.html.

References

  1. Alowibdi, J.S., Buy, U.A., Yu, P.: Empirical evaluation of profile characteristics for gender classification on twitter. In: 2013 12th International Conference on Machine Learning and Applications, vol. 1, pp. 365–369 (2013)

    Google Scholar 

  2. Alowibdi, J.S., Buy, U.A., Yu, P.: Language independent gender classification on twitter. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, pp. 739–743. Association for Computing Machinery, New York (2013)

    Google Scholar 

  3. Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Meza, I.: Evaluating topic-based representations for author profiling in social media. In: Montes-y-Gómez, M., Escalante, H.J., Segura, A., Murillo, J.D. (eds.) IBERAMIA 2016. LNCS (LNAI), vol. 10022, pp. 151–162. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47955-2_13

    Chapter  Google Scholar 

  4. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender, and the varieties of self-expression. First Monday 12 (2007)

    Google Scholar 

  5. Bacciu, A., Morgia, M.L., Mei, A., Nemmi, E.N., Neri, V., Stefa, J.: Bot and gender detection of twitter accounts using distortion and LSA notebook for PAN at CLEF 2019. CEUR Workshop Proceedings 2380 (2019)

    Google Scholar 

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)

    MATH  Google Scholar 

  7. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics, Edinburgh (2011). https://aclanthology.org/D11-1120

  8. Chopra, S., Sawhney, R., Mathur, P., Ratn Shah, R.: Hindi-English hate speech detection: author profiling, debiasing, and practical perspectives. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)

    Google Scholar 

  9. Coates, J.: Women, Men and Language: A Sociolinguistic Account of Gender Differences in Language, 3rd edn, pp. 1–245. Taylor and Francis (2015)

    Google Scholar 

  10. Dadvar, M., Jong, F.D., Ordelman, R., Trieschnigg, D.: Improved cyberbullying detection using gender information. In: Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012). University of Ghent (2012)

    Google Scholar 

  11. Ease, F.R.: Flesch-Kincaid readability test (2009)

    Google Scholar 

  12. Fatima, M., Hasan, K., Anwar, S., Nawab, R.M.A.: Multilingual author profiling on facebook. Inf. Process. Manag. 53(4), 886–904 (2017)

    Article  Google Scholar 

  13. Joo, Y., Hwang, I.: Author profiling on social media: an ensemble learning model using various features notebook for PAN at CLEF 2019. In: CEUR Workshop Proceedings (2019)

    Google Scholar 

  14. Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 3146–3154. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

  15. Kirasich, K., Smith, T., Sadler, B.: Random forest vs logistic regression: binary classification for heterogeneous datasets. SMU Data Sci. Rev. 1, 9 (2018)

    Google Scholar 

  16. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Liter. Linguist. Comput. 17(4), 401–412 (2002)

    Article  Google Scholar 

  17. Levi, G., Hassner, T.: Age and gender classification using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2015)

    Google Scholar 

  18. Liu, B.: Sentiment analysis and subjectivity. In: Handbook of Natural Language Processing, Second Edition (2010)

    Google Scholar 

  19. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Human Lang. Technol. 5(1), 1–184 (2012)

    Article  Google Scholar 

  20. Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF lab on early risk prediction on the internet: experimental foundations. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 346–360. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_30

    Chapter  Google Scholar 

  21. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk: early risk prediction on the internet. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 343–361. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_30

    Chapter  Google Scholar 

  22. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019 early risk prediction on the internet. In: Crestani, F., et al. (eds.) CLEF 2019. LNCS, vol. 11696, pp. 340–357. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28577-7_27

    Chapter  Google Scholar 

  23. Lundberg, S.M., Erion, G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. ArXiv abs/1802.03888 (2018)

    Google Scholar 

  24. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  25. Makinen, E., Raisamo, R.: Evaluation of gender classification methods with automatically detected and aligned faces. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 541–547 (2008)

    Article  Google Scholar 

  26. Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine (1978)

    Google Scholar 

  27. Miller, Z., Dickinson, B., Hu, W.: Gender prediction on twitter using stream algorithms with n-gram character features. Int. J. Intell. Sci. 02(04), 143–148 (2012)

    Article  Google Scholar 

  28. Moghaddam, B., Yang, M.H.: Gender classification with support vector machines. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), pp. 306–311 (2000)

    Google Scholar 

  29. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. Association for Computational Linguistics, Cambridge (2010). https://www.aclweb.org/anthology/D10-1021

  30. Ortega-Mendoza, R.M., Franco-Arcega, A., López-Monroy, A.P., Montes-y-Gómez, M.: I, me, mine: the role of personal phrases in author profiling. In: Fuhr, N., et al. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 110–122. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44564-9_9

    Chapter  Google Scholar 

  31. Pardo, F.M.R., Rosso, P.: Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter. In: Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.) Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, 9–12 September 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019). http://ceur-ws.org/Vol-2380/paper_263.pdf

  32. Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, SMUC ’11, pp. 37–44. Association for Computing Machinery, New York (2011)

    Google Scholar 

  33. Piot-Perez-Abadin., P., Martin-Rodilla., P., Parapar., J.: Experimental analysis of the relevance of features and effects on gender classification models for social media author profiling. In: Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Software Engineering - ENASE, pp. 103–113. INSTICC, SciTePress (2021)

    Google Scholar 

  34. Rajend, M., Swann, J., Deumert, A., Leap, W.: Introducing Sociolinguistics. Edinburgh University Press, Edinburgh (2009)

    Google Scholar 

  35. Rangel, F., Rosso, P.: On the implications of the general data protection regulation on the organisation of evaluation tasks. Language and Law= Linguagem e Direito (2019)

    Google Scholar 

  36. Rangel, F., et al.: Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings, vol. 1180, pp. 898–927. CEUR Workshop Proceedings (2014)

    Google Scholar 

  37. Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8, e73791 (2013)

    Article  Google Scholar 

  38. Sekulic, I., Gjurković, M., Šnajder, J.: Not just depressed: bipolar disorder prediction on reddit. In: Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 72–78. Association for Computational Linguistics, Brussels (2018). https://www.aclweb.org/anthology/W18-6211

  39. Vasilev, E.: Inferring gender of Reddit users. Bachelor thesis, GESIS - Leibniz Institute for the Social Sciences (2018)

    Google Scholar 

  40. Weren, E.R.D., Kauer, A.U., Mizusaki, L., Moreira, V.P., Oliveira, J.P.M.D., Wives, L.K.: Examining multiple features for author profiling. J. Inf. Data Manag. 5, 266 (2014)

    Google Scholar 

  41. Leng, X., Wang, Y.: Improving generalization for gender classification. In: 2008 15th IEEE International Conference on Image Processing, pp. 1656–1659 (2008)

    Google Scholar 

Download references

Acknowledgements

This work was supported by projects RTI2018-093336-B-C21, RTI2018-09333 6-B-C22 (Ministerio de Ciencia e Innvovación & ERDF) and the financial support supplied by the Consellería de Educación, Universidade e Formación Profesional (accreditation 2019-2022 ED431G/01, ED431B 2019/03) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT of the University of A Coruña as a Research Center of the Galician University System.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paloma Piot-Perez-Abadin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Piot-Perez-Abadin, P., Martin-Rodilla, P., Parapar, J. (2022). Gender Classification Models and Feature Impact for Social Media Author Profiling. In: Ali, R., Kaindl, H., Maciaszek, L.A. (eds) Evaluation of Novel Approaches to Software Engineering. ENASE 2021. Communications in Computer and Information Science, vol 1556. Springer, Cham. https://doi.org/10.1007/978-3-030-96648-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-96648-5_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-96647-8

  • Online ISBN: 978-3-030-96648-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics