Gender Classification Models and Feature Impact for Social Media Author Profiling

Piot-Perez-Abadin, Paloma; Martin-Rodilla, Patricia; Parapar, Javier

doi:10.1007/978-3-030-96648-5_12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1556))

Included in the following conference series:

International Conference on Evaluation of Novel Approaches to Software Engineering

513 Accesses

Abstract

Automatic profiling models infer demographic characteristics of social network users from their generated content or interactions. Due to its use in business (targeted advertising, market studies...), automatic user profiling from social networks has become a popular task. Users’ demographic data is also crucial information for more socially concerning tasks, such as automatic early detection of mental disorders. For this type of users’ analysis task, it has been demonstrated that the way users employ language is an essential indicator that contributes to the effectiveness of the models. For this reason, we also believe that considering the usage of the language from both psycho-linguistic and semantic characteristics it is useful for detecting variables such as gender, age, and user’s origin. A proper selection of features will be critical for the performance of retrieval, classification, and decision-making software systems, a proper selection of features will be critical. In this work, we shall discuss gender classification as a part of the automated profiling task. We present an experimental analysis of the performance of existing gender classification models for automated profiling based on external corpus and baselines. We also investigate the role of linguistic characteristics in the model’s classification accuracy and their impact on each gender. Following that analysis, we have developed a feature set for gender classification models in social networks that outperforms existing benchmarks in terms of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Alowibdi, J.S., Buy, U.A., Yu, P.: Empirical evaluation of profile characteristics for gender classification on twitter. In: 2013 12th International Conference on Machine Learning and Applications, vol. 1, pp. 365–369 (2013)
Google Scholar
Alowibdi, J.S., Buy, U.A., Yu, P.: Language independent gender classification on twitter. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, pp. 739–743. Association for Computing Machinery, New York (2013)
Google Scholar
Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Meza, I.: Evaluating topic-based representations for author profiling in social media. In: Montes-y-Gómez, M., Escalante, H.J., Segura, A., Murillo, J.D. (eds.) IBERAMIA 2016. LNCS (LNAI), vol. 10022, pp. 151–162. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47955-2_13
Chapter Google Scholar
Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender, and the varieties of self-expression. First Monday 12 (2007)
Google Scholar
Bacciu, A., Morgia, M.L., Mei, A., Nemmi, E.N., Neri, V., Stefa, J.: Bot and gender detection of twitter accounts using distortion and LSA notebook for PAN at CLEF 2019. CEUR Workshop Proceedings 2380 (2019)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
MATH Google Scholar
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics, Edinburgh (2011). https://aclanthology.org/D11-1120
Chopra, S., Sawhney, R., Mathur, P., Ratn Shah, R.: Hindi-English hate speech detection: author profiling, debiasing, and practical perspectives. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Google Scholar
Coates, J.: Women, Men and Language: A Sociolinguistic Account of Gender Differences in Language, 3rd edn, pp. 1–245. Taylor and Francis (2015)
Google Scholar
Dadvar, M., Jong, F.D., Ordelman, R., Trieschnigg, D.: Improved cyberbullying detection using gender information. In: Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012). University of Ghent (2012)
Google Scholar
Ease, F.R.: Flesch-Kincaid readability test (2009)
Google Scholar
Fatima, M., Hasan, K., Anwar, S., Nawab, R.M.A.: Multilingual author profiling on facebook. Inf. Process. Manag. 53(4), 886–904 (2017)
Article Google Scholar
Joo, Y., Hwang, I.: Author profiling on social media: an ensemble learning model using various features notebook for PAN at CLEF 2019. In: CEUR Workshop Proceedings (2019)
Google Scholar
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 3146–3154. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
Kirasich, K., Smith, T., Sadler, B.: Random forest vs logistic regression: binary classification for heterogeneous datasets. SMU Data Sci. Rev. 1, 9 (2018)
Google Scholar
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Liter. Linguist. Comput. 17(4), 401–412 (2002)
Article Google Scholar
Levi, G., Hassner, T.: Age and gender classification using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2015)
Google Scholar
Liu, B.: Sentiment analysis and subjectivity. In: Handbook of Natural Language Processing, Second Edition (2010)
Google Scholar
Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Human Lang. Technol. 5(1), 1–184 (2012)
Article Google Scholar
Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF lab on early risk prediction on the internet: experimental foundations. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 346–360. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_30
Chapter Google Scholar
Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk: early risk prediction on the internet. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 343–361. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_30
Chapter Google Scholar
Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019 early risk prediction on the internet. In: Crestani, F., et al. (eds.) CLEF 2019. LNCS, vol. 11696, pp. 340–357. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28577-7_27
Chapter Google Scholar
Lundberg, S.M., Erion, G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. ArXiv abs/1802.03888 (2018)
Google Scholar
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Makinen, E., Raisamo, R.: Evaluation of gender classification methods with automatically detected and aligned faces. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 541–547 (2008)
Article Google Scholar
Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine (1978)
Google Scholar
Miller, Z., Dickinson, B., Hu, W.: Gender prediction on twitter using stream algorithms with n-gram character features. Int. J. Intell. Sci. 02(04), 143–148 (2012)
Article Google Scholar
Moghaddam, B., Yang, M.H.: Gender classification with support vector machines. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), pp. 306–311 (2000)
Google Scholar
Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. Association for Computational Linguistics, Cambridge (2010). https://www.aclweb.org/anthology/D10-1021
Ortega-Mendoza, R.M., Franco-Arcega, A., López-Monroy, A.P., Montes-y-Gómez, M.: I, me, mine: the role of personal phrases in author profiling. In: Fuhr, N., et al. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 110–122. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44564-9_9
Chapter Google Scholar
Pardo, F.M.R., Rosso, P.: Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter. In: Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.) Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, 9–12 September 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019). http://ceur-ws.org/Vol-2380/paper_263.pdf
Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, SMUC ’11, pp. 37–44. Association for Computing Machinery, New York (2011)
Google Scholar
Piot-Perez-Abadin., P., Martin-Rodilla., P., Parapar., J.: Experimental analysis of the relevance of features and effects on gender classification models for social media author profiling. In: Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Software Engineering - ENASE, pp. 103–113. INSTICC, SciTePress (2021)
Google Scholar
Rajend, M., Swann, J., Deumert, A., Leap, W.: Introducing Sociolinguistics. Edinburgh University Press, Edinburgh (2009)
Google Scholar
Rangel, F., Rosso, P.: On the implications of the general data protection regulation on the organisation of evaluation tasks. Language and Law= Linguagem e Direito (2019)
Google Scholar
Rangel, F., et al.: Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings, vol. 1180, pp. 898–927. CEUR Workshop Proceedings (2014)
Google Scholar
Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8, e73791 (2013)
Article Google Scholar
Sekulic, I., Gjurković, M., Šnajder, J.: Not just depressed: bipolar disorder prediction on reddit. In: Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 72–78. Association for Computational Linguistics, Brussels (2018). https://www.aclweb.org/anthology/W18-6211
Vasilev, E.: Inferring gender of Reddit users. Bachelor thesis, GESIS - Leibniz Institute for the Social Sciences (2018)
Google Scholar
Weren, E.R.D., Kauer, A.U., Mizusaki, L., Moreira, V.P., Oliveira, J.P.M.D., Wives, L.K.: Examining multiple features for author profiling. J. Inf. Data Manag. 5, 266 (2014)
Google Scholar
Leng, X., Wang, Y.: Improving generalization for gender classification. In: 2008 15th IEEE International Conference on Image Processing, pp. 1656–1659 (2008)
Google Scholar

Download references

Acknowledgements

This work was supported by projects RTI2018-093336-B-C21, RTI2018-09333 6-B-C22 (Ministerio de Ciencia e Innvovación & ERDF) and the financial support supplied by the Consellería de Educación, Universidade e Formación Profesional (accreditation 2019-2022 ED431G/01, ED431B 2019/03) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT of the University of A Coruña as a Research Center of the Galician University System.

Author information

Authors and Affiliations

IRLab, CITIC Research Centre, Universidade de Coruña, A Coruña, Spain
Paloma Piot-Perez-Abadin, Patricia Martin-Rodilla & Javier Parapar

Authors

Paloma Piot-Perez-Abadin
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Martin-Rodilla
View author publications
You can also search for this author in PubMed Google Scholar
Javier Parapar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paloma Piot-Perez-Abadin .

Editor information

Editors and Affiliations

Hamad bin Khalifa University, Doha, Qatar
Raian Ali
TU Wien, Vienna, Austria
Hermann Kaindl
Wrocław University of Economics, Wrocław, Poland
Leszek A. Maciaszek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Piot-Perez-Abadin, P., Martin-Rodilla, P., Parapar, J. (2022). Gender Classification Models and Feature Impact for Social Media Author Profiling. In: Ali, R., Kaindl, H., Maciaszek, L.A. (eds) Evaluation of Novel Approaches to Software Engineering. ENASE 2021. Communications in Computer and Information Science, vol 1556. Springer, Cham. https://doi.org/10.1007/978-3-030-96648-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-96648-5_12
Published: 11 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96647-8
Online ISBN: 978-3-030-96648-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Gender Classification Models and Feature Impact for Social Media Author Profiling