Abstract
Automatic profiling models infer demographic characteristics of social network users from their generated content or interactions. Due to its use in business (targeted advertising, market studies...), automatic user profiling from social networks has become a popular task. Users’ demographic data is also crucial information for more socially concerning tasks, such as automatic early detection of mental disorders. For this type of users’ analysis task, it has been demonstrated that the way users employ language is an essential indicator that contributes to the effectiveness of the models. For this reason, we also believe that considering the usage of the language from both psycho-linguistic and semantic characteristics it is useful for detecting variables such as gender, age, and user’s origin. A proper selection of features will be critical for the performance of retrieval, classification, and decision-making software systems, a proper selection of features will be critical. In this work, we shall discuss gender classification as a part of the automated profiling task. We present an experimental analysis of the performance of existing gender classification models for automated profiling based on external corpus and baselines. We also investigate the role of linguistic characteristics in the model’s classification accuracy and their impact on each gender. Following that analysis, we have developed a feature set for gender classification models in social networks that outperforms existing benchmarks in terms of accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Alowibdi, J.S., Buy, U.A., Yu, P.: Empirical evaluation of profile characteristics for gender classification on twitter. In: 2013 12th International Conference on Machine Learning and Applications, vol. 1, pp. 365–369 (2013)
Alowibdi, J.S., Buy, U.A., Yu, P.: Language independent gender classification on twitter. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, pp. 739–743. Association for Computing Machinery, New York (2013)
Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Meza, I.: Evaluating topic-based representations for author profiling in social media. In: Montes-y-Gómez, M., Escalante, H.J., Segura, A., Murillo, J.D. (eds.) IBERAMIA 2016. LNCS (LNAI), vol. 10022, pp. 151–162. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47955-2_13
Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender, and the varieties of self-expression. First Monday 12 (2007)
Bacciu, A., Morgia, M.L., Mei, A., Nemmi, E.N., Neri, V., Stefa, J.: Bot and gender detection of twitter accounts using distortion and LSA notebook for PAN at CLEF 2019. CEUR Workshop Proceedings 2380 (2019)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics, Edinburgh (2011). https://aclanthology.org/D11-1120
Chopra, S., Sawhney, R., Mathur, P., Ratn Shah, R.: Hindi-English hate speech detection: author profiling, debiasing, and practical perspectives. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Coates, J.: Women, Men and Language: A Sociolinguistic Account of Gender Differences in Language, 3rd edn, pp. 1–245. Taylor and Francis (2015)
Dadvar, M., Jong, F.D., Ordelman, R., Trieschnigg, D.: Improved cyberbullying detection using gender information. In: Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012). University of Ghent (2012)
Ease, F.R.: Flesch-Kincaid readability test (2009)
Fatima, M., Hasan, K., Anwar, S., Nawab, R.M.A.: Multilingual author profiling on facebook. Inf. Process. Manag. 53(4), 886–904 (2017)
Joo, Y., Hwang, I.: Author profiling on social media: an ensemble learning model using various features notebook for PAN at CLEF 2019. In: CEUR Workshop Proceedings (2019)
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 3146–3154. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
Kirasich, K., Smith, T., Sadler, B.: Random forest vs logistic regression: binary classification for heterogeneous datasets. SMU Data Sci. Rev. 1, 9 (2018)
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Liter. Linguist. Comput. 17(4), 401–412 (2002)
Levi, G., Hassner, T.: Age and gender classification using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2015)
Liu, B.: Sentiment analysis and subjectivity. In: Handbook of Natural Language Processing, Second Edition (2010)
Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Human Lang. Technol. 5(1), 1–184 (2012)
Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF lab on early risk prediction on the internet: experimental foundations. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 346–360. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_30
Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk: early risk prediction on the internet. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 343–361. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_30
Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019 early risk prediction on the internet. In: Crestani, F., et al. (eds.) CLEF 2019. LNCS, vol. 11696, pp. 340–357. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28577-7_27
Lundberg, S.M., Erion, G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. ArXiv abs/1802.03888 (2018)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems (2017)
Makinen, E., Raisamo, R.: Evaluation of gender classification methods with automatically detected and aligned faces. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 541–547 (2008)
Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine (1978)
Miller, Z., Dickinson, B., Hu, W.: Gender prediction on twitter using stream algorithms with n-gram character features. Int. J. Intell. Sci. 02(04), 143–148 (2012)
Moghaddam, B., Yang, M.H.: Gender classification with support vector machines. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), pp. 306–311 (2000)
Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. Association for Computational Linguistics, Cambridge (2010). https://www.aclweb.org/anthology/D10-1021
Ortega-Mendoza, R.M., Franco-Arcega, A., López-Monroy, A.P., Montes-y-Gómez, M.: I, me, mine: the role of personal phrases in author profiling. In: Fuhr, N., et al. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 110–122. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44564-9_9
Pardo, F.M.R., Rosso, P.: Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter. In: Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.) Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, 9–12 September 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019). http://ceur-ws.org/Vol-2380/paper_263.pdf
Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, SMUC ’11, pp. 37–44. Association for Computing Machinery, New York (2011)
Piot-Perez-Abadin., P., Martin-Rodilla., P., Parapar., J.: Experimental analysis of the relevance of features and effects on gender classification models for social media author profiling. In: Proceedings of the 16th International Conference on Evaluation of Novel Approaches to Software Engineering - ENASE, pp. 103–113. INSTICC, SciTePress (2021)
Rajend, M., Swann, J., Deumert, A., Leap, W.: Introducing Sociolinguistics. Edinburgh University Press, Edinburgh (2009)
Rangel, F., Rosso, P.: On the implications of the general data protection regulation on the organisation of evaluation tasks. Language and Law= Linguagem e Direito (2019)
Rangel, F., et al.: Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings, vol. 1180, pp. 898–927. CEUR Workshop Proceedings (2014)
Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8, e73791 (2013)
Sekulic, I., Gjurković, M., Šnajder, J.: Not just depressed: bipolar disorder prediction on reddit. In: Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 72–78. Association for Computational Linguistics, Brussels (2018). https://www.aclweb.org/anthology/W18-6211
Vasilev, E.: Inferring gender of Reddit users. Bachelor thesis, GESIS - Leibniz Institute for the Social Sciences (2018)
Weren, E.R.D., Kauer, A.U., Mizusaki, L., Moreira, V.P., Oliveira, J.P.M.D., Wives, L.K.: Examining multiple features for author profiling. J. Inf. Data Manag. 5, 266 (2014)
Leng, X., Wang, Y.: Improving generalization for gender classification. In: 2008 15th IEEE International Conference on Image Processing, pp. 1656–1659 (2008)
Acknowledgements
This work was supported by projects RTI2018-093336-B-C21, RTI2018-09333 6-B-C22 (Ministerio de Ciencia e Innvovación & ERDF) and the financial support supplied by the Consellería de Educación, Universidade e Formación Profesional (accreditation 2019-2022 ED431G/01, ED431B 2019/03) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT of the University of A Coruña as a Research Center of the Galician University System.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Piot-Perez-Abadin, P., Martin-Rodilla, P., Parapar, J. (2022). Gender Classification Models and Feature Impact for Social Media Author Profiling. In: Ali, R., Kaindl, H., Maciaszek, L.A. (eds) Evaluation of Novel Approaches to Software Engineering. ENASE 2021. Communications in Computer and Information Science, vol 1556. Springer, Cham. https://doi.org/10.1007/978-3-030-96648-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-96648-5_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96647-8
Online ISBN: 978-3-030-96648-5
eBook Packages: Computer ScienceComputer Science (R0)