Abstract
In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machine outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Al-Ghadir, A.R.I., Azmi, A.M.: A study of Arabic social media users - posting behavior and authorās gender prediction. Cogn. Comput. 11(1), 71ā86 (2019)
Alsmearat, K., Al-Ayyoub, M., Al-Shalabi, R., Kanaan, G.: Author gender identification from Arabic text. J. Inf. Secur. Appl. 35, 85ā95 (2017)
Baseer, F., Jaafar, J., Habib, A.: Gender and age identification through Romanized Urdu dataset. In: 2019 1st International Conference on Artificial Intelligence and Data Sciences (AiDAS), pp. 164ā169. IEEE (2019)
Bassem, B., Zrigui, M.: Gender identification: a comparative study of deep learning architectures. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol. 941, pp. 792ā800. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-1_77
Baxevanakis, S., Gavras, S., Mouratidis, D., Kermanidis, K.L.: A machine learning approach for gender identification of Greek tweet authors. In: Makedon, F. (ed.) PETRA 2020: The 13th PErvasive Technologies Related to Assistive Environments Conference, Corfu, Greece, June 30āJuly 3, 2020. pp. 57:1ā57:4. ACM (2020)
Cheng, N., Chandramouli, R., Subbalakshmi, K.: Author gender identification from text. Digit. Invest. 8(1), 78ā88 (2011)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116. http://arxiv.org/abs/1911.02116 (2019)
Daud, A., Khan, W., Che, D.: Urdu language processing: a survey. Artif. Intell. Rev. 47(3), 279ā311 (2016). https://doi.org/10.1007/s10462-016-9482-x
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805 (2018)
Fatima, M., Hasan, K., Anwar, S., Nawab, R.M.A.: Multilingual author profiling on Facebook. Inf. Process. Manag. 53(4), 886ā904 (2017)
HaCohen-Kerner, Y.: Survey on profiling age and gender of text authors. Expert Syst. Appl. 199, 117ā140 (2022)
Hassan, S.U., et al.: Predicting literatureās early impact with sentiment analysis in twitter. Knowl. Based Syst. 192 (2020)
Hassan, S.U., Aljohani, N.R., Shabbir, M., Ali, U., Iqbal, S., Sarwar, R., MartĆnez-CĆ”mara, E., Ventura, S., Herrera, F.: Tweet coupling: a social media methodology for clustering scientific publications. Scientometrics 124(2), 973ā991 (2020)
Hassan, S.U., et al.: Exploiting tweet sentiments in altmetrics large-scale data. arXiv preprint arXiv:2008.13023 (2020)
Hassan, S.U., Sarwar, R., Muazzam, A.: Tapping into intra-and international collaborations of the organization of Islamic cooperation states across science and technology disciplines. Sci. Public Policy 43(5), 690ā701 (2016)
He, P., Gao, J., Chen, W.: Debertav 3: improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. ArXiv (2021)
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Their Appl. 13(4), 18ā28 (1998)
Ikae, C., Savoy, J.: Gender identification on twitter. J. Assoc. Inf. Sci. Technol. 73(1), 58ā69 (2022)
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Kucukyilmaz, T., Deniz, A., Kiziloz, H.E.: Boosting gender identification using author preference. Pattern Recognit. Lett. 140, 245ā251 (2020)
Limkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E., Nutanong, S.: Domain adaptation of Thai word segmentation models using stacked ensemble. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, 16ā20 November 2020. Association for Computational Linguistics (2020)
Limkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E., Nutanong, S.: Handling cross and out-of-domain samples in Thai word segmentation. In: Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, 1ā6 August 2021. Association for Computational Linguistics (2021)
Malik, M.K.: Urdu named entity recognition and classification system using artificial neural network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17(1), 1ā13 (2017)
Mohamed, E., Sarwar, R.: Linguistic features evaluation for hadith authenticity through automatic machine learning. Digit. Schol. Hum. (2021)
Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 conference on Empirical Methods in natural Language Processing, pp. 207ā217 (2010)
Nutanong, S., Yu, C., Sarwar, R., Xu, P., Chow, D.: A scalable framework for stylometric analysis query processing. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1125ā1130. IEEE (2016)
Sabah, F., Hassan, S.U., Muazzam, A., Iqbal, S., Soroya, S.H., Sarwar, R.: Scientific collaboration networks in Pakistan and their impact on institutional research performance: a case study based on Scopus publications. Library Hi Tech (2018)
Safara, F., et al.: An author gender detection method using whale optimization algorithm and artificial neural network. IEEE Access 8, 48428ā48437 (2020)
Safder, I., et al.: Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization. Appl. Artif. Intell. 36(1), 2004347 (2022)
Safder, I., et al.: Sentiment analysis for urdu online reviews using deep learning models. Exp. Syst. 38, e12751 (2021)
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1ā15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1
Sanchez-Perez, M.A., Markov, I., GĆ³mez-Adorno, H., Sidorov, G.: Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. In: Jones, J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 145ā151. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_15
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019)
Sarwar, R., Hassan, S.U.: A bibliometric assessment of scientific productivity and international collaboration of the Islamic world in science and technology (s &t) areas. Scientometrics 105(2), 1059ā1077 (2015)
Sarwar, R., Hassan, S.U.: Urduai: Writeprints for Urdu authorship identification. Trans. Asian Low-Resour. Lang. Inf. Process. 21(2), 1ā18 (2021)
Sarwar, R., Li, Q., Rakthanmanon, T., Nutanong, S.: A scalable framework for cross-lingual authorship identification. Inf. Sci. 465, 323ā339 (2018)
Sarwar, R., Li, Q., Rakthanmanon, T., Nutanong, S.: A scalable framework for cross-lingual authorship identification. Inf. Sci. 465, 323ā339 (2018)
Sarwar, R., Mohamed, E.: Author verification of nahj al-balagha. Digit. Schol. Hum. (2022)
Sarwar, R., Nutanong, S.: The key factors and their influence in authorship attribution. Res. Comput. Sci. 110, 139ā150 (2016)
Sarwar, R., Porthaveepong, T., Rutherford, A., Rakthanmanon, T., Nutanong, S.: Stylothai: a scalable framework for stylometric authorship identification of Thai documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19(3), 1ā15 (2020)
Sarwar, R., Rutherford, A.T., Hassan, S.U., Rakthanmanon, T., Nutanong, S.: Native language identification of fluent and advanced non-native writers. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19(4), 1ā19 (2020)
Sarwar, R., Soroya, S.H., Muazzam, A., Sabah, F., Iqbal, S., Hassan, S.U.: A bibliometric perspective on technology-driven innovation in the gulf cooperation council (GCC) countries in relation to its transformative impact on international business. In: Technology-Driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities, pp. 49ā66. IGI Global (2019)
Sarwar, R., et al.: \( cag \): Stylometric authorship attribution of multi-author documents using a co-authorship graph. IEEE Access 8, 18374ā18393 (2020)
Sarwar, R., Yu, C., Nutanong, S., Urailertprasert, N., Vannaboot, N., Rakthanmanon, T.: A scalable framework for stylometric analysis of multi-author documents. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10827, pp. 813ā829. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91452-7_52
Sarwar, R., Zia, A., Nawaz, R., Fayoumi, A., Aljohani, N.R., Hassan, S.-U.: Webometrics: evolution of social media presence of universities. Scientometrics 126(2), 951ā967 (2021). https://doi.org/10.1007/s11192-020-03804-y
Simaki, V., Aravantinou, C., Mporas, I., Kondyli, M., Megalooikonomou, V.: Sociolinguistic features for author gender identification: From qualitative evidence to quantitative analysis. J. Quant. Linguis. 24(1), 65ā84 (2017)
Trijakwanich, N., Limkonchotiwat, P., Sarwar, R., Phatthiyaphaibun, W., Chuangsuwanich, E., Nutanong, S.: Robust fragment-based framework for cross-lingual sentence retrieval. In: Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 16ā20 November 2021. Association for Computational Linguistics (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sarwar, R. (2022). Author Gender Identification for Urdu Articles. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2022. Lecture Notes in Computer Science(), vol 13528. Springer, Cham. https://doi.org/10.1007/978-3-031-15925-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-15925-1_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15924-4
Online ISBN: 978-3-031-15925-1
eBook Packages: Computer ScienceComputer Science (R0)