Skip to main content

Author Gender Identification for Urdu Articles

  • Conference paper
  • First Online:
Computational and Corpus-Based Phraseology (EUROPHRAS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13528))

Included in the following conference series:

  • 457 Accesses

Abstract

In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machine outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://beautiful-soup-4.readthedocs.io/en/latest/.

  2. 2.

    https://newspaper.readthedocs.io/en/latest/.

  3. 3.

    https://catboost.ai/en/docs/.

  4. 4.

    https://scikit-learn.org.

References

  1. Al-Ghadir, A.R.I., Azmi, A.M.: A study of Arabic social media users - posting behavior and authorā€™s gender prediction. Cogn. Comput. 11(1), 71ā€“86 (2019)

    Article  Google Scholar 

  2. Alsmearat, K., Al-Ayyoub, M., Al-Shalabi, R., Kanaan, G.: Author gender identification from Arabic text. J. Inf. Secur. Appl. 35, 85ā€“95 (2017)

    Google Scholar 

  3. Baseer, F., Jaafar, J., Habib, A.: Gender and age identification through Romanized Urdu dataset. In: 2019 1st International Conference on Artificial Intelligence and Data Sciences (AiDAS), pp. 164ā€“169. IEEE (2019)

    Google Scholar 

  4. Bassem, B., Zrigui, M.: Gender identification: a comparative study of deep learning architectures. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol. 941, pp. 792ā€“800. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-1_77

    Chapter  Google Scholar 

  5. Baxevanakis, S., Gavras, S., Mouratidis, D., Kermanidis, K.L.: A machine learning approach for gender identification of Greek tweet authors. In: Makedon, F. (ed.) PETRA 2020: The 13th PErvasive Technologies Related to Assistive Environments Conference, Corfu, Greece, June 30ā€“July 3, 2020. pp. 57:1ā€“57:4. ACM (2020)

    Google Scholar 

  6. Cheng, N., Chandramouli, R., Subbalakshmi, K.: Author gender identification from text. Digit. Invest. 8(1), 78ā€“88 (2011)

    Article  Google Scholar 

  7. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116. http://arxiv.org/abs/1911.02116 (2019)

  8. Daud, A., Khan, W., Che, D.: Urdu language processing: a survey. Artif. Intell. Rev. 47(3), 279ā€“311 (2016). https://doi.org/10.1007/s10462-016-9482-x

    Article  Google Scholar 

  9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805 (2018)

  10. Fatima, M., Hasan, K., Anwar, S., Nawab, R.M.A.: Multilingual author profiling on Facebook. Inf. Process. Manag. 53(4), 886ā€“904 (2017)

    Article  Google Scholar 

  11. HaCohen-Kerner, Y.: Survey on profiling age and gender of text authors. Expert Syst. Appl. 199, 117ā€“140 (2022)

    Article  Google Scholar 

  12. Hassan, S.U., et al.: Predicting literatureā€™s early impact with sentiment analysis in twitter. Knowl. Based Syst. 192 (2020)

    Google Scholar 

  13. Hassan, S.U., Aljohani, N.R., Shabbir, M., Ali, U., Iqbal, S., Sarwar, R., MartĆ­nez-CĆ”mara, E., Ventura, S., Herrera, F.: Tweet coupling: a social media methodology for clustering scientific publications. Scientometrics 124(2), 973ā€“991 (2020)

    Google Scholar 

  14. Hassan, S.U., et al.: Exploiting tweet sentiments in altmetrics large-scale data. arXiv preprint arXiv:2008.13023 (2020)

  15. Hassan, S.U., Sarwar, R., Muazzam, A.: Tapping into intra-and international collaborations of the organization of Islamic cooperation states across science and technology disciplines. Sci. Public Policy 43(5), 690ā€“701 (2016)

    Google Scholar 

  16. He, P., Gao, J., Chen, W.: Debertav 3: improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. ArXiv (2021)

    Google Scholar 

  17. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Their Appl. 13(4), 18ā€“28 (1998)

    Google Scholar 

  18. Ikae, C., Savoy, J.: Gender identification on twitter. J. Assoc. Inf. Sci. Technol. 73(1), 58ā€“69 (2022)

    Google Scholar 

  19. Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  20. Kucukyilmaz, T., Deniz, A., Kiziloz, H.E.: Boosting gender identification using author preference. Pattern Recognit. Lett. 140, 245ā€“251 (2020)

    Google Scholar 

  21. Limkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E., Nutanong, S.: Domain adaptation of Thai word segmentation models using stacked ensemble. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, 16ā€“20 November 2020. Association for Computational Linguistics (2020)

    Google Scholar 

  22. Limkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E., Nutanong, S.: Handling cross and out-of-domain samples in Thai word segmentation. In: Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, 1ā€“6 August 2021. Association for Computational Linguistics (2021)

    Google Scholar 

  23. Malik, M.K.: Urdu named entity recognition and classification system using artificial neural network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17(1), 1ā€“13 (2017)

    Google Scholar 

  24. Mohamed, E., Sarwar, R.: Linguistic features evaluation for hadith authenticity through automatic machine learning. Digit. Schol. Hum. (2021)

    Google Scholar 

  25. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 conference on Empirical Methods in natural Language Processing, pp. 207ā€“217 (2010)

    Google Scholar 

  26. Nutanong, S., Yu, C., Sarwar, R., Xu, P., Chow, D.: A scalable framework for stylometric analysis query processing. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1125ā€“1130. IEEE (2016)

    Google Scholar 

  27. Sabah, F., Hassan, S.U., Muazzam, A., Iqbal, S., Soroya, S.H., Sarwar, R.: Scientific collaboration networks in Pakistan and their impact on institutional research performance: a case study based on Scopus publications. Library Hi Tech (2018)

    Google Scholar 

  28. Safara, F., et al.: An author gender detection method using whale optimization algorithm and artificial neural network. IEEE Access 8, 48428ā€“48437 (2020)

    Google Scholar 

  29. Safder, I., et al.: Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization. Appl. Artif. Intell. 36(1), 2004347 (2022)

    Google Scholar 

  30. Safder, I., et al.: Sentiment analysis for urdu online reviews using deep learning models. Exp. Syst. 38, e12751 (2021)

    Google Scholar 

  31. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1ā€“15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1

  32. Sanchez-Perez, M.A., Markov, I., GĆ³mez-Adorno, H., Sidorov, G.: Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. In: Jones, J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 145ā€“151. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_15

  33. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019)

    Google Scholar 

  34. Sarwar, R., Hassan, S.U.: A bibliometric assessment of scientific productivity and international collaboration of the Islamic world in science and technology (s &t) areas. Scientometrics 105(2), 1059ā€“1077 (2015)

    Google Scholar 

  35. Sarwar, R., Hassan, S.U.: Urduai: Writeprints for Urdu authorship identification. Trans. Asian Low-Resour. Lang. Inf. Process. 21(2), 1ā€“18 (2021)

    Google Scholar 

  36. Sarwar, R., Li, Q., Rakthanmanon, T., Nutanong, S.: A scalable framework for cross-lingual authorship identification. Inf. Sci. 465, 323ā€“339 (2018)

    Google Scholar 

  37. Sarwar, R., Li, Q., Rakthanmanon, T., Nutanong, S.: A scalable framework for cross-lingual authorship identification. Inf. Sci. 465, 323ā€“339 (2018)

    Google Scholar 

  38. Sarwar, R., Mohamed, E.: Author verification of nahj al-balagha. Digit. Schol. Hum. (2022)

    Google Scholar 

  39. Sarwar, R., Nutanong, S.: The key factors and their influence in authorship attribution. Res. Comput. Sci. 110, 139ā€“150 (2016)

    Google Scholar 

  40. Sarwar, R., Porthaveepong, T., Rutherford, A., Rakthanmanon, T., Nutanong, S.: Stylothai: a scalable framework for stylometric authorship identification of Thai documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19(3), 1ā€“15 (2020)

    Google Scholar 

  41. Sarwar, R., Rutherford, A.T., Hassan, S.U., Rakthanmanon, T., Nutanong, S.: Native language identification of fluent and advanced non-native writers. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19(4), 1ā€“19 (2020)

    Google Scholar 

  42. Sarwar, R., Soroya, S.H., Muazzam, A., Sabah, F., Iqbal, S., Hassan, S.U.: A bibliometric perspective on technology-driven innovation in the gulf cooperation council (GCC) countries in relation to its transformative impact on international business. In: Technology-Driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities, pp. 49ā€“66. IGI Global (2019)

    Google Scholar 

  43. Sarwar, R., et al.: \( cag \): Stylometric authorship attribution of multi-author documents using a co-authorship graph. IEEE Access 8, 18374ā€“18393 (2020)

    Google Scholar 

  44. Sarwar, R., Yu, C., Nutanong, S., Urailertprasert, N., Vannaboot, N., Rakthanmanon, T.: A scalable framework for stylometric analysis of multi-author documents. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10827, pp. 813ā€“829. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91452-7_52

  45. Sarwar, R., Zia, A., Nawaz, R., Fayoumi, A., Aljohani, N.R., Hassan, S.-U.: Webometrics: evolution of social media presence of universities. Scientometrics 126(2), 951ā€“967 (2021). https://doi.org/10.1007/s11192-020-03804-y

  46. Simaki, V., Aravantinou, C., Mporas, I., Kondyli, M., Megalooikonomou, V.: Sociolinguistic features for author gender identification: From qualitative evidence to quantitative analysis. J. Quant. Linguis. 24(1), 65ā€“84 (2017)

    Google Scholar 

  47. Trijakwanich, N., Limkonchotiwat, P., Sarwar, R., Phatthiyaphaibun, W., Chuangsuwanich, E., Nutanong, S.: Robust fragment-based framework for cross-lingual sentence retrieval. In: Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 16ā€“20 November 2021. Association for Computational Linguistics (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raheem Sarwar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sarwar, R. (2022). Author Gender Identification for Urdu Articles. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2022. Lecture Notes in Computer Science(), vol 13528. Springer, Cham. https://doi.org/10.1007/978-3-031-15925-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15925-1_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15924-4

  • Online ISBN: 978-3-031-15925-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics