Author Gender Identification for Urdu Articles

Sarwar, Raheem

doi:10.1007/978-3-031-15925-1_16

Raheem Sarwar ORCID: orcid.org/0000-0002-0640-807X⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13528))

Included in the following conference series:

International Conference on Computational and Corpus-Based Phraseology

500 Accesses
3 Citations

Abstract

In recent years, author gender identification has gained considerable attention in the fields of computational linguistics and artificial intelligence. This task has been extensively investigated for resource-rich languages such as English and Spanish. However, researchers have not paid enough attention to perform this task for Urdu articles. Firstly, I created a new Urdu corpus to perform the author gender identification task. I then extracted two types of features from each article including the most frequent 600 multi-word expressions and the most frequent 300 words. After I completed the corpus creation and features extraction processes, I performed the features concatenation process. As a result each article was represented in a 900D feature space. Finally, I applied 10 different well-known classifiers to these features to perform the author gender identification task and compared their performances against state-of-the-art pre-trained multilingual language models, such as mBERT, DistilBERT, XLM-RoBERTa and multilingual DeBERTa, as well as Convolutional Neural Networks (CNN). I conducted extensive experimental studies which show that (i) using the most frequent 600 multi-word expressions as features and concatenating them with the most frequent 300 words as features improves the accuracy of the author gender identification task, and (ii) support vector machine outperforms other classifiers, as well as fine-tuned pre-trained language models and CNN. The code base and the corpus can be found at: https://github.com/raheem23/Gender_Identification_Urdu.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Neural Network Model to Include Textual Dependency Tree Structure in Gender Classification of Russian Text Author

Biological gender identification in Turkish news text using deep learning models

Article 08 November 2023

Automatic Recognition of Gender and Genre in a Corpus of Microtexts

Notes

References

Al-Ghadir, A.R.I., Azmi, A.M.: A study of Arabic social media users - posting behavior and author’s gender prediction. Cogn. Comput. 11(1), 71–86 (2019)
Article Google Scholar
Alsmearat, K., Al-Ayyoub, M., Al-Shalabi, R., Kanaan, G.: Author gender identification from Arabic text. J. Inf. Secur. Appl. 35, 85–95 (2017)
Google Scholar
Baseer, F., Jaafar, J., Habib, A.: Gender and age identification through Romanized Urdu dataset. In: 2019 1st International Conference on Artificial Intelligence and Data Sciences (AiDAS), pp. 164–169. IEEE (2019)
Google Scholar
Bassem, B., Zrigui, M.: Gender identification: a comparative study of deep learning architectures. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol. 941, pp. 792–800. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-1_77
Chapter Google Scholar
Baxevanakis, S., Gavras, S., Mouratidis, D., Kermanidis, K.L.: A machine learning approach for gender identification of Greek tweet authors. In: Makedon, F. (ed.) PETRA 2020: The 13th PErvasive Technologies Related to Assistive Environments Conference, Corfu, Greece, June 30–July 3, 2020. pp. 57:1–57:4. ACM (2020)
Google Scholar
Cheng, N., Chandramouli, R., Subbalakshmi, K.: Author gender identification from text. Digit. Invest. 8(1), 78–88 (2011)
Article Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116. http://arxiv.org/abs/1911.02116 (2019)
Daud, A., Khan, W., Che, D.: Urdu language processing: a survey. Artif. Intell. Rev. 47(3), 279–311 (2016). https://doi.org/10.1007/s10462-016-9482-x
Article Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805 (2018)
Fatima, M., Hasan, K., Anwar, S., Nawab, R.M.A.: Multilingual author profiling on Facebook. Inf. Process. Manag. 53(4), 886–904 (2017)
Article Google Scholar
HaCohen-Kerner, Y.: Survey on profiling age and gender of text authors. Expert Syst. Appl. 199, 117–140 (2022)
Article Google Scholar
Hassan, S.U., et al.: Predicting literature’s early impact with sentiment analysis in twitter. Knowl. Based Syst. 192 (2020)
Google Scholar
Hassan, S.U., Aljohani, N.R., Shabbir, M., Ali, U., Iqbal, S., Sarwar, R., Martínez-Cámara, E., Ventura, S., Herrera, F.: Tweet coupling: a social media methodology for clustering scientific publications. Scientometrics 124(2), 973–991 (2020)
Google Scholar
Hassan, S.U., et al.: Exploiting tweet sentiments in altmetrics large-scale data. arXiv preprint arXiv:2008.13023 (2020)
Hassan, S.U., Sarwar, R., Muazzam, A.: Tapping into intra-and international collaborations of the organization of Islamic cooperation states across science and technology disciplines. Sci. Public Policy 43(5), 690–701 (2016)
Google Scholar
He, P., Gao, J., Chen, W.: Debertav 3: improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. ArXiv (2021)
Google Scholar
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Their Appl. 13(4), 18–28 (1998)
Google Scholar
Ikae, C., Savoy, J.: Gender identification on twitter. J. Assoc. Inf. Sci. Technol. 73(1), 58–69 (2022)
Google Scholar
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Kucukyilmaz, T., Deniz, A., Kiziloz, H.E.: Boosting gender identification using author preference. Pattern Recognit. Lett. 140, 245–251 (2020)
Google Scholar
Limkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E., Nutanong, S.: Domain adaptation of Thai word segmentation models using stacked ensemble. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, 16–20 November 2020. Association for Computational Linguistics (2020)
Google Scholar
Limkonchotiwat, P., Phatthiyaphaibun, W., Sarwar, R., Chuangsuwanich, E., Nutanong, S.: Handling cross and out-of-domain samples in Thai word segmentation. In: Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, 1–6 August 2021. Association for Computational Linguistics (2021)
Google Scholar
Malik, M.K.: Urdu named entity recognition and classification system using artificial neural network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17(1), 1–13 (2017)
Google Scholar
Mohamed, E., Sarwar, R.: Linguistic features evaluation for hadith authenticity through automatic machine learning. Digit. Schol. Hum. (2021)
Google Scholar
Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 conference on Empirical Methods in natural Language Processing, pp. 207–217 (2010)
Google Scholar
Nutanong, S., Yu, C., Sarwar, R., Xu, P., Chow, D.: A scalable framework for stylometric analysis query processing. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1125–1130. IEEE (2016)
Google Scholar
Sabah, F., Hassan, S.U., Muazzam, A., Iqbal, S., Soroya, S.H., Sarwar, R.: Scientific collaboration networks in Pakistan and their impact on institutional research performance: a case study based on Scopus publications. Library Hi Tech (2018)
Google Scholar
Safara, F., et al.: An author gender detection method using whale optimization algorithm and artificial neural network. IEEE Access 8, 48428–48437 (2020)
Google Scholar
Safder, I., et al.: Parsing AUC result-figures in machine learning specific scholarly documents for semantically-enriched summarization. Appl. Artif. Intell. 36(1), 2004347 (2022)
Google Scholar
Safder, I., et al.: Sentiment analysis for urdu online reviews using deep learning models. Exp. Syst. 38, e12751 (2021)
Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1
Sanchez-Perez, M.A., Markov, I., Gómez-Adorno, H., Sidorov, G.: Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. In: Jones, J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 145–151. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_15
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019)
Google Scholar
Sarwar, R., Hassan, S.U.: A bibliometric assessment of scientific productivity and international collaboration of the Islamic world in science and technology (s &t) areas. Scientometrics 105(2), 1059–1077 (2015)
Google Scholar
Sarwar, R., Hassan, S.U.: Urduai: Writeprints for Urdu authorship identification. Trans. Asian Low-Resour. Lang. Inf. Process. 21(2), 1–18 (2021)
Google Scholar
Sarwar, R., Li, Q., Rakthanmanon, T., Nutanong, S.: A scalable framework for cross-lingual authorship identification. Inf. Sci. 465, 323–339 (2018)
Google Scholar
Sarwar, R., Li, Q., Rakthanmanon, T., Nutanong, S.: A scalable framework for cross-lingual authorship identification. Inf. Sci. 465, 323–339 (2018)
Google Scholar
Sarwar, R., Mohamed, E.: Author verification of nahj al-balagha. Digit. Schol. Hum. (2022)
Google Scholar
Sarwar, R., Nutanong, S.: The key factors and their influence in authorship attribution. Res. Comput. Sci. 110, 139–150 (2016)
Google Scholar
Sarwar, R., Porthaveepong, T., Rutherford, A., Rakthanmanon, T., Nutanong, S.: Stylothai: a scalable framework for stylometric authorship identification of Thai documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19(3), 1–15 (2020)
Google Scholar
Sarwar, R., Rutherford, A.T., Hassan, S.U., Rakthanmanon, T., Nutanong, S.: Native language identification of fluent and advanced non-native writers. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19(4), 1–19 (2020)
Google Scholar
Sarwar, R., Soroya, S.H., Muazzam, A., Sabah, F., Iqbal, S., Hassan, S.U.: A bibliometric perspective on technology-driven innovation in the gulf cooperation council (GCC) countries in relation to its transformative impact on international business. In: Technology-Driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities, pp. 49–66. IGI Global (2019)
Google Scholar
Sarwar, R., et al.: $ cag $: Stylometric authorship attribution of multi-author documents using a co-authorship graph. IEEE Access 8, 18374–18393 (2020)
Google Scholar
Sarwar, R., Yu, C., Nutanong, S., Urailertprasert, N., Vannaboot, N., Rakthanmanon, T.: A scalable framework for stylometric analysis of multi-author documents. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10827, pp. 813–829. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91452-7_52
Sarwar, R., Zia, A., Nawaz, R., Fayoumi, A., Aljohani, N.R., Hassan, S.-U.: Webometrics: evolution of social media presence of universities. Scientometrics 126(2), 951–967 (2021). https://doi.org/10.1007/s11192-020-03804-y
Simaki, V., Aravantinou, C., Mporas, I., Kondyli, M., Megalooikonomou, V.: Sociolinguistic features for author gender identification: From qualitative evidence to quantitative analysis. J. Quant. Linguis. 24(1), 65–84 (2017)
Google Scholar
Trijakwanich, N., Limkonchotiwat, P., Sarwar, R., Phatthiyaphaibun, W., Chuangsuwanich, E., Nutanong, S.: Robust fragment-based framework for cross-lingual sentence retrieval. In: Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 16–20 November 2021. Association for Computational Linguistics (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Research Group in Computational Linguistics, RIILP, University of Wolverhampton, Wolverhampton, UK
Raheem Sarwar

Authors

Raheem Sarwar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raheem Sarwar .

Editor information

Editors and Affiliations

University of Malaga, Malaga, Spain
Gloria Corpas Pastor
University of Wolverhampton, Wolverhampton, UK
Ruslan Mitkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarwar, R. (2022). Author Gender Identification for Urdu Articles. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2022. Lecture Notes in Computer Science(), vol 13528. Springer, Cham. https://doi.org/10.1007/978-3-031-15925-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-15925-1_16
Published: 21 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15924-4
Online ISBN: 978-3-031-15925-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Author Gender Identification for Urdu Articles