Abstract
The rapid growth of technology and cyberspace has changed the scope and nature of human identity from physical to virtual. New forms of abuse, such as Internet fraud, impersonation, identity theft, and plagiarism have also emerged. Content theft and spoofing have been prevalent in various texts such as news, social network messages and email. The identity of active people in cyberspace, especially criminals, should be identified. One of the most important parts of identifying a person is their gender identity, identifying the text author gender will help to solve this problem. This Article presents a method for improving the gender identification of Persian short text author using conceptual vectorization. Due to the lack of proper data available for this research in Persian language with a valid author’s name tag and their text written by the individual himself, a data set was created using valid information. The approach of this project is to use conceptual vectorization method to perform mathematical operations on vector words. Conceptual vectorization attempts to vectorize words in such a way that mathematical calculations make sense on the resulting vectors. To train a system that can identify the gender of the author of text, a neural network capable of learning complex functions was developed and implemented using artificial intelligence and machine learning algorithms. The proposed method, which is based on conceptual vectorization and using neural network for training, achieved a 22% improvement in identifying the gender of the author compared to related tasks. Based on the five-fold cross validation method, accuracy of 81.09% was obtained.
Similar content being viewed by others
Abbreviations
- CVM:
-
Conceptual Vectorization Method
- CV:
-
Cross Validation
References
Amozade M, Zarei fard R (2017) A review of the study of language, gender and identity in social linguistics. National Conference on Language and Identity, pp 32–57
Argamon S, Koppel M, Fine J, Shimoni AR (2003) Gender, genre, and writing style in formal written texts. Text & talk 23(3):321–346
Atar Sharghi N, Norouz Oliaee F (2021) A comparative survey of women's writing style studies and the impact of gender on translation and compilation in Iran. Lang Relat Res 12(2):185–230
Moradi M, Bahrani M (2015) Automatic gender identification in Persian text, pp 83–94
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on Twitter. MITRE CORP BEDFORD MA BEDFORD United States.
Fatima M, Hasan K, Anwar S, Nawab RMA (2017) Multilingual author profiling on Facebook. Inf Process Manag 53(4):886–904
Fekri-Ershad S (2019) Gender classification in human face images for smart phone applications based on local texture information and evaluated Kullback-Leibler divergence. Traitement du Signal 36(6):507–514
Gharayi F, Abrun A (2015) Study of the relationship between the quality of social relations of residents in urban areas with their mental health. The first national conference on nursing, psychology, health promotion and healthy environment, pp 510–522
Idris A (2017) Virtual identity and cybercrime. National Conference on Passive Defense in Cyberspace, pp 15–22
Maleki Gorbani N (2019) Study of the relationship between cyberspace (Internet) dependence, family relationship styles and academic problems of high school students. Sixth scientific conference on educational sciences and psychology, social and cultural harms of Iran, pp 30–45
Moeinian N (2017) Sociological study of language differences between men and women. Journal of Sociology Studies 9:83–93
Muvashah, M (2021) Persian names excel file. Retrieved from rastan.parsiblog.com: https://rastan.parsiblog.com/posts/271
Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on Search and mining user-generated contents, pp 37–44
Sazzad H, Shamsuzzaman M, Habib M (2016) Implementing ID3 algorithm for gender identification of Bangladeshi people. In: Third international conference on electrical engineering and information communication technology (iceeict). IEEE, Dhaka, Bangladesh
Tellez ES, Miranda-Jiménez S, Moctezuma D, Graff M, Salgado V, Ortiz-Bejar J (2018) Gender identification through multi-modal tweet analysis using microtc and bag of visual words. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018)
Khdr AJ, Varol C (2018) Age and gender identification by SMS text messages. In: 2018 International Conference on Artificial Intelligence and Data Processing (IDAP). IEEE, pp 1–5
Zahir J, Oukaja Y, Mousannif H (2019) Author gender identification from arabic youtube comments. In: 15th international conference on signal-image technology & internet-based systems (sitis). IEEE, Sorrento, Italy
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interests/competing interests
The authors did not receive support from any organization for the submitted work.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zarifi, A., Naghavi, M. Gender identification of short text author using conceptual vectorization. Multimed Tools Appl 82, 17097–17113 (2023). https://doi.org/10.1007/s11042-022-14141-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14141-y