Skip to main content

Advertisement

Log in

Gender identification of short text author using conceptual vectorization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The rapid growth of technology and cyberspace has changed the scope and nature of human identity from physical to virtual. New forms of abuse, such as Internet fraud, impersonation, identity theft, and plagiarism have also emerged. Content theft and spoofing have been prevalent in various texts such as news, social network messages and email. The identity of active people in cyberspace, especially criminals, should be identified. One of the most important parts of identifying a person is their gender identity, identifying the text author gender will help to solve this problem. This Article presents a method for improving the gender identification of Persian short text author using conceptual vectorization. Due to the lack of proper data available for this research in Persian language with a valid author’s name tag and their text written by the individual himself, a data set was created using valid information. The approach of this project is to use conceptual vectorization method to perform mathematical operations on vector words. Conceptual vectorization attempts to vectorize words in such a way that mathematical calculations make sense on the resulting vectors. To train a system that can identify the gender of the author of text, a neural network capable of learning complex functions was developed and implemented using artificial intelligence and machine learning algorithms. The proposed method, which is based on conceptual vectorization and using neural network for training, achieved a 22% improvement in identifying the gender of the author compared to related tasks. Based on the five-fold cross validation method, accuracy of 81.09% was obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://www.digikala.com

  2. http://rastan.parsiblog.com/posts/271

  3. https://code.google.com/archive/p/word2vec/

Abbreviations

CVM:

Conceptual Vectorization Method

CV:

Cross Validation

References

  1. Amozade M, Zarei fard R (2017) A review of the study of language, gender and identity in social linguistics. National Conference on Language and Identity, pp 32–57

  2. Argamon S, Koppel M, Fine J, Shimoni AR (2003) Gender, genre, and writing style in formal written texts. Text & talk 23(3):321–346

  3. Atar Sharghi N, Norouz Oliaee F (2021) A comparative survey of women's writing style studies and the impact of gender on translation and compilation in Iran. Lang Relat Res 12(2):185–230

  4. Moradi M, Bahrani M (2015) Automatic gender identification in Persian text, pp 83–94

  5. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

  6. Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on Twitter. MITRE CORP BEDFORD MA BEDFORD United States.

  7. Fatima M, Hasan K, Anwar S, Nawab RMA (2017) Multilingual author profiling on Facebook. Inf Process Manag 53(4):886–904

  8. Fekri-Ershad S (2019) Gender classification in human face images for smart phone applications based on local texture information and evaluated Kullback-Leibler divergence. Traitement du Signal 36(6):507–514

  9. Gharayi F, Abrun A (2015) Study of the relationship between the quality of social relations of residents in urban areas with their mental health. The first national conference on nursing, psychology, health promotion and healthy environment, pp 510–522

  10. Idris A (2017) Virtual identity and cybercrime. National Conference on Passive Defense in Cyberspace, pp 15–22

  11. Maleki Gorbani N (2019) Study of the relationship between cyberspace (Internet) dependence, family relationship styles and academic problems of high school students. Sixth scientific conference on educational sciences and psychology, social and cultural harms of Iran, pp 30–45

  12. Moeinian N (2017) Sociological study of language differences between men and women. Journal of Sociology Studies 9:83–93

  13. Muvashah, M (2021) Persian names excel file. Retrieved from rastan.parsiblog.com: https://rastan.parsiblog.com/posts/271

  14. Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on Search and mining user-generated contents, pp 37–44

  15. Sazzad H, Shamsuzzaman M, Habib M (2016) Implementing ID3 algorithm for gender identification of Bangladeshi people. In: Third international conference on electrical engineering and information communication technology (iceeict). IEEE, Dhaka, Bangladesh

    Google Scholar 

  16. Tellez ES, Miranda-Jiménez S, Moctezuma D, Graff M, Salgado V, Ortiz-Bejar J (2018) Gender identification through multi-modal tweet analysis using microtc and bag of visual words. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018)

  17. Khdr AJ, Varol C (2018) Age and gender identification by SMS text messages. In: 2018 International Conference on Artificial Intelligence and Data Processing (IDAP). IEEE, pp 1–5

  18. Zahir J, Oukaja Y, Mousannif H (2019) Author gender identification from arabic youtube comments. In: 15th international conference on signal-image technology & internet-based systems (sitis). IEEE, Sorrento, Italy

    Google Scholar 

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Zarifi.

Ethics declarations

Conflicts of interests/competing interests

The authors did not receive support from any organization for the submitted work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zarifi, A., Naghavi, M. Gender identification of short text author using conceptual vectorization. Multimed Tools Appl 82, 17097–17113 (2023). https://doi.org/10.1007/s11042-022-14141-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14141-y

Keywords

Navigation