Detection of content-based cybercrime in Roman Kashmiri using ensemble learning

Farooq, Umar; Singh, Parvinder; Khurana, Surinder Singh; Kumar, Munish

doi:10.1007/s11042-023-16678-y

Detection of content-based cybercrime in Roman Kashmiri using ensemble learning

Published: 25 September 2023

Volume 83, pages 33071–33105, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Umar Farooq¹,
Parvinder Singh¹,
Surinder Singh Khurana¹ &
…
Munish Kumar²

304 Accesses
Explore all metrics

Abstract

The official language of Kashmir, Kashmiri language or Koshur, is spoken by more than 7 million people, yet its content-based cybercrime detection remains unexplored in theoretical and experimental research. Furthermore, the absence of programming libraries for sentimental analysis and a benchmark corpus has impeded advancements in this field. Challenges persist in working with diverse scripts of Kashmiri, including Perso-Arabic, Sharada, Devanagari, and Roman. Detecting cybercrime in this language is challenging due to its complex morphological nature, lack of resources, scarcity of annotated datasets, and varied linguistic characteristics, emphasizing the importance of overcoming these obstacles to develop effective detection systems. This paper attempts to detect content-based cybercrime in Roman Kashmiri script, extensively utilized on online platforms like social media, chat rooms, emails, etc., by the Kashmiri community. A well-balanced and meaningful dataset, the first of its kind in this context, is compiled, incorporating positive and negative comments, and three strategies were employed for analysis. The findings reveal that the Tf-Idf Vectorizer outperforms other tokenization methods (Count Vectorizer and Tf-Idf Transformer), bi-gram notation exhibits superior performance compared to one and tri-gram notations, and the XGBM proves to be the most effective in terms of evaluation metrics. Leveraging these strategies, Python applications were developed for text classification, successfully distinguishing cyberbullying (unsafe) from non-cyberbullying (safe) instances, with the XGBM exhibiting exceptional accuracy using the Tf-Idf Vectorizer with bi-gram, a Bag of Words, and lexical features. This pioneering research underscores the urgent need for content-based cybercrime detection advancements in the Kashmiri language, paving the way for effective detection systems to address language-specific challenges and promote a safer online environment for the Kashmiri community. Furthermore, this research opens new avenues for further advancements in detecting and preventing cybercrime in Kashmiri and potentially in other languages lacking robust cybercrime detection methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Analysing Cyberbullying Using Natural Language Processing by Understanding Jargon in Social Media

Multi-feature Transformer for Multiclass Cyberbullying Detection in Bangla

COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic

Article 08 January 2022

Data availability

The authors have created their own dataset for the experimental work because public dataset is not available.

References

Wikipedia (2021) [Online]. Available: https://en.wikipedia.org/wiki/Kashmiri_language. Accessed 24 03 2021
Parey FH (2017) KashmiriLanguage: multi-linguistic approach, issues and role of media in its accomplishments. Int J Sci Res Publ 7(6):228–235
MathSciNet ADS Google Scholar
Abstract of speakers' strength of languages and mother tongues - 2011, Census OF India 2011, Language India, States and Union Territories (Table C-16), (2011) pp 1−5
Kanth I (2013) The untold story of the people of Azad Kashmir. Politics Relig Ideol 14(4):589–591. https://doi.org/10.1080/21567689.2013.838477
Shakil M (2012) Languages of Erstwhile State of Jammu Kashmir (A Preliminary Study)
Kiani K (2018) [Online]. Available: https://www.dawn.com/news/1410447. Accessed 24 Mar 2021
Warikoo K (2021) Language and politics in Jammu and Kashmir: Issues and perspectives, Jammu, Kashmir and Ladakh: Linguistic predicament. Delhi: Har-Anand Publications
Khatana DR. Gujari language and identity in Jammu and Kashmir. Kashmir News Network: Language Section (koshur.org). http://www.koshur.org/Linguistic/5.html
Grierson GA (1968) Specimens of the Dardic Or Piśācha Languages (including Kāshmīrī). Motilal Banarsidass:1–567
Uribe-Villegas O (1977) Issues in sociolinguistics. De Gruyter Mouton, Berlin, Boston. https://doi.org/10.1515/9783110806687
Book Google Scholar
Shabina, Sheikh AM (2011) An ethnosemantic analysis of the cultural Lexicon of Kashmiri language. University of Kashmir, Srinagar. http://hdl.handle.net/10603/3374
Google Scholar
Britannica TEE (2018) Kashmiri language. Encyclopedia Britannica. [Online]. Available: https://www.britannica.com/topic/Kashmiri-language. Accessed 25 Mar 2021
Singh U (2008) A History of Ancient and Early Medieval India: From the Stone Age to the 12th Century. Pearson, India
Google Scholar
Zribi I, Boujelbane R, Masmoudi A, Ellouze M, Belguith L, Habash N (2014) A Conventional Orthography for Tunisian Arabic, in In Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavík, Iceland
Selin H (2008) Encyclopaedia of the history of science, technology, and medicine in non-western cultures, humanities. Soc Sci Law. https://doi.org/10.1007/978-1-4020-4425-0
Omniglot (2021) [Online]. Available: https://omniglot.com/writing/sharda.htm. Accessed 25 Mar 2021
Taylor I (1883) History of the Alphabet. Aryan Alphabets, London
Google Scholar
Taylor I (1883) The Alphabet. An account of the origin and development of Letters, London
Google Scholar
Raina MK (2006) How to read and write Kashmiri in Devanagari? Kashmir News Network. https://www.koshur.org/pdf/Let%20Us%20Learn%20Kashmiri.pdf
Google Scholar
Amir S, Minoofam A, Dehshibi MM, Bastanfard A (2012) Ad-hoc Ma’qeli Script Generation Using Block Cellular Automata. J Cell Autom 7(4):321–334
Google Scholar
Veisi H, Ghoreishi SA, Bastanfard A (2021) Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting. J Signal Data Process 17(4):67–88
Article Google Scholar
Shih YE (2007) Setting the new standard with mobile computing in online learning. Int Rev Res Open Distrib Learn 8(2):1–16
Google Scholar
Cooke M, Buckley N (2008) Web 2.0, social networks and the future of market research. Int J Mark Res 50(2):267–292
Article Google Scholar
Heidemann J, Klier M, Probst F (2012) Online social networks: a survey of a global phenomenon. Comput Netw 56(18):3866–3878
Article Google Scholar
Farooq U (2021) Ensemble Machine Learning Approaches for Detection of SQL Injection Attack. Tehnički Glas 15(1):112–120
Article Google Scholar
Singh A, Kaur M (2020) Detection Framework for Content-Based Cybercrime in Online Social Networks Using Metaheuristic Approach. Arab J Sci Eng 45:2705–2719
Article Google Scholar
Wall D (2004) What are cybercrimes? Crim Justice Matters 58(1):20–21
Article Google Scholar
Sticca F, Perren S (2013) Is cyberbullying worse than traditional bullying? Examining the differential roles of medium, publicity, and anonymity for the perceived severity of bullying. J Youth Adolesc 42(5):739–750
Article PubMed Google Scholar
Abbas G, Farooq U, Singh P, Khurana SS, Singh P (2023) Feature Engineering and Ensemble Learning-Based Classifcation of VPN and Non-VPN-Based Network Trafc over Temporal Features. SN Comput Sci 4(546):1–16
Google Scholar
Bauman S, Bauman S (2015) Types of cyberbullying. In: Bauman S (ed) Cyberbullying. https://doi.org/10.1002/9781119221685.ch4
Chapter Google Scholar
Chang F-C, Lee C-M, Chiu C-H, Hsi W-Y, Huang T-F, Pan Y-C (2013) Relationships among cyberbullying, school bullying, and mental health in Taiwanese adolescents. J Sch Health 83(6):454–462
Article PubMed Google Scholar
Walrave M, Heirman W (2011) Cyberbullying: predicting victimisation and perpetration. Child Soc 25:59–72. https://doi.org/10.1111/j.1099-0860.2009.00260.x
Article Google Scholar
Hinduja S, Patchin JW (2010) Bullying, cyberbullying, and suicide. Arch Suicide Res 14(3):206–221
Article PubMed Google Scholar
Sourander A, Klomek AB, Ikonen M, Lindroos J, Luntamo T, Koskelainen M, Ristkari T, Helenius H (2010) Psychosocial risk factors associated with cyberbullying among adolescents: a population-based study. Arch Gen Psychiatry 67(7):720–728
Article PubMed Google Scholar
Farooq U (2020) Real Time Password Strength Analysis on a Web Application Using Multiple Machine Learning Approaches. Int J Eng Res Technol (IJERT) 9(12):359–364
Google Scholar
Statista (2019) [Online]. Available: https://www.statista.com/statistics/1097724/india-cyber-stalking-bullying-cases-against-women-children-by-leading-state/. Accessed 26 Mar 2021
Nandhinia BS, Sheeba JI (2015) Online Social Network Bullying Detection Using Intelligence Techniques, in International Conference on Advanced Computing Technologies and Applications (ICACTA- 2015)
del Jesus MJ, Herrera F (2013) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline datasets. Knowledge-Based Systems 38:85–104 (Elsevier)
Article Google Scholar
Dadvar M, de Jong F, Ordelman R, Trieschnigg D (2012) Improved cyberbullying detection using gender information, in Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012)
Dinakar K, Reichart R, Lieberman H (2011) Modeling the Detection of Textual Cyberbullying, in Proc. IEEE International Fifth International AAAI Conference on Weblogs and Social Media, Barcelona, Spain
Reynolds K, Kontostathis A, Edwards L (2011) Using Machine Learning to Detect Cyberbullying, in roceedings of the 2011 10thConference on Machine Learning and Applications Workshops
McGhee I, Bayzick J, Kontostathis A, Edwards L, Mcbride A, Jakubowski E (2011) Learning to Identify Internet Sexual Predation. Int J Electron Commer 2011 15:103–122
Article Google Scholar
Yin D, Xue Z, Hong L, Davison B, Kontostathis A, Edwards L (2009) Detection of Harassment on Web 2.0, in Proc. Content Analysis of Web 2.0 Workshop, Madrid, Spain
Kontostathis A, Edwards L, Leatherman A (2009) Chat Coder: Toward the Tracking and Categorization of Internet Predators, in Proceedings of Text Mining Workshop 2009 held in conjunction with the Ninth SIAM International Conference on Data Mining
Jun-Ming X, Jun K-S, Zhu X, Bellmore A (2012) Learning from bullying traces in social media, in Proceedings of the 2012 Conference of North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada
Talpur KR, Yuhaniz SS, Sjarif NN, Ali B (2020) Cyberbullying detection in Roman Urdu language using Lexicon based approach. J Crit Rev 7(16):834–848
Google Scholar
Bilal M, Israr H, Shahid M, Khan A (2016) Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. J King Saud Univ Comput Inf Sci 28(3):330–344
Google Scholar
Mehmood K, Afzal H, Majeed A, Latif H (2015) Contributions to the study of bi-lingual Roman Urdu SMS spam filtering. In: 2015 National Software Engineering Conference (NSEC), Rawalpindi, pp 42–47. https://doi.org/10.1109/NSEC.2015.7396343
Rafae A, Qayyum A, Moeenuddin M, Karim A, Sajjad H, Kamiran F (2015) An unsupervised method for discovering lexical variations in roman urdu informal text, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Emon EA, Rahman S, Banarjee J, Das AK, Mittra T (2019) A Deep Learning Approach to Detect Abusive Bengali Text, in 7th International Conference on Smart Computing & Communications (ICSCC)
Hussain MG, Al Mahmud T, Akthar W (2018) An Approach to Detect Abusive Bangla Text, in International Conference on Innovation in Engineering and Technology (ICIET)
Abdhullah-Al-Mamun, Akhter S (2018) Social media bullying detection using machine learning on Bangla text. 2018 10th International Conference on Electrical and Computer Engineering (ICECE), Dhaka, Bangladesh, pp. 385–388. https://doi.org/10.1109/ICECE.2018.8636797
Mandal AK, Sen R (2014) Supervised learning methods for Bangla web document categorization. Int J Artif Intell Appl (IJAIA) 5(5):93–105
Google Scholar
Wahbeh AH, Al-Kabi M (2012) Comparative assessment of the performance of three WEKA text classifiers applied to arabic text. Abhath Al-Yarmouk: Basic Sci Eng 21(1):15–28
Google Scholar
Mouheb D, Abushamleh MH, Abushamleh MH, Al Aghbari Z, Kamel I (2019) Real-Time Detection of Cyberbullying in Arabic Twitter Streams, in 10th IFIP International Conference on New Technologies, Mobility and Security (NTMS)
Mouheb D, Ismail R, Al Qaraghuli S, Al Aghbari Z, Kamel I (2018) Detection of Offensive Messages in Arabic Social Media Communications, in International Conference on Innovations in Information Technology (IIT)
Mesleh AM (2008) Support vector machines based Arabic language text classification system: feature selection comparative study. In: Sobh T (ed) Advances in Computer and Information Sciences and Engineering. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-8741-7_3
Gupta NV (2012) Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach, in Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), COLING
Pawar R, Raje RR. Multilingual Cyberbullying Detection System, in IEEE International Conference on Electro Information Technology (EIT), Brookings, SD, USA
Haidar B, Chamoun M, Yamout F (2016) Cyberbullying detection: A survey on multilingual techniques, in European Modelling Symposium (EMS)
Singh P, Singh P, Farooq U, Khurana SS, Verma JK, Kumar M (2023) CottonLeafNet: cotton plant leaf disease detection using deep neural networks. Multimed Tools Appl 18:1-26. https://doi.org/10.1007/s11042-023-14954-5
Perera A, Fernando P (2021) Accurate Cyberbullying Detection and Prevention on Social Media, in CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies 2020
Yin D, Xue Z, Hong L, Davison BD, Edwards L (2019) “Detection of Harassment on Web 2.0.” In CAW2.0 2009, April 21, 2009, Madrid, Spain
Sood S, Churchill EF, Antin J (2012) Automatic identification of personal insults on social news sites. J Am Soc Inf Sci Technol 63(2):270–285
Article Google Scholar
Squicciarini A, Rajtmajer S, Liu Y, Griffin C (2015) Identification and characterization of cyberbullying dynamics in an online social network, in Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 - ASONAM ‘15, Paris, France
Chavan VS, Shylaja SS (2015) Machine learning approach for detection of cyber-aggressive comments by peers on social media network, in International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi India
Khodaei A, Bastanfard A, Saboohi H, Aligholizadeh H (2022) Deep Emotion Detection Sentiment Analysis of Persian Literary Text, PrePrint (Version 1) available at Research Square https://doi.org/10.21203/rs.3.rs-1796157/v1
Savargiv M, Bastanfard A (2013) Text material design for fuzzy emotional speech corpus based on persian semantic and structure, in 2013 International Conference on Fuzzy Theory and Its Applications (iFUZZY)
Mahdavi R, Bastanfard A, Amirkhani D (2020) Persian Accents Identification Using Modeling of Speech Articulatory Features, in 2020 25th International Computer Conference, Computer Society of Iran (CSICC)
Savargiv M, Bastanfard A. Persian speech emotion recognition, in 2015 7th Conference on Information and Knowledge Technology (IKT)
Bastanfard A, Amirkhani D, Naderi S (2020) A Singing Voice Separation Method from Persian Music Based on Pitch Detection Methods, in 2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
Bastanfard A, Aghaahmadi M, Kelishami AA, Fazel M, Moghadam M (2009) Persian viseme classification for developing visual speech training application. In: Muneesawang P, Wu F, Kumazawa I, Roeksabutr A, Liao M, Tang X (eds) Advances in Multimedia Information Processing - PCM 2009. PCM 2009. Lecture Notes in Computer Science, vol 5879. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10467-1_104
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Central University of Punjab, Bathinda, India
Umar Farooq, Parvinder Singh & Surinder Singh Khurana
Maharaja Ranjit Singh Punjab Technical University, Bathinda, India
Munish Kumar

Authors

Umar Farooq
View author publications
You can also search for this author in PubMed Google Scholar
Parvinder Singh
View author publications
You can also search for this author in PubMed Google Scholar
Surinder Singh Khurana
View author publications
You can also search for this author in PubMed Google Scholar
Munish Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Parvinder Singh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Farooq, U., Singh, P., Khurana, S.S. et al. Detection of content-based cybercrime in Roman Kashmiri using ensemble learning. Multimed Tools Appl 83, 33071–33105 (2024). https://doi.org/10.1007/s11042-023-16678-y

Download citation

Received: 03 October 2022
Revised: 21 August 2023
Accepted: 27 August 2023
Published: 25 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16678-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of content-based cybercrime in Roman Kashmiri using ensemble learning

Abstract

Access this article

Similar content being viewed by others

Analysing Cyberbullying Using Natural Language Processing by Understanding Jargon in Social Media

Multi-feature Transformer for Multiclass Cyberbullying Detection in Bangla

COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detection of content-based cybercrime in Roman Kashmiri using ensemble learning

Abstract

Access this article

Similar content being viewed by others

Analysing Cyberbullying Using Natural Language Processing by Understanding Jargon in Social Media

Multi-feature Transformer for Multiclass Cyberbullying Detection in Bangla

COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation