Abstract
The official language of Kashmir, Kashmiri language or Koshur, is spoken by more than 7 million people, yet its content-based cybercrime detection remains unexplored in theoretical and experimental research. Furthermore, the absence of programming libraries for sentimental analysis and a benchmark corpus has impeded advancements in this field. Challenges persist in working with diverse scripts of Kashmiri, including Perso-Arabic, Sharada, Devanagari, and Roman. Detecting cybercrime in this language is challenging due to its complex morphological nature, lack of resources, scarcity of annotated datasets, and varied linguistic characteristics, emphasizing the importance of overcoming these obstacles to develop effective detection systems. This paper attempts to detect content-based cybercrime in Roman Kashmiri script, extensively utilized on online platforms like social media, chat rooms, emails, etc., by the Kashmiri community. A well-balanced and meaningful dataset, the first of its kind in this context, is compiled, incorporating positive and negative comments, and three strategies were employed for analysis. The findings reveal that the Tf-Idf Vectorizer outperforms other tokenization methods (Count Vectorizer and Tf-Idf Transformer), bi-gram notation exhibits superior performance compared to one and tri-gram notations, and the XGBM proves to be the most effective in terms of evaluation metrics. Leveraging these strategies, Python applications were developed for text classification, successfully distinguishing cyberbullying (unsafe) from non-cyberbullying (safe) instances, with the XGBM exhibiting exceptional accuracy using the Tf-Idf Vectorizer with bi-gram, a Bag of Words, and lexical features. This pioneering research underscores the urgent need for content-based cybercrime detection advancements in the Kashmiri language, paving the way for effective detection systems to address language-specific challenges and promote a safer online environment for the Kashmiri community. Furthermore, this research opens new avenues for further advancements in detecting and preventing cybercrime in Kashmiri and potentially in other languages lacking robust cybercrime detection methodologies.
Similar content being viewed by others
Data availability
The authors have created their own dataset for the experimental work because public dataset is not available.
References
Wikipedia (2021) [Online]. Available: https://en.wikipedia.org/wiki/Kashmiri_language. Accessed 24 03 2021
Parey FH (2017) KashmiriLanguage: multi-linguistic approach, issues and role of media in its accomplishments. Int J Sci Res Publ 7(6):228–235
Abstract of speakers' strength of languages and mother tongues - 2011, Census OF India 2011, Language India, States and Union Territories (Table C-16), (2011) pp 1−5
Kanth I (2013) The untold story of the people of Azad Kashmir. Politics Relig Ideol 14(4):589–591. https://doi.org/10.1080/21567689.2013.838477
Shakil M (2012) Languages of Erstwhile State of Jammu Kashmir (A Preliminary Study)
Kiani K (2018) [Online]. Available: https://www.dawn.com/news/1410447. Accessed 24 Mar 2021
Warikoo K (2021) Language and politics in Jammu and Kashmir: Issues and perspectives, Jammu, Kashmir and Ladakh: Linguistic predicament. Delhi: Har-Anand Publications
Khatana DR. Gujari language and identity in Jammu and Kashmir. Kashmir News Network: Language Section (koshur.org). http://www.koshur.org/Linguistic/5.html
Grierson GA (1968) Specimens of the Dardic Or Piśācha Languages (including Kāshmīrī). Motilal Banarsidass:1–567
Uribe-Villegas O (1977) Issues in sociolinguistics. De Gruyter Mouton, Berlin, Boston. https://doi.org/10.1515/9783110806687
Shabina, Sheikh AM (2011) An ethnosemantic analysis of the cultural Lexicon of Kashmiri language. University of Kashmir, Srinagar. http://hdl.handle.net/10603/3374
Britannica TEE (2018) Kashmiri language. Encyclopedia Britannica. [Online]. Available: https://www.britannica.com/topic/Kashmiri-language. Accessed 25 Mar 2021
Singh U (2008) A History of Ancient and Early Medieval India: From the Stone Age to the 12th Century. Pearson, India
Zribi I, Boujelbane R, Masmoudi A, Ellouze M, Belguith L, Habash N (2014) A Conventional Orthography for Tunisian Arabic, in In Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavík, Iceland
Selin H (2008) Encyclopaedia of the history of science, technology, and medicine in non-western cultures, humanities. Soc Sci Law. https://doi.org/10.1007/978-1-4020-4425-0
Omniglot (2021) [Online]. Available: https://omniglot.com/writing/sharda.htm. Accessed 25 Mar 2021
Taylor I (1883) History of the Alphabet. Aryan Alphabets, London
Taylor I (1883) The Alphabet. An account of the origin and development of Letters, London
Raina MK (2006) How to read and write Kashmiri in Devanagari? Kashmir News Network. https://www.koshur.org/pdf/Let%20Us%20Learn%20Kashmiri.pdf
Amir S, Minoofam A, Dehshibi MM, Bastanfard A (2012) Ad-hoc Ma’qeli Script Generation Using Block Cellular Automata. J Cell Autom 7(4):321–334
Veisi H, Ghoreishi SA, Bastanfard A (2021) Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting. J Signal Data Process 17(4):67–88
Shih YE (2007) Setting the new standard with mobile computing in online learning. Int Rev Res Open Distrib Learn 8(2):1–16
Cooke M, Buckley N (2008) Web 2.0, social networks and the future of market research. Int J Mark Res 50(2):267–292
Heidemann J, Klier M, Probst F (2012) Online social networks: a survey of a global phenomenon. Comput Netw 56(18):3866–3878
Farooq U (2021) Ensemble Machine Learning Approaches for Detection of SQL Injection Attack. Tehnički Glas 15(1):112–120
Singh A, Kaur M (2020) Detection Framework for Content-Based Cybercrime in Online Social Networks Using Metaheuristic Approach. Arab J Sci Eng 45:2705–2719
Wall D (2004) What are cybercrimes? Crim Justice Matters 58(1):20–21
Sticca F, Perren S (2013) Is cyberbullying worse than traditional bullying? Examining the differential roles of medium, publicity, and anonymity for the perceived severity of bullying. J Youth Adolesc 42(5):739–750
Abbas G, Farooq U, Singh P, Khurana SS, Singh P (2023) Feature Engineering and Ensemble Learning-Based Classifcation of VPN and Non-VPN-Based Network Trafc over Temporal Features. SN Comput Sci 4(546):1–16
Bauman S, Bauman S (2015) Types of cyberbullying. In: Bauman S (ed) Cyberbullying. https://doi.org/10.1002/9781119221685.ch4
Chang F-C, Lee C-M, Chiu C-H, Hsi W-Y, Huang T-F, Pan Y-C (2013) Relationships among cyberbullying, school bullying, and mental health in Taiwanese adolescents. J Sch Health 83(6):454–462
Walrave M, Heirman W (2011) Cyberbullying: predicting victimisation and perpetration. Child Soc 25:59–72. https://doi.org/10.1111/j.1099-0860.2009.00260.x
Hinduja S, Patchin JW (2010) Bullying, cyberbullying, and suicide. Arch Suicide Res 14(3):206–221
Sourander A, Klomek AB, Ikonen M, Lindroos J, Luntamo T, Koskelainen M, Ristkari T, Helenius H (2010) Psychosocial risk factors associated with cyberbullying among adolescents: a population-based study. Arch Gen Psychiatry 67(7):720–728
Farooq U (2020) Real Time Password Strength Analysis on a Web Application Using Multiple Machine Learning Approaches. Int J Eng Res Technol (IJERT) 9(12):359–364
Statista (2019) [Online]. Available: https://www.statista.com/statistics/1097724/india-cyber-stalking-bullying-cases-against-women-children-by-leading-state/. Accessed 26 Mar 2021
Nandhinia BS, Sheeba JI (2015) Online Social Network Bullying Detection Using Intelligence Techniques, in International Conference on Advanced Computing Technologies and Applications (ICACTA- 2015)
del Jesus MJ, Herrera F (2013) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline datasets. Knowledge-Based Systems 38:85–104 (Elsevier)
Dadvar M, de Jong F, Ordelman R, Trieschnigg D (2012) Improved cyberbullying detection using gender information, in Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012)
Dinakar K, Reichart R, Lieberman H (2011) Modeling the Detection of Textual Cyberbullying, in Proc. IEEE International Fifth International AAAI Conference on Weblogs and Social Media, Barcelona, Spain
Reynolds K, Kontostathis A, Edwards L (2011) Using Machine Learning to Detect Cyberbullying, in roceedings of the 2011 10thConference on Machine Learning and Applications Workshops
McGhee I, Bayzick J, Kontostathis A, Edwards L, Mcbride A, Jakubowski E (2011) Learning to Identify Internet Sexual Predation. Int J Electron Commer 2011 15:103–122
Yin D, Xue Z, Hong L, Davison B, Kontostathis A, Edwards L (2009) Detection of Harassment on Web 2.0, in Proc. Content Analysis of Web 2.0 Workshop, Madrid, Spain
Kontostathis A, Edwards L, Leatherman A (2009) Chat Coder: Toward the Tracking and Categorization of Internet Predators, in Proceedings of Text Mining Workshop 2009 held in conjunction with the Ninth SIAM International Conference on Data Mining
Jun-Ming X, Jun K-S, Zhu X, Bellmore A (2012) Learning from bullying traces in social media, in Proceedings of the 2012 Conference of North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada
Talpur KR, Yuhaniz SS, Sjarif NN, Ali B (2020) Cyberbullying detection in Roman Urdu language using Lexicon based approach. J Crit Rev 7(16):834–848
Bilal M, Israr H, Shahid M, Khan A (2016) Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. J King Saud Univ Comput Inf Sci 28(3):330–344
Mehmood K, Afzal H, Majeed A, Latif H (2015) Contributions to the study of bi-lingual Roman Urdu SMS spam filtering. In: 2015 National Software Engineering Conference (NSEC), Rawalpindi, pp 42–47. https://doi.org/10.1109/NSEC.2015.7396343
Rafae A, Qayyum A, Moeenuddin M, Karim A, Sajjad H, Kamiran F (2015) An unsupervised method for discovering lexical variations in roman urdu informal text, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Emon EA, Rahman S, Banarjee J, Das AK, Mittra T (2019) A Deep Learning Approach to Detect Abusive Bengali Text, in 7th International Conference on Smart Computing & Communications (ICSCC)
Hussain MG, Al Mahmud T, Akthar W (2018) An Approach to Detect Abusive Bangla Text, in International Conference on Innovation in Engineering and Technology (ICIET)
Abdhullah-Al-Mamun, Akhter S (2018) Social media bullying detection using machine learning on Bangla text. 2018 10th International Conference on Electrical and Computer Engineering (ICECE), Dhaka, Bangladesh, pp. 385–388. https://doi.org/10.1109/ICECE.2018.8636797
Mandal AK, Sen R (2014) Supervised learning methods for Bangla web document categorization. Int J Artif Intell Appl (IJAIA) 5(5):93–105
Wahbeh AH, Al-Kabi M (2012) Comparative assessment of the performance of three WEKA text classifiers applied to arabic text. Abhath Al-Yarmouk: Basic Sci Eng 21(1):15–28
Mouheb D, Abushamleh MH, Abushamleh MH, Al Aghbari Z, Kamel I (2019) Real-Time Detection of Cyberbullying in Arabic Twitter Streams, in 10th IFIP International Conference on New Technologies, Mobility and Security (NTMS)
Mouheb D, Ismail R, Al Qaraghuli S, Al Aghbari Z, Kamel I (2018) Detection of Offensive Messages in Arabic Social Media Communications, in International Conference on Innovations in Information Technology (IIT)
Mesleh AM (2008) Support vector machines based Arabic language text classification system: feature selection comparative study. In: Sobh T (ed) Advances in Computer and Information Sciences and Engineering. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-8741-7_3
Gupta NV (2012) Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach, in Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), COLING
Pawar R, Raje RR. Multilingual Cyberbullying Detection System, in IEEE International Conference on Electro Information Technology (EIT), Brookings, SD, USA
Haidar B, Chamoun M, Yamout F (2016) Cyberbullying detection: A survey on multilingual techniques, in European Modelling Symposium (EMS)
Singh P, Singh P, Farooq U, Khurana SS, Verma JK, Kumar M (2023) CottonLeafNet: cotton plant leaf disease detection using deep neural networks. Multimed Tools Appl 18:1-26. https://doi.org/10.1007/s11042-023-14954-5
Perera A, Fernando P (2021) Accurate Cyberbullying Detection and Prevention on Social Media, in CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies 2020
Yin D, Xue Z, Hong L, Davison BD, Edwards L (2019) “Detection of Harassment on Web 2.0.” In CAW2.0 2009, April 21, 2009, Madrid, Spain
Sood S, Churchill EF, Antin J (2012) Automatic identification of personal insults on social news sites. J Am Soc Inf Sci Technol 63(2):270–285
Squicciarini A, Rajtmajer S, Liu Y, Griffin C (2015) Identification and characterization of cyberbullying dynamics in an online social network, in Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 - ASONAM ‘15, Paris, France
Chavan VS, Shylaja SS (2015) Machine learning approach for detection of cyber-aggressive comments by peers on social media network, in International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi India
Khodaei A, Bastanfard A, Saboohi H, Aligholizadeh H (2022) Deep Emotion Detection Sentiment Analysis of Persian Literary Text, PrePrint (Version 1) available at Research Square https://doi.org/10.21203/rs.3.rs-1796157/v1
Savargiv M, Bastanfard A (2013) Text material design for fuzzy emotional speech corpus based on persian semantic and structure, in 2013 International Conference on Fuzzy Theory and Its Applications (iFUZZY)
Mahdavi R, Bastanfard A, Amirkhani D (2020) Persian Accents Identification Using Modeling of Speech Articulatory Features, in 2020 25th International Computer Conference, Computer Society of Iran (CSICC)
Savargiv M, Bastanfard A. Persian speech emotion recognition, in 2015 7th Conference on Information and Knowledge Technology (IKT)
Bastanfard A, Amirkhani D, Naderi S (2020) A Singing Voice Separation Method from Persian Music Based on Pitch Detection Methods, in 2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)
Bastanfard A, Aghaahmadi M, Kelishami AA, Fazel M, Moghadam M (2009) Persian viseme classification for developing visual speech training application. In: Muneesawang P, Wu F, Kumazawa I, Roeksabutr A, Liao M, Tang X (eds) Advances in Multimedia Information Processing - PCM 2009. PCM 2009. Lecture Notes in Computer Science, vol 5879. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10467-1_104
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Farooq, U., Singh, P., Khurana, S.S. et al. Detection of content-based cybercrime in Roman Kashmiri using ensemble learning. Multimed Tools Appl 83, 33071–33105 (2024). https://doi.org/10.1007/s11042-023-16678-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16678-y