Skip to main content
Log in

Detection of content-based cybercrime in Roman Kashmiri using ensemble learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The official language of Kashmir, Kashmiri language or Koshur, is spoken by more than 7 million people, yet its content-based cybercrime detection remains unexplored in theoretical and experimental research. Furthermore, the absence of programming libraries for sentimental analysis and a benchmark corpus has impeded advancements in this field. Challenges persist in working with diverse scripts of Kashmiri, including Perso-Arabic, Sharada, Devanagari, and Roman. Detecting cybercrime in this language is challenging due to its complex morphological nature, lack of resources, scarcity of annotated datasets, and varied linguistic characteristics, emphasizing the importance of overcoming these obstacles to develop effective detection systems. This paper attempts to detect content-based cybercrime in Roman Kashmiri script, extensively utilized on online platforms like social media, chat rooms, emails, etc., by the Kashmiri community. A well-balanced and meaningful dataset, the first of its kind in this context, is compiled, incorporating positive and negative comments, and three strategies were employed for analysis. The findings reveal that the Tf-Idf Vectorizer outperforms other tokenization methods (Count Vectorizer and Tf-Idf Transformer), bi-gram notation exhibits superior performance compared to one and tri-gram notations, and the XGBM proves to be the most effective in terms of evaluation metrics. Leveraging these strategies, Python applications were developed for text classification, successfully distinguishing cyberbullying (unsafe) from non-cyberbullying (safe) instances, with the XGBM exhibiting exceptional accuracy using the Tf-Idf Vectorizer with bi-gram, a Bag of Words, and lexical features. This pioneering research underscores the urgent need for content-based cybercrime detection advancements in the Kashmiri language, paving the way for effective detection systems to address language-specific challenges and promote a safer online environment for the Kashmiri community. Furthermore, this research opens new avenues for further advancements in detecting and preventing cybercrime in Kashmiri and potentially in other languages lacking robust cybercrime detection methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Algorithm 1:
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

The authors have created their own dataset for the experimental work because public dataset is not available.

References

  1. Wikipedia (2021) [Online]. Available: https://en.wikipedia.org/wiki/Kashmiri_language. Accessed 24 03 2021

  2. Parey FH (2017) KashmiriLanguage: multi-linguistic approach, issues and role of media in its accomplishments. Int J Sci Res Publ 7(6):228–235

    MathSciNet  ADS  Google Scholar 

  3. Abstract of speakers' strength of languages and mother tongues - 2011, Census OF India 2011, Language India, States and Union Territories (Table C-16), (2011) pp 1−5

  4. Kanth I (2013) The untold story of the people of Azad Kashmir. Politics Relig Ideol 14(4):589–591. https://doi.org/10.1080/21567689.2013.838477

  5. Shakil M (2012) Languages of Erstwhile State of Jammu Kashmir (A Preliminary Study)

  6. Kiani K (2018) [Online]. Available: https://www.dawn.com/news/1410447. Accessed 24 Mar 2021

  7. Warikoo K (2021) Language and politics in Jammu and Kashmir: Issues and perspectives, Jammu, Kashmir and Ladakh: Linguistic predicament. Delhi: Har-Anand Publications

  8. Khatana DR. Gujari language and identity in Jammu and Kashmir. Kashmir News Network: Language Section (koshur.org). http://www.koshur.org/Linguistic/5.html

  9. Grierson GA (1968) Specimens of the Dardic Or Piśācha Languages (including Kāshmīrī). Motilal Banarsidass:1–567

  10. Uribe-Villegas O (1977) Issues in sociolinguistics. De Gruyter Mouton, Berlin, Boston. https://doi.org/10.1515/9783110806687

    Book  Google Scholar 

  11. Shabina, Sheikh AM (2011) An ethnosemantic analysis of the cultural Lexicon of Kashmiri language. University of Kashmir, Srinagar. http://hdl.handle.net/10603/3374

    Google Scholar 

  12. Britannica TEE (2018) Kashmiri language. Encyclopedia Britannica. [Online]. Available: https://www.britannica.com/topic/Kashmiri-language. Accessed 25 Mar 2021

  13. Singh U (2008) A History of Ancient and Early Medieval India: From the Stone Age to the 12th Century. Pearson, India

    Google Scholar 

  14. Zribi I, Boujelbane R, Masmoudi A, Ellouze M, Belguith L, Habash N (2014) A Conventional Orthography for Tunisian Arabic, in In Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavík, Iceland

  15. Selin H (2008) Encyclopaedia of the history of science, technology, and medicine in non-western cultures, humanities. Soc Sci Law. https://doi.org/10.1007/978-1-4020-4425-0

  16. Omniglot (2021) [Online]. Available: https://omniglot.com/writing/sharda.htm. Accessed 25 Mar 2021

  17. Taylor I (1883) History of the Alphabet. Aryan Alphabets, London

    Google Scholar 

  18. Taylor I (1883) The Alphabet. An account of the origin and development of Letters, London

    Google Scholar 

  19. Raina MK (2006) How to read and write Kashmiri in Devanagari? Kashmir News Network. https://www.koshur.org/pdf/Let%20Us%20Learn%20Kashmiri.pdf

    Google Scholar 

  20. Amir S, Minoofam A, Dehshibi MM, Bastanfard A (2012) Ad-hoc Ma’qeli Script Generation Using Block Cellular Automata. J Cell Autom 7(4):321–334

    Google Scholar 

  21. Veisi H, Ghoreishi SA, Bastanfard A (2021) Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting. J Signal Data Process 17(4):67–88

    Article  Google Scholar 

  22. Shih YE (2007) Setting the new standard with mobile computing in online learning. Int Rev Res Open Distrib Learn 8(2):1–16

    Google Scholar 

  23. Cooke M, Buckley N (2008) Web 2.0, social networks and the future of market research. Int J Mark Res 50(2):267–292

    Article  Google Scholar 

  24. Heidemann J, Klier M, Probst F (2012) Online social networks: a survey of a global phenomenon. Comput Netw 56(18):3866–3878

    Article  Google Scholar 

  25. Farooq U (2021) Ensemble Machine Learning Approaches for Detection of SQL Injection Attack. Tehnički Glas 15(1):112–120

    Article  Google Scholar 

  26. Singh A, Kaur M (2020) Detection Framework for Content-Based Cybercrime in Online Social Networks Using Metaheuristic Approach. Arab J Sci Eng 45:2705–2719

    Article  Google Scholar 

  27. Wall D (2004) What are cybercrimes? Crim Justice Matters 58(1):20–21

    Article  Google Scholar 

  28. Sticca F, Perren S (2013) Is cyberbullying worse than traditional bullying? Examining the differential roles of medium, publicity, and anonymity for the perceived severity of bullying. J Youth Adolesc 42(5):739–750

    Article  PubMed  Google Scholar 

  29. Abbas G, Farooq U, Singh P, Khurana SS, Singh P (2023) Feature Engineering and Ensemble Learning-Based Classifcation of VPN and Non-VPN-Based Network Trafc over Temporal Features. SN Comput Sci 4(546):1–16

    Google Scholar 

  30. Bauman S, Bauman S (2015) Types of cyberbullying. In: Bauman S (ed) Cyberbullying. https://doi.org/10.1002/9781119221685.ch4

    Chapter  Google Scholar 

  31. Chang F-C, Lee C-M, Chiu C-H, Hsi W-Y, Huang T-F, Pan Y-C (2013) Relationships among cyberbullying, school bullying, and mental health in Taiwanese adolescents. J Sch Health 83(6):454–462

    Article  PubMed  Google Scholar 

  32. Walrave M, Heirman W (2011) Cyberbullying: predicting victimisation and perpetration. Child Soc 25:59–72. https://doi.org/10.1111/j.1099-0860.2009.00260.x

    Article  Google Scholar 

  33. Hinduja S, Patchin JW (2010) Bullying, cyberbullying, and suicide. Arch Suicide Res 14(3):206–221

    Article  PubMed  Google Scholar 

  34. Sourander A, Klomek AB, Ikonen M, Lindroos J, Luntamo T, Koskelainen M, Ristkari T, Helenius H (2010) Psychosocial risk factors associated with cyberbullying among adolescents: a population-based study. Arch Gen Psychiatry 67(7):720–728

    Article  PubMed  Google Scholar 

  35. Farooq U (2020) Real Time Password Strength Analysis on a Web Application Using Multiple Machine Learning Approaches. Int J Eng Res Technol (IJERT) 9(12):359–364

    Google Scholar 

  36. Statista (2019) [Online]. Available: https://www.statista.com/statistics/1097724/india-cyber-stalking-bullying-cases-against-women-children-by-leading-state/. Accessed 26 Mar 2021

  37. Nandhinia BS, Sheeba JI (2015) Online Social Network Bullying Detection Using Intelligence Techniques, in International Conference on Advanced Computing Technologies and Applications (ICACTA- 2015)

  38. del Jesus MJ, Herrera F (2013) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline datasets. Knowledge-Based Systems 38:85–104 (Elsevier)

    Article  Google Scholar 

  39. Dadvar M, de Jong F, Ordelman R, Trieschnigg D (2012) Improved cyberbullying detection using gender information, in Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012)

  40. Dinakar K, Reichart R, Lieberman H (2011) Modeling the Detection of Textual Cyberbullying, in Proc. IEEE International Fifth International AAAI Conference on Weblogs and Social Media, Barcelona, Spain

  41. Reynolds K, Kontostathis A, Edwards L (2011) Using Machine Learning to Detect Cyberbullying, in roceedings of the 2011 10thConference on Machine Learning and Applications Workshops

  42. McGhee I, Bayzick J, Kontostathis A, Edwards L, Mcbride A, Jakubowski E (2011) Learning to Identify Internet Sexual Predation. Int J Electron Commer 2011 15:103–122

    Article  Google Scholar 

  43. Yin D, Xue Z, Hong L, Davison B, Kontostathis A, Edwards L (2009) Detection of Harassment on Web 2.0, in Proc. Content Analysis of Web 2.0 Workshop, Madrid, Spain

  44. Kontostathis A, Edwards L, Leatherman A (2009) Chat Coder: Toward the Tracking and Categorization of Internet Predators, in Proceedings of Text Mining Workshop 2009 held in conjunction with the Ninth SIAM International Conference on Data Mining

  45. Jun-Ming X, Jun K-S, Zhu X, Bellmore A (2012) Learning from bullying traces in social media, in Proceedings of the 2012 Conference of North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada

  46. Talpur KR, Yuhaniz SS, Sjarif NN, Ali B (2020) Cyberbullying detection in Roman Urdu language using Lexicon based approach. J Crit Rev 7(16):834–848

    Google Scholar 

  47. Bilal M, Israr H, Shahid M, Khan A (2016) Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. J King Saud Univ Comput Inf Sci 28(3):330–344

    Google Scholar 

  48. Mehmood K, Afzal H, Majeed A, Latif H (2015) Contributions to the study of bi-lingual Roman Urdu SMS spam filtering. In: 2015 National Software Engineering Conference (NSEC), Rawalpindi, pp 42–47. https://doi.org/10.1109/NSEC.2015.7396343

  49. Rafae A, Qayyum A, Moeenuddin M, Karim A, Sajjad H, Kamiran F (2015) An unsupervised method for discovering lexical variations in roman urdu informal text, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

  50. Emon EA, Rahman S, Banarjee J, Das AK, Mittra T (2019) A Deep Learning Approach to Detect Abusive Bengali Text, in 7th International Conference on Smart Computing & Communications (ICSCC)

  51. Hussain MG, Al Mahmud T, Akthar W (2018) An Approach to Detect Abusive Bangla Text, in International Conference on Innovation in Engineering and Technology (ICIET)

  52. Abdhullah-Al-Mamun, Akhter S (2018) Social media bullying detection using machine learning on Bangla text. 2018 10th International Conference on Electrical and Computer Engineering (ICECE), Dhaka, Bangladesh, pp. 385–388. https://doi.org/10.1109/ICECE.2018.8636797

  53. Mandal AK, Sen R (2014) Supervised learning methods for Bangla web document categorization. Int J Artif Intell Appl (IJAIA) 5(5):93–105

    Google Scholar 

  54. Wahbeh AH, Al-Kabi M (2012) Comparative assessment of the performance of three WEKA text classifiers applied to arabic text. Abhath Al-Yarmouk: Basic Sci Eng 21(1):15–28

    Google Scholar 

  55. Mouheb D, Abushamleh MH, Abushamleh MH, Al Aghbari Z, Kamel I (2019) Real-Time Detection of Cyberbullying in Arabic Twitter Streams, in 10th IFIP International Conference on New Technologies, Mobility and Security (NTMS)

  56. Mouheb D, Ismail R, Al Qaraghuli S, Al Aghbari Z, Kamel I (2018) Detection of Offensive Messages in Arabic Social Media Communications, in International Conference on Innovations in Information Technology (IIT)

  57. Mesleh AM (2008) Support vector machines based Arabic language text classification system: feature selection comparative study. In: Sobh T (ed) Advances in Computer and Information Sciences and Engineering. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-8741-7_3

  58. Gupta NV (2012) Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach, in Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), COLING

  59. Pawar R, Raje RR. Multilingual Cyberbullying Detection System, in IEEE International Conference on Electro Information Technology (EIT), Brookings, SD, USA

  60. Haidar B, Chamoun M, Yamout F (2016) Cyberbullying detection: A survey on multilingual techniques, in European Modelling Symposium (EMS)

  61. Singh P, Singh P, Farooq U, Khurana SS, Verma JK, Kumar M (2023) CottonLeafNet: cotton plant leaf disease detection using deep neural networks. Multimed Tools Appl 18:1-26. https://doi.org/10.1007/s11042-023-14954-5

  62. Perera A, Fernando P (2021) Accurate Cyberbullying Detection and Prevention on Social Media, in CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies 2020

  63. Yin D, Xue Z, Hong L, Davison BD, Edwards L (2019) “Detection of Harassment on Web 2.0.” In CAW2.0 2009, April 21, 2009, Madrid, Spain

  64. Sood S, Churchill EF, Antin J (2012) Automatic identification of personal insults on social news sites. J Am Soc Inf Sci Technol 63(2):270–285

    Article  Google Scholar 

  65. Squicciarini A, Rajtmajer S, Liu Y, Griffin C (2015) Identification and characterization of cyberbullying dynamics in an online social network, in Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 - ASONAM ‘15, Paris, France

  66. Chavan VS, Shylaja SS (2015) Machine learning approach for detection of cyber-aggressive comments by peers on social media network, in International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi India

  67. Khodaei A, Bastanfard A, Saboohi H, Aligholizadeh H (2022) Deep Emotion Detection Sentiment Analysis of Persian Literary Text, PrePrint (Version 1) available at Research Square https://doi.org/10.21203/rs.3.rs-1796157/v1

  68. Savargiv M, Bastanfard A (2013) Text material design for fuzzy emotional speech corpus based on persian semantic and structure, in 2013 International Conference on Fuzzy Theory and Its Applications (iFUZZY)

  69. Mahdavi R, Bastanfard A, Amirkhani D (2020) Persian Accents Identification Using Modeling of Speech Articulatory Features, in 2020 25th International Computer Conference, Computer Society of Iran (CSICC)

  70. Savargiv M, Bastanfard A. Persian speech emotion recognition, in 2015 7th Conference on Information and Knowledge Technology (IKT)

  71. Bastanfard A, Amirkhani D, Naderi S (2020) A Singing Voice Separation Method from Persian Music Based on Pitch Detection Methods, in 2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)

  72. Bastanfard A, Aghaahmadi M, Kelishami AA, Fazel M, Moghadam M (2009) Persian viseme classification for developing visual speech training application. In: Muneesawang P, Wu F, Kumazawa I, Roeksabutr A, Liao M, Tang X (eds) Advances in Multimedia Information Processing - PCM 2009. PCM 2009. Lecture Notes in Computer Science, vol 5879. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10467-1_104

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parvinder Singh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Farooq, U., Singh, P., Khurana, S.S. et al. Detection of content-based cybercrime in Roman Kashmiri using ensemble learning. Multimed Tools Appl 83, 33071–33105 (2024). https://doi.org/10.1007/s11042-023-16678-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16678-y

Keywords

Navigation