Abstract
In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.



















Similar content being viewed by others
Dataset availability
The dataset is made freely available on our GitHub repository—https://github.com/kmi-linguistics/comma under AGPL 3.0 license for research. For commercial usage, the dataset will be made available through a separate commercial agreement, to be executed on case-to-case basis.
Notes
For example, some work on selecting the best representative sample for building cross-lingual parsing system using Universal Dependencies dataset (Schluter & Agić, 2017) have shown that it may not be necessary to have datasets for all languages but a careful, intelligent selection of diverse languages might also prove to be useful. In case of abusive language research, no such study has been carried out; also the kind of languages being represented may not be representative of the diversity of languages in the world.
Especially in case of cyberbullying which is defined as hateful or offensive speech—for any act to be classified as “bullying", it must be repetitive and a power equation needs to be involved but it is not necessarily true across all datasets.
All the examples given in this document are “real-life" examples and have been reproduced exactly as they appear in the actual post/comment and form part of our corpus. The comments written in the Devanagari and Bangla scripts have been transcribed in the Roman script for the benefit of the readers.
What we learnt in hindsight is that it might have been a better idea to force these diversions and breaks into the annotation tool itself—that would have proved to be a better and more useful tactic. We are now looking at good ways of integrating this in the annotation tool.
Each “file" represents all the annotated comments from one single video.
In all the tables, MN—Meitei, BN— Bangla, HN—Hindi, EN—English, CM—Code-Mix, Agg.—Aggression, Ag. In.—Aggression Intensity, Dis.—Discursive, Gen.—Gendered, Com.—Communal, Ca./Cl.—Caste/Class, Et./Ra.—Ethnic/Racial, Comm.—Total number of Comments on YouTube Video for that class and language, Cnt.—Count of the audio instances for that class and language, Reg—Regular, Sur—Surprise.
Major religious groups in India.
BJP is a Hindu Nationalist Party currently heading the central government of India.
It might be possible to improve on these by using the additional information provided by us. However, discussion of such experiments is out of the scope of the current paper and may be taken up as future work.
First Workshop on Trolling, Aggression and Cyberbullying.
Second Workshop on Trolling, Aggression and Cyberbullying.
References
Agha, A. (2007). Language and social relations. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511618284
Akhtar, S., Basile, V., & Patti, V. (2019). A new measure of polarization in the annotation of hate speech. In Proceedings of the international conference of the Italian association for artificial intelligence, pp. 588–603. https://doi.org/10.1007/978-3-030-35166-3_41
Al-Hassan, A., & Al-Dossari, H. (2019). Detection of hate speech in social networks: A survey on multilingual corpus. Computer Science and Information Technology, 2019, 83–100. https://doi.org/10.5121/csit.2019.90208
Albadi, N., Kurdi, M., & Mishra, S. (2018). Are they our brothers? analysis and detection of religious hate speech in the arabic twittersphere. In Proceedings of the 2018 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 69–76. https://doi.org/10.1109/ASONAM.2018.8508247
Alfina, I., Mulia, R., Fanany, M.I., et al. (2017). Hate speech detection in the indonesian language: A dataset and preliminary study. In Proceedings of 2017 international conference on advanced computer science and information systems (ICACSIS), IEEE. https://doi.org/10.1109/ICACSIS.2017.8355039
Amjad, M., Zhila, A., Sidorov, G., et al. (2021). Overview of abusive and threatening language detection in urdu at fire 2021. In Proceedings of the 12th forum for information retrieval evaluation (FIRE). Association for computing machinery, New York, USA, pp. 744–762.
Aporna, A. A., Azad, I., Amlan, N. S., et al. (2022). Classifying offensive speech of bangla text and analysis using explainable ai. In M. Singh, V. Tyagi, P. K. Gupta, et al. (Eds.), Advances in computing and data sciences (pp. 133–144). Springer.
Banik, N., & Rahman, M.H.H. (2019). Toxicity detection on bengali social media comments using supervised models. In 2019 2nd international conference on Innovation in Engineering and Technology (ICIET), pp. 1–5. https://doi.org/10.1109/ICIET48527.2019.9290710
Bhattacharya, S., Singh, S., Kumar, R., et al. (2020). Developing a multilingual annotated corpus of misogyny and aggression. In Proceedings of the second workshop on trolling, aggression and cyberbullying. European Language Resources Association (ELRA), Marseille, France, pp. 158–168, https://aclanthology.org/2020.trac-1.25
Bohra, A., Vijay, D., Singh, V., et al. (2018). A dataset of Hindi-English code-mixed social media text for hate speech detection. In Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media. Association for Computational Linguistics, New Orleans, Louisiana, USA, pp. 36–41. https://doi.org/10.18653/v1/W18-1105. https://aclanthology.org/W18-1105
Chakraborty, P., & Seddiqui, M.H. (2019). Threat and abusive language detection on social media in bengali language. In 2019 1st international conference on Advances in Science, Engineering and Robotics Technology (ICASERT), pp. 1–6. https://doi.org/10.1109/ICASERT.2019.8934609
Chen, Y., Zhou, Y., Zhu, S., et al. (2012). Detecting offensive language in social media to protect adolescent online safety. In 2012 international conference on privacy, security, risk and trust and 2012 international confernece on social computing, pp. 71–80. https://doi.org/10.1109/SocialCom-PASSAT.2012.55
Chung, Y.L., Kuzmenko, E., Tekiroglu, S.S., et al. (2019). CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 2819–2829. https://doi.org/10.18653/v1/P19-1271. https://aclanthology.org/P19-1271
Conneau, A., Khandelwal, K., Goyal, N., et al. (2019). Unsupervised cross-lingual representation learning at scale. CoRR . arXiv:1911.02116
Das, A. K., Asif, A. A., Paul, A., et al. (2021). Bangla hate speech detection on social media using attention-based recurrent neural network. Journal of Intelligent Systems, 30(1), 578–591. https://doi.org/10.1515/jisys-2020-0060
David, A. B. (2015). Descriptive grammar of Bangla: DE GRUYTER. https://doi.org/10.1515/9781614512295. www.degruyter.com/document/doi/10.1515/9781614512295/html
Davidson, T., Warmsley, D., Macy, M., et al, (2017). Automated hate speech detection and the problem of offensive language. In Proceedings of the eleventh international conference on web and social media, AAAI, pp. 512–515.
de Pelle, R., & Moreira, V.P. (2016). Offensive comments in the brazilian web: A dataset and baseline results. In Proceedings of the fifth Brazilian workshop on social network analysis and mining (BraSNAM 2016), p 510–519. https://doi.org/10.5753/brasnam.2017.3260
Del Vigna, F., Cimino, A., Dell’Orletta, F., et al. (2017). Hate me, hate me not: Hate speech detection on facebook. In Proceedings of the First Italian conference on cybersecurity (ITASEC17), CEUR.org, pp. 86–95.
Devlin, J., Chang, M., Lee, K., et al. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR arXiv:1810.04805
D’Orazio, V., Kenwick, M., Lane, M., et al. (2016). Crowdsourcing the measurement of interstate conflict. PLoS ONE, 11(6), e0156,527. https://doi.org/10.1371/journal.pone.0156527
Eshan, S.C., & Hasan, M.S. (2017). An application of machine learning to detect abusive bengali text. In 2017 20th international conference of Computer and Information Technology (ICCIT), pp. 1–6. https://doi.org/10.1109/ICCITECHN.2017.8281787
Fernquist, J., Lindholm, O., Kaati, L., et al. (2019). A study on the feasibility to detect hate speech in swedish. In 2019 IEEE international conference on big data (Big Data), 2019, IEEE, pp. 4724–4729. https://doi.org/10.1109/BigData47090.2019.9005534
Fortuna, P., Rocha da Silva, J., Soler-Company, J., et al. (2019). A hierarchically-labeled Portuguese hate speech dataset. In Proceedings of the third workshop on abusive language online. Association for Computational Linguistics, Florence, Italy, pp. 94–104. https://doi.org/10.18653/v1/W19-3510. https://aclanthology.org/W19-3510
Haddad, H., Mulki, H., & Oueslati, A. (2019). T-hsab: A tunisian hate speech and abusive dataset. In 7th international conference on Arabic language processing, pp. 251–263. https://doi.org/10.1007/978-3-030-32959-4_18
Hammer, H. (2017). Automatic detection of hateful comments in online discussion. In Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, pp. 164–173. https://doi.org/10.1007/978-3-319-52569-3_15
Hussain, M. G., & Mahmud, T. A. (2019). A technique for perceiving abusive Bangla comments. Green University of Bangladesh Journal of Science and Engineering. https://doi.org/10.5281/zenodo.3544583
Ishmam, A., & Sharmin, S. (2019). Hateful speech detection in public facebook pages for the bengali language. In 18th IEEE international conference on machine learning and applications, ICMLA 2019, Boca Raton, FL, USA, pp. 555–560. https://doi.org/10.1109/ICMLA.2019.00104
Islam, T., Ahmed, N., & Latif, S. (2021). An evolutionary approach to comparative analysis of detecting bangla abusive text. Bulletin of Electrical Engineering and Informatics, 10, 2163–2169. https://doi.org/10.11591/eei.v10i4.3107
Joshi, P., Santy, S., Budhiraja, A., et al. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pp. 6282–6293. https://doi.org/10.18653/v1/2020.acl-main.560https://aclanthology.org/2020.acl-main.560
Jurgens, D., Hemphill, L., & Chandrasekharan, E. (2019). A just and comprehensive strategy for using NLP to address online abuse. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 3658–3666. https://doi.org/10.18653/v1/P19-1357https://aclanthology.org/P19-1357
Kaggle (2020). Jigsaw multilingual toxic comment classification. https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/138198
Kakwani, D., Kunchukuttan, A., Golla, S., et al. (2020). IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of EMNLP
Karim, M.R., Dey, S.K., Islam, T., et al. (2021). Deephateexplainer: Explainable hate speech detection in under-resourced bengali language. In 2021 IEEE 8th international conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. https://doi.org/10.1109/DSAA53316.2021.9564230
Karim, M.R., Raja Chakravarthi, B., McCrae, J.P., et al. (2020). Classification benchmarks for under-resourced bengali language based on multichannel convolutional-lstm network. In 2020 IEEE 7th international conference on Data Science and Advanced Analytics (DSAA), pp. 390–399. https://doi.org/10.1109/DSAA49011.2020.00053
Khanuja, S., Bansal, D., Mehtani, S., et al. (2021). Muril: Multilingual representations for Indian languages. arXiv:2103.10730
Kolhatkar, V., Wu, H., Cavasso, L., et al. (2020). The sfu opinion and comments corpus: A corpus for the analysis of online news comments. Corpus Pragmatics. https://doi.org/10.1007/s41701-019-00065-w
Kumar, R., Lahiri, B., & Ojha, A. (2021). Aggressive and offensive language identification in Hindi, Bangla, and English: A comparative study. SN Computer Science. https://doi.org/10.1007/s42979-020-00414-6
Kumar, R., Ojha, A.K., Malmasi, S., et al. (2018a). Benchmarking aggression identification in social media. In Proceedings of the first workshop on trolling, aggression and cyberbullying (TRAC-2018). Association for Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1–11. https://aclanthology.org/W18-4401
Kumar, R., Ojha, A.K., Malmasi, S., et al. (2020). Evaluating aggression identification in social media. In Proceedings of the second workshop on trolling, aggression and cyberbullying. European Language Resources Association (ELRA), Marseille, France, pp. 1–5. https://aclanthology.org/2020.trac-1.1
Kumar, R., Ratan, S., Singh, S., et al. (2022). The comma dataset v0.2: Annotating aggression and bias in multilingual social media discourse. In Proceedings of the language resources and evaluation conference. European Language Resources Association, Marseille, France, pp. 4149–4161. https://aclanthology.org/2022.lrec-1.441
Kumar, R., Reganti, A. N., Bhatia, A., et al. (2018b). Aggression-annotated corpus of Hindi-English code-mixed data. In Proceedings of the eleventh international conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1226
Malmasi, S., & Zampieri, M. (2017). Detecting hate speech in social media. In Proceedings of the international conference Recent Advances in Natural Language Processing, RANLP 2017. INCOMA Ltd., Varna, Bulgaria, pp. 467–472. https://doi.org/10.26615/978-954-452-049-6_062
Mandl, T., Modha, S., Shahi, G.K., et al. (2020). Overview of the hasoc track at fire 2020: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th forum for information retrieval evaluation (FIRE). Association for Computing Machinery, New York, USA, p 29–32.
Mandl, T., Modha, S., Shahi, G.K., et al. (2021). Overview of the hasoc subtrack at fire 2021: Hatespeech and offensive content identification in english and indo-aryan languages. In Proceedings of the 12th forum for information retrieval evaluation (FIRE). Association for Computing Machinery, New York, USA, pp. 1–19.
Martins, R., Gomes, M., Almeida, J., et al. (2018). Hate speech classification in social media using emotional analysis. In Proceedings of the 2018 Brazilian conference on intelligent systems, BRACIS 2018, pp. 61–66. https://doi.org/10.1109/BRACIS.2018.00019
Mathur, P., Shah, R., Sawhney, R., et al. (2018). Detecting offensive tweets in Hindi-English code-switched language. In Proceedings of the Sixth international workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Melbourne, Australia, pp. 18–26. https://doi.org/10.18653/v1/W18-3504https://aclanthology.org/W18-3504
Mubarak, H., Darwish, K., Magdy, W. (2017). Abusive language detection on Arabic social media. In Proceedings of the First Workshop on Abusive Language Online. Association for Computational Linguistics, Vancouver, BC, Canada, pp. 52–56. https://doi.org/10.18653/v1/W17-3008https://aclanthology.org/W17-3008
Nascimento, G., Carvalho, F., Cunha, A., et al. (2019). Hate speech detection using brazilian imageboards. In Proceedings of the 25th Brazillian symposium on multimedia and the web, WebMedia 2019, pp. 325–328. https://doi.org/10.1145/3323503.3360619
Nobata, C., Tetreault, J., Thomas, A., et al. (2016). Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web (WWW’16). International World Wide Web Conferences Steering Committee, pp. 145–153. https://doi.org/10.1145/2872427.2883062
Ousidhoum, N., Lin, Z., Zhang, H., et al. (2019). Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp. 4675–4684. https://doi.org/10.18653/v1/D19-1474https://aclanthology.org/D19-1474
Poletto, F., Basile, V., Sanguinetti, M., et al. (2021). Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Evaluation, 55, 477–523. https://doi.org/10.1007/s10579-020-09502-8
Ranasinghe, T., & Zampieri, M. (2021). An evaluation of multilingual offensive language identification methods for the languages of india. Information 12(8). https://doi.org/10.3390/info12080306. https://www.mdpi.com/2078-2489/12/8/306
Ritu, S.S., Mondal, J., Mia, M.M., et al. (2021). Bangla abusive language detection using machine learning on radio message gateway. In 2021 6th international conference on Communication and Electronics Systems (ICCES), pp. 1725–1729. https://doi.org/10.1109/ICCES51350.2021.9489131
Romim, N., Ahmed, M., Islam, M.S., et al. (2021a). HS-BAN: A benchmark dataset of social media comments for hate speech detection in bangla. arXiv:2112.01902
Romim, N., Ahmed, M., Islam, M.S., et al. (2022). Bd-shs: A benchmark dataset for learning to detect online bangla hate speech in different social contexts. https://doi.org/10.48550/ARXIV.2206.00372. arXiv:2206.00372
Romim, N., Ahmed, M., Talukder, H., et al. (2021b). Hate speech detection in the bengali language: A dataset and its baseline evaluation. In: Uddin, M.S., & Bansal, J.C. (eds) Proceedings of International Joint Conference on Advances in Computational Intelligence. Springer Singapore, Singapore, pp. 457–468.
Rosenthal, S., Atanasova, P., Karadzhov, G., et al. (2021). Solid: A large-scale semi-supervised dataset for offensive language identification. pp. 915–928. https://doi.org/10.18653/v1/2021.findings-acl.80
Ross, B., Rist, M., Carbonell, G., et al. (2017). Measuring the reliability of hate speech annotations: The case of the european refugee crisis. In NLP4CMC III: 3rd workshop on natural language processing for computer-mediated communication. https://doi.org/10.17185/duepublico/42132
Sanguinetti, M., Poletto, F., Bosco, C., et al. (2018). An Italian Twitter corpus of hate speech against immigrants. In Proceedings of the Eleventh international conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, p 2798–2895, https://aclanthology.org/L18-1443
Sanh, V., Debut, L., Chaumond, J., et al. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv arXiv:1910.01108
Sazzed, S. (2021a). Abusive content detection in transliterated Bengali-English social media corpus. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching. Association for Computational Linguistics, Online, pp. 125–130. https://doi.org/10.18653/v1/2021.calcs-1.16. https://aclanthology.org/2021.calcs-1.16
Sazzed, S. (2021). Identifying vulgarity in bengali social media textual content. PeerJ Comput Sci. https://doi.org/10.7717/peerj-cs.665
Schäfer, J., & Burtenshaw, B. (2019). Offence in dialogues: A corpus-based study. In Proceedings of the international conference on Recent Advances in Natural Language Processing (RANLP 2019) INCOMA Ltd., Varna, Bulgaria, pp. 1085–1093. https://doi.org/10.26615/978-954-452-056-4_125https://aclanthology.org/R19-1125
Schluter, N., & Agić, Ž. (2017). Empirically sampling Universal Dependencies. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017). Association for Computational Linguistics, Gothenburg, Sweden, pp. 117–122, https://aclanthology.org/W17-0415.
Schmidt, A., & Wiegand, M. (2017). A survey on hate speech detection using natural language processing. In Proceedings of the Fifth international workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Valencia, Spain, pp. 1–10. https://doi.org/10.18653/v1/W17-1101. https://aclanthology.org/W17-1101
Sharif, O., & Hoque, M. M. (2022). Tackling cyber-aggression: Identification and fine-grained categorization of aggressive texts on social media using weighted ensemble of transformers. Neurocomputing, 490, 462–481. https://doi.org/10.1016/j.neucom.2021.12.022
Sharif, O., Hoque, M. M., et al. (2021). Identification and classification of textual aggression in social media: Resource creation and evaluation. In T. Chakraborty, K. Shu, & H. R. Bernard (Eds.), Combating Online Hostile Posts in Regional Languages during Emergency Situation (pp. 9–20). Cham: Springer International Publishing.
Shmueli, B., Fell, J., Ray, S., et al. (2021). Beyond fair pay: Ethical implications of NLP crowdsourcing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p 3758–3769, https://aclanthology.org/2021.naacl-main.295
Steinberger, J., Brychcín, T., Hercig, T., et al. (2017). Cross-lingual flames detection in news discussions. In Proceedings of the international conference Recent Advances in Natural Language Processing, RANLP 2017. INCOMA Ltd., Varna, Bulgaria, pp. 694–700. https://doi.org/10.26615/978-954-452-049-6_089
Vidgen, B., & Derczynski, L. (2020). Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLOS ONE, 15(e0243), 300. https://doi.org/10.1371/journal.pone.0243300
Vidgen, B., & Yasseri, T. (2020). Detecting weak and strong islamophobic hate speech on social media. Journal of Information Technology & Politics, 17, 66–78. https://doi.org/10.1080/19331681.2019.1702607
Wang, S., Liu, J., Ouyang, X., et al. (2020). Galileo at SemEval-2020 task 12: Multi-lingual learning for offensive language identification using pre-trained language models. In Proceedings of the Fourteenth Workshop on Semantic Evaluation. International Committee for Computational Linguistics, Barcelona (online), pp. 1448–1455. https://doi.org/10.18653/v1/2020.semeval-1.189. https://aclanthology.org/2020.semeval-1.189
Waseem, Z. (2016). Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on NLP and computational social science. Association for Computational Linguistics (ACL), pp. 138–142. https://doi.org/10.18653/v1/W16-5618
Waseem, Z., Davidson, T., Warmsley, D., et al. (2017). Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of the First Workshop on Abusive Language Online. Association for Computational Linguistics, Vancouver, BC, Canada, pp. 78–84. https://doi.org/10.18653/v1/W17-3012https://aclanthology.org/W17-3012
Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. Association for Computational Linguistics, San Diego, California, pp. 88–93. https://doi.org/10.18653/v1/N16-2013. https://aclanthology.org/N16-2013
Weingartner, S., & Stahel, L. (2019). Online aggression from a sociological perspective: An integrative view on determinants and possible countermeasures. In Proceedings of the third workshop on abusive language online. Association for Computational Linguistics, Florence, Italy, pp. 181–187. https://doi.org/10.18653/v1/W19-3520https://aclanthology.org/W19-3520
Zampieri, M., Malmasi, S., Nakov, P., et al. (2019a). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp. 1415–1420. https://doi.org/10.18653/v1/N19-1144https://aclanthology.org/N19-1144
Zampieri, M., Malmasi, S., Nakov, P., et al. (2019b). SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th international workshop on semantic evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA, pp. 75–86. https://doi.org/10.18653/v1/S19-2010https://aclanthology.org/S19-2010
Zampieri, M., Nakov, P., Rosenthal, S., et al. (2020). SemEval-2020 task 12: Multilingual offensive language identification in social media (OffensEval 2020). In Proceedings of the fourteenth workshop on semantic evaluation. International Committee for Computational Linguistics, Barcelona (online), pp. 1425–1447. https://doi.org/10.18653/v1/2020.semeval-1.188https://aclanthology.org/2020.semeval-1.188
Acknowledgements
This research is funded by Facebook Research under its Content Policy Research Initiative. We would like to thank Afrida Aainun Murshida, Sanju Pukhrambam, Diana Thingujam, and Sonal Sinha for helping us out with annotations at different points in the project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Data statement
Appendix A: Data statement
1.1 A.1 Header
-
Dataset Title: ComMA Dataset v0.2
-
Dataset Curator(s):
-
Akash Bhagat, Indian Institute of Technology-Kharagpur
-
Enakshi Nandi, Panlingua Language Processing LLP, New Delhi
-
Laishram Niranjana Devi, Panlingua Language Processing LLP, New Delhi
-
Mohit Raj, Panlingua Language Processing LLP, New Delhi
-
Shiladitya Bhattacharya, Jawaharlal Nehru University, New Delhi
-
Shyam Ratan, Dr. Bhimrao Ambedkar University, Agra
-
Siddharth Singh, Dr. Bhimrao Ambedkar University, Agra
-
Yogesh Dawer, Dr. Bhimrao Ambedkar University, Agra
-
-
Dataset Version: Version 0.2, 2nd October 2021
-
Dataset Citation: NA
-
Data Statement Authors:
-
Enakshi Nandi, Panlingua Language Processing LLP, New Delhi
-
Laishram Niranjana Devi, Panlingua Language Processing LLP, New Delhi
-
Shyam Ratan, Dr. Bhimrao Ambedkar University
-
-
Data Statement Version: 1, 17th November 2021
-
Data Statement Citation and DOI: NA
-
Links to versions of this data statement in other languages: NA
1.2 A.2 Executive summary
The objective of working on this dataset is to identify and tag aggression and various kinds of bias (gender, communal, caste/class, ethnic/racial) in social media discourse. To that end, this dataset has been compiled by collecting over 70 h of annotated audio, over 1.2k annotated memes, and over 57k text data points (totalling over 172k manual annotations) from YouTube, Facebook, Twitter, and Telegram in Meitei, Bangla, Hindi, and English. The data was collected from videos/audios, memes, and posts that were politically, socially, sexually, religiously, racially, or otherwise polarized or controversial in nature, so as to elicit a wide and extensive range of hateful, aggressive, gendered, communal, casteist, classist, and racist speech data for our dataset.
1.3 A.3 Curation rationale
This dataset was created with the ultimate goal of developing a system that is able to identify and tag aggression, gender bias, communal bias, caste/class bias, and ethnic/racial bias in social media discourse. To that end, this dataset has been manually annotated by multiple annotators in order to identify the linguistic and pragmatic features that characterize aggression, gender bias, communal bias, caste/class bias, and ethnic/racial bias in the comments on posts, videos, memes, and articles posted on social media sites such as YouTube, Facebook, Twitter, and Telegram.
The specific social media posts and articles whose comments we collected were selected manually, and then crawled with the help of their respective web crawlers. This selection process was contingent on many factors, the chief of which was the need to collect as many aggressive, gender biased, communal, casteist, classist, and racist comments as possible to create a robust dataset. To that end, we focused on identifying controversial posts of a politically, socially, sexually, communally, and racially charged nature that have elicited a significant number of the kind of comments described above. We have then followed similar suggested posts, videos, or articles on the platform to collect more data of a similar or comparable nature. The second factor was language: the comments needed to be in Meitei, Bangla, and Hindi for the most part, with English comments included because they are ubiquitous in the context of Indian social media.
The dataset was organized in the form of a spreadsheet, with each comment identified by a unique comment code that would help the annotators distinguish an independent comment from comments featuring in a thread, and each comment posted under an article or video constituting one data instance, regardless of its length, language, or content. In other words, a data instance can be a single letter or an essay-length comment, can be written in a single language or a combination of languages, and can contain text (in any script), numerals, and emojis individually or all in one comment. The data instance is annotated taking the entire comment as one single, compact unit.
1.4 A.4 Documentation for source datasets
This dataset has been developed from a source dataset that marked only aggressive speech collected from public Facebook and Twitter pages, and a subsequent source dataset that developed the tagset to include speech with aggression and gender bias, collected from Facebook, Twitter, and YouTube.
The links for the research papers published and the workshops conducted on the respective source datasets are listed below:
- 1.
- 2.
- 3.
- 4.
The current dataset was built on the foundation laid down by these source datasets, and has added several new, finely-grained tags, including two primary tags marking caste/class bias and ethnic/racial bias, and two secondary tags that mark the discursive roles and discursive effects of (overtly and covertly) aggressive speech.
1.5 A.5 Language varieties
The languages included in this dataset, listed with their respective BCP-47 language tags, include:
-
code unavailable: Meitei as spoken by the Meitei community in Manipur, India.
-
bn-IN and bn-BD: Bangla (and its varieties) as spoken in India and Bangladesh.
-
he-IN: Hindi (and its varieties) as spoken in various parts of India.
-
en-IN: English (and its varieties) as spoken in India, otherwise known as Indian English.
Since this dataset has been exclusively collected from online sources, the users writing the comments are assumed to be multilingual and may be based in any part of the world, not just in the places these languages are primarily spoken in. However, the language varieties used in the dataset are primarily those mentioned in the list above.
1.6 A.6 Speaker demographic
This dataset has been sourced exclusively from the internet, hence the speaker demographic of the dataset cannot be identified beyond the language they speak. It is assumed that the speakers could be of any age, gender, sexual orientation, educational background, nationality, caste, class, religion, race, tribe, or ethnicity.
The speakers are probably multilingual as well, with the language they post in being one of the many they would know or be fluent in. It is a safe assumption to make that many of these comments are made by Indians (specifically people who have Meitei, Bangla, and Hindi as their first or primary language) and Bangladeshis given the nature and reach of the topics selected, but this assumption is not backed by any data or statistical findings.
1.7 A.7 Annotator demographic
The annotation scheme and guidelines for this dataset has been developed by Dr Ritesh Kumar, the principal investigator of the ComMA Project and a faculty at the Department of Transdisciplinary Studies, Dr. Bhimrao Ambedkar University, Agra, India. He was assisted by the co-PIs of the project - Dr. Bornini Lahiri, Assistant Professor at IIT-Kharagpur and Dr. Atul Kr. Ojha and Akanksha Bansal, co-founders of Panlingua Language Processing LLP - and annotators of this dataset, who have been listed below. Further, these annotators have manually identified the appropriate posts and videos to work on, crawled the data, and then annotated and analysed the processed data in their respective languages.
-
annotator_1: A 31-year-old Bengali Muslim woman working from Gangtok, Sikkim. She has a PhD in English, speaks Bangla, Hindi, and English, and her ideological leanings are centrist. She is annotating the Bangla data.
-
annotator_2: A 29-year-old Bengali Hindu man working from Malda, West Bengal. He has an MA in Linguistics, and speaks Bangla, English, Hindi and Bhojpuri. He is annotating the Bangla data.
-
annotator_3: A 24-year-old Meitei woman working from Imphal, Manipur. She has an MA in Linguistics, speaks Meiteilon, English, and Hindi, and her ideological leanings are centrist. She is annotating the Meitei data.
-
annotator_4: A 33-year-old Bengali Hindu woman working from Kalyani, West Bengal. She has a PhD in Linguistics, speaks English, Hindi, Bangla, and Sylheti, and her ideological leanings are leftist. She is annotating the Bangla data.
-
annotator_5: A 30-year-old Meitei Hindu woman working from Imphal, Manipur. She is pursuing a PhD in Linguistics, speaks English, Hindi, and Meitei, and her ideological leanings are centrist. She is annotating the Meitei data.
-
annotator_6: A 24-year-old Meitei man working from Imphal, Manipur. He is an MA in Linguistics, and speaks Meiteilon, English, and Hindi. He is annotating the Meitei data.
-
annotator_7: A 25-year-old North Indian Hindu man working from Agra, Uttar Pradesh. He is pursuing an MPhil in Linguistics, and speaks Braj, Hindi and English. He is annotating the English and Hindi data.
-
annotator_8: A 27-year-old North Indian Hindu man working from Agra, Uttar Pradesh. He is pursuing an MSc in Computational Linguistics, speaks Hindi, Bhojpuri,and English and his ideological leanings are centrist. He is annotating the English and Hindi data.
-
annotator_9: A 27-year-old North Indian Hindu woman working from Patna, Bihar. She is pursuing an MPhil in Linguistics and speaks Bhojpuri, Hindi, and English. She is annotating the Hindi and English data.
-
annotator_10: A 32-year-old Punjabi Hindu man working from Agra, Uttar Pradesh. He is an MA in Journalism and in Linguistics, speaks English, Hindi, and Punjabi, and his ideological leanings are leftist. He is annotating the English and Hindi data.
-
A 28-year-old North Indian Hindu man working from Patna, Bihar. He is pursuing a PhD in Linguistics, speaks English, Hindi, Magahi, and Bhojpuri, and his ideological leanings are centrist. He is annotating the English and Hindi data.
-
A 33-year-old Bengali Hindu man working from Kolkata and New Delhi. He has a PhD in Computational Linguistics, speaks Bangla, Hindi, and English, and his ideological leanings are leftist. He is annotating the Bangla data.
1.8 A.8 Speech situation and text characteristics
This dataset comprises of online comments written by users of various social media platforms. The comments collected range from 2012 to 2021 (and continuing), and form part of an extensive and intensive social media discourse.
-
Time and place of linguistic activity - Online
-
Date(s) of data collection - April to September 2021
-
Modality - Written
-
Scripted/edited vs. spontaneous - Spontaneous
-
Synchronous vs. asynchronous interaction - Asynchronous (online comments)
-
Speakers’ intended audience - Other users of the respective social media platforms and channels
-
Genre - Social media
-
Topic - Socially or politically polarizing or controversial topics
-
Non-linguistic context - The videos which provide the context for the comments generated
-
Additional details about the cultural context - The sociopolitical climate and cultural context in which the commenters live have a huge influence on the nature, tone, and ideological underpinnnings of the comments they write on social media
1.9 A.9 Preprocessing and data formatting
The preprocessing of the raw data involves deleting all duplicates of a data instance, deleting data instances with urls and texts with less than three words, and removing data instances which occur in languages apart from Meitei, Bangla, Hindi, and English. In the Telegram data, all translations of texts have been deleted manually. The data instances are listed without the names of the commenters, but when someone has replied to a previous comment by tagging them with the ’@’ symbol, that information is available to the annotator within the text itself.
Next, the processed data is arranged on a Google spreadsheet, columns are made with the relevant headings and tags (using the option for data validation), and copies of the spreadsheet are shared amongst the annotators working on a particular language so they can annotate the files individually and without consultation with each other. This is to ensure that no annotator is influenced in their annotation by the ideas of another. However, at no stage in the process are the annotators anonymous to each other or anyone else in the team.
1.10 A.10 Capture quality
As with any other dataset, we have faced quality issues in data capture. The primary of these is the difficulty in finding every kind of data in every language. For instance, in Bangla, it is very difficult to find racist or communal data, because most conversations that we have come across in social media platforms that are of a communal or racist nature and involves Bangla speakers occur in English. Similar challenges have been faced in Meitei with regard to casteist data, and in Hindi with regard to ethnically and racially biased data. These discrepancies can be explained when the social, political, and cultural contexts of each of these language and speech communities is taken into consideration, which are significantly different from each other.
1.11 A.11 Limitations
Following the point in the previous section, another limitation in the data is the dearth of comments that can be tagged by the discursive effects of counterspeech, abet and instigate, and gaslighting. In contrast, the discursive effect of attack is very well-represented, not so closely followed by defend. All of these factors combine to make it challenging for the dataset in each language to be equally representative of each of the primary tags, thus making it difficult for the researchers to embark on intensive comparative analyses of the characteristics of each of these phenomena across all of the languages being analysed.
This tagset also does not allow us the option to distinguish a personal attack from an identity based one, to mark national/regional or political bias, and to distinguish sexual harassment from aggression, sexual threat, and gender bias. These are shortcomings that will have to be addressed and resolved in subsequent versions of the tagset.
1.12 A.12 Metadata
The relevant links to the metadata for this dataset have been provided below:
-
License: CC BY-NC-SA 4.0
-
Annotation Guidelines:
-
Annotation Process: Manual annotation
-
Dataset Quality Metrics: Krippendorff’s Alpha for IAA
-
Errata: NA
1.13 A.13 Disclosures and Ethical Review
This dataset has been funded by Facebook Research under Content Policy Research Initiative Phase 2.
1.14 A.14 Other
NA.
1.15 A.15 Glossary
NA.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kumar, R., Ratan, S., Singh, S. et al. A multilingual, multimodal dataset of aggression and bias: the ComMA dataset. Lang Resources & Evaluation 58, 757–837 (2024). https://doi.org/10.1007/s10579-023-09696-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-023-09696-7