skip to main content
research-article

Improving the Detection of Multilingual South African Abusive Language via Skip-gram Using Joint Multilevel Domain Adaptation

Published: 08 February 2024 Publication History

Abstract

The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This article, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.

References

[1]
OECD, Artificial Intelligence, Machine Learning and Big Data in Finance: Opportunities, Challenges, and Implications for Policy Makers. 2021. Retrieved from https://www.oecd.org/finance/artificial-intelligence-machine-learningbig-data-in-finance.htm.%0A
[2]
A. Sheth, V. L. Shalin, and U. Kursuncu. 2022. Defining and detecting toxicity on social media: Context and knowledge are key. Neurocomputing 490 (2022), 312–318.
[3]
O. Oriola and E. Kotzé. 2020. Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter. South Afric. Comput. J. 32, 2 (2020), 56–79. DOI:
[4]
J. Vanhoeyveld and D. Martens. 2018. Imbalanced classification in sparse and large behaviour datasets. Data Min. Knowl. Discov. 32, 1 (2018), 25–82. DOI:
[5]
O. Wärnling and J. Bissmark. 2017. The sparse data problem within classification algorithms the effect of sparse data on the naïve Bayes algorithm. KTH. Retrieved from http://www.diva-portal.se/smash/get/diva2:1111045/FULLTEXT01.pdf
[6]
Y. W. David Guthrie, Ben Allison, Wei Liu, and Louise Guthrie. 2006. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 1222–1225. Retrieved from http://www.lrec-conf.org/proceedings/lrec2006/pdf/357_pdf.pdf
[7]
A. Y. Xue, J. Qi, X. Xie, R. Zhang, J. Huang, and Y. Li. 2015. Solving the data sparsity problem in destination prediction. VLDB J. 24 (2015), 219–243.
[8]
W. N. Laster Pirtle. 2022. “White people still come out on top”: The persistence of white supremacy in shaping coloured South Africans’ perceptions of racial hierarchy and experiences of racism in post-Apartheid South Africa. Soc. Sci. 11, 2 (2022), 1–14. DOI:
[9]
RSA. 2000. South Afric.: Promot. Equal. Prevent. Unfair Discrim. Act 39, 4 (2000), 905–926. DOI:
[10]
RSA. 2018. Republic of South Africa Prevention and Combating of Hate Crimes and Hate Speech Bill, no. 41543. Republic of South Africa. 1–18.
[11]
O. Oriola and E. Kotze. 2022. Evaluating machine learning techniques for detecting offensive and hate speech in South African tweets. IEEE Access 8 (2020), 21496–21509. DOI:
[12]
M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar. 2019. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval’19). 75–86. DOI:
[13]
RSA. 1996. The Constitution of the Republic of South Africa 1331, 38 (1996). 1–56. DOI:
[14]
B. Nkrumah. 2018. Words that wound: Rethinking online hate speech in South Africa. Alternat.: Interdiscip. J. Study Arts Humanit. South. Afric. SP23, no. 2018 (2018), 108–133. DOI:
[15]
P. Fortuna and S. Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Comput. Surv. 51, 4 (2018), 1–30. DOI:
[16]
S. A. Islam, C. M. Kearney, A. Choudhury, and E. J. Baker. 2017. Protein classification using modified n-gram and skip-gram models. Bioinformatics 34, 9 (2017), 1481–1487. DOI:
[17]
E. D'Hondt, S. Verberne, N. Weber, K. Koster, and L. Boves. 2012. Using skipgrams and PoS-based feature selection for patent classification. Computat. Linguist. Netherl. J. 2 (2012), 52–70.
[18]
B. Wang, A. Wang, F. Chen, Y. Wang, and C. C. J. Kuo. 2019. Evaluating word embedding models: Methods and experimental results. APSIPA Trans. Sig. Inf. Process. 8 (2019), 1–14. DOI:
[19]
S. Wang and R. Koopman. 2017. Semantic embedding for information retrieval. CEUR Worksh. Proc. 1823 (2017), 122–132.
[20]
O. Oriola and E. Kotzé. 2022. Exploring neural embeddings and transformers for isolation of offensive and hate speech in South African Social media space. In Proceedings of the International Conference on Computational Science and Its Application. 1–13.
[21]
C. Wan, R. Pan, and J. Li. 2011. Bi-weighting domain adaptation for cross-language text classification. In Proceedings of the International Joint Conference on Artificial Intelligence. 1535–1540. DOI:
[22]
J. Barnes, R. Klinger, and S. Schulte im Walde. 2018. Projecting embeddings for domain adaptation: Joint modeling of sentiment analysis in diverse domains. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 818–830.
[23]
J. Li, R. He, H. Ye, H. T. Ng, L. Bing, and R. Yan. 2020. Unsupervised domain adaptation of a pretrained cross-lingual language model. In Proceedings of the International Joint Conference on Artificial Intelligence. 3672–3678. DOI:
[24]
A. Abad, P. Bell, A. Carmantini, and S. Renals. 2020. Cross lingual transfer learning for zero-resource domain adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 6909–6913.
[25]
J. Wang, Y. Chen, S. Hao, W. Feng, and Z. Shen. 2017. Balanced distribution adaptation for transfer learning. In Proceedings of the IEEE International Conference on Data Mining, (ICDM’17). 1129–1134. DOI:
[26]
Z. Wang, X. Wang, F. Liu, P. Gao, and Y. Ni. 2021. Adaptative balanced distribution for domain adaptation with strong alignment. IEEE Access 9 (2021), 100665–100676. DOI:
[27]
D. Zhang, R. Nallapati, H. Zhu, F. Nan, C. dos Santos, K. McKeown, and B. Xiang. 2020. Margin-aware unsupervised domain adaptation for cross-lingual text labeling. In Findings of the Association for Computational Linguistics (EMNLP'20), 3527--3536. DOI:
[28]
C. Casula. 2020. Transfer learning for multilingual offensive language detection with BERT. Uppsala Universitet. https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1437440&dswid=-9851
[29]
E. Ye, X. Bai, N. O'Hare, E. Asgarieh, K. Thadani, F. Perez-Sorrosal, and S. Adiga. 2022. Multilingual taxonomic web page classification for contextual targeting at yahoo. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4372--4380. DOI:
[30]
M. Alexander. 2018. South Africa Gateway: South Africa's population. 2018. Retrieved from https://southafrica-info.com/people/south-africa-population/
[31]
A. Basile and C. Rubagotti. 2018. Crotonemilano for AMI at evalita2018. A performant, cross-lingual misogyny detection system. CEUR Worksh. Proc. 2263 (2018). DOI:
[32]
O. M. Ibrohim and I. Budi. 2021. Translated vs non-translated method for multilingual hate speech identification in Twitter. Int. J. Advanc. Sci., Eng. Inf. Technol. 9, 4 (2021), 1116–1123.
[33]
N. Ousidhoum, Z. Lin, H. Zhang, Y. Song, and D. Y. Yeung. 2019. Multilingual and multi-aspect hate speech analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference. 4675–4684. DOI:
[34]
S. S. Aluru, B. Mathew, P. Saha, and A. Mukherjee. 2020. Deep learning models for multilingual hate speech detection. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD’20). 1–16. Retrieved from http://arxiv.org/abs/2004.06465
[35]
N. Vashistha and A. Zubiaga. 2021. Online multilingual hate speech detection: Experimenting with Hindi and English social media. Information (Switz.) 12, 1 (2021), 1–16. DOI:
[36]
E. Nurçe and J. Këci. 2020. Multilingual Hate Speech Detection: Various Solutions for Multilingual Offensive Speech and Hate Speech Detection in Social Media. IT University, Copenhagen. https://en.itu.dk/
[37]
E. W. Pamungkas, V. Basile, and V. Patti. 2020. Misogyny detection in Twitter: A multilingual and cross-domain study. Inf. Process. Manag. 57, 6 (2020), 102360. DOI:
[38]
M. R. Awal, R. K. W. Lee, E. Tanwar, T. Garg, and T. Chakraborty. 2023. Model-agnostic meta-learning for multilingual hate speech detection. IEEE Trans. Computat. Soc. Syst. (Early Access) (2023), 1--10. DOI:
[39]
P. Röttger, H. Seelawi, D. Nozza, Z. Talat, and B. Vidgen. 2022. Multilingual hate check: Functional tests for multilingual hate speech detection models. In Proceedings of the 6th Workshop on Online Abuse and Harms (WOAH’22). 154–169. DOI:
[40]
F. Vitiugin, Y. Senarath, and H. Purohit. 2021. Efficient detection of multilingual hate speech by using interactive attention network with minimal human feedback. In Proceedings of the 13th ACM Web Science Conference, 130--138. DOI:
[41]
S. Malmasi and M. Zampieri. 2018. Challenges in discriminating profanity from hate speech. J. Experim. Theoret. Artif. Intell. 30, 2 (2018), 187–202. DOI:
[42]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13). 1–12.
[43]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2 (2013), 3111–3119.
[44]
Z. Wang, Y. Mao, H. Guo, and R. Zhang. 2020. On skipgram word embedding models with negative sampling: Unified framework and impact of noise distributions. 1–18. Retrieved from http://arxiv.org/abs/2009.04413
[45]
A. Bražinskas, S. Havrylov, and I. Titov. 2018. Embedding words as distributions with a Bayesian skip-gram model. In –Proceedings of the 27th International Conference on Computational Linguistic. 1775–1789.
[46]
K. Grzegorczyk and M. Kurdziel. 2018. Disambiguated skip-gram model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 1445–1454. DOI:
[47]
Y. Song, S. Shi, J. Li, and H. Zhang. 2018. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings. In Proceedings of theConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 175–180. DOI:
[48]
M. Leimeister and B. J. Wilson. 2018. Skip-gram word embeddings in hyperbolic space. Retrieved from http://arxiv.org/abs/1809.01498
[49]
S. Meftah, N. Tamaazousti, Youssef Semmar, H. Essafi, and S. Fatiha. 2019. Joint learning of pre-trained and random units for domain adaptation in part-of-speech tagging. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 4107–4112.
[50]
Y. Meng, J. Huang, G. Wang, C. Zhang, H. Zhuang, L. Kaplan, and J. Han. 2019. Spherical text embedding. Advances in Neural Information Processing Systems. 32 (2019).
[51]
M. Gongolidis, J. Minton, R. Wu, V. Stauber, J. Hoelscher-Obermaier, and V. Botev. 2021. Domain-adaptation of Spherical Embeddings. Retrieved from http://arxiv.org/abs/2111.00677
[52]
J. Manning, P. Richard, and S. C. D. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
[53]
T. Mikolov and T. G. Com. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning. 1188–1196.
[54]
K. Devlin, J. Chang, M. W. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of theConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
[55]
R. Rehurek and P. Sojka. 2011. Gensim--Python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011), 2.
[56]
H. El Boukkouri. 2021. Domain Adaptation of Word Embeddings through the Exploitation of In-domain Corpora and Knowledge Bases. Doctoral dissertation. Université Paris-Saclay.
[57]
R. Baeza-Yates and B. Rideiro-Neto. 1999. Modern Information Retrieval. ACM Press, New York, NY.
[58]
K. Hemker and B. Schuller. 2018. Data augmentation and deep learning for hate speech detection. Imperial College London. Retrieved from https://www.imperial.ac.uk/media/imperial-college/faculty-of-engineering/computing/public/1718-pg-projects/HemkerK-Data-Augmentation-and-Deep-Learning.pdf
[59]
V. Marivate and T. Sefara. 2020. Improving short text classification through global augmentation methods. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12279 LNCS, 385–399. DOI:
[60]
T. Davidson, D. Warmsley, M. Macy, and I. Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM’17). 512–515.
[61]
S. Bird, E. Klein, and E. Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly Media, Inc.
[62]
S. Uddin, A. Khan, E. Hossain, and M. A. Moni. 2019. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 8 (2019), 1–16.
[63]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and J. Vanderplas. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 847 (2011), 2825--2830.
[64]
N. V. Chawla, K. W. Bowyer, and L. O. Hall. 2002. SMOTE: Synthetic minority oversampling technique. J. Artif. Intell. Res. 16 (2002), 321–357.
[65]
F. Chollet. 2018. Deep Learning with Python and Keras: The Practical Guide from developer of the Keras Library. MITP-Verlags GmbH & Co. KG.
[66]
D. P. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv Prepr., arXiv:1412:6980.
[67]
J. R. Quinlan. 1986. Induction of decision trees. Mach. Learn. 1, 1 (1986), 81–106.

Cited By

View all
  • (2024)Machine Learning Techniques for Identifying Child Abusive Texts in Online Platforms2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10724830(1-6)Online publication date: 24-Jun-2024
  • (2024)Benchmarking Political Bias Classification with In-Context Learning: Insights from GPT-3.5, GPT-4o, LLaMA-3, and Gemma-2Artificial Intelligence Research10.1007/978-3-031-78255-8_10(161-175)Online publication date: 26-Nov-2024

Index Terms

  1. Improving the Detection of Multilingual South African Abusive Language via Skip-gram Using Joint Multilevel Domain Adaptation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
      February 2024
      340 pages
      EISSN:2375-4702
      DOI:10.1145/3613556
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 February 2024
      Online AM: 28 December 2023
      Accepted: 24 December 2023
      Revised: 08 November 2023
      Received: 31 July 2022
      Published in TALLIP Volume 23, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Text classification
      2. South African abusive language
      3. Skip-gram
      4. Joint Domain Adaptation

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)122
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Machine Learning Techniques for Identifying Child Abusive Texts in Online Platforms2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10724830(1-6)Online publication date: 24-Jun-2024
      • (2024)Benchmarking Political Bias Classification with In-Context Learning: Insights from GPT-3.5, GPT-4o, LLaMA-3, and Gemma-2Artificial Intelligence Research10.1007/978-3-031-78255-8_10(161-175)Online publication date: 26-Nov-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media