Use of Distributed Machine Learning Toolkit for Searching Content Promoting Hate Speech on the Web

Woda, Marek; Torbiarczyk, Mateusz

doi:10.1007/978-3-319-91446-6_50

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 761))

Included in the following conference series:

International Conference on Dependability and Complex Systems

524 Accesses
1 Citations

Abstract

The paper describes results of research on applicability of a new tool called Distributed Machine Learning Toolkit (DMTK) to detect hate speech on the Internet. For this purpose, the Word Embedding module was used, which uses the word2vec method to create a vector representation of the word. These representations were used for vector recording of entries posted on twitter and then they were subjected to classification using LightGBM, a classifier using gradient boosting methods. As a reference, in order to compare results provided by DMTK, two free of charge machine learning algorithms Gensim and GloVe were scrutinized.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759–760. International WWW Conferences Steering Committee (2017)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(6), 1137–1155 (2003)
MATH Google Scholar
Bian, J., Gao, B., Liu, T.Y.: Knowledge-powered deep learning for word embedding. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 132–148. Springer, Heidelberg (2014)
Google Scholar
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
MATH Google Scholar
Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the 11th International Conference on Web and Social Media (ICWSM) (2017). https://data.world/crowdflower/hate-speech-identification. Accessed 21 Jan 2018
Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., Bhamidipati, N.: Hate speech detection with comment embeddings. In: Proceedings of the 24th International Conference on World Wide Web, pp. 29–30. ACM, May 2015
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics (2001)
Chapter Google Scholar
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling (2016). arXiv preprint arXiv:1602.02410
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749, February 2016
Google Scholar
Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI, pp. 3650–3656, July 2015
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Ross, B., Rist, M., Carbonell, G., Cabrera, B., Kurowsky, N., Wojatzki, M.: Measuring the reliability of hate speech annotations: the case of the European refugee crisis (2017). arXiv preprint arXiv:1701.08118
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (2010)
Google Scholar
Silva, L.A., Mondal, M., Correa, D., Benevenuto, F., Weber, I.: Analyzing the targets of hate in online social media. In: ICWSM, pp. 687–690, March 2016
Google Scholar
Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural language processing. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 1–10 (2017)
Google Scholar
Twitter Sentiment Analysis Training Corpus (Dataset). http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/. Accessed 21 Jan 2018
Waseem, Z., Hovy, D.: Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93 (2016)
Google Scholar
Word2Vec algorithm. https://code.google.com/archive/p/word2vec/. Accessed 21 Jan 2018
https://www.quora.com/What-is-the-main-difference-between-word2vec-and-fastText. Accessed 15 Mar 2018
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 238–247 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Wroclaw University of Technology, Janiszewskiego 11-17, 50-372, Wrocław, Poland
Marek Woda & Mateusz Torbiarczyk

Authors

Marek Woda
View author publications
You can also search for this author in PubMed Google Scholar
Mateusz Torbiarczyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Woda .

Editor information

Editors and Affiliations

Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Wojciech Zamojski
Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Jacek Mazurkiewicz
Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Jarosław Sugier
Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Tomasz Walkowiak
Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Woda, M., Torbiarczyk, M. (2019). Use of Distributed Machine Learning Toolkit for Searching Content Promoting Hate Speech on the Web. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Contemporary Complex Systems and Their Dependability. DepCoS-RELCOMEX 2018. Advances in Intelligent Systems and Computing, vol 761. Springer, Cham. https://doi.org/10.1007/978-3-319-91446-6_50

Download citation

DOI: https://doi.org/10.1007/978-3-319-91446-6_50
Published: 27 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91445-9
Online ISBN: 978-3-319-91446-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics