Abstract
The paper describes results of research on applicability of a new tool called Distributed Machine Learning Toolkit (DMTK) to detect hate speech on the Internet. For this purpose, the Word Embedding module was used, which uses the word2vec method to create a vector representation of the word. These representations were used for vector recording of entries posted on twitter and then they were subjected to classification using LightGBM, a classifier using gradient boosting methods. As a reference, in order to compare results provided by DMTK, two free of charge machine learning algorithms Gensim and GloVe were scrutinized.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759–760. International WWW Conferences Steering Committee (2017)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(6), 1137–1155 (2003)
Bian, J., Gao, B., Liu, T.Y.: Knowledge-powered deep learning for word embedding. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 132–148. Springer, Heidelberg (2014)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the 11th International Conference on Web and Social Media (ICWSM) (2017). https://data.world/crowdflower/hate-speech-identification. Accessed 21 Jan 2018
Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., Bhamidipati, N.: Hate speech detection with comment embeddings. In: Proceedings of the 24th International Conference on World Wide Web, pp. 29–30. ACM, May 2015
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics (2001)
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling (2016). arXiv preprint arXiv:1602.02410
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749, February 2016
Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI, pp. 3650–3656, July 2015
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Ross, B., Rist, M., Carbonell, G., Cabrera, B., Kurowsky, N., Wojatzki, M.: Measuring the reliability of hate speech annotations: the case of the European refugee crisis (2017). arXiv preprint arXiv:1701.08118
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (2010)
Silva, L.A., Mondal, M., Correa, D., Benevenuto, F., Weber, I.: Analyzing the targets of hate in online social media. In: ICWSM, pp. 687–690, March 2016
Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural language processing. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 1–10 (2017)
Twitter Sentiment Analysis Training Corpus (Dataset). http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/. Accessed 21 Jan 2018
Waseem, Z., Hovy, D.: Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93 (2016)
Word2Vec algorithm. https://code.google.com/archive/p/word2vec/. Accessed 21 Jan 2018
https://www.quora.com/What-is-the-main-difference-between-word2vec-and-fastText. Accessed 15 Mar 2018
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 238–247 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Woda, M., Torbiarczyk, M. (2019). Use of Distributed Machine Learning Toolkit for Searching Content Promoting Hate Speech on the Web. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Contemporary Complex Systems and Their Dependability. DepCoS-RELCOMEX 2018. Advances in Intelligent Systems and Computing, vol 761. Springer, Cham. https://doi.org/10.1007/978-3-319-91446-6_50
Download citation
DOI: https://doi.org/10.1007/978-3-319-91446-6_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91445-9
Online ISBN: 978-3-319-91446-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)