Skip to main content

Use of Natural Language Processing to Identify Inappropriate Content in Text

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11734))

Abstract

The quick development of communication through new technology media such as social networks and mobile phones has improved our lives. However, this also produces collateral problems such as the presence of insults and abusive comments. In this work, we address the problem of detecting violent content on text documents using Natural Language Processing techniques. Following an approach based on Machine Learning techniques, we have trained six models resulting from the combinations of two text encoders, Term Frequency-Inverse Document Frequency and Bag of Words, together with three classifiers: Logistic Regression, Support Vector Machines and Naïve Bayes. We have also assessed StarSpace, a Deep Learning approach proposed by Facebook and configured to use a Hit@1 accuracy. We evaluated these seven alternatives in two publicly available datasets from the Wikipedia Detox Project: Attack and Aggression. StarSpace achieved an accuracy of 0.938 and 0.937 in these datasets, respectively, being the algorithm recommended to detect violent content on text documents among the alternatives evaluated.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://research.fb.com/downloads/starspace/.

  2. 2.

    http://archive.ics.uci.edu/ml/index.php.

  3. 3.

    https://kaggle.com.

  4. 4.

    https://github.com/facebookresearch/StarSpace.

  5. 5.

    https://developers.google.com/freebase/.

  6. 6.

    https://www.anaconda.com/distribution/.

  7. 7.

    https://scikit-learn.org/stable/.

  8. 8.

    https://pandas.pydata.org/.

  9. 9.

    http://www.numpy.org/.

  10. 10.

    https://meta.wikimedia.org/wiki/Research:Detox.

References

  1. Hussainalsaid, A., Azami, B.Z., Abhari, A.: Automatic classification of the emotional content of URL documents using NLP algorithms. In: Proceedings of the 18th Symposium on Communications & Networking, pp. 56–59 (2015)

    Google Scholar 

  2. Chin, H., Kim, J., Kim, Y., Shin, J., Yi, M.Y.: Explicit content detection in music lyrics using machine learning. In: IEEE International Conference on Big Data and Smart Computing, pp. 517–521 (2018)

    Google Scholar 

  3. Duarte, N., Llanso, E., Loup, A.: Mixed Messages? The Limits of Automated Social Media Content Analysis. In: FAT, vol. 106 (2018)

    Google Scholar 

  4. Mironczuk, M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106 (2016)

    Article  Google Scholar 

  5. Bui, D.D.A., Del Fiol, G., Jonnalagadda, S.: PDF text classification to leverage information extraction from publication reports. J. Biomed. Inform. 61, 141–148 (2016)

    Article  Google Scholar 

  6. Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with Naïve Bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)

    Article  Google Scholar 

  7. Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 659–661 (2002)

    Google Scholar 

  8. Diab, D.M., Hindi, K.: Using differential evolution for fine tuning Naïve Bayesian classifiers and its application for text classification. Appl. Soft Comput. 54 (2016)

    Article  Google Scholar 

  9. Chavan, V., Shylaja, S.: Machine learning approach for detection of cyber-aggressive comments by peers on social media network, pp. 2354–2358 (2015)

    Google Scholar 

  10. Hammer, H.: Automatic detection of hateful comments in online discussion. Ind. Netw. Intell. Syst., 164–173 (2017)

    Google Scholar 

  11. Eshan, S., Hasan, M.: An application of machine learning to detect abusive Bengali text. In: International Conference of Computer and Information Technology, pp. 1–6 (2017)

    Google Scholar 

  12. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)

    Article  Google Scholar 

  13. Chu, T., Jue, K., Wang, M.: Comment abuse classification with deep learning. Stanford University (2016)

    Google Scholar 

  14. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: International Conference on World Wide Web Companion, pp. 759–760 (2017)

    Google Scholar 

  15. Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Process. Manag. 39(1), 45–65 (2003)

    Article  Google Scholar 

  16. Harris, Z.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  17. Cox, D.: The regression analysis of binary sequences. J. Roy. Stat. Soc. B 20(2), 215–232 (1958)

    MathSciNet  MATH  Google Scholar 

  18. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  19. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, no. 1, pp. 41–48 (1998)

    Google Scholar 

  20. Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: StarSpace: embed all the things!. In: AAAI Conference on Artificial Intelligence, pp. 5569–5577 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Víctor González-Castro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Merayo-Alba, S., Fidalgo, E., González-Castro, V., Alaiz-Rodríguez, R., Velasco-Mata, J. (2019). Use of Natural Language Processing to Identify Inappropriate Content in Text. In: Pérez García, H., Sánchez González, L., Castejón Limas, M., Quintián Pardo, H., Corchado Rodríguez, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2019. Lecture Notes in Computer Science(), vol 11734. Springer, Cham. https://doi.org/10.1007/978-3-030-29859-3_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29859-3_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29858-6

  • Online ISBN: 978-3-030-29859-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics