Skip to main content

Transformers and Attention Mechanism for Website Classification and Porn Detection

  • Conference paper
  • First Online:
New Trends in Database and Information Systems (ADBIS 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1850))

Included in the following conference series:

  • 515 Accesses

Abstract

Detecting pornographic content on the web is an important challenge in protecting users from inappropriate content. The heterogeneity of the web, the diversity of used languages, and the existence of implicit pornography using language that cannot be detected by keywords, make this task difficult. There are very few published works on text-based web classification. In this paper, we propose a novel approach that addresses these challenges. We tackle web porn detection based on multiple pages for the same website, by incorporating an attention mechanism to treat pages according to their respective importance, and the first to use transformers for web porn detection. Our method outperforms various other approaches that do not incorporate attention. With our multilingual solution, we achieved the accuracy of 91.59% on a hand-labeled test set for the task of porn detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://everaccountable.com/blog/how-pornography-affects-teenagers-and-children/.

References

  1. Best websites. https://5000best.com/websites/. Accessed 25 Apr 2022

  2. Angelidis, S., Lapata, M.: Multiple instance learning networks for fine-grained sentiment analysis. Trans. Assoc. Comput. Linguist. 6, 17–31 (2018)

    Article  Google Scholar 

  3. Bőthe, B., et al.: A longitudinal study of adolescents’ pornography use frequency, motivations, and problematic use before and during the covid-19 pandemic. Arch. Sex. Behav. 51(1), 139–156 (2022)

    Article  Google Scholar 

  4. Buber, E., Diri, B.: Web page classification using RNN. Procedia Comput. Sci. 154, 62–72 (2019). https://doi.org/10.1016/j.procs.2019.06.011. https://linkinghub.elsevier.com/retrieve/pii/S187705091930780X

    Article  Google Scholar 

  5. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)

  6. Demirkıran, F., Çayır, A., Ünal, U., Dağ, H.: Website category classification using fine-tuned BERT language model. In: 2020 5th International Conference on Computer Science and Engineering (UBMK), pp. 333–336, September 2020. https://doi.org/10.1109/UBMK50275.2020.9219384

  7. DMOZ: Dmoz open directory project. https://dmoz-odp.org/. Accessed 21 Jun 2023

  8. Hellman, S., et al.: Multiple instance learning for content feedback localization without annotation. In: Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 30–40 (2020)

    Google Scholar 

  9. Hu, W., Wu, O., Chen, Z., Fu, Z., Maybank, S.: Recognition of pornographic web pages by classifying texts and images. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1019–1034 (2007)

    Article  Google Scholar 

  10. Karthikeyan, T., Sekaran, K., Ranjith, D., Vinoth, K.V., Balajee, J.M.: Personalized content extraction and text classification using effective web scraping techniques. Int. J. Web Portals 11(2), 41–52 (2019). https://doi.org/10.4018/IJWP.2019070103. https://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJWP.2019070103

    Article  Google Scholar 

  11. Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018)

  12. Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  13. Patel, A.D., Sharma, Y.K.: Web page classification on news feeds using hybrid technique for extraction. In: Satapathy, S.C., Joshi, A. (eds.) Information and Communication Technology for Intelligent Systems. SIST, vol. 107, pp. 399–405. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1747-7_38

    Chapter  Google Scholar 

  14. Prigent, F.: Blacklist université de toulouse 1 (ut-capitole). https://dsi.ut-capitole.fr/blacklists/. Accessed 21 Jun 2023

  15. Sahoo, D., Liu, C., Hoi, S.C.H.: Malicious URL detection using machine learning: a survey. arXiv:1701.07179 [cs], August 2019

  16. Song, K., Kang, Y., Gao, W., Gao, Z., Sun, C., Liu, X.: Evidence aware neural pornographic text identification for child protection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 17, pp. 14939–14947 (2021). https://ojs.aaai.org/index.php/AAAI/article/view/17753

  17. Sun, G., Zhang, Z., Cheng, Y., Chai, T.: Adaptive segmented webpage text based malicious website detection. Comput. Networks 216, 109236 (2022). https://doi.org/10.1016/j.comnet.2022.109236. https://www.sciencedirect.com/science/article/pii/S1389128622003140

    Article  Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  19. Yamoun, L., Guessoum, Z., Girard, C.: Transformer RoBERTa vs. TF-IDF for websites content-based classification. In: Deep Learning meets Ontologies and Natural Language Processing, International Workshop in conjunction with ESWC, Hersonissos, Greece (2022). https://hal.archives-ouvertes.fr/hal-03725602

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lahcen Yamoun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yamoun, L., Guessoum, Z., Girard, C. (2023). Transformers and Attention Mechanism for Website Classification and Porn Detection. In: Abelló, A., et al. New Trends in Database and Information Systems. ADBIS 2023. Communications in Computer and Information Science, vol 1850. Springer, Cham. https://doi.org/10.1007/978-3-031-42941-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42941-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42940-8

  • Online ISBN: 978-3-031-42941-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics