Abstract
Detecting pornographic content on the web is an important challenge in protecting users from inappropriate content. The heterogeneity of the web, the diversity of used languages, and the existence of implicit pornography using language that cannot be detected by keywords, make this task difficult. There are very few published works on text-based web classification. In this paper, we propose a novel approach that addresses these challenges. We tackle web porn detection based on multiple pages for the same website, by incorporating an attention mechanism to treat pages according to their respective importance, and the first to use transformers for web porn detection. Our method outperforms various other approaches that do not incorporate attention. With our multilingual solution, we achieved the accuracy of 91.59% on a hand-labeled test set for the task of porn detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Best websites. https://5000best.com/websites/. Accessed 25 Apr 2022
Angelidis, S., Lapata, M.: Multiple instance learning networks for fine-grained sentiment analysis. Trans. Assoc. Comput. Linguist. 6, 17–31 (2018)
Bőthe, B., et al.: A longitudinal study of adolescents’ pornography use frequency, motivations, and problematic use before and during the covid-19 pandemic. Arch. Sex. Behav. 51(1), 139–156 (2022)
Buber, E., Diri, B.: Web page classification using RNN. Procedia Comput. Sci. 154, 62–72 (2019). https://doi.org/10.1016/j.procs.2019.06.011. https://linkinghub.elsevier.com/retrieve/pii/S187705091930780X
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)
Demirkıran, F., Çayır, A., Ünal, U., Dağ, H.: Website category classification using fine-tuned BERT language model. In: 2020 5th International Conference on Computer Science and Engineering (UBMK), pp. 333–336, September 2020. https://doi.org/10.1109/UBMK50275.2020.9219384
DMOZ: Dmoz open directory project. https://dmoz-odp.org/. Accessed 21 Jun 2023
Hellman, S., et al.: Multiple instance learning for content feedback localization without annotation. In: Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 30–40 (2020)
Hu, W., Wu, O., Chen, Z., Fu, Z., Maybank, S.: Recognition of pornographic web pages by classifying texts and images. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1019–1034 (2007)
Karthikeyan, T., Sekaran, K., Ranjith, D., Vinoth, K.V., Balajee, J.M.: Personalized content extraction and text classification using effective web scraping techniques. Int. J. Web Portals 11(2), 41–52 (2019). https://doi.org/10.4018/IJWP.2019070103. https://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJWP.2019070103
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018)
Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Patel, A.D., Sharma, Y.K.: Web page classification on news feeds using hybrid technique for extraction. In: Satapathy, S.C., Joshi, A. (eds.) Information and Communication Technology for Intelligent Systems. SIST, vol. 107, pp. 399–405. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1747-7_38
Prigent, F.: Blacklist université de toulouse 1 (ut-capitole). https://dsi.ut-capitole.fr/blacklists/. Accessed 21 Jun 2023
Sahoo, D., Liu, C., Hoi, S.C.H.: Malicious URL detection using machine learning: a survey. arXiv:1701.07179 [cs], August 2019
Song, K., Kang, Y., Gao, W., Gao, Z., Sun, C., Liu, X.: Evidence aware neural pornographic text identification for child protection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 17, pp. 14939–14947 (2021). https://ojs.aaai.org/index.php/AAAI/article/view/17753
Sun, G., Zhang, Z., Cheng, Y., Chai, T.: Adaptive segmented webpage text based malicious website detection. Comput. Networks 216, 109236 (2022). https://doi.org/10.1016/j.comnet.2022.109236. https://www.sciencedirect.com/science/article/pii/S1389128622003140
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Yamoun, L., Guessoum, Z., Girard, C.: Transformer RoBERTa vs. TF-IDF for websites content-based classification. In: Deep Learning meets Ontologies and Natural Language Processing, International Workshop in conjunction with ESWC, Hersonissos, Greece (2022). https://hal.archives-ouvertes.fr/hal-03725602
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yamoun, L., Guessoum, Z., Girard, C. (2023). Transformers and Attention Mechanism for Website Classification and Porn Detection. In: Abelló, A., et al. New Trends in Database and Information Systems. ADBIS 2023. Communications in Computer and Information Science, vol 1850. Springer, Cham. https://doi.org/10.1007/978-3-031-42941-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-42941-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42940-8
Online ISBN: 978-3-031-42941-5
eBook Packages: Computer ScienceComputer Science (R0)