Abstract
With the growing popularity of the Internet, websites could make significant profit by hosting illegal and harmful content, such as violence, sexual, illegal gambling, drug abuse, etc. They are serious threats to a safe and secure Internet, and they are especially harmful to the underage population. Government agencies, ISPs, network administrators at various levels, and parents have been seeking for accurate and robust solutions to block such illegal and harmful webpages. Existing solutions detect inappropriate pages based on content, e.g., using keyword matching or content-based image classification. They could be easily escaped by altering the internal format of texts or images, e.g., mixing different alphabets. In this paper, we propose to utilize relatively stable features extracted from the relationships between the targeted illegal/harmful webpages to discover and identify illegal webpages. We introduce a new mechanism, namely HinPage, that utilizes such features for the robust identification of PG (pornographic and gambling) pages. HinPage models the candidate PG pages and the resources on the pages with a heterogeneous information network (HIN). A transductive classification algorithm is then applied to the HIN to identify PG pages.
Through experiments on 10,033 candidate PG pages, we demonstrate that HinPage achieves an accuracy of 83.5% on PG page identification. In particular, it is able to identify illegal/harmful PG pages that cannot be recognized by SOTA commercial products.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Luo, C., Guan, R., Wang, Z., Lin, C.: HetPathMine: a novel transductive classification algorithm on heterogeneous information networks. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 210–221. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_18
Yang, H., Du, K., Zhang, Y., et al.: Casino Royale: a deep exploration of illegal online gambling. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 500–513 (2019)
Farman, A., Pervez, K., Kashif, R., et al.: A fuzzy ontology and SVM-based Web content classification system. IEEE Access 25781–25797 (2017)
Li, L., Gou, G., Xiong, G., Cao, Z., Li, Z.: Identifying gambling and porn websites with image recognition. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds.) PCM 2017. LNCS, vol. 10736, pp. 488–497. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77383-4_48
Hu, W., Wu, O., Chen, Z., et al.: Recognition of pornographic web pages by classifying texts and images. IEEE Trans. Pattern Anal. 1019–1034 (2007)
Huang, Y., Liu, D., Yan, Z., et al.: An abused webpage detection method based on screenshots text recognition. In: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications, pp. 106–110 (2021)
Chen, Y., Zheng, R., Zhou, A., et al.: Automatic detection of pornographic and gambling websites based on visual and textual content using a decision mechanism. Sensors (2020)
Yang, R., Liu, J., Gu, L., et al.: Search & catch: detecting promotion infection in the underground through search engines. In: IEEE TrustCom, pp. 1566–1571 (2020)
Starov, O., Zhou, Y., Zhang, X., et al.: Betrayed by your dashboard: discovering malicious campaigns via web analytics. In: Proceedings of the World Wide Web Conference, pp. 227–236 (2018)
Salam, H., Maarof, M.A., Zainal, A.: Design consideration for improved term weighting scheme for pornographic web sites. In: Abraham, A., Muda, A.K., Choo, Y.-H. (eds.) Pattern Analysis, Intelligent Security and the Internet of Things. AISC, vol. 355, pp. 275–285. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17398-6_25
Wang, L., Zhang, J., Wang, M., Tian, J., Zhuo, L.: Multilevel fusion of multimodal deep features for porn streamer recognition in live video. Pattern Recogn. Lett. 140, 150–157 (2020)
Ahmadi, A., Fotouhi, M., Khaleghi, M.: Intelligent classification of webpages using contextual and visual features. Appl. Soft Comput. 11, 1638–1647 (2011)
Maktabar, M., Zainal, A., Maarof, M.A., Kassim, M.N.: Content based fraudulent website detection using supervised machine learning techniques. In: Abraham, A., Muhuri, P.K., Muda, A.K., Gandhi, N. (eds.) HIS 2017. AISC, vol. 734, pp. 294–304. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76351-4_30
European Commission. Illegal and Harmful Content on the Internet COM(96)487final (1996)
Shin, J., Lee, S., Wang, T.: Semantic approach for identifying harmful sites using the link relations. In: Proceedings of the 2014 IEEE International Conference on Semantic Computing, pp. 16–18 (2014)
Farooq, M.S., Khan, M.A., Abbas, S., et al.: Skin detection based pornography filtering using adaptive back propagation neural network. In: 8th International Conference on Information and Communication Technologies, pp. 106–112 (2019)
Yaqub, W., Mohanty, M., et al.: Encrypted domain skin tone detection for pornographic image filtering. In: 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–5 (2018)
Granizo, S.L., Caraguay, Á.L., López, L.I., Hernández-Álvarez, M.: Detection of possible illicit messages using natural language processing and computer vision on twitter and linked websites. IEEE Access (2020)
Lee, P.Y., Hui, S.C., Fong, A.C.M.: An intelligent categorization engine for bilingual web content filtering. IEEE Trans. Multimed. 1183–1190 (2005)
Sae-Bae, N., Sun, X., et al.: Towards automatic detection of child pornography. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 5332–5336 (2014)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004)
Chrome DevTools. https://chromedevtools.github.io/devtools-protocol/1-3/Page/
OpenCV. https://opencv.org/
Sun, Y., Han, J., Yan, X., et al.: PathSim: meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4, 992–1003 (2011)
Symantec sitereview. https://sitereview.bluecoat.com/
Baidu Security Platform. https://bsb.baidu.com/
Evaluation Standard of Baidu Security Platform. https://bsb.baidu.com/standard
Nomura, S., Oyama, S., Hayamizu, T., et al.: Analysis and improvement of HITS algorithm for detecting web communities. Syst. Comput. 35, 32–42 (2004)
Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144 (2017)
Sokolov, M., Olufowobi, K., Herndon, N.: Visual spoofing in content-based spam detection. In: 13th International Conference on Security of Information and Networks (2020)
Yuan, K., et al.: Stealthy porn: understanding real-world adversarial images for illicit online promotion. In: IEEE Symposium on Security and Privacy (SP) (2019)
Tong, S., Zhang, H, Shen, B., et al.: Detecting gambling sites from post behaviors. In: IEEE 11th Conference on Industrial Electronics and Applications, pp. 2495–2500 (2016)
Moustafa, M., et al.: Applying deep learning to classify pornographic images and videos. arXiv Preprint arxiv:1511.08899 (2015)
Acknowledgment
This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences with No. XDC02030000, the National Key Research and Development Program of China No. 2021YFB3101403 and National Key R &D Program 2021 (Grant No. 2021YFB3101001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Yu, L., Liu, Q. (2023). HinPage: Illegal and Harmful Webpage Identification Using Transductive Classification. In: Deng, Y., Yung, M. (eds) Information Security and Cryptology. Inscrypt 2022. Lecture Notes in Computer Science, vol 13837. Springer, Cham. https://doi.org/10.1007/978-3-031-26553-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-26553-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26552-5
Online ISBN: 978-3-031-26553-2
eBook Packages: Computer ScienceComputer Science (R0)