HinPage: Illegal and Harmful Webpage Identification Using Transductive Classification

Li, Yunfan; Yu, Lingjing; Liu, Qingyun

doi:10.1007/978-3-031-26553-2_20

Yunfan Li^9,10,11,
Lingjing Yu^9,10 &
Qingyun Liu^9,10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13837))

Included in the following conference series:

International Conference on Information Security and Cryptology

651 Accesses

Abstract

With the growing popularity of the Internet, websites could make significant profit by hosting illegal and harmful content, such as violence, sexual, illegal gambling, drug abuse, etc. They are serious threats to a safe and secure Internet, and they are especially harmful to the underage population. Government agencies, ISPs, network administrators at various levels, and parents have been seeking for accurate and robust solutions to block such illegal and harmful webpages. Existing solutions detect inappropriate pages based on content, e.g., using keyword matching or content-based image classification. They could be easily escaped by altering the internal format of texts or images, e.g., mixing different alphabets. In this paper, we propose to utilize relatively stable features extracted from the relationships between the targeted illegal/harmful webpages to discover and identify illegal webpages. We introduce a new mechanism, namely HinPage, that utilizes such features for the robust identification of PG (pornographic and gambling) pages. HinPage models the candidate PG pages and the resources on the pages with a heterogeneous information network (HIN). A transductive classification algorithm is then applied to the HIN to identify PG pages.

Through experiments on 10,033 candidate PG pages, we demonstrate that HinPage achieves an accuracy of 83.5% on PG page identification. In particular, it is able to identify illegal/harmful PG pages that cannot be recognized by SOTA commercial products.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Luo, C., Guan, R., Wang, Z., Lin, C.: HetPathMine: a novel transductive classification algorithm on heterogeneous information networks. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 210–221. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_18
Chapter Google Scholar
Yang, H., Du, K., Zhang, Y., et al.: Casino Royale: a deep exploration of illegal online gambling. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 500–513 (2019)
Google Scholar
Farman, A., Pervez, K., Kashif, R., et al.: A fuzzy ontology and SVM-based Web content classification system. IEEE Access 25781–25797 (2017)
Google Scholar
Li, L., Gou, G., Xiong, G., Cao, Z., Li, Z.: Identifying gambling and porn websites with image recognition. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds.) PCM 2017. LNCS, vol. 10736, pp. 488–497. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77383-4_48
Chapter Google Scholar
Hu, W., Wu, O., Chen, Z., et al.: Recognition of pornographic web pages by classifying texts and images. IEEE Trans. Pattern Anal. 1019–1034 (2007)
Google Scholar
Huang, Y., Liu, D., Yan, Z., et al.: An abused webpage detection method based on screenshots text recognition. In: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications, pp. 106–110 (2021)
Google Scholar
Chen, Y., Zheng, R., Zhou, A., et al.: Automatic detection of pornographic and gambling websites based on visual and textual content using a decision mechanism. Sensors (2020)
Google Scholar
Yang, R., Liu, J., Gu, L., et al.: Search & catch: detecting promotion infection in the underground through search engines. In: IEEE TrustCom, pp. 1566–1571 (2020)
Google Scholar
Starov, O., Zhou, Y., Zhang, X., et al.: Betrayed by your dashboard: discovering malicious campaigns via web analytics. In: Proceedings of the World Wide Web Conference, pp. 227–236 (2018)
Google Scholar
Salam, H., Maarof, M.A., Zainal, A.: Design consideration for improved term weighting scheme for pornographic web sites. In: Abraham, A., Muda, A.K., Choo, Y.-H. (eds.) Pattern Analysis, Intelligent Security and the Internet of Things. AISC, vol. 355, pp. 275–285. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17398-6_25
Chapter Google Scholar
Wang, L., Zhang, J., Wang, M., Tian, J., Zhuo, L.: Multilevel fusion of multimodal deep features for porn streamer recognition in live video. Pattern Recogn. Lett. 140, 150–157 (2020)
Article Google Scholar
Ahmadi, A., Fotouhi, M., Khaleghi, M.: Intelligent classification of webpages using contextual and visual features. Appl. Soft Comput. 11, 1638–1647 (2011)
Article Google Scholar
Maktabar, M., Zainal, A., Maarof, M.A., Kassim, M.N.: Content based fraudulent website detection using supervised machine learning techniques. In: Abraham, A., Muhuri, P.K., Muda, A.K., Gandhi, N. (eds.) HIS 2017. AISC, vol. 734, pp. 294–304. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76351-4_30
Chapter Google Scholar
European Commission. Illegal and Harmful Content on the Internet COM(96)487final (1996)
Google Scholar
Shin, J., Lee, S., Wang, T.: Semantic approach for identifying harmful sites using the link relations. In: Proceedings of the 2014 IEEE International Conference on Semantic Computing, pp. 16–18 (2014)
Google Scholar
Farooq, M.S., Khan, M.A., Abbas, S., et al.: Skin detection based pornography filtering using adaptive back propagation neural network. In: 8th International Conference on Information and Communication Technologies, pp. 106–112 (2019)
Google Scholar
Yaqub, W., Mohanty, M., et al.: Encrypted domain skin tone detection for pornographic image filtering. In: 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–5 (2018)
Google Scholar
Granizo, S.L., Caraguay, Á.L., López, L.I., Hernández-Álvarez, M.: Detection of possible illicit messages using natural language processing and computer vision on twitter and linked websites. IEEE Access (2020)
Google Scholar
Lee, P.Y., Hui, S.C., Fong, A.C.M.: An intelligent categorization engine for bilingual web content filtering. IEEE Trans. Multimed. 1183–1190 (2005)
Google Scholar
Sae-Bae, N., Sun, X., et al.: Towards automatic detection of child pornography. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 5332–5336 (2014)
Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004)
Google Scholar
Chrome DevTools. https://chromedevtools.github.io/devtools-protocol/1-3/Page/
OpenCV. https://opencv.org/
Sun, Y., Han, J., Yan, X., et al.: PathSim: meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4, 992–1003 (2011)
Article Google Scholar
Symantec sitereview. https://sitereview.bluecoat.com/
Baidu Security Platform. https://bsb.baidu.com/
Evaluation Standard of Baidu Security Platform. https://bsb.baidu.com/standard
Nomura, S., Oyama, S., Hayamizu, T., et al.: Analysis and improvement of HITS algorithm for detecting web communities. Syst. Comput. 35, 32–42 (2004)
Article Google Scholar
Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144 (2017)
Google Scholar
Sokolov, M., Olufowobi, K., Herndon, N.: Visual spoofing in content-based spam detection. In: 13th International Conference on Security of Information and Networks (2020)
Google Scholar
Yuan, K., et al.: Stealthy porn: understanding real-world adversarial images for illicit online promotion. In: IEEE Symposium on Security and Privacy (SP) (2019)
Google Scholar
Tong, S., Zhang, H, Shen, B., et al.: Detecting gambling sites from post behaviors. In: IEEE 11th Conference on Industrial Electronics and Applications, pp. 2495–2500 (2016)
Google Scholar
Moustafa, M., et al.: Applying deep learning to classify pornographic images and videos. arXiv Preprint arxiv:1511.08899 (2015)

Download references

Acknowledgment

This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences with No. XDC02030000, the National Key Research and Development Program of China No. 2021YFB3101403 and National Key R &D Program 2021 (Grant No. 2021YFB3101001).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Yunfan Li, Lingjing Yu & Qingyun Liu
National Engineering Laboratory for Information Security Technologies, Beijing, China
Yunfan Li, Lingjing Yu & Qingyun Liu
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yunfan Li

Authors

Yunfan Li
View author publications
You can also search for this author in PubMed Google Scholar
Lingjing Yu
View author publications
You can also search for this author in PubMed Google Scholar
Qingyun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingjing Yu .

Editor information

Editors and Affiliations

Institute of Information Engineering, CAS, Beijing, China
Yi Deng
Columbia University, New York, NY, USA
Moti Yung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Yu, L., Liu, Q. (2023). HinPage: Illegal and Harmful Webpage Identification Using Transductive Classification. In: Deng, Y., Yung, M. (eds) Information Security and Cryptology. Inscrypt 2022. Lecture Notes in Computer Science, vol 13837. Springer, Cham. https://doi.org/10.1007/978-3-031-26553-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-26553-2_20
Published: 19 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26552-5
Online ISBN: 978-3-031-26553-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HinPage: Illegal and Harmful Webpage Identification Using Transductive Classification