Skip to main content

HinPage: Illegal and Harmful Webpage Identification Using Transductive Classification

  • Conference paper
  • First Online:
Information Security and Cryptology (Inscrypt 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13837))

Included in the following conference series:

  • 651 Accesses

Abstract

With the growing popularity of the Internet, websites could make significant profit by hosting illegal and harmful content, such as violence, sexual, illegal gambling, drug abuse, etc. They are serious threats to a safe and secure Internet, and they are especially harmful to the underage population. Government agencies, ISPs, network administrators at various levels, and parents have been seeking for accurate and robust solutions to block such illegal and harmful webpages. Existing solutions detect inappropriate pages based on content, e.g., using keyword matching or content-based image classification. They could be easily escaped by altering the internal format of texts or images, e.g., mixing different alphabets. In this paper, we propose to utilize relatively stable features extracted from the relationships between the targeted illegal/harmful webpages to discover and identify illegal webpages. We introduce a new mechanism, namely HinPage, that utilizes such features for the robust identification of PG (pornographic and gambling) pages. HinPage models the candidate PG pages and the resources on the pages with a heterogeneous information network (HIN). A transductive classification algorithm is then applied to the HIN to identify PG pages.

Through experiments on 10,033 candidate PG pages, we demonstrate that HinPage achieves an accuracy of 83.5% on PG page identification. In particular, it is able to identify illegal/harmful PG pages that cannot be recognized by SOTA commercial products.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Luo, C., Guan, R., Wang, Z., Lin, C.: HetPathMine: a novel transductive classification algorithm on heterogeneous information networks. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 210–221. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_18

    Chapter  Google Scholar 

  2. Yang, H., Du, K., Zhang, Y., et al.: Casino Royale: a deep exploration of illegal online gambling. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 500–513 (2019)

    Google Scholar 

  3. Farman, A., Pervez, K., Kashif, R., et al.: A fuzzy ontology and SVM-based Web content classification system. IEEE Access 25781–25797 (2017)

    Google Scholar 

  4. Li, L., Gou, G., Xiong, G., Cao, Z., Li, Z.: Identifying gambling and porn websites with image recognition. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds.) PCM 2017. LNCS, vol. 10736, pp. 488–497. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77383-4_48

    Chapter  Google Scholar 

  5. Hu, W., Wu, O., Chen, Z., et al.: Recognition of pornographic web pages by classifying texts and images. IEEE Trans. Pattern Anal. 1019–1034 (2007)

    Google Scholar 

  6. Huang, Y., Liu, D., Yan, Z., et al.: An abused webpage detection method based on screenshots text recognition. In: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications, pp. 106–110 (2021)

    Google Scholar 

  7. Chen, Y., Zheng, R., Zhou, A., et al.: Automatic detection of pornographic and gambling websites based on visual and textual content using a decision mechanism. Sensors (2020)

    Google Scholar 

  8. Yang, R., Liu, J., Gu, L., et al.: Search & catch: detecting promotion infection in the underground through search engines. In: IEEE TrustCom, pp. 1566–1571 (2020)

    Google Scholar 

  9. Starov, O., Zhou, Y., Zhang, X., et al.: Betrayed by your dashboard: discovering malicious campaigns via web analytics. In: Proceedings of the World Wide Web Conference, pp. 227–236 (2018)

    Google Scholar 

  10. Salam, H., Maarof, M.A., Zainal, A.: Design consideration for improved term weighting scheme for pornographic web sites. In: Abraham, A., Muda, A.K., Choo, Y.-H. (eds.) Pattern Analysis, Intelligent Security and the Internet of Things. AISC, vol. 355, pp. 275–285. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17398-6_25

    Chapter  Google Scholar 

  11. Wang, L., Zhang, J., Wang, M., Tian, J., Zhuo, L.: Multilevel fusion of multimodal deep features for porn streamer recognition in live video. Pattern Recogn. Lett. 140, 150–157 (2020)

    Article  Google Scholar 

  12. Ahmadi, A., Fotouhi, M., Khaleghi, M.: Intelligent classification of webpages using contextual and visual features. Appl. Soft Comput. 11, 1638–1647 (2011)

    Article  Google Scholar 

  13. Maktabar, M., Zainal, A., Maarof, M.A., Kassim, M.N.: Content based fraudulent website detection using supervised machine learning techniques. In: Abraham, A., Muhuri, P.K., Muda, A.K., Gandhi, N. (eds.) HIS 2017. AISC, vol. 734, pp. 294–304. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76351-4_30

    Chapter  Google Scholar 

  14. European Commission. Illegal and Harmful Content on the Internet COM(96)487final (1996)

    Google Scholar 

  15. Shin, J., Lee, S., Wang, T.: Semantic approach for identifying harmful sites using the link relations. In: Proceedings of the 2014 IEEE International Conference on Semantic Computing, pp. 16–18 (2014)

    Google Scholar 

  16. Farooq, M.S., Khan, M.A., Abbas, S., et al.: Skin detection based pornography filtering using adaptive back propagation neural network. In: 8th International Conference on Information and Communication Technologies, pp. 106–112 (2019)

    Google Scholar 

  17. Yaqub, W., Mohanty, M., et al.: Encrypted domain skin tone detection for pornographic image filtering. In: 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–5 (2018)

    Google Scholar 

  18. Granizo, S.L., Caraguay, Á.L., López, L.I., Hernández-Álvarez, M.: Detection of possible illicit messages using natural language processing and computer vision on twitter and linked websites. IEEE Access (2020)

    Google Scholar 

  19. Lee, P.Y., Hui, S.C., Fong, A.C.M.: An intelligent categorization engine for bilingual web content filtering. IEEE Trans. Multimed. 1183–1190 (2005)

    Google Scholar 

  20. Sae-Bae, N., Sun, X., et al.: Towards automatic detection of child pornography. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 5332–5336 (2014)

    Google Scholar 

  21. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004)

    Google Scholar 

  22. Chrome DevTools. https://chromedevtools.github.io/devtools-protocol/1-3/Page/

  23. OpenCV. https://opencv.org/

  24. Sun, Y., Han, J., Yan, X., et al.: PathSim: meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4, 992–1003 (2011)

    Article  Google Scholar 

  25. Symantec sitereview. https://sitereview.bluecoat.com/

  26. Baidu Security Platform. https://bsb.baidu.com/

  27. Evaluation Standard of Baidu Security Platform. https://bsb.baidu.com/standard

  28. Nomura, S., Oyama, S., Hayamizu, T., et al.: Analysis and improvement of HITS algorithm for detecting web communities. Syst. Comput. 35, 32–42 (2004)

    Article  Google Scholar 

  29. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144 (2017)

    Google Scholar 

  30. Sokolov, M., Olufowobi, K., Herndon, N.: Visual spoofing in content-based spam detection. In: 13th International Conference on Security of Information and Networks (2020)

    Google Scholar 

  31. Yuan, K., et al.: Stealthy porn: understanding real-world adversarial images for illicit online promotion. In: IEEE Symposium on Security and Privacy (SP) (2019)

    Google Scholar 

  32. Tong, S., Zhang, H, Shen, B., et al.: Detecting gambling sites from post behaviors. In: IEEE 11th Conference on Industrial Electronics and Applications, pp. 2495–2500 (2016)

    Google Scholar 

  33. Moustafa, M., et al.: Applying deep learning to classify pornographic images and videos. arXiv Preprint arxiv:1511.08899 (2015)

Download references

Acknowledgment

This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences with No. XDC02030000, the National Key Research and Development Program of China No. 2021YFB3101403 and National Key R &D Program 2021 (Grant No. 2021YFB3101001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingjing Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Yu, L., Liu, Q. (2023). HinPage: Illegal and Harmful Webpage Identification Using Transductive Classification. In: Deng, Y., Yung, M. (eds) Information Security and Cryptology. Inscrypt 2022. Lecture Notes in Computer Science, vol 13837. Springer, Cham. https://doi.org/10.1007/978-3-031-26553-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26553-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26552-5

  • Online ISBN: 978-3-031-26553-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics