ABSTRACT
With the rapid development of the Internet, webpages containing abused information such as pornography and gambling have emerged in an endless stream. These webpages are using various methods to evade traditional detection methods and which seriously make the Internet environment worse. Thus, how to accurately identify these webpages are becoming more and more significant. In response to this problem, by combining text recognition and text classification, this paper proposes an abused webpage detection method based on screenshots, which can efficiently detect and classify webpages by acquiring the user's real visible webpage information. Also, this paper uses the traditional web crawler method to conduct a comparative experiment, and the accuracy and the advantage of the method have been verified. This work will provide technical support for fighting against illegal activities and purifying the Internet environment.
- Farman Ali, Pervez Khan, Kashif Riaz, Daehan Kwak, Tamer Abuhmed, Daeyoung Park, and Kyung Sup Kwak. 2017. A fuzzy ontology and SVM-based Web content classification system. IEEE Access 5 (2017), 25781--25797.Google ScholarCross Ref
- China Internet Network Information Center. 2021. The 47th "Statistical Report on the Internet Development in China". Technical Report. China Internet Network Information Center.Google Scholar
- Zhou Fa, Guang-Gang Geng, Zhi-Wei Yan, and Xiao-Dong Lee. 2017. A robust internet abuse detection method. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 1712--1715.Google ScholarCross Ref
- Byeong Woo Han and Ji Won Yoon. 2016. Illegal and Harmful Information Detection Technique Using Combination of Search Words. Journal of the Korea Institute of Information Security & Cryptology 26, 2 (2016), 397--404.Google ScholarCross Ref
- He Han. 2019. Introduction to Natural Language Processing. The People's Posts and Telecommunications Press.Google Scholar
- Zhang Han-Long, Shen Bei-Jun, and Wang Yong-Jian. 2015. Illegal Website Identification Method Based on Template Detection. Journal of Nanjing University of Science and Technology 3 (2015), 266--271.Google Scholar
- Mahdi Hashemi. 2020. Web page classification: a survey of perspectives, gaps, and future directions. Multimedia Tools and Applications (2020), 1--25.Google Scholar
- Seok-Woo Jang and Sang-Hong Lee. 2018. Harmful content detection based on cascaded adaptive boosting. Journal of Sensors 2018 (2018).Google Scholar
- Zhang Jia-Liang, Lu Jiang-Bo, Zhang Ming-Liang, and Jia Yu. 2019. A Method for Identifying Harmful Information on webpages Based on Machine Learning.Google Scholar
- Longxi Li, Gaopeng Gou, Gang Xiong, Zigang Cao, and Zhen Li. 2017. Identifying Gambling and Porn Websites with Image Recognition. In Pacific Rim Conference on Multimedia. Springer, 488--497.Google Scholar
- Xiyan Liu, Gaofeng Meng, and Chunhong Pan. 2019. Scene text detection and recognition with advances in deep learning: a survey. International Journal on Document Analysis and Recognition (IJDAR) 22, 2 (2019), 143--162.Google ScholarDigital Library
- Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39, 11 (2016), 2298--2304.Google ScholarDigital Library
- Junghoon Shin, Sangjun Lee, and Taehyung Wang. 2014. Semantic Approach for Identifying Harmful Sites Using the Link Relations. In 2014 IEEE International Conference on Semantic Computing. IEEE, 256--257.Google Scholar
- Qiang Song and Gang Li. 2009. The Research on the Measurement of China Internet Illegal and Harmful Content. In 2009 Fifth International Conference on Information Assurance and Security, Vol. 1. IEEE, 705--709.Google Scholar
- Xiao-Ping Tian, Guang-Gang Geng, and Hong-Tao Li. 2010. A framework for multi-features based web harmful information identification. In 2010 International Conference on Computer Application and System Modeling (ICCASM 2010), Vol. 11. IEEE, V11-614.Google Scholar
- Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9336--9345.Google ScholarCross Ref
- Jônatas Wehrmann, Gabriel S Simões, Rodrigo C Barros, and Victor F Cavalcante. 2018. Adult content detection in videos with convolutional and recurrent neural networks. Neurocomputing 272 (2018), 432--438.Google ScholarDigital Library
- Hao Yang, Kun Du, Yubao Zhang, Shuang Hao, Zhou Li, Mingxuan Liu, Haining Wang, Haixin Duan, Yazhou Shi, Xiaodong Su, et al. 2019. Casino royale: a deep exploration of illegal online gambling. In Proceedings of the 35th Annual Computer Security Applications Conference. 500--513.Google ScholarDigital Library
Index Terms
- An Abused Webpage Detection Method Based on Screenshots Text Recognition
Recommendations
Malicious Webpage Detection by Semantics-Aware Reasoning
ISDA '08: Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications - Volume 01The recent evolutional development of dynamic HTML techniques empowers attackers a new and powerful tool to compromise machines. A malicious DHTML code disguises itself as a normal webpage. The malicious webpage infects the victim when a user browses ...
Automatic Text Recognition in Web Images
WebMedia '17: Proceedings of the 23rd Brazillian Symposium on Multimedia and the WebWeb images play an important role in delivering multimedia content on the Web. The text embedded in web images carry semantic information related to layout and content of the pages. Statistics show that there is a significant need to detect and ...
Phishing Webpage Detection
ICDAR '05: Proceedings of the Eighth International Conference on Document Analysis and RecognitionAn approach to detection of phishing webpages based on visual similarity is proposed, which can be utilized as a part of an enterprise solution to anti-phishing. A legitimate webpage owner can use this approach to search the Web for suspicious webpages ...
Comments