research-article

An Abused Webpage Detection Method Based on Screenshots Text Recognition

Authors:
Yan-Ming Huang

Jinan University, Guangzhou, China

Jinan University, Guangzhou, China
View Profile

,
Dong-Jie Liu

Computer Network Information Center, Chinese Academy of Sciences, Beijing, China

Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
View Profile

,
Zhi-Wei Yan

China Internet Network Information Center, Beijing, China

China Internet Network Information Center, Beijing, China
View Profile

,
Yan-Ming Zhang

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Guang-Gang Geng

Jinan University, Guangzhou, China

Jinan University, Guangzhou, China
View Profile

ACM ICEA '21: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging ApplicationsDecember 2021Pages 106–110https://doi.org/10.1145/3491396.3506562

Published:07 January 2022Publication History

ACM ICEA '21: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications

Pages 106–110

ABSTRACT

With the rapid development of the Internet, webpages containing abused information such as pornography and gambling have emerged in an endless stream. These webpages are using various methods to evade traditional detection methods and which seriously make the Internet environment worse. Thus, how to accurately identify these webpages are becoming more and more significant. In response to this problem, by combining text recognition and text classification, this paper proposes an abused webpage detection method based on screenshots, which can efficiently detect and classify webpages by acquiring the user's real visible webpage information. Also, this paper uses the traditional web crawler method to conduct a comparative experiment, and the accuracy and the advantage of the method have been verified. This work will provide technical support for fighting against illegal activities and purifying the Internet environment.

References

Farman Ali, Pervez Khan, Kashif Riaz, Daehan Kwak, Tamer Abuhmed, Daeyoung Park, and Kyung Sup Kwak. 2017. A fuzzy ontology and SVM-based Web content classification system. IEEE Access 5 (2017), 25781--25797.Google ScholarCross Ref
China Internet Network Information Center. 2021. The 47th "Statistical Report on the Internet Development in China". Technical Report. China Internet Network Information Center.Google Scholar
Zhou Fa, Guang-Gang Geng, Zhi-Wei Yan, and Xiao-Dong Lee. 2017. A robust internet abuse detection method. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 1712--1715.Google ScholarCross Ref
Byeong Woo Han and Ji Won Yoon. 2016. Illegal and Harmful Information Detection Technique Using Combination of Search Words. Journal of the Korea Institute of Information Security & Cryptology 26, 2 (2016), 397--404.Google ScholarCross Ref
He Han. 2019. Introduction to Natural Language Processing. The People's Posts and Telecommunications Press.Google Scholar
Zhang Han-Long, Shen Bei-Jun, and Wang Yong-Jian. 2015. Illegal Website Identification Method Based on Template Detection. Journal of Nanjing University of Science and Technology 3 (2015), 266--271.Google Scholar
Mahdi Hashemi. 2020. Web page classification: a survey of perspectives, gaps, and future directions. Multimedia Tools and Applications (2020), 1--25.Google Scholar
Seok-Woo Jang and Sang-Hong Lee. 2018. Harmful content detection based on cascaded adaptive boosting. Journal of Sensors 2018 (2018).Google Scholar
Zhang Jia-Liang, Lu Jiang-Bo, Zhang Ming-Liang, and Jia Yu. 2019. A Method for Identifying Harmful Information on webpages Based on Machine Learning.Google Scholar
Longxi Li, Gaopeng Gou, Gang Xiong, Zigang Cao, and Zhen Li. 2017. Identifying Gambling and Porn Websites with Image Recognition. In Pacific Rim Conference on Multimedia. Springer, 488--497.Google Scholar
Xiyan Liu, Gaofeng Meng, and Chunhong Pan. 2019. Scene text detection and recognition with advances in deep learning: a survey. International Journal on Document Analysis and Recognition (IJDAR) 22, 2 (2019), 143--162.Google ScholarDigital Library
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39, 11 (2016), 2298--2304.Google ScholarDigital Library
Junghoon Shin, Sangjun Lee, and Taehyung Wang. 2014. Semantic Approach for Identifying Harmful Sites Using the Link Relations. In 2014 IEEE International Conference on Semantic Computing. IEEE, 256--257.Google Scholar
Qiang Song and Gang Li. 2009. The Research on the Measurement of China Internet Illegal and Harmful Content. In 2009 Fifth International Conference on Information Assurance and Security, Vol. 1. IEEE, 705--709.Google Scholar
Xiao-Ping Tian, Guang-Gang Geng, and Hong-Tao Li. 2010. A framework for multi-features based web harmful information identification. In 2010 International Conference on Computer Application and System Modeling (ICCASM 2010), Vol. 11. IEEE, V11-614.Google Scholar
Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9336--9345.Google ScholarCross Ref
Jônatas Wehrmann, Gabriel S Simões, Rodrigo C Barros, and Victor F Cavalcante. 2018. Adult content detection in videos with convolutional and recurrent neural networks. Neurocomputing 272 (2018), 432--438.Google ScholarDigital Library
Hao Yang, Kun Du, Yubao Zhang, Shuang Hao, Zhou Li, Mingxuan Liu, Haining Wang, Haixin Duan, Yazhou Shi, Xiaodong Su, et al. 2019. Casino royale: a deep exploration of illegal online gambling. In Proceedings of the 35th Annual Computer Security Applications Conference. 500--513.Google ScholarDigital Library

Index Terms

An Abused Webpage Detection Method Based on Screenshots Text Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
  2. World Wide Web
    1. Web searching and information discovery

Recommendations

Malicious Webpage Detection by Semantics-Aware Reasoning
ISDA '08: Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications - Volume 01

The recent evolutional development of dynamic HTML techniques empowers attackers a new and powerful tool to compromise machines. A malicious DHTML code disguises itself as a normal webpage. The malicious webpage infects the victim when a user browses ...
Read More
Automatic Text Recognition in Web Images
WebMedia '17: Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web

Web images play an important role in delivering multimedia content on the Web. The text embedded in web images carry semantic information related to layout and content of the pages. Statistics show that there is a significant need to detect and ...
Read More
Phishing Webpage Detection
ICDAR '05: Proceedings of the Eighth International Conference on Document Analysis and Recognition

An approach to detection of phishing webpages based on visual similarity is proposed, which can be utilized as a part of an enterprise solution to anti-phishing. A legitimate webpage owner can use this approach to search the Web for suspicious webpages ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ACM ICEA '21: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications
December 2021
241 pages
ISBN:9781450391603
DOI:10.1145/3491396

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 January 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Abused Webpages Detection
Text Classification
Text Detection
Text Recognition
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 100
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Abused Webpage Detection Method Based on Screenshots Text Recognition

ACM ICEA '21: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Malicious Webpage Detection by Semantics-Aware Reasoning

Automatic Text Recognition in Web Images

Phishing Webpage Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An Abused Webpage Detection Method Based on Screenshots Text Recognition

ACM ICEA '21: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Malicious Webpage Detection by Semantics-Aware Reasoning

Automatic Text Recognition in Web Images

Phishing Webpage Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media