research-article

A Survey of Modern Crawler Methods

Author:
Zixiang Chang

School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China

School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China
View Profile

CCEAI '22: Proceedings of the 6th International Conference on Control Engineering and Artificial IntelligenceMarch 2022Pages 21–28https://doi.org/10.1145/3522749.3523076

Published:13 April 2022Publication History

CCEAI '22: Proceedings of the 6th International Conference on Control Engineering and Artificial Intelligence

Pages 21–28

ABSTRACT

Web crawler is a kind of computer program to browse the World Wide Web (WWW) automatically and efficiently. In the information age, due to the explosive growth of Internet pages, it has become exceedingly difficult and time-consuming for people to search information, therefore people need the help of crawler to get the information they need. In this paper, the existing crawler technology is briefly summarized, the purpose is for beginners or people interested in the field of crawler can have a preliminary understanding and cognition of crawler. This paper first introduces the background of crawler technology and its classification and use. Then it summarizes some common libraries and frameworks of modern crawlers through tables, which is convenient for readers to compare and understand. Secondly, this paper reviews the literature of modern crawler technology, and describes how people realize crawler application through library or framework. The practical application of crawler technology is also briefly explained.

References

Kausar, Md Abu, V. S. Dhaka, and Sanjeev Kumar Singh. "Web crawler: a review." International Journal of Computer Applications 63.2 (2013).Google Scholar
Sun, Yang, Isaac G. Councill, and C. Lee Giles. "The ethicality of web crawlers." 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Vol. 1. IEEE, 2010.Google Scholar
Sharma, Shruti, and Parul Gupta. "The anatomy of web crawlers." International Conference on Computing, Communication & Automation. IEEE, 2015.Google Scholar
Najork, Marc. "Web Crawler Architecture." (2009): 3462-3465.Google Scholar
Yu, Linxuan, "Summary of web crawler technology research." Journal of Physics: Conference Series. Vol. 1449. No. 1. IOP Publishing, 2020.Google Scholar
Boldi, Paolo, "Ubicrawler: A scalable fully distributed web crawler." Software: Practice and Experience 34.8 (2004): 711-726.Google ScholarDigital Library
Shiaeles, Stavros, Nicholas Kolokotronis, and Emanuele Bellini. "IoT vulnerability data crawling and analysis." 2019 IEEE World Congress on Services (SERVICES). Vol. 2642. IEEE, 2019.Google Scholar
Menshchikov, Alexander, "A study of different web-crawler behaviour." 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, 2017.Google Scholar
Dhenakaran, S. S., and K. Thirugnana Sambanthan. "Web crawler-an overview." International Journal of Computer Science and Communication 2.1 (2011): 265-267.Google Scholar
Mirtaheri, Seyed M., "A brief history of web crawlers." arXiv preprint arXiv:1405.0749 (2014).Google Scholar
Donat, Wolfram. "The Web Bot." Learn Raspberry Pi Programming with Python. Apress, Berkeley, CA, 2014. 67-80.Google ScholarCross Ref
Sundaram, Anita. "Job Aggregation Search Engine." (2011).Google Scholar
Rahmel, Dan. "Testing a site with ApacheBench, JMeter, and Selenium." Advanced Joomla!. Apress, Berkeley, CA, 2013. 211-247.Google Scholar
Sharma, Utkarsh, and Brijesh Kumar Singh. "REINTRODUCTION OF AJAX USING JAVASCRIPT LIBRARIES."Google Scholar
Wang, Yuxing, Zhiguo Hong, and Minyong Shi. "Research on lda model algorithm of news-oriented web crawler." 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS). IEEE, 2018.Google Scholar
Mehta, Kunal, "A Comparative Study of Various Approaches to Adaptive Web Scraping." ICDSMLA 2019. Springer, Singapore, 2020. 1245-1256.Google ScholarCross Ref
Kuzikov, Borys, and Maksym Vynohradov. "Web Service for Monitoring the Prices of Online Stores." International Scientific Committee: 97.Google Scholar
Jarmul, Katharine, and Richard Lawson. Python Web Scraping. Packt Publishing Ltd, 2017.Google Scholar
Wang, Jie, "The crawling and analysis of agricultural products big data based on Jsoup." 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE, 2015.Google Scholar
Chhaware, Shaikh Phiroj, Mohammad Atique, and Latesh G. Malik. "Web Content Mining Based on Dom Intersection and Visual Features Concept."Google Scholar
Hung, Wei-Hsiang, and Lih-Juan ChanLin. "Development of mobile web for the library." Procedia-Social and Behavioral Sciences 197 (2015): 259-264.Google Scholar
Watson, Mark. "Cleaning, Segmenting, and Spell-Checking Text." Scripting Intelligence: Web 3.0 Information Gathering and Processing (2009): 19-33.Google Scholar
Wang, RongFu. "Design and Implementation of It Job Recruitment Data Based on Web Crawler." (2021).Google Scholar
Lang, Duncan Temple. "R as a Web Client–the RCurl package." Journal of Statistical Software (2007).Google Scholar
Moniruzzaman, A. B. M., and Syed Akhter Hossain. "Nosql database: New era of databases for big data analytics-classification, characteristics and comparison." arXiv preprint arXiv:1307.0191 (2013).Google Scholar
Boicea, Alexandru, Florin Radulescu, and Laura Ioana Agapin. "MongoDB vs Oracle–database comparison." 2012 third international conference on emerging intelligent data and web technologies. IEEE, 2012.Google Scholar
Bahrami, Mehdi, Mukesh Singhal, and Zixuan Zhuang. "A cloud-based web crawler architecture." 2015 18th International Conference on Intelligence in Next Generation Networks. IEEE, 2015.Google Scholar
Wang, Jing, and Yuchun Guo. "Scrapy-based crawling and user-behavior characteristics analysis on taobao." 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. IEEE, 2012.Google Scholar
Cho, Junghoo, and Hector Garcia-Molina. The evolution of the web and implications for an incremental crawler. Stanford, 1999.Google Scholar
Barlow, Luke, "A Novel Approach to Detect Phishing Attacks using Binary Visualisation and Machine Learning." 2020 IEEE World Congress on Services (SERVICES). IEEE, 2020.Google Scholar
Rahim, Md Shamsur, "Mining trailers data from youtube for predicting gross income of movies." 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE, 2017.Google Scholar
Peng, Jing, "Web Crawler of Power Grid Based on Selenium." 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing. IEEE, 2019.Google Scholar
Shkapenyuk, Vladislav, and Torsten Suel. "Design and implementation of a high-performance distributed web crawler." Proceedings 18th International Conference on Data Engineering. IEEE, 2002.Google Scholar
Xu, Zhang, and Dong Yan. "Designing and implementing of the webpage information extracting model based on tags." 2011 International Conference on Intelligence Science and Information Engineering. IEEE, 2011.Google Scholar
Lu, Mengmeng, "The design and implementation of configurable news collection system based on web crawler." 2017 3rd IEEE International Conference on Computer and Communications (ICCC). IEEE, 2017.Google Scholar
Fan, Yuhao. "Design and implementation of distributed crawler system based on scrapy." IOP Conference Series: Earth and Environmental Science. Vol. 108. No. 4. IOP Publishing, 2018.Google Scholar
Yuvarani, Meiyappan, and A. Kannan. "LSCrawler: a framework for an enhanced focused web crawler based on link semantics." 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06). IEEE, 2006.Google Scholar
Peshave, Monica, and Kamyar Dezhgosha. How Search Engines Work: And a Web Crawler Application. Diss. University of Illinois Springfield, 2005.Google Scholar
Wang, Xiaoxia, and Zhanqiang Li. "Integrated platform for smart traffic big data." 2016 International Conference on Logistics, Informatics and Service Sciences (LISS). IEEE, 2016.Google Scholar
Chen, Jinbo, "Research on agricultural monitoring system based on convolutional neural network." Future Generation Computer Systems 88 (2018): 271-278.Google ScholarDigital Library
Chaulagain, Ram Sharan, "Cloud based web scraping for big data applications." 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE, 2017.Google Scholar
Haddaway, Neal R. "The use of web-scraping software in searching for grey literature." Grey J 11.3 (2015): 186-90.Google Scholar

Recommendations

Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications
3PGCIC '13: Proceedings of the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, as old as the web itself. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is ...
Read More
Learnable topic-specific web crawler
Special issue on computational intelligence on the internet

Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages ...
Read More
Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early
DASFAA '09: Proceedings of the 14th International Conference on Database Systems for Advanced Applications

Crawling important pages early is a well studied problem. However, the availability of different types of framework for publishing web content greatly increases the number of web pages. Therefore, the crawler should be fast enough to prioritize and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CCEAI '22: Proceedings of the 6th International Conference on Control Engineering and Artificial Intelligence
March 2022
130 pages
ISBN:9781450385916
DOI:10.1145/3522749
Editors:
Dan Zhang,
Marek Ogiela
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 April 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crawler framework
crawler library
modern crawler
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 311
  Total Downloads
- Downloads (Last 12 months)160
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Survey of Modern Crawler Methods

CCEAI '22: Proceedings of the 6th International Conference on Control Engineering and Artificial Intelligence

ABSTRACT

References

Cited By

Recommendations

Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications

Learnable topic-specific web crawler

Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A Survey of Modern Crawler Methods

CCEAI '22: Proceedings of the 6th International Conference on Control Engineering and Artificial Intelligence

ABSTRACT

References

Cited By

Recommendations

Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications

Learnable topic-specific web crawler

Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media