ABSTRACT
Web crawler is a kind of computer program to browse the World Wide Web (WWW) automatically and efficiently. In the information age, due to the explosive growth of Internet pages, it has become exceedingly difficult and time-consuming for people to search information, therefore people need the help of crawler to get the information they need. In this paper, the existing crawler technology is briefly summarized, the purpose is for beginners or people interested in the field of crawler can have a preliminary understanding and cognition of crawler. This paper first introduces the background of crawler technology and its classification and use. Then it summarizes some common libraries and frameworks of modern crawlers through tables, which is convenient for readers to compare and understand. Secondly, this paper reviews the literature of modern crawler technology, and describes how people realize crawler application through library or framework. The practical application of crawler technology is also briefly explained.
- Kausar, Md Abu, V. S. Dhaka, and Sanjeev Kumar Singh. "Web crawler: a review." International Journal of Computer Applications 63.2 (2013).Google Scholar
- Sun, Yang, Isaac G. Councill, and C. Lee Giles. "The ethicality of web crawlers." 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Vol. 1. IEEE, 2010.Google Scholar
- Sharma, Shruti, and Parul Gupta. "The anatomy of web crawlers." International Conference on Computing, Communication & Automation. IEEE, 2015.Google Scholar
- Najork, Marc. "Web Crawler Architecture." (2009): 3462-3465.Google Scholar
- Yu, Linxuan, "Summary of web crawler technology research." Journal of Physics: Conference Series. Vol. 1449. No. 1. IOP Publishing, 2020.Google Scholar
- Boldi, Paolo, "Ubicrawler: A scalable fully distributed web crawler." Software: Practice and Experience 34.8 (2004): 711-726.Google ScholarDigital Library
- Shiaeles, Stavros, Nicholas Kolokotronis, and Emanuele Bellini. "IoT vulnerability data crawling and analysis." 2019 IEEE World Congress on Services (SERVICES). Vol. 2642. IEEE, 2019.Google Scholar
- Menshchikov, Alexander, "A study of different web-crawler behaviour." 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, 2017.Google Scholar
- Dhenakaran, S. S., and K. Thirugnana Sambanthan. "Web crawler-an overview." International Journal of Computer Science and Communication 2.1 (2011): 265-267.Google Scholar
- Mirtaheri, Seyed M., "A brief history of web crawlers." arXiv preprint arXiv:1405.0749 (2014).Google Scholar
- Donat, Wolfram. "The Web Bot." Learn Raspberry Pi Programming with Python. Apress, Berkeley, CA, 2014. 67-80.Google ScholarCross Ref
- Sundaram, Anita. "Job Aggregation Search Engine." (2011).Google Scholar
- Rahmel, Dan. "Testing a site with ApacheBench, JMeter, and Selenium." Advanced Joomla!. Apress, Berkeley, CA, 2013. 211-247.Google Scholar
- Sharma, Utkarsh, and Brijesh Kumar Singh. "REINTRODUCTION OF AJAX USING JAVASCRIPT LIBRARIES."Google Scholar
- Wang, Yuxing, Zhiguo Hong, and Minyong Shi. "Research on lda model algorithm of news-oriented web crawler." 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS). IEEE, 2018.Google Scholar
- Mehta, Kunal, "A Comparative Study of Various Approaches to Adaptive Web Scraping." ICDSMLA 2019. Springer, Singapore, 2020. 1245-1256.Google ScholarCross Ref
- Kuzikov, Borys, and Maksym Vynohradov. "Web Service for Monitoring the Prices of Online Stores." International Scientific Committee: 97.Google Scholar
- Jarmul, Katharine, and Richard Lawson. Python Web Scraping. Packt Publishing Ltd, 2017.Google Scholar
- Wang, Jie, "The crawling and analysis of agricultural products big data based on Jsoup." 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE, 2015.Google Scholar
- Chhaware, Shaikh Phiroj, Mohammad Atique, and Latesh G. Malik. "Web Content Mining Based on Dom Intersection and Visual Features Concept."Google Scholar
- Hung, Wei-Hsiang, and Lih-Juan ChanLin. "Development of mobile web for the library." Procedia-Social and Behavioral Sciences 197 (2015): 259-264.Google Scholar
- Watson, Mark. "Cleaning, Segmenting, and Spell-Checking Text." Scripting Intelligence: Web 3.0 Information Gathering and Processing (2009): 19-33.Google Scholar
- Wang, RongFu. "Design and Implementation of It Job Recruitment Data Based on Web Crawler." (2021).Google Scholar
- Lang, Duncan Temple. "R as a Web Client–the RCurl package." Journal of Statistical Software (2007).Google Scholar
- Moniruzzaman, A. B. M., and Syed Akhter Hossain. "Nosql database: New era of databases for big data analytics-classification, characteristics and comparison." arXiv preprint arXiv:1307.0191 (2013).Google Scholar
- Boicea, Alexandru, Florin Radulescu, and Laura Ioana Agapin. "MongoDB vs Oracle–database comparison." 2012 third international conference on emerging intelligent data and web technologies. IEEE, 2012.Google Scholar
- Bahrami, Mehdi, Mukesh Singhal, and Zixuan Zhuang. "A cloud-based web crawler architecture." 2015 18th International Conference on Intelligence in Next Generation Networks. IEEE, 2015.Google Scholar
- Wang, Jing, and Yuchun Guo. "Scrapy-based crawling and user-behavior characteristics analysis on taobao." 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. IEEE, 2012.Google Scholar
- Cho, Junghoo, and Hector Garcia-Molina. The evolution of the web and implications for an incremental crawler. Stanford, 1999.Google Scholar
- Barlow, Luke, "A Novel Approach to Detect Phishing Attacks using Binary Visualisation and Machine Learning." 2020 IEEE World Congress on Services (SERVICES). IEEE, 2020.Google Scholar
- Rahim, Md Shamsur, "Mining trailers data from youtube for predicting gross income of movies." 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE, 2017.Google Scholar
- Peng, Jing, "Web Crawler of Power Grid Based on Selenium." 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing. IEEE, 2019.Google Scholar
- Shkapenyuk, Vladislav, and Torsten Suel. "Design and implementation of a high-performance distributed web crawler." Proceedings 18th International Conference on Data Engineering. IEEE, 2002.Google Scholar
- Xu, Zhang, and Dong Yan. "Designing and implementing of the webpage information extracting model based on tags." 2011 International Conference on Intelligence Science and Information Engineering. IEEE, 2011.Google Scholar
- Lu, Mengmeng, "The design and implementation of configurable news collection system based on web crawler." 2017 3rd IEEE International Conference on Computer and Communications (ICCC). IEEE, 2017.Google Scholar
- Fan, Yuhao. "Design and implementation of distributed crawler system based on scrapy." IOP Conference Series: Earth and Environmental Science. Vol. 108. No. 4. IOP Publishing, 2018.Google Scholar
- Yuvarani, Meiyappan, and A. Kannan. "LSCrawler: a framework for an enhanced focused web crawler based on link semantics." 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06). IEEE, 2006.Google Scholar
- Peshave, Monica, and Kamyar Dezhgosha. How Search Engines Work: And a Web Crawler Application. Diss. University of Illinois Springfield, 2005.Google Scholar
- Wang, Xiaoxia, and Zhanqiang Li. "Integrated platform for smart traffic big data." 2016 International Conference on Logistics, Informatics and Service Sciences (LISS). IEEE, 2016.Google Scholar
- Chen, Jinbo, "Research on agricultural monitoring system based on convolutional neural network." Future Generation Computer Systems 88 (2018): 271-278.Google ScholarDigital Library
- Chaulagain, Ram Sharan, "Cloud based web scraping for big data applications." 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE, 2017.Google Scholar
- Haddaway, Neal R. "The use of web-scraping software in searching for grey literature." Grey J 11.3 (2015): 186-90.Google Scholar
Recommendations
Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications
3PGCIC '13: Proceedings of the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet ComputingCrawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, as old as the web itself. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is ...
Learnable topic-specific web crawler
Special issue on computational intelligence on the internetTopic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages ...
Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early
DASFAA '09: Proceedings of the 14th International Conference on Database Systems for Advanced ApplicationsCrawling important pages early is a well studied problem. However, the availability of different types of framework for publishing web content greatly increases the number of web pages. Therefore, the crawler should be fast enough to prioritize and ...
Comments