skip to main content
10.1145/3522749.3523076acmotherconferencesArticle/Chapter ViewAbstractPublication PagescceaiConference Proceedingsconference-collections
research-article

A Survey of Modern Crawler Methods

Authors Info & Claims
Published:13 April 2022Publication History

ABSTRACT

Web crawler is a kind of computer program to browse the World Wide Web (WWW) automatically and efficiently. In the information age, due to the explosive growth of Internet pages, it has become exceedingly difficult and time-consuming for people to search information, therefore people need the help of crawler to get the information they need. In this paper, the existing crawler technology is briefly summarized, the purpose is for beginners or people interested in the field of crawler can have a preliminary understanding and cognition of crawler. This paper first introduces the background of crawler technology and its classification and use. Then it summarizes some common libraries and frameworks of modern crawlers through tables, which is convenient for readers to compare and understand. Secondly, this paper reviews the literature of modern crawler technology, and describes how people realize crawler application through library or framework. The practical application of crawler technology is also briefly explained.

References

  1. Kausar, Md Abu, V. S. Dhaka, and Sanjeev Kumar Singh. "Web crawler: a review." International Journal of Computer Applications 63.2 (2013).Google ScholarGoogle Scholar
  2. Sun, Yang, Isaac G. Councill, and C. Lee Giles. "The ethicality of web crawlers." 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Vol. 1. IEEE, 2010.Google ScholarGoogle Scholar
  3. Sharma, Shruti, and Parul Gupta. "The anatomy of web crawlers." International Conference on Computing, Communication & Automation. IEEE, 2015.Google ScholarGoogle Scholar
  4. Najork, Marc. "Web Crawler Architecture." (2009): 3462-3465.Google ScholarGoogle Scholar
  5. Yu, Linxuan, "Summary of web crawler technology research." Journal of Physics: Conference Series. Vol. 1449. No. 1. IOP Publishing, 2020.Google ScholarGoogle Scholar
  6. Boldi, Paolo, "Ubicrawler: A scalable fully distributed web crawler." Software: Practice and Experience 34.8 (2004): 711-726.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shiaeles, Stavros, Nicholas Kolokotronis, and Emanuele Bellini. "IoT vulnerability data crawling and analysis." 2019 IEEE World Congress on Services (SERVICES). Vol. 2642. IEEE, 2019.Google ScholarGoogle Scholar
  8. Menshchikov, Alexander, "A study of different web-crawler behaviour." 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, 2017.Google ScholarGoogle Scholar
  9. Dhenakaran, S. S., and K. Thirugnana Sambanthan. "Web crawler-an overview." International Journal of Computer Science and Communication 2.1 (2011): 265-267.Google ScholarGoogle Scholar
  10. Mirtaheri, Seyed M., "A brief history of web crawlers." arXiv preprint arXiv:1405.0749 (2014).Google ScholarGoogle Scholar
  11. Donat, Wolfram. "The Web Bot." Learn Raspberry Pi Programming with Python. Apress, Berkeley, CA, 2014. 67-80.Google ScholarGoogle ScholarCross RefCross Ref
  12. Sundaram, Anita. "Job Aggregation Search Engine." (2011).Google ScholarGoogle Scholar
  13. Rahmel, Dan. "Testing a site with ApacheBench, JMeter, and Selenium." Advanced Joomla!. Apress, Berkeley, CA, 2013. 211-247.Google ScholarGoogle Scholar
  14. Sharma, Utkarsh, and Brijesh Kumar Singh. "REINTRODUCTION OF AJAX USING JAVASCRIPT LIBRARIES."Google ScholarGoogle Scholar
  15. Wang, Yuxing, Zhiguo Hong, and Minyong Shi. "Research on lda model algorithm of news-oriented web crawler." 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS). IEEE, 2018.Google ScholarGoogle Scholar
  16. Mehta, Kunal, "A Comparative Study of Various Approaches to Adaptive Web Scraping." ICDSMLA 2019. Springer, Singapore, 2020. 1245-1256.Google ScholarGoogle ScholarCross RefCross Ref
  17. Kuzikov, Borys, and Maksym Vynohradov. "Web Service for Monitoring the Prices of Online Stores." International Scientific Committee: 97.Google ScholarGoogle Scholar
  18. Jarmul, Katharine, and Richard Lawson. Python Web Scraping. Packt Publishing Ltd, 2017.Google ScholarGoogle Scholar
  19. Wang, Jie, "The crawling and analysis of agricultural products big data based on Jsoup." 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE, 2015.Google ScholarGoogle Scholar
  20. Chhaware, Shaikh Phiroj, Mohammad Atique, and Latesh G. Malik. "Web Content Mining Based on Dom Intersection and Visual Features Concept."Google ScholarGoogle Scholar
  21. Hung, Wei-Hsiang, and Lih-Juan ChanLin. "Development of mobile web for the library." Procedia-Social and Behavioral Sciences 197 (2015): 259-264.Google ScholarGoogle Scholar
  22. Watson, Mark. "Cleaning, Segmenting, and Spell-Checking Text." Scripting Intelligence: Web 3.0 Information Gathering and Processing (2009): 19-33.Google ScholarGoogle Scholar
  23. Wang, RongFu. "Design and Implementation of It Job Recruitment Data Based on Web Crawler." (2021).Google ScholarGoogle Scholar
  24. Lang, Duncan Temple. "R as a Web Client–the RCurl package." Journal of Statistical Software (2007).Google ScholarGoogle Scholar
  25. Moniruzzaman, A. B. M., and Syed Akhter Hossain. "Nosql database: New era of databases for big data analytics-classification, characteristics and comparison." arXiv preprint arXiv:1307.0191 (2013).Google ScholarGoogle Scholar
  26. Boicea, Alexandru, Florin Radulescu, and Laura Ioana Agapin. "MongoDB vs Oracle–database comparison." 2012 third international conference on emerging intelligent data and web technologies. IEEE, 2012.Google ScholarGoogle Scholar
  27. Bahrami, Mehdi, Mukesh Singhal, and Zixuan Zhuang. "A cloud-based web crawler architecture." 2015 18th International Conference on Intelligence in Next Generation Networks. IEEE, 2015.Google ScholarGoogle Scholar
  28. Wang, Jing, and Yuchun Guo. "Scrapy-based crawling and user-behavior characteristics analysis on taobao." 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. IEEE, 2012.Google ScholarGoogle Scholar
  29. Cho, Junghoo, and Hector Garcia-Molina. The evolution of the web and implications for an incremental crawler. Stanford, 1999.Google ScholarGoogle Scholar
  30. Barlow, Luke, "A Novel Approach to Detect Phishing Attacks using Binary Visualisation and Machine Learning." 2020 IEEE World Congress on Services (SERVICES). IEEE, 2020.Google ScholarGoogle Scholar
  31. Rahim, Md Shamsur, "Mining trailers data from youtube for predicting gross income of movies." 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE, 2017.Google ScholarGoogle Scholar
  32. Peng, Jing, "Web Crawler of Power Grid Based on Selenium." 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing. IEEE, 2019.Google ScholarGoogle Scholar
  33. Shkapenyuk, Vladislav, and Torsten Suel. "Design and implementation of a high-performance distributed web crawler." Proceedings 18th International Conference on Data Engineering. IEEE, 2002.Google ScholarGoogle Scholar
  34. Xu, Zhang, and Dong Yan. "Designing and implementing of the webpage information extracting model based on tags." 2011 International Conference on Intelligence Science and Information Engineering. IEEE, 2011.Google ScholarGoogle Scholar
  35. Lu, Mengmeng, "The design and implementation of configurable news collection system based on web crawler." 2017 3rd IEEE International Conference on Computer and Communications (ICCC). IEEE, 2017.Google ScholarGoogle Scholar
  36. Fan, Yuhao. "Design and implementation of distributed crawler system based on scrapy." IOP Conference Series: Earth and Environmental Science. Vol. 108. No. 4. IOP Publishing, 2018.Google ScholarGoogle Scholar
  37. Yuvarani, Meiyappan, and A. Kannan. "LSCrawler: a framework for an enhanced focused web crawler based on link semantics." 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06). IEEE, 2006.Google ScholarGoogle Scholar
  38. Peshave, Monica, and Kamyar Dezhgosha. How Search Engines Work: And a Web Crawler Application. Diss. University of Illinois Springfield, 2005.Google ScholarGoogle Scholar
  39. Wang, Xiaoxia, and Zhanqiang Li. "Integrated platform for smart traffic big data." 2016 International Conference on Logistics, Informatics and Service Sciences (LISS). IEEE, 2016.Google ScholarGoogle Scholar
  40. Chen, Jinbo, "Research on agricultural monitoring system based on convolutional neural network." Future Generation Computer Systems 88 (2018): 271-278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Chaulagain, Ram Sharan, "Cloud based web scraping for big data applications." 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE, 2017.Google ScholarGoogle Scholar
  42. Haddaway, Neal R. "The use of web-scraping software in searching for grey literature." Grey J 11.3 (2015): 186-90.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    CCEAI '22: Proceedings of the 6th International Conference on Control Engineering and Artificial Intelligence
    March 2022
    130 pages
    ISBN:9781450385916
    DOI:10.1145/3522749

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 13 April 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format