Skip to main content

A Hybrid Approach for Recognizing Web Crawlers

  • Conference paper
  • First Online:
Wireless Algorithms, Systems, and Applications (WASA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11604))

Abstract

In recent years, web crawlers have been widely used for collecting data from the Internet. Accurately recognizing web crawlers can help to better utilize friendly crawlers while stopping malicious ones. Existing web crawler recognition researches have difficulties in handling new crawlers, such as distributed crawlers, proxy based crawlers, and browser engine based crawlers. Moreover, it is non-trivial to achieve both high identification accuracy and high response time simultaneously. To tackle these issues, we propose a novel approach to web crawler recognition which combines real-time recognition methods based on heuristic rules and offline recognition methods based on machine learning. The aforementioned problems are well solved in this approach. The advantage of this approach is that both accuracy and efficiency are improved. We build a website and analyze its web access log using the proposed method. According to the results, the proposed approach achieves desirable performance in both accuracy and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/monperrus/crawler-user-agents.

  2. 2.

    Weka is a collection of machine learning algorithms for data mining tasks. Its website is https://www.cs.waikato.ac.nz/ml/weka/.

References

  1. da Silva, A.S., Veloso, E.A., Golgher, P.B., Ribeiro-Neto, B., Laender, A.H.F., Ziviani, N.: Cobweb-a crawler for the Brazilian web. In: 6th International Symposium on String Processing and Information Retrieval, pp. 184–191 (1999)

    Google Scholar 

  2. Raina, S., Agarwal, A.P.: How crawlers aid regression testing in web applications: the state of the art. Int. J. Comput. Appl. 68(14), 33–38 (2014)

    Google Scholar 

  3. Lau, C.H., Tao, X., Tjondronegoro, D., Li, Y.: Retrieving information from microblog using pattern mining and relevance feedback. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds.) ICDKE 2012. LNCS, vol. 7696, pp. 152–160. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34679-8_15

    Google Scholar 

  4. Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: irobot: An intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM, New York (2008)

    Google Scholar 

  5. Lu, P.: Bring you into the world of crawler and anti-crawler. Softw. Integr. Circ. 12, 12–13 (2016)

    Google Scholar 

  6. CtripTech: This is enough for anti-crawler technology, June 2016. https://segmentfault.com/a/1190000005840672

  7. Friesel, R.: PhantomJS cookbook over 70 recipes to help boost the productivity of your applications using real-world testing with PhantomJS (2014)

    Google Scholar 

  8. Chan, L.: Anti crawler technology in the era of big data. Comput. Inf. Technol. 24(6), 2016

    Google Scholar 

  9. Fan, C., Yuan, B., Yu, Z., Xu, L.: Spider detection based on trap techniques. J. Comput. Appl. 30(7), 1782–1784 (2010)

    Google Scholar 

  10. Doran, D., Morillo, K., Gokhale, S.S.: A comparison of web robot and human requests. In: Proceedigs of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1374–1380 (2013)

    Google Scholar 

  11. Jacob, G., Kirda, E., Kruegel, C., Vigna, G.: PUBCRAWL: protecting users and businesses from CRAWLers. In: Proceedings of USENIX Conference on Security Symposium, p. 25 (2013)

    Google Scholar 

  12. Wan, S., Li, Y., Sun, K.: Protecting web contents against persistent distributed crawlers. In: Proceedings of IEEE International Conference on Communications (2017)

    Google Scholar 

  13. Suchacka, G., Sobków, M.: Detection of internet robots using a Bayesian approach. In: Proceedings of IEEE International Conference on Cybernetics, pp. 365–370 (2015)

    Google Scholar 

  14. phantomjs.org: Full web stack, no browser required, March 2018. http://phantomjs.org/

  15. Stassopoulou, A., Dikaiakos, M.D.: Web robot detection: a probabilistic reasoning approach. Comput. Netw. 53(3), 265–278 (2009)

    Google Scholar 

  16. Lalani, A.S.: Data mining of web access logs. In: Hybrid Intelligent Systems (2003)

    Google Scholar 

  17. Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6, 9–35 (2002)

    Google Scholar 

  18. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD Explor. Newsl. 1(2), 12–23 (2000)

    Google Scholar 

  19. Zhuang, L., Kou, Z., Zhang, C.: Session identification based on time interval in web log mining. J. Tsinghua Univ. 163, 389–396 (2004)

    Google Scholar 

  20. Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs J. Comput. 15(2), 171–190 (2003)

    Google Scholar 

  21. Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Proceedings of the Third International World-Wide Web Conference on Technology, Tools and Applications, pp. 1065–1073 (1995)

    Google Scholar 

  22. npcassoc.org, July 2011. http://npcassoc.org/log/access.log

  23. Algiryage, N.: Distinguishing real web crawlers from fakes: Googlebot example. In: 2018 Moratuwa Engineering Research Conference (MERCon), pp. 13–18 (2018)

    Google Scholar 

Download references

Acknowledgments

This research is supported in part by National Key R&D Program of China No. 2018YFC1604000, Chutian Scholars Program of Hubei, Luojia Young Scholar Funds of Wuhan University No. 1503/600400001, 2018 Science and Technology Transformation Project of Grain Administration of Hubei Province “Grain and Oil Quality & Safety Assurance System Research”, and Applied Basic Research Program of WuHan City, China No. 2017010201010117.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiping Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, W., Gao, H., He, Z., Qin, J., Han, B. (2019). A Hybrid Approach for Recognizing Web Crawlers. In: Biagioni, E., Zheng, Y., Cheng, S. (eds) Wireless Algorithms, Systems, and Applications. WASA 2019. Lecture Notes in Computer Science(), vol 11604. Springer, Cham. https://doi.org/10.1007/978-3-030-23597-0_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23597-0_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23596-3

  • Online ISBN: 978-3-030-23597-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics