A Hybrid Approach for Recognizing Web Crawlers

Zhu, Weiping; Gao, Hang; He, Zongjian; Qin, Jiangbo; Han, Bo

doi:10.1007/978-3-030-23597-0_41

Weiping Zhu¹⁷,
Hang Gao¹⁷,
Zongjian He¹⁸,
Jiangbo Qin¹⁷ &
…
Bo Han¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11604))

Included in the following conference series:

International Conference on Wireless Algorithms, Systems, and Applications

2263 Accesses
5 Citations

Abstract

In recent years, web crawlers have been widely used for collecting data from the Internet. Accurately recognizing web crawlers can help to better utilize friendly crawlers while stopping malicious ones. Existing web crawler recognition researches have difficulties in handling new crawlers, such as distributed crawlers, proxy based crawlers, and browser engine based crawlers. Moreover, it is non-trivial to achieve both high identification accuracy and high response time simultaneously. To tackle these issues, we propose a novel approach to web crawler recognition which combines real-time recognition methods based on heuristic rules and offline recognition methods based on machine learning. The aforementioned problems are well solved in this approach. The advantage of this approach is that both accuracy and efficiency are improved. We build a website and analyze its web access log using the proposed method. According to the results, the proposed approach achieves desirable performance in both accuracy and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/monperrus/crawler-user-agents.
2.
Weka is a collection of machine learning algorithms for data mining tasks. Its website is https://www.cs.waikato.ac.nz/ml/weka/.

References

da Silva, A.S., Veloso, E.A., Golgher, P.B., Ribeiro-Neto, B., Laender, A.H.F., Ziviani, N.: Cobweb-a crawler for the Brazilian web. In: 6th International Symposium on String Processing and Information Retrieval, pp. 184–191 (1999)
Google Scholar
Raina, S., Agarwal, A.P.: How crawlers aid regression testing in web applications: the state of the art. Int. J. Comput. Appl. 68(14), 33–38 (2014)
Google Scholar
Lau, C.H., Tao, X., Tjondronegoro, D., Li, Y.: Retrieving information from microblog using pattern mining and relevance feedback. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds.) ICDKE 2012. LNCS, vol. 7696, pp. 152–160. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34679-8_15
Google Scholar
Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: irobot: An intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM, New York (2008)
Google Scholar
Lu, P.: Bring you into the world of crawler and anti-crawler. Softw. Integr. Circ. 12, 12–13 (2016)
Google Scholar
CtripTech: This is enough for anti-crawler technology, June 2016. https://segmentfault.com/a/1190000005840672
Friesel, R.: PhantomJS cookbook over 70 recipes to help boost the productivity of your applications using real-world testing with PhantomJS (2014)
Google Scholar
Chan, L.: Anti crawler technology in the era of big data. Comput. Inf. Technol. 24(6), 2016
Google Scholar
Fan, C., Yuan, B., Yu, Z., Xu, L.: Spider detection based on trap techniques. J. Comput. Appl. 30(7), 1782–1784 (2010)
Google Scholar
Doran, D., Morillo, K., Gokhale, S.S.: A comparison of web robot and human requests. In: Proceedigs of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1374–1380 (2013)
Google Scholar
Jacob, G., Kirda, E., Kruegel, C., Vigna, G.: PUBCRAWL: protecting users and businesses from CRAWLers. In: Proceedings of USENIX Conference on Security Symposium, p. 25 (2013)
Google Scholar
Wan, S., Li, Y., Sun, K.: Protecting web contents against persistent distributed crawlers. In: Proceedings of IEEE International Conference on Communications (2017)
Google Scholar
Suchacka, G., Sobków, M.: Detection of internet robots using a Bayesian approach. In: Proceedings of IEEE International Conference on Cybernetics, pp. 365–370 (2015)
Google Scholar
phantomjs.org: Full web stack, no browser required, March 2018. http://phantomjs.org/
Stassopoulou, A., Dikaiakos, M.D.: Web robot detection: a probabilistic reasoning approach. Comput. Netw. 53(3), 265–278 (2009)
Google Scholar
Lalani, A.S.: Data mining of web access logs. In: Hybrid Intelligent Systems (2003)
Google Scholar
Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6, 9–35 (2002)
Google Scholar
Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD Explor. Newsl. 1(2), 12–23 (2000)
Google Scholar
Zhuang, L., Kou, Z., Zhang, C.: Session identification based on time interval in web log mining. J. Tsinghua Univ. 163, 389–396 (2004)
Google Scholar
Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs J. Comput. 15(2), 171–190 (2003)
Google Scholar
Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Proceedings of the Third International World-Wide Web Conference on Technology, Tools and Applications, pp. 1065–1073 (1995)
Google Scholar
npcassoc.org, July 2011. http://npcassoc.org/log/access.log
Algiryage, N.: Distinguishing real web crawlers from fakes: Googlebot example. In: 2018 Moratuwa Engineering Research Conference (MERCon), pp. 13–18 (2018)
Google Scholar

Download references

Acknowledgments

This research is supported in part by National Key R&D Program of China No. 2018YFC1604000, Chutian Scholars Program of Hubei, Luojia Young Scholar Funds of Wuhan University No. 1503/600400001, 2018 Science and Technology Transformation Project of Grain Administration of Hubei Province “Grain and Oil Quality & Safety Assurance System Research”, and Applied Basic Research Program of WuHan City, China No. 2017010201010117.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, People’s Republic of China
Weiping Zhu, Hang Gao, Jiangbo Qin & Bo Han
Center for eResearch, University of Auckland, Auckland, New Zealand
Zongjian He

Authors

Weiping Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Zongjian He
View author publications
You can also search for this author in PubMed Google Scholar
Jiangbo Qin
View author publications
You can also search for this author in PubMed Google Scholar
Bo Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiping Zhu .

Editor information

Editors and Affiliations

University of Hawaii at Manoa, Honolulu, HI, USA
Edoardo S. Biagioni
University of Hawaii at Manoa, Honolulu, USA
Yao Zheng
Harbin Institute of Technology, Harbin, China
Siyao Cheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, W., Gao, H., He, Z., Qin, J., Han, B. (2019). A Hybrid Approach for Recognizing Web Crawlers. In: Biagioni, E., Zheng, Y., Cheng, S. (eds) Wireless Algorithms, Systems, and Applications. WASA 2019. Lecture Notes in Computer Science(), vol 11604. Springer, Cham. https://doi.org/10.1007/978-3-030-23597-0_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-23597-0_41
Published: 21 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23596-3
Online ISBN: 978-3-030-23597-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics