Abstract
In recent years, web crawlers have been widely used for collecting data from the Internet. Accurately recognizing web crawlers can help to better utilize friendly crawlers while stopping malicious ones. Existing web crawler recognition researches have difficulties in handling new crawlers, such as distributed crawlers, proxy based crawlers, and browser engine based crawlers. Moreover, it is non-trivial to achieve both high identification accuracy and high response time simultaneously. To tackle these issues, we propose a novel approach to web crawler recognition which combines real-time recognition methods based on heuristic rules and offline recognition methods based on machine learning. The aforementioned problems are well solved in this approach. The advantage of this approach is that both accuracy and efficiency are improved. We build a website and analyze its web access log using the proposed method. According to the results, the proposed approach achieves desirable performance in both accuracy and efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Weka is a collection of machine learning algorithms for data mining tasks. Its website is https://www.cs.waikato.ac.nz/ml/weka/.
References
da Silva, A.S., Veloso, E.A., Golgher, P.B., Ribeiro-Neto, B., Laender, A.H.F., Ziviani, N.: Cobweb-a crawler for the Brazilian web. In: 6th International Symposium on String Processing and Information Retrieval, pp. 184–191 (1999)
Raina, S., Agarwal, A.P.: How crawlers aid regression testing in web applications: the state of the art. Int. J. Comput. Appl. 68(14), 33–38 (2014)
Lau, C.H., Tao, X., Tjondronegoro, D., Li, Y.: Retrieving information from microblog using pattern mining and relevance feedback. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds.) ICDKE 2012. LNCS, vol. 7696, pp. 152–160. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34679-8_15
Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: irobot: An intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM, New York (2008)
Lu, P.: Bring you into the world of crawler and anti-crawler. Softw. Integr. Circ. 12, 12–13 (2016)
CtripTech: This is enough for anti-crawler technology, June 2016. https://segmentfault.com/a/1190000005840672
Friesel, R.: PhantomJS cookbook over 70 recipes to help boost the productivity of your applications using real-world testing with PhantomJS (2014)
Chan, L.: Anti crawler technology in the era of big data. Comput. Inf. Technol. 24(6), 2016
Fan, C., Yuan, B., Yu, Z., Xu, L.: Spider detection based on trap techniques. J. Comput. Appl. 30(7), 1782–1784 (2010)
Doran, D., Morillo, K., Gokhale, S.S.: A comparison of web robot and human requests. In: Proceedigs of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1374–1380 (2013)
Jacob, G., Kirda, E., Kruegel, C., Vigna, G.: PUBCRAWL: protecting users and businesses from CRAWLers. In: Proceedings of USENIX Conference on Security Symposium, p. 25 (2013)
Wan, S., Li, Y., Sun, K.: Protecting web contents against persistent distributed crawlers. In: Proceedings of IEEE International Conference on Communications (2017)
Suchacka, G., Sobków, M.: Detection of internet robots using a Bayesian approach. In: Proceedings of IEEE International Conference on Cybernetics, pp. 365–370 (2015)
phantomjs.org: Full web stack, no browser required, March 2018. http://phantomjs.org/
Stassopoulou, A., Dikaiakos, M.D.: Web robot detection: a probabilistic reasoning approach. Comput. Netw. 53(3), 265–278 (2009)
Lalani, A.S.: Data mining of web access logs. In: Hybrid Intelligent Systems (2003)
Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6, 9–35 (2002)
Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD Explor. Newsl. 1(2), 12–23 (2000)
Zhuang, L., Kou, Z., Zhang, C.: Session identification based on time interval in web log mining. J. Tsinghua Univ. 163, 389–396 (2004)
Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs J. Comput. 15(2), 171–190 (2003)
Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Proceedings of the Third International World-Wide Web Conference on Technology, Tools and Applications, pp. 1065–1073 (1995)
npcassoc.org, July 2011. http://npcassoc.org/log/access.log
Algiryage, N.: Distinguishing real web crawlers from fakes: Googlebot example. In: 2018 Moratuwa Engineering Research Conference (MERCon), pp. 13–18 (2018)
Acknowledgments
This research is supported in part by National Key R&D Program of China No. 2018YFC1604000, Chutian Scholars Program of Hubei, Luojia Young Scholar Funds of Wuhan University No. 1503/600400001, 2018 Science and Technology Transformation Project of Grain Administration of Hubei Province “Grain and Oil Quality & Safety Assurance System Research”, and Applied Basic Research Program of WuHan City, China No. 2017010201010117.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, W., Gao, H., He, Z., Qin, J., Han, B. (2019). A Hybrid Approach for Recognizing Web Crawlers. In: Biagioni, E., Zheng, Y., Cheng, S. (eds) Wireless Algorithms, Systems, and Applications. WASA 2019. Lecture Notes in Computer Science(), vol 11604. Springer, Cham. https://doi.org/10.1007/978-3-030-23597-0_41
Download citation
DOI: https://doi.org/10.1007/978-3-030-23597-0_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23596-3
Online ISBN: 978-3-030-23597-0
eBook Packages: Computer ScienceComputer Science (R0)