Abstract
Most modern Web robots that crawl the Internet to support value-added services and technologies possess sophisticated data collection and analysis capabilities. Some of these robots, however, may be ill-behaved or malicious, and hence, may impose a significant strain on a Web server. It is thus necessary to detect Web robots in order to block undesirable ones from accessing the server. Such detection is also essential to ensure that the robot traffic is considered appropriately in the performance and capacity planning of Web servers. Despite a variety of Web robot detection techniques, there is no consensus regarding a single technique, or even a specific “type” of technique, that performs well in practice. Therefore, to aid in the development of a practically applicable robot detection technique, this survey presents a critical analysis and comparison of the prevalent detection approaches. We propose a framework to classify the existing detection techniques into four categories based on their underlying detection philosophy. We compare the different classes to gain insights into those characteristics that make up an effective robot detection scheme. Finally, we discuss why the contemporary techniques fail to offer a general solution to the robot detection problem and propose a set of key ingredients necessary for strong Web robot detection.
Similar content being viewed by others
References
Ah LV, Blum M, Langford J (2003) CAPTCHA: using hard AI problems for security. In: Proceedings of Eurocrypt, pp 294–311
AWStats—Free log file analyzer for advanced statistics (GNU GPL). Available at http://awstats.sourceforge.net/
Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection—preprocessing web logfiles for robot detection. In: New developments in classification and data analysis, pp 113–124
Buzikashvili N (2008) Query log analysis: disrupted query chains and adaptive segmentation. In: Proceedings of workshop information. Retrieval 2008, pp 35–40
Dikaiakos MD, Stassopoulou A, Papageorgiou L (2005) An investigation of Web crawler behavior: characterization and metrics. Comput Commun 28: 880–897
Doran D, Gokhale SS (2009) Classifying Web robots by K-means clustering. In: Proceedings of the international conference on software engineering and knowledge engineering, pp 97–102
Doran D, Gokhale SS (2008) Discovering new trends in Web robot traffic through functional classification. In: Proceedings of international symposium on network computing and applications, pp 275–278
Duskin O, Feitelson DG (2009) Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In: Proceedings of 2009 workshop on Web Search Click Data, pp 15–19
Geens N, Juysmans J, Vanthienen J (2006) Evaluation of Web robot discovery techniques: a benchmarking study. In: Lecture notes in computer science vol 4065/2006, pp 121–130
Giles C, Sun Y, Councill I (2010) Measuring the web crawler ethics. In: Proceedings of 19th international conference on the World Wide Web, pp 1101–1102
Gossweilier R, Kamvar M, Baluja S (2009) What’s up CAPTCHA?: a CAPTCHA based on image orientation. In: Proceedings of 18th international conference on World wide web, pp 841–850
Guo W, Ju S, Gu Y (2005) Web robot detection techniques based on statistics of their requested URL resources. In: Proceedings of ninth international conference on computer supported cooperative work in design, pp 302–306
Huntington P, Nicholas D, Jamali HR (2008) Web robot detection in the scholarly information environment. J Info Sci 34: 726–741
Jansen BJ, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Info Process Manage 36: 207–227
Kabe T, Miyazaki M (2000) Determining WWW user-agents from server access log. In: Proceedings of seventh international conference on parallel and distributed systems, pp 173–178
Kandula S, Katabi D, Jacob M, Berger A (2005) Botz-4-sale: surviving organized DDoS attacks that mimic flash crowds. In: Proceedings of the 2nd conference on symposium on networked systems design & implementation, pp 287–300
Kluever KA, Zanibbi R (2008) Video CAPTCHAs: usability vs. security. In: Proceedings of IEEE Western New York Image Processing Workshop 2008
Koster M (1994) A standard for robot exclusion. http://www.robotstxt.org/wc/exclusion.html
Lee J, Cha S, Lee S, Lee H (2009) Classification of web robots: an empirical study based on over one billion requests. Comput Secur 28: 795–802
Lin X, Quan L, Wu H (2008) An automatic scheme to categorize user sessions in modern HTTP traffic. In: Proceedings of IEEE global telecommunications conference 2008, pp 1–6
Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: Proceedings of international conference on communications, circuits and systems, pp 1806–1810
Motoyama M, Levchenko K, Kanich C, McCoy D, Voelker G, Savage S (2010) CAPTCHAs—understanding CAPTCHA solving from an economic context. In: Proceedings of the USENIX security symposium 2010
Oriley T (2007) What is Web 2.0: Design patterns and business models for the next generation of software. In: Communications & Strategies, pp 17–37
Park KS, Pai V, Lee KW, Calo S (2006) Securing Web service by automatic robot detection. In: Proceedings of the annual conference on USENIX ’06 annual technical conference
Princeton University (2003) PlanetLab—an open platform for developing, deploying, and accessing planetary-scale services. http://www.planet-lab.org
Prince MB, Holloway L, Keller AM (2005) Understanding how spammers steam your e-mail address: An analysis of the first six months of data from project honey pot. In: Second conference on Email and Anti-Spam
Rabiner LR (1990) A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, 77:257–286
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27: 379–423 623–656
Shirali-Shahreza HM, Shirali-ShahrezaM (2008) An Anti-SMS-Spam using CAPTCHA. In: Proceedings of 2008 ISECS international colloquium on computing, communication, control, and management, pp 318–321
Smith JA, McCown F, Nelson ML (2006) Observed Web robot behavior on decaying Web subsites. In: D-Lib Magazine vol 12. http://www.dlib.org/dlib/february06/smith/02smith.html
Stassopoulou A, Dikaiakos MD 2007 A probabilistic reasoning approach for discovering Web crawler sessions. In: APWeb/WAIM, pp 265–272
Tan PN, Kumar V (2002) Discovery of Web robot sessions based on their navigational patterns. Data Min Knowl Discov 6(1): 9–35
Turing A (1950) Computing machinery and intelligence. Mind 59: 433–460
Ye S, Lu G, Li X (2004) Workload-aware Web crawling and server workload detection. In: Proceedings of the second Asia-Pacific advanced network research workshop, pp 263–269
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charles Elkan.
Rights and permissions
About this article
Cite this article
Doran, D., Gokhale, S.S. Web robot detection techniques: overview and limitations. Data Min Knowl Disc 22, 183–210 (2011). https://doi.org/10.1007/s10618-010-0180-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0180-z