Web robot detection techniques: overview and limitations

Doran, Derek; Gokhale, Swapna S.

doi:10.1007/s10618-010-0180-z

Web robot detection techniques: overview and limitations

Published: 26 June 2010

Volume 22, pages 183–210, (2011)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Derek Doran¹ &
Swapna S. Gokhale¹

1182 Accesses
68 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

Most modern Web robots that crawl the Internet to support value-added services and technologies possess sophisticated data collection and analysis capabilities. Some of these robots, however, may be ill-behaved or malicious, and hence, may impose a significant strain on a Web server. It is thus necessary to detect Web robots in order to block undesirable ones from accessing the server. Such detection is also essential to ensure that the robot traffic is considered appropriately in the performance and capacity planning of Web servers. Despite a variety of Web robot detection techniques, there is no consensus regarding a single technique, or even a specific “type” of technique, that performs well in practice. Therefore, to aid in the development of a practically applicable robot detection technique, this survey presents a critical analysis and comparison of the prevalent detection approaches. We propose a framework to classify the existing detection techniques into four categories based on their underlying detection philosophy. We compare the different classes to gain insights into those characteristics that make up an effective robot detection scheme. Finally, we discuss why the contemporary techniques fail to offer a general solution to the robot detection problem and propose a set of key ingredients necessary for strong Web robot detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ah LV, Blum M, Langford J (2003) CAPTCHA: using hard AI problems for security. In: Proceedings of Eurocrypt, pp 294–311
AWStats—Free log file analyzer for advanced statistics (GNU GPL). Available at http://awstats.sourceforge.net/
Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection—preprocessing web logfiles for robot detection. In: New developments in classification and data analysis, pp 113–124
Buzikashvili N (2008) Query log analysis: disrupted query chains and adaptive segmentation. In: Proceedings of workshop information. Retrieval 2008, pp 35–40
Dikaiakos MD, Stassopoulou A, Papageorgiou L (2005) An investigation of Web crawler behavior: characterization and metrics. Comput Commun 28: 880–897
Article Google Scholar
Doran D, Gokhale SS (2009) Classifying Web robots by K-means clustering. In: Proceedings of the international conference on software engineering and knowledge engineering, pp 97–102
Doran D, Gokhale SS (2008) Discovering new trends in Web robot traffic through functional classification. In: Proceedings of international symposium on network computing and applications, pp 275–278
Duskin O, Feitelson DG (2009) Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In: Proceedings of 2009 workshop on Web Search Click Data, pp 15–19
Geens N, Juysmans J, Vanthienen J (2006) Evaluation of Web robot discovery techniques: a benchmarking study. In: Lecture notes in computer science vol 4065/2006, pp 121–130
Giles C, Sun Y, Councill I (2010) Measuring the web crawler ethics. In: Proceedings of 19th international conference on the World Wide Web, pp 1101–1102
Gossweilier R, Kamvar M, Baluja S (2009) What’s up CAPTCHA?: a CAPTCHA based on image orientation. In: Proceedings of 18th international conference on World wide web, pp 841–850
Guo W, Ju S, Gu Y (2005) Web robot detection techniques based on statistics of their requested URL resources. In: Proceedings of ninth international conference on computer supported cooperative work in design, pp 302–306
Huntington P, Nicholas D, Jamali HR (2008) Web robot detection in the scholarly information environment. J Info Sci 34: 726–741
Article Google Scholar
Jansen BJ, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Info Process Manage 36: 207–227
Article Google Scholar
Kabe T, Miyazaki M (2000) Determining WWW user-agents from server access log. In: Proceedings of seventh international conference on parallel and distributed systems, pp 173–178
Kandula S, Katabi D, Jacob M, Berger A (2005) Botz-4-sale: surviving organized DDoS attacks that mimic flash crowds. In: Proceedings of the 2nd conference on symposium on networked systems design & implementation, pp 287–300
Kluever KA, Zanibbi R (2008) Video CAPTCHAs: usability vs. security. In: Proceedings of IEEE Western New York Image Processing Workshop 2008
Koster M (1994) A standard for robot exclusion. http://www.robotstxt.org/wc/exclusion.html
Lee J, Cha S, Lee S, Lee H (2009) Classification of web robots: an empirical study based on over one billion requests. Comput Secur 28: 795–802
Article Google Scholar
Lin X, Quan L, Wu H (2008) An automatic scheme to categorize user sessions in modern HTTP traffic. In: Proceedings of IEEE global telecommunications conference 2008, pp 1–6
Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: Proceedings of international conference on communications, circuits and systems, pp 1806–1810
Motoyama M, Levchenko K, Kanich C, McCoy D, Voelker G, Savage S (2010) CAPTCHAs—understanding CAPTCHA solving from an economic context. In: Proceedings of the USENIX security symposium 2010
Oriley T (2007) What is Web 2.0: Design patterns and business models for the next generation of software. In: Communications & Strategies, pp 17–37
Park KS, Pai V, Lee KW, Calo S (2006) Securing Web service by automatic robot detection. In: Proceedings of the annual conference on USENIX ’06 annual technical conference
Princeton University (2003) PlanetLab—an open platform for developing, deploying, and accessing planetary-scale services. http://www.planet-lab.org
Prince MB, Holloway L, Keller AM (2005) Understanding how spammers steam your e-mail address: An analysis of the first six months of data from project honey pot. In: Second conference on Email and Anti-Spam
Rabiner LR (1990) A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, 77:257–286
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27: 379–423 623–656
MATH MathSciNet Google Scholar
Shirali-Shahreza HM, Shirali-ShahrezaM (2008) An Anti-SMS-Spam using CAPTCHA. In: Proceedings of 2008 ISECS international colloquium on computing, communication, control, and management, pp 318–321
Smith JA, McCown F, Nelson ML (2006) Observed Web robot behavior on decaying Web subsites. In: D-Lib Magazine vol 12. http://www.dlib.org/dlib/february06/smith/02smith.html
Stassopoulou A, Dikaiakos MD 2007 A probabilistic reasoning approach for discovering Web crawler sessions. In: APWeb/WAIM, pp 265–272
Tan PN, Kumar V (2002) Discovery of Web robot sessions based on their navigational patterns. Data Min Knowl Discov 6(1): 9–35
Article MathSciNet Google Scholar
Turing A (1950) Computing machinery and intelligence. Mind 59: 433–460
Article MathSciNet Google Scholar
Ye S, Lu G, Li X (2004) Workload-aware Web crawling and server workload detection. In: Proceedings of the second Asia-Pacific advanced network research workshop, pp 263–269

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA
Derek Doran & Swapna S. Gokhale

Authors

Derek Doran
View author publications
You can also search for this author in PubMed Google Scholar
Swapna S. Gokhale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Derek Doran.

Additional information

Responsible editor: Charles Elkan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doran, D., Gokhale, S.S. Web robot detection techniques: overview and limitations. Data Min Knowl Disc 22, 183–210 (2011). https://doi.org/10.1007/s10618-010-0180-z

Download citation

Received: 01 June 2009
Accepted: 16 June 2010
Published: 26 June 2010
Issue Date: January 2011
DOI: https://doi.org/10.1007/s10618-010-0180-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web robot detection techniques: overview and limitations

Abstract

Access this article

Similar content being viewed by others

Content-aware web robot detection

Lino - An Intelligent System for Detecting Malicious Web-Robots

Semi-Supervised Self-Training Approach for Web Robots Activity Detection in Weblog

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Web robot detection techniques: overview and limitations

Abstract

Access this article

Similar content being viewed by others

Content-aware web robot detection

Lino - An Intelligent System for Detecting Malicious Web-Robots

Semi-Supervised Self-Training Approach for Web Robots Activity Detection in Weblog

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation