Skip to main content
Log in

Web robot detection techniques: overview and limitations

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Most modern Web robots that crawl the Internet to support value-added services and technologies possess sophisticated data collection and analysis capabilities. Some of these robots, however, may be ill-behaved or malicious, and hence, may impose a significant strain on a Web server. It is thus necessary to detect Web robots in order to block undesirable ones from accessing the server. Such detection is also essential to ensure that the robot traffic is considered appropriately in the performance and capacity planning of Web servers. Despite a variety of Web robot detection techniques, there is no consensus regarding a single technique, or even a specific “type” of technique, that performs well in practice. Therefore, to aid in the development of a practically applicable robot detection technique, this survey presents a critical analysis and comparison of the prevalent detection approaches. We propose a framework to classify the existing detection techniques into four categories based on their underlying detection philosophy. We compare the different classes to gain insights into those characteristics that make up an effective robot detection scheme. Finally, we discuss why the contemporary techniques fail to offer a general solution to the robot detection problem and propose a set of key ingredients necessary for strong Web robot detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ah LV, Blum M, Langford J (2003) CAPTCHA: using hard AI problems for security. In: Proceedings of Eurocrypt, pp 294–311

  • AWStats—Free log file analyzer for advanced statistics (GNU GPL). Available at http://awstats.sourceforge.net/

  • Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection—preprocessing web logfiles for robot detection. In: New developments in classification and data analysis, pp 113–124

  • Buzikashvili N (2008) Query log analysis: disrupted query chains and adaptive segmentation. In: Proceedings of workshop information. Retrieval 2008, pp 35–40

  • Dikaiakos MD, Stassopoulou A, Papageorgiou L (2005) An investigation of Web crawler behavior: characterization and metrics. Comput Commun 28: 880–897

    Article  Google Scholar 

  • Doran D, Gokhale SS (2009) Classifying Web robots by K-means clustering. In: Proceedings of the international conference on software engineering and knowledge engineering, pp 97–102

  • Doran D, Gokhale SS (2008) Discovering new trends in Web robot traffic through functional classification. In: Proceedings of international symposium on network computing and applications, pp 275–278

  • Duskin O, Feitelson DG (2009) Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In: Proceedings of 2009 workshop on Web Search Click Data, pp 15–19

  • Geens N, Juysmans J, Vanthienen J (2006) Evaluation of Web robot discovery techniques: a benchmarking study. In: Lecture notes in computer science vol 4065/2006, pp 121–130

  • Giles C, Sun Y, Councill I (2010) Measuring the web crawler ethics. In: Proceedings of 19th international conference on the World Wide Web, pp 1101–1102

  • Gossweilier R, Kamvar M, Baluja S (2009) What’s up CAPTCHA?: a CAPTCHA based on image orientation. In: Proceedings of 18th international conference on World wide web, pp 841–850

  • Guo W, Ju S, Gu Y (2005) Web robot detection techniques based on statistics of their requested URL resources. In: Proceedings of ninth international conference on computer supported cooperative work in design, pp 302–306

  • Huntington P, Nicholas D, Jamali HR (2008) Web robot detection in the scholarly information environment. J Info Sci 34: 726–741

    Article  Google Scholar 

  • Jansen BJ, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Info Process Manage 36: 207–227

    Article  Google Scholar 

  • Kabe T, Miyazaki M (2000) Determining WWW user-agents from server access log. In: Proceedings of seventh international conference on parallel and distributed systems, pp 173–178

  • Kandula S, Katabi D, Jacob M, Berger A (2005) Botz-4-sale: surviving organized DDoS attacks that mimic flash crowds. In: Proceedings of the 2nd conference on symposium on networked systems design & implementation, pp 287–300

  • Kluever KA, Zanibbi R (2008) Video CAPTCHAs: usability vs. security. In: Proceedings of IEEE Western New York Image Processing Workshop 2008

  • Koster M (1994) A standard for robot exclusion. http://www.robotstxt.org/wc/exclusion.html

  • Lee J, Cha S, Lee S, Lee H (2009) Classification of web robots: an empirical study based on over one billion requests. Comput Secur 28: 795–802

    Article  Google Scholar 

  • Lin X, Quan L, Wu H (2008) An automatic scheme to categorize user sessions in modern HTTP traffic. In: Proceedings of IEEE global telecommunications conference 2008, pp 1–6

  • Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: Proceedings of international conference on communications, circuits and systems, pp 1806–1810

  • Motoyama M, Levchenko K, Kanich C, McCoy D, Voelker G, Savage S (2010) CAPTCHAs—understanding CAPTCHA solving from an economic context. In: Proceedings of the USENIX security symposium 2010

  • Oriley T (2007) What is Web 2.0: Design patterns and business models for the next generation of software. In: Communications & Strategies, pp 17–37

  • Park KS, Pai V, Lee KW, Calo S (2006) Securing Web service by automatic robot detection. In: Proceedings of the annual conference on USENIX ’06 annual technical conference

  • Princeton University (2003) PlanetLab—an open platform for developing, deploying, and accessing planetary-scale services. http://www.planet-lab.org

  • Prince MB, Holloway L, Keller AM (2005) Understanding how spammers steam your e-mail address: An analysis of the first six months of data from project honey pot. In: Second conference on Email and Anti-Spam

  • Rabiner LR (1990) A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, 77:257–286

  • Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27: 379–423 623–656

    MATH  MathSciNet  Google Scholar 

  • Shirali-Shahreza HM, Shirali-ShahrezaM (2008) An Anti-SMS-Spam using CAPTCHA. In: Proceedings of 2008 ISECS international colloquium on computing, communication, control, and management, pp 318–321

  • Smith JA, McCown F, Nelson ML (2006) Observed Web robot behavior on decaying Web subsites. In: D-Lib Magazine vol 12. http://www.dlib.org/dlib/february06/smith/02smith.html

  • Stassopoulou A, Dikaiakos MD 2007 A probabilistic reasoning approach for discovering Web crawler sessions. In: APWeb/WAIM, pp 265–272

  • Tan PN, Kumar V (2002) Discovery of Web robot sessions based on their navigational patterns. Data Min Knowl Discov 6(1): 9–35

    Article  MathSciNet  Google Scholar 

  • Turing A (1950) Computing machinery and intelligence. Mind 59: 433–460

    Article  MathSciNet  Google Scholar 

  • Ye S, Lu G, Li X (2004) Workload-aware Web crawling and server workload detection. In: Proceedings of the second Asia-Pacific advanced network research workshop, pp 263–269

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Derek Doran.

Additional information

Responsible editor: Charles Elkan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doran, D., Gokhale, S.S. Web robot detection techniques: overview and limitations. Data Min Knowl Disc 22, 183–210 (2011). https://doi.org/10.1007/s10618-010-0180-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-010-0180-z

Keywords

Navigation