Skip to main content
Log in

Content-aware web robot detection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing, and publishing, as well as websites with rich and unique content are the ones mostly affected by their actions. To deal with this problem, we present a novel web robot detection approach that takes advantage of the content of a website based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical user session representation of log-based features with a novel set of features that capture the semantics of the content of the requested resources. In addition, we contribute a new real-world dataset, which we make publicly available, towards alleviating the scarcity of open data in this field. Empirical results on this dataset validate our assumption and show that our approach outranks state-of-the-art methods for web robot detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://search.lib.auth.gr/

  2. The country codes are obtained from the IP address using the Geolite2 database (https://dev.maxmind.com/geoip/geoip2/geolite2/)

  3. https://zenodo.org/record/3477932

  4. https://www.elastic.co/

  5. https://browscap.org/ - Version 6000031

  6. https://github.com/atmire/COUNTER-Robots - Accessed 28-Mar-2019

  7. https://www.projectcounter.org - Accessed 15-July-2019

  8. https://bit.ly/2XSDjzI - Accessed 28-Mar-2019

  9. https://matomo.org - Accessed 15-July-2019

  10. https://www.readcube.com/papers/ - Accessed 27-March-2019

  11. https://hc.apache.org - Accessed 27-March-2019

  12. https://github.com/peterwittek/somoclu

  13. https://github.com/RezaSadeghiWSU/SMART

  14. https://github.com/guyallard/markov_clustering

References

  1. AlNoamany YA, Weigle MC, Nelson ML (2013) Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. ACM, pp 339–348

  2. Ansari Z A, Sattar S A, Babu A V (2017) A fuzzy neural network based framework to discover user access patterns from web log data. ADAC 11(3):519–546

    MathSciNet  MATH  Google Scholar 

  3. Blei D M, Ng A Y, Jordan M I (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  4. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Google Scholar 

  5. Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection-preprocessing web logfiles for robot detection. In: New developments in classification and data analysis. Springer, pp 113–124

  6. Brown K, Doran D (2018) Contrasting web robot and human behaviors with network models. arXiv:180109715

  7. Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, pp 785–794

  8. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805

  9. Doran D, Gokhale S S (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606

    Google Scholar 

  10. Doran D, Morillo K, Gokhale S S (2013) A comparison of web robot and human requests. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. ACM, pp 1374–1380

  11. Dots G (2018) 2018 bad bot report. https://www.globaldots.com/bad-bot-report-2018/, (Last accessed 11-June-2019)

  12. Ferrara E, Varol O, Davis C, Menczer F, Flammini A (2016) The rise of social bots. Commun ACM 59(7):96–104

    Google Scholar 

  13. Foundation O (2018) Owasp automated threat handbook web application version 1.2. https://www.owasp.org/index.php/File:Automated-threat-handbook.pdf, (Last accessed 20-September-2018)

  14. Greene J W (2016) Web robot detection in scholarly open access institutional repositories. Library Hi Tech 34(3):500–520

    Google Scholar 

  15. Hamidzadeh J, Zabihimayvan M, Sadeghi R (2018) Detection of web site visitors based on fuzzy rough sets. Soft Comput 22(7):2175–2188

    Google Scholar 

  16. Kang H, Wang K, Soukal D, Behr F, Zheng Z (2010) Large-scale bot detection for search engines. In: Proceedings of the 19th international conference on World wide web. ACM, pp 501–510

  17. Kwon S, Kim YG, Cha S (2012a) Web robot detection based on pattern-matching technique. J Inf Sci 38(2):118–126

  18. Kwon S, Oh M, Kim D, Lee J, Kim YG, Cha S (2012b) Web Robot Detection based on Monotonous Behavior. Proceedings of the Information Science and Industrial Applications, pp 43–48

  19. Lagopoulos A, Tsoumakas G, Papadopoulos G (2018) Web Robot detection: A semantic approach. In: 2018 IEEE 30Th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 968–974

  20. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196

  21. Lee J, Cha S, Lee D, Lee H (2009) Classification of web robots: An empirical study based on over one billion requests. comput Secur 28(8):795–802

    Google Scholar 

  22. Networks D (2019) 2019 bad bot report. https://resources.distilnetworks.com/white-paper-reports/bad-bot-report-2019, (Last accessed 11-June-2019)

  23. Rude H N, Doran D (2015) Request type prediction for web robot and internet of things traffic. In: 2015 IEEE 14Th international conference on machine learning and applications (ICMLA). IEEE, pp 995–1000

  24. Stassopoulou A, Dikaiakos MD (2007) A probabilistic reasoning approach for discovering web crawler sessions. In: Advances in Data and Web Management. Springer, pp 265–272

  25. Stassopoulou A, Dikaiakos M D (2009) Web robot detection: a probabilistic reasoning approach. Comput Netw 53(3):265–278

    MATH  Google Scholar 

  26. Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717

    Google Scholar 

  27. Stevanovic D, Vlajic N, An A (2013) Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl Soft Comput 13(1):698–708

    Google Scholar 

  28. Suchacka G, Sobkow M (2015) Detection of internet robots using a bayesian approach. In: 2015 IEEE 2Nd international conference on cybernetics (CYBCONF). IEEE, pp 365–370

  29. Tan PN, Kumar V (2004) Discovery of web robot sessions based on their navigational patterns. In: Intelligent Technologies for Information Analysis. Springer, pp 193–222

  30. Zabihi M, Jahan MV, Hamidzadeh J (2014) A density based clustering approach for web robot detection. Proceedings of the 4th International Conference on Computer and Knowledge Engineering, ICCKE 2014, pp 23–28. https://doi.org/10.1109/ICCKE.2014.6993362

  31. Zabihimayvan M, Doran D (2018) Some (non-) universal features of web robot traffic. In: 2018 52Nd annual conference on information sciences and systems (CISS). IEEE, pp 1–6

  32. Zabihimayvan M, Sadeghi R, Rude H N, Doran D (2017) A soft computing approach for benign and malicious web robot detection. Expert Syst Appl 87:129–140

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank Theodoros Theodoropoulos and Aikaterini Nasta from Aristotle University’s Central Library for their overall help on providing the data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Athanasios Lagopoulos.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is co-financed by Greece and the European Union (European Social Fund- ESF) through the Operational Programme Human Resources Development, Education and Lifelong Learning in the context of the project Strengthening Human Resources Research Potential via Doctorate Research (MIS-5000432), implemented by the State Scholarships Foundation (IKY).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lagopoulos, A., Tsoumakas, G. Content-aware web robot detection. Appl Intell 50, 4017–4028 (2020). https://doi.org/10.1007/s10489-020-01754-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01754-9

Keywords

Navigation