Skip to main content
Log in

Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network

  • Original Paper
  • Published:
Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Abstract

In the domain of web security, websites strive to prevent themselves from data gathering performed by automatic programs called bots. In that way, crawler traps are an efficient brake against this kind of programs. By creating similar pages or random content dynamically, crawler traps give fake information to the bot and resulting by wasting time and resources. Nowadays, there is no available bots able to detect the presence of a crawler trap. Our aim was to find a generic solution to escape any type of crawler trap. Since the random generation is potentially endless, the only way to perform crawler trap detection is on the fly. Using machine learning, it is possible to compute the comparison between datasets of webpages extracted from regular websites from those generated by crawler traps. Since machine learning requires to use distances, we designed our system using information theory. We considered widely used distances compared to a new one designed to take into account heterogeneous data. Indeed, two pages does not have necessary the same words and it is operationally impossible to know all possible words by advance. To solve our problematic, our new distance compares two webpages and the results showed that our distance is more accurate than other tested distances. By extension, we can say that our distance has a much larger potential range than just crawler traps detection. This opens many new possibilities in the scope of data classification and data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Steven, V.V., Ondrej, K., Vojtech, Z.: Crawler traps: how to identify and avoid them. https://www.contentkingapp.com/academy/crawler-traps/ (2020). Accessed 21 Nov 2019

  2. Rouse, M.: Chatbot definition. https://searchcustomerexperience.techtarget.com/definition/chatbot (2019). Accessed 21 Nov 2019

  3. Google: Googlebot definition. https://support.google.com/webmasters/answer/182072?hl=en (2019). Accessed 21 Nov 2019

  4. Hasan, W.: A survey of current research on captcha. Int. J. Comput. Sci. Eng. Surv. 7, 1–21 (2016). https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf (2016). Accessed 27 Mar 2021

  5. Sivakorn, S., Polakis, J., Keromytis, A.D.: I’m not a human: breaking the google recaptcha. https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf (2016). Accessed 27 Mar 2021

  6. Techopedia: Spider trap. https://www.techopedia.com/definition/5197/spider-trap (2017). Accessed 21 Nov 2019

  7. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37, 145–151 (1991)

    Article  MathSciNet  Google Scholar 

  8. Matusita, K.: Decision rules, based on the distance, for problems of fit, two samples, and estimation. Ann. Math. Stat. 26, 12 (1955)

    Article  MathSciNet  Google Scholar 

  9. Bhattacharyya, A.: On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 7(4), 401–406 (1946)

    MathSciNet  MATH  Google Scholar 

  10. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 (2003)

    Article  Google Scholar 

  11. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948).

  12. Nikulin, M.: Hellinger distance, Encyclopedia of Mathematics. http://www.encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453 (2011). Accessed 21 Nov 2019

  13. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)

    MathSciNet  MATH  Google Scholar 

  14. Aherne, F.J., Thacker, N.A., Rockett, P.I.: The Bhattacharyya metric as an absolute similarity measure for frequency coded data. Kybernetika 34, 363–368 (1998)

    MathSciNet  MATH  Google Scholar 

  15. Choi, E., Lee, C.: Feature extraction based on the Bhattacharyya distance. Pattern Recogn. 36, 1703–1709 (2003)

    Article  Google Scholar 

  16. Mohammadi, A., Plataniotis, K.: Improper complex-valued Bhattacharyya distance. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1049–1064 (2015)

  17. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Professional, Inc., San Diego (1990)

    MATH  Google Scholar 

  18. Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1-2. Wiley, Hoboken (1968)

    MATH  Google Scholar 

  19. Kendall, M., Stuart, A.: The Advanced Theory of Statistics. Distribution Theory, vol. 1, 4th edn. Macmillan, New York (1977)

    MATH  Google Scholar 

  20. Saporta, G.: Probabilités, analyse des données et statistique, 2e édition révisée et augmentée. Technip (2006)

  21. Zwillinger, D.: CRC Standard Mathematical Tables and Formulae: Ser. Mathematical Science References, 30th edn. CRC-Press, Boca Raton (1995)

    Book  Google Scholar 

  22. Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, ninth dover printing, tenth gpo printing ed. Dover, New York (1964)

    Google Scholar 

  23. Pollastri, A., Tornaghi, F.: Some properties of the arctangent distribution. https://www.researchgate.net/profile/Angiola_Pollastri/publication/242525426_Some_properties_of_the_Arctangent_Distribution/links/551be9890cf2fe6cbf75fc42/Some-properties-of-the-Arctangent-Distribution.pdf (2004). Accessed 21 Nov 2019

  24. Zenga, M.: L’impiego della funzione arcotangente incompleta nello studio della distribuzione asintotica dello scarto standardizzato assoluto massimo di una trinomiale. Statistica 12(XXXIX), 269–286 (1979)

    MathSciNet  MATH  Google Scholar 

  25. Sugarplum – spam poison. http://www.devin.com/sugarplum/. Accessed 21 Nov 2019

  26. Stickano/tarantula: Spider trap. https://github.com/Stickano/Tarantula. Accessed 21 Nov 2019

  27. Nosek, B., Alter, G., Banks, G., Borsboom, D., Bowman, S., Breckler, S., Buck, S., Chambers, C., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D., Hesse, B., Humphreys, M., Yarkoni, T.: Promoting an open research culture. Science (New York, N.Y.) 348, 1422–1425 (2015)

    Article  Google Scholar 

  28. Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (Methodol.) 36(2), 111–133 (1974). https://doi.org/10.1111/j.2517-6161.1974.tb00994.x

    Article  MathSciNet  MATH  Google Scholar 

  29. Fisher, S.R.: 272: The nature of probability. Centen. Rev. 2, 261–274 (1958)

    Google Scholar 

  30. Fischer, H.: A history of the central limit theorem. From classical to modern probability theory. http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/HistoryCentralLimitTheorem.pdf (2011). Accessed 21 Nov 2019

  31. Pólya, G.: über den zentralen grenzwertsatz der wahrscheinlichkeitsrechnung und das momentenproblem. Mathematische Zeitschrift 8, 171–181 (1920)

    Article  MathSciNet  Google Scholar 

  32. Knol, M.J., Pestman, W.R., Grobbee, D.E.: The (mis)use of overlap of confidence intervals to assess effect modification. Eur. J. Epidemiol. 26(4), 253–254 (2011)

    Article  Google Scholar 

  33. Payton, M.E., Greenstone, M.H., Schenker, N.: Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance? J. Insect Sci. 3, 34 (2003)

    Article  Google Scholar 

  34. Frost, J.: Using confidence intervals to compare means. https://statisticsbyjim.com/hypothesis-testing/confidence-intervals-compare-means/ (2019). Accessed 21 Nov 2019

  35. Austin, P., Hux, J.: A brief note on overlapping confidence intervals. J. Vasc. Surg. 36, 194–195 (2002)

    Article  Google Scholar 

  36. Parasurama, P.: Why overlapping confidence intervals mean nothing about statistical significance. https://towardsdatascience.com/why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance-48360559900a (2017). Accessed 21 Nov 2019

  37. Kabir, S., Wagner, C., Havens, T., Anderson, D., Aickelin, Novel U.: Similarity measure for interval-valued data based on overlapping ratio. https://www.researchgate.net/profile/Uwe_Aickelin/publication/318311193_Novel_similarity_measure_for_interval-valued_data_based_on_overlapping_ratio/links/5d0c6f15299bf1547c716709/Novel-similarity-measure-for-interval-valued-data-based-on-overlapping-ratio.pdf (2017). Accessed 21 Nov 2019

  38. Jaccard, P.: Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37, 241–272 (1901)

    Google Scholar 

  39. Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons, ser. Biologiske skrifter. I kommission hos E. Munksgaard. http://www.royalacademy.dk/Publications/High/295S%C3%B8rensen,%20Thorvald.pdf (1948). Accessed 21 Nov 2019

  40. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)

    Article  Google Scholar 

  41. McCulloch, J., Wagner, C., Aickelin, U.: Analysing fuzzy sets through combining measures of similarity and distance. In: IEEE International Conference on Fuzzy Systems. https://www.researchgate.net/profile/Uwe_Aickelin/publication/265337448_Analysing_Fuzzy_Sets_Through_Combining_Measures_of_Similarity_and_Distance/links/541adedf0cf2218008bfe73c/Analysing-Fuzzy-Sets-Through-Combining-Measures-of-Similarity-and-Distance.pdf (2014). Accessed 21 Nov 2019

  42. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  43. Aggarwal, C.C.: Data Mining: The Textbook. Springer, Cham (2015)

    MATH  Google Scholar 

  44. Khanam, M., Mahboob, T., Imtiaz, W., Ghafoor, H., Sehar, R.: A survey on unsupervised machine learning algorithms for automation, classification and maintenance. Int. J. Comput. Appl. 119, 34–39 (2015)

    Google Scholar 

  45. Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K.-L., Elkhatib, Y., Hussain, A., Al-Fuqaha, A.: Unsupervised machine learning for networking: Techniques, applications and research challenges, IEEE Access (2017)

  46. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif.: University of California Press, pp. 281–297. https://projecteuclid.org/euclid.bsmsp/1200512992 (1967)

  47. Huber, P., Ronchetti, E.: Robust Statistics, ser. Wiley Series in Probability and Statistics. Wiley, Hoboken (2011)

    Google Scholar 

  48. Herwindiati, D.E., Djauhari, M.A., Jaupi, L.: Robust statistics for classification of remote sensing data. In: 20th International Conference on Computational Statistics. COMPSTAT 2012, Limassol, Cyprus, pp. 317–328, proceedings of COMPSTAT 2012ISBN:978-90-73592-32-2p. 317-328. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02468060 (2012)

  49. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  50. Aggarwal, C.C.: Outlier Analysis. Springer, Berlin (2013)

    Book  Google Scholar 

  51. Dalenius, T.: The problem of optimum stratification. Scand. Actuarial J. 1950(3–4), 203–213 (1950). https://doi.org/10.1080/03461238.1950.10432042

    Article  MathSciNet  MATH  Google Scholar 

  52. Cox, D.R.: Note on grouping. J. Am. Stat. Assoc. 52(280), 543–547 (1957). https://doi.org/10.1080/01621459.1957.10501411

    Article  MATH  Google Scholar 

  53. Fisher, W.D.: On grouping for maximum homogeneity. J. Am. Stat. Assoc. 53(284), 789–798 (1958)

    Article  MathSciNet  Google Scholar 

  54. García-Escudero, L., Gordaliza, A.: Robustness properties of k means and trimmed k means. J. Am. Stat. Assoc. 94, 956–969 (1999)

    MathSciNet  MATH  Google Scholar 

  55. García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A review of robust clustering methods. Adv. Data Anal. Classif. 4, 89–109 (2010)

    Article  MathSciNet  Google Scholar 

  56. Droesbeke, J.-J. , Saporta, G., Thomas-Agnan, C.: Méthodes robustes en statistique, Technip, Ed. https://hal.archives-ouvertes.fr/hal-01126519 (2015)

  57. Gordaliza, A.: Best approximations to random variables based on trimming procedures. J. Approx. Theory 64(2), 162–180 (1991)

    Article  MathSciNet  Google Scholar 

  58. Cuesta-Albertos, J., Gordaliza, A., Matrán, C.: Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25, 553–576 (1997)

    Article  MathSciNet  Google Scholar 

  59. García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)

    Article  MathSciNet  Google Scholar 

  60. Adnan, R., Jedi, A.: Tclust?: trimming approach of robust clustering method. Malays. J. Fund. Sci. 8, 253–258 (2012)

    Google Scholar 

  61. David, B., Delong, M., Filiol, E.: Detection of crawler traps: formalization and implementation defeating protection on internet and on the TOR network. In: 4th International Workshop on FORmal methods for Security Engineering (ForSE 2020)/6th International Conference on Information Systems Security and Privacy (ICISSP 2020), Valetta, Malta, 25–27 February, 2020

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric Filiol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

David, B., Delong, M. & Filiol, E. Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network. J Comput Virol Hack Tech 17, 185–198 (2021). https://doi.org/10.1007/s11416-021-00380-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-021-00380-4

Keywords

Navigation