Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network

David, Baptiste; Delong, Maxence; Filiol, Eric

doi:10.1007/s11416-021-00380-4

Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network

Original Paper
Published: 01 April 2021

Volume 17, pages 185–198, (2021)
Cite this article

Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Baptiste David¹,
Maxence Delong¹ &
Eric Filiol^2,3

367 Accesses
3 Citations
Explore all metrics

Abstract

In the domain of web security, websites strive to prevent themselves from data gathering performed by automatic programs called bots. In that way, crawler traps are an efficient brake against this kind of programs. By creating similar pages or random content dynamically, crawler traps give fake information to the bot and resulting by wasting time and resources. Nowadays, there is no available bots able to detect the presence of a crawler trap. Our aim was to find a generic solution to escape any type of crawler trap. Since the random generation is potentially endless, the only way to perform crawler trap detection is on the fly. Using machine learning, it is possible to compute the comparison between datasets of webpages extracted from regular websites from those generated by crawler traps. Since machine learning requires to use distances, we designed our system using information theory. We considered widely used distances compared to a new one designed to take into account heterogeneous data. Indeed, two pages does not have necessary the same words and it is operationally impossible to know all possible words by advance. To solve our problematic, our new distance compares two webpages and the results showed that our distance is more accurate than other tested distances. By extension, we can say that our distance has a much larger potential range than just crawler traps detection. This opens many new possibilities in the scope of data classification and data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

PathMarker: protecting web contents against inside crawlers

Article Open access 20 February 2019

Shengye Wan, Yue Li & Kun Sun

BINSPECT: Holistic Analysis and Detection of Malicious Web Pages

CrawlBot: A Domain-Specific Pseudonymous Crawler

References

Steven, V.V., Ondrej, K., Vojtech, Z.: Crawler traps: how to identify and avoid them. https://www.contentkingapp.com/academy/crawler-traps/ (2020). Accessed 21 Nov 2019
Rouse, M.: Chatbot definition. https://searchcustomerexperience.techtarget.com/definition/chatbot (2019). Accessed 21 Nov 2019
Google: Googlebot definition. https://support.google.com/webmasters/answer/182072?hl=en (2019). Accessed 21 Nov 2019
Hasan, W.: A survey of current research on captcha. Int. J. Comput. Sci. Eng. Surv. 7, 1–21 (2016). https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf (2016). Accessed 27 Mar 2021
Sivakorn, S., Polakis, J., Keromytis, A.D.: I’m not a human: breaking the google recaptcha. https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf (2016). Accessed 27 Mar 2021
Techopedia: Spider trap. https://www.techopedia.com/definition/5197/spider-trap (2017). Accessed 21 Nov 2019
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37, 145–151 (1991)
Article MathSciNet Google Scholar
Matusita, K.: Decision rules, based on the distance, for problems of fit, two samples, and estimation. Ann. Math. Stat. 26, 12 (1955)
Article MathSciNet Google Scholar
Bhattacharyya, A.: On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 7(4), 401–406 (1946)
MathSciNet MATH Google Scholar
Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 (2003)
Article Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948).
Nikulin, M.: Hellinger distance, Encyclopedia of Mathematics. http://www.encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453 (2011). Accessed 21 Nov 2019
Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)
MathSciNet MATH Google Scholar
Aherne, F.J., Thacker, N.A., Rockett, P.I.: The Bhattacharyya metric as an absolute similarity measure for frequency coded data. Kybernetika 34, 363–368 (1998)
MathSciNet MATH Google Scholar
Choi, E., Lee, C.: Feature extraction based on the Bhattacharyya distance. Pattern Recogn. 36, 1703–1709 (2003)
Article Google Scholar
Mohammadi, A., Plataniotis, K.: Improper complex-valued Bhattacharyya distance. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1049–1064 (2015)
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Professional, Inc., San Diego (1990)
MATH Google Scholar
Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1-2. Wiley, Hoboken (1968)
MATH Google Scholar
Kendall, M., Stuart, A.: The Advanced Theory of Statistics. Distribution Theory, vol. 1, 4th edn. Macmillan, New York (1977)
MATH Google Scholar
Saporta, G.: Probabilités, analyse des données et statistique, 2e édition révisée et augmentée. Technip (2006)
Zwillinger, D.: CRC Standard Mathematical Tables and Formulae: Ser. Mathematical Science References, 30th edn. CRC-Press, Boca Raton (1995)
Book Google Scholar
Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, ninth dover printing, tenth gpo printing ed. Dover, New York (1964)
Google Scholar
Pollastri, A., Tornaghi, F.: Some properties of the arctangent distribution. https://www.researchgate.net/profile/Angiola_Pollastri/publication/242525426_Some_properties_of_the_Arctangent_Distribution/links/551be9890cf2fe6cbf75fc42/Some-properties-of-the-Arctangent-Distribution.pdf (2004). Accessed 21 Nov 2019
Zenga, M.: L’impiego della funzione arcotangente incompleta nello studio della distribuzione asintotica dello scarto standardizzato assoluto massimo di una trinomiale. Statistica 12(XXXIX), 269–286 (1979)
MathSciNet MATH Google Scholar
Sugarplum – spam poison. http://www.devin.com/sugarplum/. Accessed 21 Nov 2019
Stickano/tarantula: Spider trap. https://github.com/Stickano/Tarantula. Accessed 21 Nov 2019
Nosek, B., Alter, G., Banks, G., Borsboom, D., Bowman, S., Breckler, S., Buck, S., Chambers, C., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D., Hesse, B., Humphreys, M., Yarkoni, T.: Promoting an open research culture. Science (New York, N.Y.) 348, 1422–1425 (2015)
Article Google Scholar
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (Methodol.) 36(2), 111–133 (1974). https://doi.org/10.1111/j.2517-6161.1974.tb00994.x
Article MathSciNet MATH Google Scholar
Fisher, S.R.: 272: The nature of probability. Centen. Rev. 2, 261–274 (1958)
Google Scholar
Fischer, H.: A history of the central limit theorem. From classical to modern probability theory. http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/HistoryCentralLimitTheorem.pdf (2011). Accessed 21 Nov 2019
Pólya, G.: über den zentralen grenzwertsatz der wahrscheinlichkeitsrechnung und das momentenproblem. Mathematische Zeitschrift 8, 171–181 (1920)
Article MathSciNet Google Scholar
Knol, M.J., Pestman, W.R., Grobbee, D.E.: The (mis)use of overlap of confidence intervals to assess effect modification. Eur. J. Epidemiol. 26(4), 253–254 (2011)
Article Google Scholar
Payton, M.E., Greenstone, M.H., Schenker, N.: Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance? J. Insect Sci. 3, 34 (2003)
Article Google Scholar
Frost, J.: Using confidence intervals to compare means. https://statisticsbyjim.com/hypothesis-testing/confidence-intervals-compare-means/ (2019). Accessed 21 Nov 2019
Austin, P., Hux, J.: A brief note on overlapping confidence intervals. J. Vasc. Surg. 36, 194–195 (2002)
Article Google Scholar
Parasurama, P.: Why overlapping confidence intervals mean nothing about statistical significance. https://towardsdatascience.com/why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance-48360559900a (2017). Accessed 21 Nov 2019
Kabir, S., Wagner, C., Havens, T., Anderson, D., Aickelin, Novel U.: Similarity measure for interval-valued data based on overlapping ratio. https://www.researchgate.net/profile/Uwe_Aickelin/publication/318311193_Novel_similarity_measure_for_interval-valued_data_based_on_overlapping_ratio/links/5d0c6f15299bf1547c716709/Novel-similarity-measure-for-interval-valued-data-based-on-overlapping-ratio.pdf (2017). Accessed 21 Nov 2019
Jaccard, P.: Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37, 241–272 (1901)
Google Scholar
Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons, ser. Biologiske skrifter. I kommission hos E. Munksgaard. http://www.royalacademy.dk/Publications/High/295S%C3%B8rensen,%20Thorvald.pdf (1948). Accessed 21 Nov 2019
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)
Article Google Scholar
McCulloch, J., Wagner, C., Aickelin, U.: Analysing fuzzy sets through combining measures of similarity and distance. In: IEEE International Conference on Fuzzy Systems. https://www.researchgate.net/profile/Uwe_Aickelin/publication/265337448_Analysing_Fuzzy_Sets_Through_Combining_Measures_of_Similarity_and_Distance/links/541adedf0cf2218008bfe73c/Analysing-Fuzzy-Sets-Through-Combining-Measures-of-Similarity-and-Distance.pdf (2014). Accessed 21 Nov 2019
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press, Cambridge (2012)
MATH Google Scholar
Aggarwal, C.C.: Data Mining: The Textbook. Springer, Cham (2015)
MATH Google Scholar
Khanam, M., Mahboob, T., Imtiaz, W., Ghafoor, H., Sehar, R.: A survey on unsupervised machine learning algorithms for automation, classification and maintenance. Int. J. Comput. Appl. 119, 34–39 (2015)
Google Scholar
Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K.-L., Elkhatib, Y., Hussain, A., Al-Fuqaha, A.: Unsupervised machine learning for networking: Techniques, applications and research challenges, IEEE Access (2017)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif.: University of California Press, pp. 281–297. https://projecteuclid.org/euclid.bsmsp/1200512992 (1967)
Huber, P., Ronchetti, E.: Robust Statistics, ser. Wiley Series in Probability and Statistics. Wiley, Hoboken (2011)
Google Scholar
Herwindiati, D.E., Djauhari, M.A., Jaupi, L.: Robust statistics for classification of remote sensing data. In: 20th International Conference on Computational Statistics. COMPSTAT 2012, Limassol, Cyprus, pp. 317–328, proceedings of COMPSTAT 2012ISBN:978-90-73592-32-2p. 317-328. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02468060 (2012)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Aggarwal, C.C.: Outlier Analysis. Springer, Berlin (2013)
Book Google Scholar
Dalenius, T.: The problem of optimum stratification. Scand. Actuarial J. 1950(3–4), 203–213 (1950). https://doi.org/10.1080/03461238.1950.10432042
Article MathSciNet MATH Google Scholar
Cox, D.R.: Note on grouping. J. Am. Stat. Assoc. 52(280), 543–547 (1957). https://doi.org/10.1080/01621459.1957.10501411
Article MATH Google Scholar
Fisher, W.D.: On grouping for maximum homogeneity. J. Am. Stat. Assoc. 53(284), 789–798 (1958)
Article MathSciNet Google Scholar
García-Escudero, L., Gordaliza, A.: Robustness properties of k means and trimmed k means. J. Am. Stat. Assoc. 94, 956–969 (1999)
MathSciNet MATH Google Scholar
García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A review of robust clustering methods. Adv. Data Anal. Classif. 4, 89–109 (2010)
Article MathSciNet Google Scholar
Droesbeke, J.-J. , Saporta, G., Thomas-Agnan, C.: Méthodes robustes en statistique, Technip, Ed. https://hal.archives-ouvertes.fr/hal-01126519 (2015)
Gordaliza, A.: Best approximations to random variables based on trimming procedures. J. Approx. Theory 64(2), 162–180 (1991)
Article MathSciNet Google Scholar
Cuesta-Albertos, J., Gordaliza, A., Matrán, C.: Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25, 553–576 (1997)
Article MathSciNet Google Scholar
García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
Article MathSciNet Google Scholar
Adnan, R., Jedi, A.: Tclust?: trimming approach of robust clustering method. Malays. J. Fund. Sci. 8, 253–258 (2012)
Google Scholar
David, B., Delong, M., Filiol, E.: Detection of crawler traps: formalization and implementation defeating protection on internet and on the TOR network. In: 4th International Workshop on FORmal methods for Security Engineering (ForSE 2020)/6th International Conference on Information Systems Security and Privacy (ICISSP 2020), Valetta, Malta, 25–27 February, 2020

Download references

Author information

Authors and Affiliations

Laboratoire de Virologie et de cryptologie opérationnelles, ESIEA, Laval, France
Baptiste David & Maxence Delong
Department of Cyberdefense, ENSIBS, Vannes, France
Eric Filiol
Higher School of Economics, Moscow, Russian Federation
Eric Filiol

Authors

Baptiste David
View author publications
You can also search for this author in PubMed Google Scholar
Maxence Delong
View author publications
You can also search for this author in PubMed Google Scholar
Eric Filiol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric Filiol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

David, B., Delong, M. & Filiol, E. Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network. J Comput Virol Hack Tech 17, 185–198 (2021). https://doi.org/10.1007/s11416-021-00380-4

Download citation

Received: 07 September 2020
Accepted: 16 March 2021
Published: 01 April 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11416-021-00380-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network

Abstract

Access this article

Similar content being viewed by others

PathMarker: protecting web contents against inside crawlers

BINSPECT: Holistic Analysis and Detection of Malicious Web Pages

CrawlBot: A Domain-Specific Pseudonymous Crawler

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network

Abstract

Access this article

Similar content being viewed by others

PathMarker: protecting web contents against inside crawlers

BINSPECT: Holistic Analysis and Detection of Malicious Web Pages

CrawlBot: A Domain-Specific Pseudonymous Crawler

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation