Abstract
In the domain of web security, websites strive to prevent themselves from data gathering performed by automatic programs called bots. In that way, crawler traps are an efficient brake against this kind of programs. By creating similar pages or random content dynamically, crawler traps give fake information to the bot and resulting by wasting time and resources. Nowadays, there is no available bots able to detect the presence of a crawler trap. Our aim was to find a generic solution to escape any type of crawler trap. Since the random generation is potentially endless, the only way to perform crawler trap detection is on the fly. Using machine learning, it is possible to compute the comparison between datasets of webpages extracted from regular websites from those generated by crawler traps. Since machine learning requires to use distances, we designed our system using information theory. We considered widely used distances compared to a new one designed to take into account heterogeneous data. Indeed, two pages does not have necessary the same words and it is operationally impossible to know all possible words by advance. To solve our problematic, our new distance compares two webpages and the results showed that our distance is more accurate than other tested distances. By extension, we can say that our distance has a much larger potential range than just crawler traps detection. This opens many new possibilities in the scope of data classification and data mining.
Similar content being viewed by others
References
Steven, V.V., Ondrej, K., Vojtech, Z.: Crawler traps: how to identify and avoid them. https://www.contentkingapp.com/academy/crawler-traps/ (2020). Accessed 21 Nov 2019
Rouse, M.: Chatbot definition. https://searchcustomerexperience.techtarget.com/definition/chatbot (2019). Accessed 21 Nov 2019
Google: Googlebot definition. https://support.google.com/webmasters/answer/182072?hl=en (2019). Accessed 21 Nov 2019
Hasan, W.: A survey of current research on captcha. Int. J. Comput. Sci. Eng. Surv. 7, 1–21 (2016). https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf (2016). Accessed 27 Mar 2021
Sivakorn, S., Polakis, J., Keromytis, A.D.: I’m not a human: breaking the google recaptcha. https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf (2016). Accessed 27 Mar 2021
Techopedia: Spider trap. https://www.techopedia.com/definition/5197/spider-trap (2017). Accessed 21 Nov 2019
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37, 145–151 (1991)
Matusita, K.: Decision rules, based on the distance, for problems of fit, two samples, and estimation. Ann. Math. Stat. 26, 12 (1955)
Bhattacharyya, A.: On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 7(4), 401–406 (1946)
Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 (2003)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948).
Nikulin, M.: Hellinger distance, Encyclopedia of Mathematics. http://www.encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453 (2011). Accessed 21 Nov 2019
Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)
Aherne, F.J., Thacker, N.A., Rockett, P.I.: The Bhattacharyya metric as an absolute similarity measure for frequency coded data. Kybernetika 34, 363–368 (1998)
Choi, E., Lee, C.: Feature extraction based on the Bhattacharyya distance. Pattern Recogn. 36, 1703–1709 (2003)
Mohammadi, A., Plataniotis, K.: Improper complex-valued Bhattacharyya distance. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1049–1064 (2015)
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Professional, Inc., San Diego (1990)
Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1-2. Wiley, Hoboken (1968)
Kendall, M., Stuart, A.: The Advanced Theory of Statistics. Distribution Theory, vol. 1, 4th edn. Macmillan, New York (1977)
Saporta, G.: Probabilités, analyse des données et statistique, 2e édition révisée et augmentée. Technip (2006)
Zwillinger, D.: CRC Standard Mathematical Tables and Formulae: Ser. Mathematical Science References, 30th edn. CRC-Press, Boca Raton (1995)
Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, ninth dover printing, tenth gpo printing ed. Dover, New York (1964)
Pollastri, A., Tornaghi, F.: Some properties of the arctangent distribution. https://www.researchgate.net/profile/Angiola_Pollastri/publication/242525426_Some_properties_of_the_Arctangent_Distribution/links/551be9890cf2fe6cbf75fc42/Some-properties-of-the-Arctangent-Distribution.pdf (2004). Accessed 21 Nov 2019
Zenga, M.: L’impiego della funzione arcotangente incompleta nello studio della distribuzione asintotica dello scarto standardizzato assoluto massimo di una trinomiale. Statistica 12(XXXIX), 269–286 (1979)
Sugarplum – spam poison. http://www.devin.com/sugarplum/. Accessed 21 Nov 2019
Stickano/tarantula: Spider trap. https://github.com/Stickano/Tarantula. Accessed 21 Nov 2019
Nosek, B., Alter, G., Banks, G., Borsboom, D., Bowman, S., Breckler, S., Buck, S., Chambers, C., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D., Hesse, B., Humphreys, M., Yarkoni, T.: Promoting an open research culture. Science (New York, N.Y.) 348, 1422–1425 (2015)
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (Methodol.) 36(2), 111–133 (1974). https://doi.org/10.1111/j.2517-6161.1974.tb00994.x
Fisher, S.R.: 272: The nature of probability. Centen. Rev. 2, 261–274 (1958)
Fischer, H.: A history of the central limit theorem. From classical to modern probability theory. http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/HistoryCentralLimitTheorem.pdf (2011). Accessed 21 Nov 2019
Pólya, G.: über den zentralen grenzwertsatz der wahrscheinlichkeitsrechnung und das momentenproblem. Mathematische Zeitschrift 8, 171–181 (1920)
Knol, M.J., Pestman, W.R., Grobbee, D.E.: The (mis)use of overlap of confidence intervals to assess effect modification. Eur. J. Epidemiol. 26(4), 253–254 (2011)
Payton, M.E., Greenstone, M.H., Schenker, N.: Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance? J. Insect Sci. 3, 34 (2003)
Frost, J.: Using confidence intervals to compare means. https://statisticsbyjim.com/hypothesis-testing/confidence-intervals-compare-means/ (2019). Accessed 21 Nov 2019
Austin, P., Hux, J.: A brief note on overlapping confidence intervals. J. Vasc. Surg. 36, 194–195 (2002)
Parasurama, P.: Why overlapping confidence intervals mean nothing about statistical significance. https://towardsdatascience.com/why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance-48360559900a (2017). Accessed 21 Nov 2019
Kabir, S., Wagner, C., Havens, T., Anderson, D., Aickelin, Novel U.: Similarity measure for interval-valued data based on overlapping ratio. https://www.researchgate.net/profile/Uwe_Aickelin/publication/318311193_Novel_similarity_measure_for_interval-valued_data_based_on_overlapping_ratio/links/5d0c6f15299bf1547c716709/Novel-similarity-measure-for-interval-valued-data-based-on-overlapping-ratio.pdf (2017). Accessed 21 Nov 2019
Jaccard, P.: Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37, 241–272 (1901)
Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons, ser. Biologiske skrifter. I kommission hos E. Munksgaard. http://www.royalacademy.dk/Publications/High/295S%C3%B8rensen,%20Thorvald.pdf (1948). Accessed 21 Nov 2019
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)
McCulloch, J., Wagner, C., Aickelin, U.: Analysing fuzzy sets through combining measures of similarity and distance. In: IEEE International Conference on Fuzzy Systems. https://www.researchgate.net/profile/Uwe_Aickelin/publication/265337448_Analysing_Fuzzy_Sets_Through_Combining_Measures_of_Similarity_and_Distance/links/541adedf0cf2218008bfe73c/Analysing-Fuzzy-Sets-Through-Combining-Measures-of-Similarity-and-Distance.pdf (2014). Accessed 21 Nov 2019
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press, Cambridge (2012)
Aggarwal, C.C.: Data Mining: The Textbook. Springer, Cham (2015)
Khanam, M., Mahboob, T., Imtiaz, W., Ghafoor, H., Sehar, R.: A survey on unsupervised machine learning algorithms for automation, classification and maintenance. Int. J. Comput. Appl. 119, 34–39 (2015)
Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K.-L., Elkhatib, Y., Hussain, A., Al-Fuqaha, A.: Unsupervised machine learning for networking: Techniques, applications and research challenges, IEEE Access (2017)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif.: University of California Press, pp. 281–297. https://projecteuclid.org/euclid.bsmsp/1200512992 (1967)
Huber, P., Ronchetti, E.: Robust Statistics, ser. Wiley Series in Probability and Statistics. Wiley, Hoboken (2011)
Herwindiati, D.E., Djauhari, M.A., Jaupi, L.: Robust statistics for classification of remote sensing data. In: 20th International Conference on Computational Statistics. COMPSTAT 2012, Limassol, Cyprus, pp. 317–328, proceedings of COMPSTAT 2012ISBN:978-90-73592-32-2p. 317-328. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02468060 (2012)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Aggarwal, C.C.: Outlier Analysis. Springer, Berlin (2013)
Dalenius, T.: The problem of optimum stratification. Scand. Actuarial J. 1950(3–4), 203–213 (1950). https://doi.org/10.1080/03461238.1950.10432042
Cox, D.R.: Note on grouping. J. Am. Stat. Assoc. 52(280), 543–547 (1957). https://doi.org/10.1080/01621459.1957.10501411
Fisher, W.D.: On grouping for maximum homogeneity. J. Am. Stat. Assoc. 53(284), 789–798 (1958)
García-Escudero, L., Gordaliza, A.: Robustness properties of k means and trimmed k means. J. Am. Stat. Assoc. 94, 956–969 (1999)
García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A review of robust clustering methods. Adv. Data Anal. Classif. 4, 89–109 (2010)
Droesbeke, J.-J. , Saporta, G., Thomas-Agnan, C.: Méthodes robustes en statistique, Technip, Ed. https://hal.archives-ouvertes.fr/hal-01126519 (2015)
Gordaliza, A.: Best approximations to random variables based on trimming procedures. J. Approx. Theory 64(2), 162–180 (1991)
Cuesta-Albertos, J., Gordaliza, A., Matrán, C.: Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25, 553–576 (1997)
García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
Adnan, R., Jedi, A.: Tclust?: trimming approach of robust clustering method. Malays. J. Fund. Sci. 8, 253–258 (2012)
David, B., Delong, M., Filiol, E.: Detection of crawler traps: formalization and implementation defeating protection on internet and on the TOR network. In: 4th International Workshop on FORmal methods for Security Engineering (ForSE 2020)/6th International Conference on Information Systems Security and Privacy (ICISSP 2020), Valetta, Malta, 25–27 February, 2020
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
David, B., Delong, M. & Filiol, E. Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network. J Comput Virol Hack Tech 17, 185–198 (2021). https://doi.org/10.1007/s11416-021-00380-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-021-00380-4