Web Document Duplicate Detection Using Fuzzy Hashing

Figuerola, Carlos G.; Díaz, Raquel Gómez; Alonso Berrocal, José L.; Zazo Rodríguez, Angel F.

doi:10.1007/978-3-642-19931-8_15

Carlos G. Figuerola⁷,
Raquel Gómez Díaz⁷,
José L. Alonso Berrocal⁷ &
…
Angel F. Zazo Rodríguez⁷

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 90))

524 Accesses
2 Citations

Abstract

The web is the largest repository of documents available and, for retrieval for various purposes, we must use crawlers to navigate autonomously, to select documents and processing them according to the objectives pursued. However, we can see, even intuitively, that are obtained more or less abundant replications of a significant number of documents. The detection of these duplicates is important because it allows to lighten databases and improve the efficiency of information retrieval engines, but also improve the precision of cybermetric analysis, web mining studies, etc. Hash standard techniques used to detect these duplicates only detect exact duplicates, at the bit level. However, many of the duplicates found in the real world are not exactly alike. For example, we can find web pages with the same content, but with different headers or meta tags, or viewed with style sheets different. A frequent case is that of the same document but in different formats; in these cases we will have completely different documents at binary level. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments of the converters on various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs ...). In this work we introduce the possibility of using what is known as fuzzy-hashing. The idea is to produce fingerprints of files (or documents, etc..). This way, a comparison between two fingerprints could give us an estimate of the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc.. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bar-Ilan, J.: Expectations versus reality – search engine features needed for web research at mid 2005. Cybermetrics 9(1) (2005), http://www.cindoc.csic.es/cybermetrics/articles/v9i1p2.html
Bharat, K., Broder, A.: Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks 31(11-16), 1579–1590 (1999), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.1488&rep=rep1&type=pdf
Article Google Scholar
Chowdhury, A.: Duplicate data detection (2004), retrieved from http://ir.iit.edu/~abdur/Research/Duplicate.html , http://gogamza.mireene.co.kr/wp-content/uploads/1/XbsrPeUgh6.pdf
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.3673&rep=rep1&type=pdf
Article Google Scholar
Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964), http://www.cis.uni-muenchen.de/~heller/SuchMasch/apcadg/liter.atur/data/damerau_distance.pdf
Article Google Scholar
Figuerola, C.G., Alonso Berrocal, J.L., Zazo Rodríguez, Á.F., Rodríguez Vázquez de Aldana, E.: Diseño de spiders. Tech. Rep. DPTOIA-IT-2006-002 (2006)
Google Scholar
Figuerola, C.G., Gómez Díaz, R., Alonso Berrocal, J.L., Zazo Rodríguez, A.F.: Proyecto 7: un motor de recuperación web colaborativo. Scire. Representación y Organización del Conocimiento 16, 53–60 (2010)
Google Scholar
Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29(2), 147–160 (1950), http://www.lee.eng.uerj.br/~gil/redesII/hamming.pdf
MathSciNet Google Scholar
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006), https://www.dfrws.org/2006/proceedings/12-Kornblum.pdf
Article Google Scholar
Kornblum, J.: Beyond fuzzy hash. In: US Digital Forensic and Incident Response Summit 2010 (2010), http://computer-forensics.sans.org/community/summits/2010/files/19-beyond-fuzzy-hashing-kornblum.pdf
Kornblum, J.: Fuzzy hashing and sseep (2010), http://ssdeep.sourceforge.net/
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Milenko, D.: ssdeep 2.5. python wrapper for ssdeep library (2010), http://pypi.python.org/pypi/ssdeep
Navarro, G.: A guided tour to approximate string matching. ACM computing surveys (CSUR) 33(1), 31–88 (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7225&rep=rep1&type=pdf
Article Google Scholar
Soukoreff, R., MacKenzie, I.: Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI 2001 Extended Abstracts on Human Factors in Computing Systems, pp. 319–320. ACM, New York (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.757&rep=rep1&type=pdf
Chapter Google Scholar
Tan, P., Steinbach, M., Kumar, V., et al.: Introduction to data mining. Pearson Addison Wesley, Boston (2006), http://www.pphust.cn/uploadfiles/200912/20091204204805761.pdf
Google Scholar
Tridgell, A.: Spamsum overview and code (2002), http://samba.org/ftp/unpacked/junkcode/spamsum
Tridgell, A., Mackerras, P.: The rsync algorithm (2004), http://dspace-prod1.anu.edu.au/bitstream/1885/40765/2/TR-CS-96-05.pdf
Yerra, R., Ng, Y.: Detecting similar html documents using a fuzzy set information retrieval approach. In: 2005 IEEE International Conference on Granular Computing, vol. 2, pp. 693–699. IEEE, Los Alamitos (2005), http://faculty.cs.byu.edu/~dennis/papers/ieee-grc.ps
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Salamanca, C/ Francisco de Vitoria 6-16, 37008, Salamanca, Spain
Carlos G. Figuerola, Raquel Gómez Díaz, José L. Alonso Berrocal & Angel F. Zazo Rodríguez

Authors

Carlos G. Figuerola
View author publications
You can also search for this author in PubMed Google Scholar
Raquel Gómez Díaz
View author publications
You can also search for this author in PubMed Google Scholar
José L. Alonso Berrocal
View author publications
You can also search for this author in PubMed Google Scholar
Angel F. Zazo Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Informática y Automática, Facultad de Ciencias, Universidad de Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Juan M. Corchado
Escuela Universitaria de Informática, Universidad Pontificia de Salamanca, Compañía 5, 37002, Salamanca, Spain
Javier Bajo Pérez
The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Campusvej 55, DK-5230, Odense M, Denmark
Kasper Hallenborg
Insitute of Manangement Engineering, Poznan University of Technology, Strzelecka 11, 60-965, Poznan, Poland
Paulina Golinska
ETSI Informática, Avda. Reina Mercedes, s/n, 41012, Sevilla, Spain
Rafael Corchuelo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Figuerola, C.G., Díaz, R.G., Alonso Berrocal, J.L., Zazo Rodríguez, A.F. (2011). Web Document Duplicate Detection Using Fuzzy Hashing. In: Corchado, J.M., Pérez, J.B., Hallenborg, K., Golinska, P., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19931-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-19931-8_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19930-1
Online ISBN: 978-3-642-19931-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics