Skip to main content

Abstract

The web is the largest repository of documents available and, for retrieval for various purposes, we must use crawlers to navigate autonomously, to select documents and processing them according to the objectives pursued. However, we can see, even intuitively, that are obtained more or less abundant replications of a significant number of documents. The detection of these duplicates is important because it allows to lighten databases and improve the efficiency of information retrieval engines, but also improve the precision of cybermetric analysis, web mining studies, etc. Hash standard techniques used to detect these duplicates only detect exact duplicates, at the bit level. However, many of the duplicates found in the real world are not exactly alike. For example, we can find web pages with the same content, but with different headers or meta tags, or viewed with style sheets different. A frequent case is that of the same document but in different formats; in these cases we will have completely different documents at binary level. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments of the converters on various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs ...). In this work we introduce the possibility of using what is known as fuzzy-hashing. The idea is to produce fingerprints of files (or documents, etc..). This way, a comparison between two fingerprints could give us an estimate of the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc.. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bar-Ilan, J.: Expectations versus reality – search engine features needed for web research at mid 2005. Cybermetrics 9(1) (2005), http://www.cindoc.csic.es/cybermetrics/articles/v9i1p2.html

  2. Bharat, K., Broder, A.: Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks 31(11-16), 1579–1590 (1999), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.1488&rep=rep1&type=pdf

    Article  Google Scholar 

  3. Chowdhury, A.: Duplicate data detection (2004), retrieved from http://ir.iit.edu/~abdur/Research/Duplicate.html , http://gogamza.mireene.co.kr/wp-content/uploads/1/XbsrPeUgh6.pdf

  4. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.3673&rep=rep1&type=pdf

    Article  Google Scholar 

  5. Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964), http://www.cis.uni-muenchen.de/~heller/SuchMasch/apcadg/liter.atur/data/damerau_distance.pdf

    Article  Google Scholar 

  6. Figuerola, C.G., Alonso Berrocal, J.L., Zazo Rodríguez, Á.F., Rodríguez Vázquez de Aldana, E.: Diseño de spiders. Tech. Rep. DPTOIA-IT-2006-002 (2006)

    Google Scholar 

  7. Figuerola, C.G., Gómez Díaz, R., Alonso Berrocal, J.L., Zazo Rodríguez, A.F.: Proyecto 7: un motor de recuperación web colaborativo. Scire. Representación y Organización del Conocimiento 16, 53–60 (2010)

    Google Scholar 

  8. Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29(2), 147–160 (1950), http://www.lee.eng.uerj.br/~gil/redesII/hamming.pdf

    MathSciNet  Google Scholar 

  9. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006), https://www.dfrws.org/2006/proceedings/12-Kornblum.pdf

    Article  Google Scholar 

  10. Kornblum, J.: Beyond fuzzy hash. In: US Digital Forensic and Incident Response Summit 2010 (2010), http://computer-forensics.sans.org/community/summits/2010/files/19-beyond-fuzzy-hashing-kornblum.pdf

  11. Kornblum, J.: Fuzzy hashing and sseep (2010), http://ssdeep.sourceforge.net/

  12. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  13. Milenko, D.: ssdeep 2.5. python wrapper for ssdeep library (2010), http://pypi.python.org/pypi/ssdeep

  14. Navarro, G.: A guided tour to approximate string matching. ACM computing surveys (CSUR) 33(1), 31–88 (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7225&rep=rep1&type=pdf

    Article  Google Scholar 

  15. Soukoreff, R., MacKenzie, I.: Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI 2001 Extended Abstracts on Human Factors in Computing Systems, pp. 319–320. ACM, New York (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.757&rep=rep1&type=pdf

    Chapter  Google Scholar 

  16. Tan, P., Steinbach, M., Kumar, V., et al.: Introduction to data mining. Pearson Addison Wesley, Boston (2006), http://www.pphust.cn/uploadfiles/200912/20091204204805761.pdf

    Google Scholar 

  17. Tridgell, A.: Spamsum overview and code (2002), http://samba.org/ftp/unpacked/junkcode/spamsum

  18. Tridgell, A., Mackerras, P.: The rsync algorithm (2004), http://dspace-prod1.anu.edu.au/bitstream/1885/40765/2/TR-CS-96-05.pdf

  19. Yerra, R., Ng, Y.: Detecting similar html documents using a fuzzy set information retrieval approach. In: 2005 IEEE International Conference on Granular Computing, vol. 2, pp. 693–699. IEEE, Los Alamitos (2005), http://faculty.cs.byu.edu/~dennis/papers/ieee-grc.ps

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Figuerola, C.G., Díaz, R.G., Alonso Berrocal, J.L., Zazo Rodríguez, A.F. (2011). Web Document Duplicate Detection Using Fuzzy Hashing. In: Corchado, J.M., Pérez, J.B., Hallenborg, K., Golinska, P., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19931-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19931-8_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19930-1

  • Online ISBN: 978-3-642-19931-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics