Temporal Shingling for Version Identification in Web Archives

Schenkel, Ralf

doi:10.1007/978-3-642-12275-0_44

Ralf Schenkel²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5993))

Included in the following conference series:

European Conference on Information Retrieval

2168 Accesses
3 Citations

Abstract

Building and preserving archives of the evolving Web has been an important problem in research. Given the huge volume of content that is added or updated daily, identifying the right versions of pages to store in the archive is an important building block of any large-scale archival system. This paper presents temporal shingling, an extension of the well-established shingling technique for measuring how similar two snapshots of a page are. This novel method considers the lifespan of shingles to differentiate between important updates that should be archived and transient changes that may be ignored. Extensive experiments demonstrate the tradeoff between archive size and version coverage, and show that the novel method yields better archive coverage at smaller sizes than existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anand, A., et al.: EverLast: a distributed architecture for preserving the web. In: JCDL, pp. 331–340 (2009)
Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD Conference, pp. 398–409 (1995)
Google Scholar
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Chapter Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29(8-13), 1157–1166 (1997)
Google Scholar
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Google Scholar
Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)
Article Google Scholar
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Techn. 3(3), 256–290 (2003)
Article Google Scholar
Chowdhury, A., et al.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)
Article Google Scholar
Conrad, J.G., et al.: Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM, pp. 443–452 (2003)
Google Scholar
Henzinger, M.R.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR, pp. 284–291 (2006)
Google Scholar
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. JASIST 54(3), 203–215 (2003)
Article Google Scholar
Kolcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: KDD, pp. 605–610 (2004)
Google Scholar
Manber, U.: Finding similar files in a large file system. In: USENIX Winter, pp. 1–10 (1994)
Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Google Scholar
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008)
Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: SIGIR, pp. 563–570 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Saarland University, Saarbrücken, Germany
Ralf Schenkel

Authors

Ralf Schenkel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Adaptive Information Cluster, Dublin City University, Dublin, 9, Ireland
Cathal Gurrin
The Open University, Walton Hall, MK7 6HF, Milton Keynes, UK
Yulan He
Microsoft Research Ltd, 7 JJ Thomson Avenue, CB3 0FB, Cambridge, UK
Gabriella Kazai
Department of Computer Science, University of Essex, Wivenhoe Park, CO4 3SQ, Colchester, UK
Udo Kruschwitz
The Open University, Walton Hall, Milton Keynes, UK
Suzanne Little
University of London, London, UK
Thomas Roelleke
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Department of Computing Science, University of Glasgow, 17 Lilybank Gardens, G12 8QQ, Glasgow, UK
Keith van Rijsbergen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schenkel, R. (2010). Temporal Shingling for Version Identification in Web Archives. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_44

Download citation

DOI: https://doi.org/10.1007/978-3-642-12275-0_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12274-3
Online ISBN: 978-3-642-12275-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics