Finding Near-Replicas of Documents on the Web

Shivakumar, Narayanan; Garcia-Molina, Hector

doi:10.1007/10704656_13

Finding Near-Replicas of Documents on the Web

Narayanan Shivakumar⁷ &
Hector Garcia-Molina⁷

Conference paper

554 Accesses
32 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1590))

Abstract

We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web – about 24 million web pages which corresponds to about 150 Gigabytes of textual information.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alexa. Alexa technology, http://www.alexa.com
AltaVista. Altavista search engine, http://altavista.digital.com
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, San Francisco, CA (May 1995)
Google Scholar
Broder, A.: On the resemblance and containment of documents. Technical report, DIGITAL Systems Research Center Tech. Report (1997)
Google Scholar
Broder, A., Glassman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Sixth International World Wide Web Conference (April 1997)
Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through url ordering. In: Seventh International World Wide Web conference (April 1998)
Google Scholar
Florescu, D., Koller, D., Levy, A.: Using probabilistic information in data integration. In: Proceedings of 23rd Conference on Very Large Databases (VLDB 1997), August 1997, pp. 216–225 (1997)
Google Scholar
Gravano, L., Garcia-Molina, H.: Merging ranks from heterogeneous internet sources. In: Proceedings of 23rd Conference on Very Large Databases (VLDB 1997), August 1997, pp. 196–205 (1997)
Google Scholar
Heintze, N.: Scalable document fingerprinting. In: Proceedings of Second USENIX Workshop on Electronic Commerce (November 1996)
Google Scholar
Infoseek. Infoseek search engine, http://www.infoseek.com
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proceedings of 9th ACM-SIAM Symposium on Discrete Algorithms, SODA 1998 (1998)
Google Scholar
Brin, S., Page, L.: Google search engine/ backrub web crawler, http://google.stanford.edu
Manber, U.: Finding similar files in a large file system. Technical Report TR 93-33, University of Arizona, Tuscon, Arizona (October 1993)
Google Scholar
Manber, U., Wu, S.: Glimpse: A tool to search through entire file systems. In: Proceedings of the winter USENIX Conference (January 1994)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL 1995), Austin, Texas (June 1995)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of 1st ACM Conference on Digital Libraries (DL 1996), Bethesda, Maryland (March 1996)
Google Scholar
Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman, J.D.: Computing iceberg queries efficiently. In: Proceedings of 24th Conference on Very Large Databases (VLDB 1998) (August 1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Stanford, CA, 94305, USA
Narayanan Shivakumar & Hector Garcia-Molina

Authors

Narayanan Shivakumar
View author publications
You can also search for this author in PubMed Google Scholar
Hector Garcia-Molina
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Informatica e Automazione, Università Roma Tre, Via Vasca Navale 79, 00146, Roma, Italy
Paolo Atzeni
Department of Computer Science, University of Toronto, Toronto, Canada
Alberto Mendelzon
Dipartimento di Matematica e Informatica, Università della Basilicata, Potenza, Italy
Giansalvatore Mecca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shivakumar, N., Garcia-Molina, H. (1999). Finding Near-Replicas of Documents on the Web. In: Atzeni, P., Mendelzon, A., Mecca, G. (eds) The World Wide Web and Databases. WebDB 1998. Lecture Notes in Computer Science, vol 1590. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10704656_13

Download citation

DOI: https://doi.org/10.1007/10704656_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65890-0
Online ISBN: 978-3-540-48909-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics