Abstract
We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web – about 24 million web pages which corresponds to about 150 Gigabytes of textual information.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Alexa. Alexa technology, http://www.alexa.com
AltaVista. Altavista search engine, http://altavista.digital.com
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, San Francisco, CA (May 1995)
Broder, A.: On the resemblance and containment of documents. Technical report, DIGITAL Systems Research Center Tech. Report (1997)
Broder, A., Glassman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Sixth International World Wide Web Conference (April 1997)
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through url ordering. In: Seventh International World Wide Web conference (April 1998)
Florescu, D., Koller, D., Levy, A.: Using probabilistic information in data integration. In: Proceedings of 23rd Conference on Very Large Databases (VLDB 1997), August 1997, pp. 216–225 (1997)
Gravano, L., Garcia-Molina, H.: Merging ranks from heterogeneous internet sources. In: Proceedings of 23rd Conference on Very Large Databases (VLDB 1997), August 1997, pp. 196–205 (1997)
Heintze, N.: Scalable document fingerprinting. In: Proceedings of Second USENIX Workshop on Electronic Commerce (November 1996)
Infoseek. Infoseek search engine, http://www.infoseek.com
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proceedings of 9th ACM-SIAM Symposium on Discrete Algorithms, SODA 1998 (1998)
Brin, S., Page, L.: Google search engine/ backrub web crawler, http://google.stanford.edu
Manber, U.: Finding similar files in a large file system. Technical Report TR 93-33, University of Arizona, Tuscon, Arizona (October 1993)
Manber, U., Wu, S.: Glimpse: A tool to search through entire file systems. In: Proceedings of the winter USENIX Conference (January 1994)
Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL 1995), Austin, Texas (June 1995)
Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of 1st ACM Conference on Digital Libraries (DL 1996), Bethesda, Maryland (March 1996)
Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman, J.D.: Computing iceberg queries efficiently. In: Proceedings of 24th Conference on Very Large Databases (VLDB 1998) (August 1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shivakumar, N., Garcia-Molina, H. (1999). Finding Near-Replicas of Documents on the Web. In: Atzeni, P., Mendelzon, A., Mecca, G. (eds) The World Wide Web and Databases. WebDB 1998. Lecture Notes in Computer Science, vol 1590. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10704656_13
Download citation
DOI: https://doi.org/10.1007/10704656_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65890-0
Online ISBN: 978-3-540-48909-2
eBook Packages: Springer Book Archive