skip to main content
10.1145/1099554.1099733acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Redundant documents and search effectiveness

Published: 31 October 2005 Publication History

Abstract

The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant.

References

[1]
Allan, J., Wade, C. & Bolivar, A. (2003), Retrieval and novelty detection at the sentence level, in 'Proc. ACM SIGIR conference', ACM Press, pp. 314--321.]]
[2]
Bernstein, Y. & Zobel, J. (2004), A scalable system for identifying co-derivative documents, in 'Proc. String Processing and Information Retrieval Symposium (SPIRE)', Springer, pp. 55--67.]]
[3]
Brin, S., Davis, J. & Garcíía-Molina, H. (1995), Copy detection mechanisms for digital documents, in 'Proceedings of the ACM SIGMOD Annual Conference', pp. 398--409.]]
[4]
Broder, A. Z., Glassman, S. C., Manasse, M. S. & Zweig, G. (1997), 'Syntactic clustering of the Web', Computer Networks and ISDN Systems 29(8-13), 1157--1166.]]
[5]
Buckley, C. & Voorhees, E. M. (2000), Evaluating evaluation measure stability, in 'Proc. ACM SIGIR conference', ACM Press, pp. 33--40.]]
[6]
Cho, J., Shivakumar, N. & Garcia-Molina, H. (2000), Finding Replicated Web Collections, in 'Proc. ACM SIGMOD Conference', pp. 355--366.]]
[7]
Chowdhury, A., Frieder, O., Grossman, D. & McCabe, M. C. (2002), 'Collection statistics for fast duplicate document detection', ACM Transactions on Information Systems (TOIS) 20(2), 171--191.]]
[8]
Clarke, C., Craswell, N. & Soboroff, I. (2004), Overview of the TREC 2004 Terabyte Track, in 'Proceedings of the 13th Text REtrieval Conference (TREC 2004)'.]]
[9]
Fetterly, D., Manasse, M. & Najork, M. (2003), On the Evolution of Clusters of Near-Duplicate Web Pages, in 'Proceedings of the 1st Latin American Web Congress', IEEE, pp. 37--45.]]
[10]
Garcia, S., Williams, H. E. & Cannane, A. (2004), Access-ordered indexes, in 'Proc. 27th conference on Australasian computer science', pp. 7--14.]]
[11]
Harman, D. (2002), Overview of the TREC 2002 Novelty Track, in 'The Eleventh Text REtrieval Conference (TREC 2002)'.]]
[12]
Hearst, M. A. & Pedersen, J. O. (1996), Reexamining the cluster hypothesis: scatter/gather on retrieval results, in 'Proc. ACM SIGIR conference', ACM Press, pp. 76--84.]]
[13]
Heintze, N. (1996), Scalable Document Fingerprinting, in '1996 USENIX Workshop on Electronic Commerce'.]]
[14]
Hoad, T. C. & Zobel, J. (2003), 'Methods for Identifying Versioned and Plagiarised Documents', Journal of the American Society for Information Science and Technology 54(3), 203--215.]]
[15]
Manber, U. (1994), Finding Similar Files in a Large File System, in 'Proceedings of the USENIX Winter 1994 Technical Conference', pp. 1--10.]]
[16]
Rivest, R. (1992), 'The MD5 Message-Digest Algorithm'. RFC 1321.]]
[17]
Sanderson, M. & Zobel, J. (2005), Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in 'Proc. ACM SIGIR conference', pp. 162--169.]]
[18]
Schleimer, S., Wilkerson, D. S. & Aiken, A. (2003), Winnowing: local algorithms for document fingerprinting, in 'Proc. ACM SIGMOD conference', ACM Press, pp. 76--85.]]
[19]
Shivakumar, N. & Garcíía-Molina, H. (1995), SCAM: A Copy Detection Mechanism for Digital Documents, in 'Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries'.]]
[20]
Shivakumar, N. & Garcíía-Molina, H. (1999), Finding Near-Replicas of Documents on the Web, in 'WEBDB: International Workshop on the World Wide Web and Databases, WebDB', Springer-Verlag.]]
[21]
Soboroff, I. & Harman, D. (2003), Overview of the TREC 2003 Novelty Track, in 'The Twelfth Text REtrieval Conference (TREC 2003)', pp. 38--53.]]
[22]
van Rijsbergen, C. J. (1979), Information Retrieval, Butterworth-Heinemann.]]
[23]
Voorhees, E. M. & Buckley, C. (2002), The effect of topic set size on retrieval experiment error, in 'Proc. ACM SIGIR conference', ACM Press, pp. 316--323.]]
[24]
Witten, I. H., Moffat, A. & Bell, T. C. (1999), Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kauffman.]]
[25]
Zhai, C. X., Cohen, W. W. & Lafferty, J. (2003), Beyond independent relevance: methods and evaluation metrics for subtopic retrieval, in 'Proc. ACM SIGIR conference', ACM Press, pp. 10--17.]]
[26]
Zhang, Y., Callan, J. & Minka, T. (2002), Novelty and redundancy detection in adaptive filtering, in 'Proc. ACM SIGIR conference', ACM Press, pp. 81--88.]]

Cited By

View all
  • (2024)LLMs can be Fooled into Labelling a Document as Relevant: best café near me; this paper is perfectly relevantProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698431(32-41)Online publication date: 8-Dec-2024
  • (2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
  • (2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
October 2005
854 pages
ISBN:1595931406
DOI:10.1145/1099554
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. duplicate detection
  2. novelty
  3. search effectiveness

Qualifiers

  • Article

Conference

CIKM05
Sponsor:
CIKM05: Conference on Information and Knowledge Management
October 31 - November 5, 2005
Bremen, Germany

Acceptance Rates

CIKM '05 Paper Acceptance Rate 77 of 425 submissions, 18%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LLMs can be Fooled into Labelling a Document as Relevant: best café near me; this paper is perfectly relevantProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698431(32-41)Online publication date: 8-Dec-2024
  • (2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
  • (2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
  • (2022)Novelty Detection: A Perspective from Natural Language ProcessingComputational Linguistics10.1162/coli_a_0042948:1(77-117)Online publication date: 4-Apr-2022
  • (2021)CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common CrawlProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463246(2398-2404)Online publication date: 11-Jul-2021
  • (2021)Incentives for Item Duplication Under Fair Ranking PoliciesAdvances in Bias and Fairness in Information Retrieval10.1007/978-3-030-78818-6_7(64-77)Online publication date: 25-Jun-2021
  • (2020)Sampling Bias Due to Near-Duplicates in Learning to RankProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401212(1997-2000)Online publication date: 25-Jul-2020
  • (2015)Search Result DiversificationFoundations and Trends in Information Retrieval10.1561/15000000409:1(1-90)Online publication date: 1-Mar-2015
  • (2015)Automated News Suggestions for Populating Wikipedia Entity PagesProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806531(323-332)Online publication date: 17-Oct-2015
  • (2015)On the reliability of diversity and redundancy-based search metrics2015 7th International Conference on Information Technology and Electrical Engineering (ICITEE)10.1109/ICITEED.2015.7409020(615-620)Online publication date: Oct-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media