skip to main content
10.1145/1008992.1009131acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Constructing a text corpus for inexact duplicate detection

Published: 25 July 2004 Publication History

Abstract

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.

References

[1]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6'97, pages 391--404. Elsevier Science, April 1997.
[2]
J. Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249--254, 1996.
[3]
A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM TOIS, 20(2):171--191, April 2002.
[4]
J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM'03, pages 443--452. ACM Press, Nov. 2003.
[5]
N. Shrivakumar and H. García-Molina. Finding near-replicas of documents on the Web. In Proceedings of Workshop on WebDB '98, pages 204--212, March 1998.
[6]
H. Turtle. Natural language vs. Boolean query evaluation: A comparison of retrieval performance. In Proceedings of SIGIR '94, pages 212--221. Springer-Verlag, July 1994.

Cited By

View all
  • (2019)On Tradeoffs Between Document Signature Methods for a Legal Due Diligence CorpusProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331311(1001-1004)Online publication date: 18-Jul-2019
  • (2011)Detection of near-duplicate user generated contentsProceedings of the 3rd international workshop on Search and mining user-generated contents10.1145/2065023.2065031(27-34)Online publication date: 28-Oct-2011
  • (2009)Coordinated weighted sampling for estimating aggregates over multiple weight assignmentsProceedings of the VLDB Endowment10.14778/1687627.16877012:1(646-657)Online publication date: 1-Aug-2009
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. duplicate document detection
  2. test collections

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2019)On Tradeoffs Between Document Signature Methods for a Legal Due Diligence CorpusProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331311(1001-1004)Online publication date: 18-Jul-2019
  • (2011)Detection of near-duplicate user generated contentsProceedings of the 3rd international workshop on Search and mining user-generated contents10.1145/2065023.2065031(27-34)Online publication date: 28-Oct-2011
  • (2009)Coordinated weighted sampling for estimating aggregates over multiple weight assignmentsProceedings of the VLDB Endowment10.14778/1687627.16877012:1(646-657)Online publication date: 1-Aug-2009
  • (2009)Leveraging discarded samples for tighter estimation of multiple-set aggregatesACM SIGMETRICS Performance Evaluation Review10.1145/2492101.155537937:1(251-262)Online publication date: 15-Jun-2009
  • (2009)Leveraging discarded samples for tighter estimation of multiple-set aggregatesProceedings of the eleventh international joint conference on Measurement and modeling of computer systems10.1145/1555349.1555379(251-262)Online publication date: 15-Jun-2009
  • (2008)Estimating Aggregates over Multiple SetsProceedings of the 2008 Eighth IEEE International Conference on Data Mining10.1109/ICDM.2008.110(761-766)Online publication date: 15-Dec-2008
  • (2008)New Issues in Near-duplicate DetectionData Analysis, Machine Learning and Applications10.1007/978-3-540-78246-9_71(601-609)Online publication date: 2008
  • (2007)Detecting near-duplicates for web crawlingProceedings of the 16th international conference on World Wide Web10.1145/1242572.1242592(141-150)Online publication date: 8-May-2007
  • (2005)Near-duplicate detection for eRulemakingProceedings of the 2005 national conference on Digital government research10.5555/1065226.1065247(78-86)Online publication date: 15-May-2005

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media