short-paper

CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl

Authors:

Janek Bevendorff,

Michael Völske,

Martin Potthast,

Matthias HagenAuthors Info & Claims

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2398 - 2404

https://doi.org/10.1145/3404835.3463246

Published: 11 July 2021 Publication History

Abstract

The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands from their users either to develop a preprocessing pipeline for deduplication, which is costly both computationally and in person hours, or accepting the undesired effects that near-duplicates have on reliability and validity of experiments. We introduce ChatNoir-CopyCat-21, which simplifies deduplication significantly. It comes in two parts: (1) A compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, as well as between selections of these crawls, and (2) a software library that implements the deduplication of arbitrary document sets. Our analysis shows that 14--52, of the documents within a crawl and around~0.7--2.5, between the crawls are near-duplicates. Two showcases demonstrate the application and usefulness of our resource.

References

[1]

Eytan Adar, Jaime Teevan, Susan T. Dumais, and Jonathan L. Elsas. 2009. The web changes everything: understanding the dynamics of web content. In Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, February 9--11, 2009, Ricardo Baeza-Yates, Paolo Boldi, Berthier A. Ribeiro-Neto, and Berkant Barla Cambazoglu (Eds.). ACM, 282--291. https://doi.org/10.1145/1498759.1498837

Digital Library

[2]

Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar. 2009. URL normalization for de-duplication of web pages. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2--6, 2009, David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy J. Lin (Eds.). ACM, 1987--1990. https://doi.org/10.1145/1645953.1646283

Digital Library

[3]

Bassma Alsulami, Maysoon Abulkhair, and Fathy Eassa. 2012. Near duplicate document detection survey. International Journal of Computer Science and Communications Networks, Vol. 2, 2 (2012), 147--151.

[4]

Yaniv Bernstein and Justin Zobel. 2004. A Scalable System for Identifying Co-derivative Documents. In String Processing and Information Retrieval, 11th International Conference, SPIRE 2004, Padova, Italy, October 5--8, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3246), Alberto Apostolico and Massimo Melucci (Eds.). Springer, 55--67. https://doi.org/10.1007/978--3--540--30213--1_6

[5]

Yaniv Bernstein and Justin Zobel. 2005. Redundant documents and search effectiveness. In Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, October 31 - November 5, 2005, Otthein Herzog, Hans-Jö rg Schek, Norbert Fuhr, Abdur Chowdhury, and Wilfried Teiken (Eds.). ACM, 736--743. https://doi.org/10.1145/1099554.1099733

Digital Library

[6]

Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2018. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In Advances in Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26--29, 2018, Proceedings (Lecture Notes in Computer Science, Vol. 10772), Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer, 820--824. https://doi.org/10.1007/978--3--319--76941--7_83

[7]

Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11--13, 1997, Proceedings, Bruno Carpentieri, Alfredo De Santis, Ugo Vaccaro, and James A. Storer (Eds.). IEEE, 21--29. https://doi.org/10.1109/SEQUEN.1997.666900

[8]

Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19--21, 2002, Montré al, Qué bec, Canada, John H. Reif (Ed.). ACM, 380--388. https://doi.org/10.1145/509907.509965

Digital Library

[9]

Junghoo Cho and Hector Garcia-Molina. 2000. The Evolution of the Web and Implications for an Incremental Crawler. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10--14, 2000, Cairo, Egypt, Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang (Eds.). Morgan Kaufmann, 200--209. http://www.vldb.org/conf/2000/P200.pdf

[10]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. CoRR, Vol. abs/2003.07820 (2020). arxiv: 2003.07820 https://arxiv.org/abs/2003.07820

[11]

Dennis Fetterly, Mark S. Manasse, and Marc Najork. 2003 a. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress (LA-WEB 2003), Empowering Our Web, 10--12 November 2003, Sanitago, Chile. IEEE Computer Society, 37--45. https://doi.org/10.1109/LAWEB.2003.1250280

[12]

Dennis Fetterly, Mark S. Manasse, Marc Najork, and Janet L. Wiener. 2003 b. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20--24, 2003, Gusztá v Hencsey, Bebo White, Yih-Farn Robin Chen, Lá szló Ková cs, and Steve Lawrence (Eds.). ACM, 669--678. https://doi.org/10.1145/775152.775246

Digital Library

[13]

Maik Frö be, Janek Bevendorff, Jan Heinrich Reimer, Martin Potthast, and Matthias Hagen. 2020 a. Sampling Bias Due to Near-Duplicates in Learning to Rank. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25--30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 1997--2000. https://doi.org/10.1145/3397271.3401212

Digital Library

[14]

Maik Frö be, Jan Philipp Bittner, Martin Potthast, and Matthias Hagen. 2020 b. The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines. In Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14--17, 2020, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12036), Joemon M. Jose, Emine Yilmaz, Jo a o Magalh a es, Pablo Castells, Nicola Ferro, Má rio J. Silva, and Flá vio Martins (Eds.). Springer, 12--19. https://doi.org/10.1007/978--3-030--45442--5_2

[15]

Monika Rauch Henzinger. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6--11, 2006, Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo J"a rvelin (Eds.). ACM, 284--291. https://doi.org/10.1145/1148170.1148222

Digital Library

[16]

Sung Jin Kim and Sang Ho Lee. 2005. An Empirical Study on the Change of Web Pages. In Web Technologies Research and Development - APWeb 2005, 7th Asia-Pacific Web Conference, Shanghai, China, March 29 - April 1, 2005, Proceedings (Lecture Notes in Computer Science, Vol. 3399), Yanchun Zhang, Katsumi Tanaka, Jeffrey Xu Yu, Shan Wang, and Minglu Li (Eds.). Springer, 632--642. https://doi.org/10.1007/978--3--540--31849--1_62

[17]

Christian Kohlschü tter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4--6, 2010, Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 441--450. https://doi.org/10.1145/1718487.1718542

Digital Library

[18]

Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, and Amit Sasturkar. 2010. Learning URL patterns for webpage de-duplication. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4--6, 2010, Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 381--390. https://doi.org/10.1145/1718487.1718535

Digital Library

[19]

Jimmy Lin and Peilin Yang. 2019. The Impact of Score Ties on Repeatability in Document Ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21--25, 2019, Benjamin Piwowarski, Max Chevalier, É ric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1125--1128. https://doi.org/10.1145/3331184.3331339

Digital Library

[20]

Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8--12, 2007, Carey L. Williamson, Mary Ellen Zurko, Peter F. Patel-Schneider, and Prashant J. Shenoy (Eds.). ACM, 141--150. https://doi.org/10.1145/1242572.1242592

Digital Library

[21]

Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17--20, 2004, Stuart I. Feldman, Mike Uretsky, Marc Najork, and Craig E. Wills (Eds.). ACM, 1--12. https://doi.org/10.1145/988672.988674

Digital Library

[22]

Christopher Olston and Sandeep Pandey. 2008. Recrawl scheduling based on information longevity. In Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21--25, 2008, Jinpeng Huai, Robin Chen, Hsiao-Wuen Hon, Yunhao Liu, Wei-Ying Ma, Andrew Tomkins, and Xiaodong Zhang (Eds.). ACM, 437--446. https://doi.org/10.1145/1367497.1367557

Digital Library

[23]

Jo a o R. M. Palotti, Harrisen Scells, and Guido Zuccon. 2019. TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21--25, 2019, Benjamin Piwowarski, Max Chevalier, É ric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1325--1328. https://doi.org/10.1145/3331184.3331399

Digital Library

[24]

Tetsuya Sakai. 2007. Alternatives to Bpref. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23--27, 2007, Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando (Eds.). ACM, 71--78. https://doi.org/10.1145/1277741.1277756

Digital Library

[25]

Gilles Vandewiele, Isabelle Dehaene, Gyö rgy Ková cs, Lucas Sterckx, Olivier Janssens, Femke Ongenae, Femke De Backere, Filip De Turck, Kristien Roelens, Johan Decruyenaere, Sofie Van Hoecke, and Thomas Demeester. 2021. Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling. Artif. Intell. Medicine, Vol. 111 (2021), 101987. https://doi.org/10.1016/j.artmed.2020.101987

[26]

Ellen M. Voorhees. 2001. The Philosophy of Information Retrieval Evaluation. In Evaluation of Cross-Language Information Retrieval Systems, Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September 3--4, 2001, Revised Papers (Lecture Notes in Computer Science, Vol. 2406), Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck (Eds.). Springer, 355--370. https://doi.org/10.1007/3--540--45691-0_34

[27]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7--11, 2017, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 1253--1256. https://doi.org/10.1145/3077136.3080721

Digital Library

Cited By

Song MHua ZZheng YXiang TJia X(2024)SimLESS: A Secure Deduplication System Over Similar Data in Cloud Media SharingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.338260319(4700-4715)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3382603
Tang ZZeng SHan SYu QJiang SChen P(2024)DEFD: Dual-Entity Fuzzy Deduplication for Untrusted Environments2024 21st Annual International Conference on Privacy, Security and Trust (PST)10.1109/PST62714.2024.10788052(1-11)Online publication date: 28-Aug-2024
https://doi.org/10.1109/PST62714.2024.10788052
Chu ZSakai TAi QLiu Y(2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3624918.3625317
Show More Cited By

Index Terms

CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Detecting near-duplicates for web crawling
WWW '07: Proceedings of the 16th international conference on World Wide Web

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can ...
Clustering near-duplicate images in large collections
MIR '07: Proceedings of the international workshop on Workshop on multimedia information retrieval

Near-duplicate images introduce problems of redundancy and copyright infringement in large image collections. The problem is acute on the web, where appropriation of images without acknowledgment of source is prevalent. In this paper, we present an ...
Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

In this paper, we study the problem of aligning all-pair near-duplicate passages in two long texts. A passage is a sequence of consecutive words in a text. It can begin and end with any word in the text, whether around a period or not. Due to the high ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2021

2998 pages

ISBN:9781450380379

DOI:10.1145/3404835

General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '21

Sponsor:

SIGIR

SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2021

Virtual Event, Canada

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
223
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Song MHua ZZheng YXiang TJia X(2024)SimLESS: A Secure Deduplication System Over Similar Data in Cloud Media SharingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.338260319(4700-4715)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3382603
Tang ZZeng SHan SYu QJiang SChen P(2024)DEFD: Dual-Entity Fuzzy Deduplication for Untrusted Environments2024 21st Annual International Conference on Privacy, Security and Trust (PST)10.1109/PST62714.2024.10788052(1-11)Online publication date: 28-Aug-2024
https://doi.org/10.1109/PST62714.2024.10788052
Chu ZSakai TAi QLiu Y(2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3624918.3625317
Jiang TYuan XChen YCheng KWang LChen XMa J(2023)FuzzyDedup: Secure Fuzzy Deduplication for Cloud StorageIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.318531320:3(2466-2483)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1109/TDSC.2022.3185313
Bondarenko AFröbe MKiesel JSyed SGurcke TBeloucif MPanchenko ABiemann CStein BWachsmuth HPotthast MHagen M(2022)Overview of Touché 2022: Argument RetrievalAdvances in Information Retrieval10.1007/978-3-030-99739-7_43(339-346)Online publication date: 10-Apr-2022
https://dl.acm.org/doi/10.1007/978-3-030-99739-7_43
Bondarenko AGienapp LFröbe MBeloucif MAjjour YPanchenko ABiemann CStein BWachsmuth HPotthast MHagen M(2021)Overview of Touché 2021: Argument RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_28(450-467)Online publication date: 21-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-85251-1_28

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten