skip to main content
10.1145/3404835.3463246acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl

Published: 11 July 2021 Publication History

Abstract

The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands from their users either to develop a preprocessing pipeline for deduplication, which is costly both computationally and in person hours, or accepting the undesired effects that near-duplicates have on reliability and validity of experiments. We introduce ChatNoir-CopyCat-21, which simplifies deduplication significantly. It comes in two parts: (1) A compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, as well as between selections of these crawls, and (2) a software library that implements the deduplication of arbitrary document sets. Our analysis shows that 14--52, of the documents within a crawl and around~0.7--2.5, between the crawls are near-duplicates. Two showcases demonstrate the application and usefulness of our resource.

References

[1]
Eytan Adar, Jaime Teevan, Susan T. Dumais, and Jonathan L. Elsas. 2009. The web changes everything: understanding the dynamics of web content. In Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, February 9--11, 2009, Ricardo Baeza-Yates, Paolo Boldi, Berthier A. Ribeiro-Neto, and Berkant Barla Cambazoglu (Eds.). ACM, 282--291. https://doi.org/10.1145/1498759.1498837
[2]
Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar. 2009. URL normalization for de-duplication of web pages. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2--6, 2009, David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy J. Lin (Eds.). ACM, 1987--1990. https://doi.org/10.1145/1645953.1646283
[3]
Bassma Alsulami, Maysoon Abulkhair, and Fathy Eassa. 2012. Near duplicate document detection survey. International Journal of Computer Science and Communications Networks, Vol. 2, 2 (2012), 147--151.
[4]
Yaniv Bernstein and Justin Zobel. 2004. A Scalable System for Identifying Co-derivative Documents. In String Processing and Information Retrieval, 11th International Conference, SPIRE 2004, Padova, Italy, October 5--8, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3246), Alberto Apostolico and Massimo Melucci (Eds.). Springer, 55--67. https://doi.org/10.1007/978--3--540--30213--1_6
[5]
Yaniv Bernstein and Justin Zobel. 2005. Redundant documents and search effectiveness. In Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, October 31 - November 5, 2005, Otthein Herzog, Hans-Jö rg Schek, Norbert Fuhr, Abdur Chowdhury, and Wilfried Teiken (Eds.). ACM, 736--743. https://doi.org/10.1145/1099554.1099733
[6]
Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2018. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In Advances in Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26--29, 2018, Proceedings (Lecture Notes in Computer Science, Vol. 10772), Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer, 820--824. https://doi.org/10.1007/978--3--319--76941--7_83
[7]
Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11--13, 1997, Proceedings, Bruno Carpentieri, Alfredo De Santis, Ugo Vaccaro, and James A. Storer (Eds.). IEEE, 21--29. https://doi.org/10.1109/SEQUEN.1997.666900
[8]
Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19--21, 2002, Montré al, Qué bec, Canada, John H. Reif (Ed.). ACM, 380--388. https://doi.org/10.1145/509907.509965
[9]
Junghoo Cho and Hector Garcia-Molina. 2000. The Evolution of the Web and Implications for an Incremental Crawler. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10--14, 2000, Cairo, Egypt, Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang (Eds.). Morgan Kaufmann, 200--209. http://www.vldb.org/conf/2000/P200.pdf
[10]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. CoRR, Vol. abs/2003.07820 (2020). arxiv: 2003.07820 https://arxiv.org/abs/2003.07820
[11]
Dennis Fetterly, Mark S. Manasse, and Marc Najork. 2003 a. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress (LA-WEB 2003), Empowering Our Web, 10--12 November 2003, Sanitago, Chile. IEEE Computer Society, 37--45. https://doi.org/10.1109/LAWEB.2003.1250280
[12]
Dennis Fetterly, Mark S. Manasse, Marc Najork, and Janet L. Wiener. 2003 b. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20--24, 2003, Gusztá v Hencsey, Bebo White, Yih-Farn Robin Chen, Lá szló Ková cs, and Steve Lawrence (Eds.). ACM, 669--678. https://doi.org/10.1145/775152.775246
[13]
Maik Frö be, Janek Bevendorff, Jan Heinrich Reimer, Martin Potthast, and Matthias Hagen. 2020 a. Sampling Bias Due to Near-Duplicates in Learning to Rank. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25--30, 2020, Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 1997--2000. https://doi.org/10.1145/3397271.3401212
[14]
Maik Frö be, Jan Philipp Bittner, Martin Potthast, and Matthias Hagen. 2020 b. The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines. In Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14--17, 2020, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12036), Joemon M. Jose, Emine Yilmaz, Jo a o Magalh a es, Pablo Castells, Nicola Ferro, Má rio J. Silva, and Flá vio Martins (Eds.). Springer, 12--19. https://doi.org/10.1007/978--3-030--45442--5_2
[15]
Monika Rauch Henzinger. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6--11, 2006, Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo J"a rvelin (Eds.). ACM, 284--291. https://doi.org/10.1145/1148170.1148222
[16]
Sung Jin Kim and Sang Ho Lee. 2005. An Empirical Study on the Change of Web Pages. In Web Technologies Research and Development - APWeb 2005, 7th Asia-Pacific Web Conference, Shanghai, China, March 29 - April 1, 2005, Proceedings (Lecture Notes in Computer Science, Vol. 3399), Yanchun Zhang, Katsumi Tanaka, Jeffrey Xu Yu, Shan Wang, and Minglu Li (Eds.). Springer, 632--642. https://doi.org/10.1007/978--3--540--31849--1_62
[17]
Christian Kohlschü tter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4--6, 2010, Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 441--450. https://doi.org/10.1145/1718487.1718542
[18]
Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, and Amit Sasturkar. 2010. Learning URL patterns for webpage de-duplication. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4--6, 2010, Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 381--390. https://doi.org/10.1145/1718487.1718535
[19]
Jimmy Lin and Peilin Yang. 2019. The Impact of Score Ties on Repeatability in Document Ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21--25, 2019, Benjamin Piwowarski, Max Chevalier, É ric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1125--1128. https://doi.org/10.1145/3331184.3331339
[20]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8--12, 2007, Carey L. Williamson, Mary Ellen Zurko, Peter F. Patel-Schneider, and Prashant J. Shenoy (Eds.). ACM, 141--150. https://doi.org/10.1145/1242572.1242592
[21]
Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. 2004. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17--20, 2004, Stuart I. Feldman, Mike Uretsky, Marc Najork, and Craig E. Wills (Eds.). ACM, 1--12. https://doi.org/10.1145/988672.988674
[22]
Christopher Olston and Sandeep Pandey. 2008. Recrawl scheduling based on information longevity. In Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21--25, 2008, Jinpeng Huai, Robin Chen, Hsiao-Wuen Hon, Yunhao Liu, Wei-Ying Ma, Andrew Tomkins, and Xiaodong Zhang (Eds.). ACM, 437--446. https://doi.org/10.1145/1367497.1367557
[23]
Jo a o R. M. Palotti, Harrisen Scells, and Guido Zuccon. 2019. TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21--25, 2019, Benjamin Piwowarski, Max Chevalier, É ric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1325--1328. https://doi.org/10.1145/3331184.3331399
[24]
Tetsuya Sakai. 2007. Alternatives to Bpref. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23--27, 2007, Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando (Eds.). ACM, 71--78. https://doi.org/10.1145/1277741.1277756
[25]
Gilles Vandewiele, Isabelle Dehaene, Gyö rgy Ková cs, Lucas Sterckx, Olivier Janssens, Femke Ongenae, Femke De Backere, Filip De Turck, Kristien Roelens, Johan Decruyenaere, Sofie Van Hoecke, and Thomas Demeester. 2021. Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling. Artif. Intell. Medicine, Vol. 111 (2021), 101987. https://doi.org/10.1016/j.artmed.2020.101987
[26]
Ellen M. Voorhees. 2001. The Philosophy of Information Retrieval Evaluation. In Evaluation of Cross-Language Information Retrieval Systems, Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September 3--4, 2001, Revised Papers (Lecture Notes in Computer Science, Vol. 2406), Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck (Eds.). Springer, 355--370. https://doi.org/10.1007/3--540--45691-0_34
[27]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7--11, 2017, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 1253--1256. https://doi.org/10.1145/3077136.3080721

Cited By

View all
  • (2024)SimLESS: A Secure Deduplication System Over Similar Data in Cloud Media SharingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.338260319(4700-4715)Online publication date: 2024
  • (2024)DEFD: Dual-Entity Fuzzy Deduplication for Untrusted Environments2024 21st Annual International Conference on Privacy, Security and Trust (PST)10.1109/PST62714.2024.10788052(1-11)Online publication date: 28-Aug-2024
  • (2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
  • Show More Cited By

Index Terms

  1. CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TREC evaluation
    2. near-duplicate detection
    3. relevance transfer

    Qualifiers

    • Short-paper

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)49
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SimLESS: A Secure Deduplication System Over Similar Data in Cloud Media SharingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.338260319(4700-4715)Online publication date: 2024
    • (2024)DEFD: Dual-Entity Fuzzy Deduplication for Untrusted Environments2024 21st Annual International Conference on Privacy, Security and Trust (PST)10.1109/PST62714.2024.10788052(1-11)Online publication date: 28-Aug-2024
    • (2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
    • (2023)FuzzyDedup: Secure Fuzzy Deduplication for Cloud StorageIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.318531320:3(2466-2483)Online publication date: 1-May-2023
    • (2022)Overview of Touché 2022: Argument RetrievalAdvances in Information Retrieval10.1007/978-3-030-99739-7_43(339-346)Online publication date: 10-Apr-2022
    • (2021)Overview of Touché 2021: Argument RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_28(450-467)Online publication date: 21-Sep-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media