skip to main content
10.1145/3529372.3530934acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Investigating bloom filters for web archives' holdings

Published: 20 June 2022 Publication History

Abstract

What web archives hold is often opaque to the public and even experts in the domain struggle to provide precise assessments. Given the increasing need for and use of crawled and archived web resources, discovery of individual records as well as sharing of entire holdings are pressing use cases. We investigate Bloom Filters (BFs) and their applicability to address these use cases. We experiment with and analyze parameters for their creation, measure their performance, outline an approach for scalability, and describe various pilot implementations that showcase their potential to meet our needs. BFs come with beneficial characteristics and hence have enjoyed popularity in various domains. We highlight their suitability for web archiving use cases and how they can contribute to very fast and accurate search services.

References

[1]
2017. The WARC File Format (ISO 28500). http://bibnum.bnf.fr/WARC/.
[2]
Sawood Alam and Michael L. Nelson. 2016. MemGator --- A portable concurrent memento aggregator: Cross-platform CLI and server binaries in Go. In 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL). 243--244.
[3]
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, and David S. H. Rosenthal. 2016. Web archive profiling through CDX summarization. International Journal on Digital Libraries 17, 3 (2016), 223--238.
[4]
Sawood Alam, Michele C. Weigle, Michael L. Nelson, Fernando Melo, Daniel Bicho, and Daniel Gomes. 2019. Mementomap Framework for Flexible and Adaptive Web Archive Profiling. In Proceedings of the 18th Joint Conference on Digital Libraries (JCDL '19). 172--181.
[5]
John A. Berlin, Mat Kelly, Michael L. Nelson, and Michele C. Weigle. [n. d.]. WAIL: Collection-Based Personal Web Archiving. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).
[6]
Nicolas J. Bornand, Lyudmila Balakireva, and Herbert Van de Sompel. 2016. Routing Memento Requests Using Binary Classifiers. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL '16). 63--72.
[7]
T. Bray. 2017. The JavaScript Object Notation (JSON) Data Interchange Format. https://datatracker.ietf.org/doc/html/rfc8259.
[8]
Andrei Broder and Michael Mitzenmacher. 2004. Network Applications of Bloom Filters: A Survey. Internet Mathematics 1, 4 (2004), 485--509.
[9]
Maria-Dorina Costea. 2018. Report on the scholarly use of web archives. NetLab.
[10]
Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila Balakireva, Scott Ainsworth, and Harihar Shankar. 2009. Memento: Time Travel for the Web. CoRR abs/0911.1112 (2009). http://arxiv.org/abs/0911.1112
[11]
Marc Antoine Gosselin-Lavigne, Hugo Gonzalez, Natalia Stakhanova, and Ali A. Ghorbani. 2015. A Performance Evaluation of Hash Functions for IP Reputation Lookup Using Bloom Filters. In 2015 10th International Conference on Availability, Reliability and Security. 516--521.
[12]
Akansha Goyal, Arun Swaminathan, Rasika Pande, and Vahida Attar. 2016. Cross platform (RDBMS to NoSQL) database validation tool using bloom filter. In 2016 International Conference on Recent Trends in Information Technology (ICRTIT). 1--5.
[13]
David Grochol and Lukas Sekanina. 2016. Evolutionary Design of Fast High-Quality Hash Functions for Network Applications. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO '16). 901--908.
[14]
Deke Guo, Jie Wu, Honghui Chen, Ye Yuan, and Xueshan Luo. 2010. The Dynamic Bloom Filters. IEEE Transactions on Knowledge and Data Engineering 22, 1 (2010), 120--133.
[15]
Shawn M. Jones, Martin Klein, Herbert Van de Sompel, Michael L. Nelson, and Michele C. Weigle. 2021. Interoperability for Accessing Versions of Web Resources with the Memento Protocol. Springer International Publishing, 101--126.
[16]
Aveksha Kapoor and Vinay Arora. 2016. Application of Bloom Filter for Duplicate URL Detection in a Web Crawler. In 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC). 246--255.
[17]
Adam Kirsch and Michael Mitzenmacher. 2006. Less Hashing, Same Performance: Building a Better Bloom Filter. In Algorithms - ESA 2006. 456--467.
[18]
Martin Klein, Lyudmila Balakireva, and Harihar Shankar. 2019. Evaluating Memento Service Optimizations. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). 182--185.
[19]
Martin Klein and Herbert Van de Sompel. 2013. Extending Sitemaps for ResourceSync. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '13). 277--280.
[20]
Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, and Richard Tobin. 2014. Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9, 12 (2014).
[21]
J. Kunze and T. Baker. 2007. The Dublin Core Metadata Element Set. https://datatracker.ietf.org/doc/html/rfc5013.
[22]
Lailong Luo, Deke Guo, Richard T. B. Ma, Ori Rottenstreich, and Xueshan Luo. 2019. Optimizing Bloom Filter: Challenges, Solutions, and Comparisons. IEEE Communications Surveys & Tutorials 21 (2019), 1912--1949.
[23]
M. Nottingham. 2017. Web Linking. https://datatracker.ietf.org/doc/html/rfc8288.
[24]
M. Nottingham. 2019. Well-Known Uniform Resource Identifiers (URIs). https://datatracker.ietf.org/doc/html/rfc8615.
[25]
Rasmus Pagh, Gil Segev, and Udi Wieder. 2013. How to Approximate a Set without Knowing Its Size in Advance. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. 80--89.
[26]
Saibal K. Pal, Puneet Sardana, and Kamlesh Yadav. 2012. Efficient multilingual keyword search using bloom filter for cloud computing applications. In 2012 Fourth International Conference on Advanced Computing (ICoAC). 1--7.
[27]
Manuel Pozo, Raja Chiky, Farid Meziane, and Elisabeth Métais. 2016. An item/user representation for recommender systems based on bloom filters. In 2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS). 1--12.
[28]
Abrams S, Goethals A, Klein M, and Lack R. 2016. Cobweb: A Collaborative Collection Development Platform for Web Archiving. Research Ideas and Outcomes 4, e8760 (2016).
[29]
Uri Schonfeld and Narayanan Shivakumar. 2009. Sitemaps: Above and beyond the Crawl of Duty. In Proceedings of the 18th International Conference on World Wide Web (WWW '09). 991--1000.
[30]
Sasu Tarkoma, C. Rothenberg, and Emil Lagerspetz. 2011. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Surveys and Tutorials 14, 1 (2011), 131--155.
[31]
Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTP Framework for Time-Based Access to Resource States - Memento. https://datatracker.ietf.org/doc/html/rfc7089.

Cited By

View all
  • (2024)Exploiting the untapped functional potential of Memento aggregators beyond aggregationInternational Journal on Digital Libraries10.1007/s00799-023-00391-025:1(93-104)Online publication date: 27-Jan-2024
  • (2023)Double locality sensitive hashing Bloom filter for high-dimensional streaming anomaly detectionInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10330660:3Online publication date: 1-May-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
June 2022
392 pages
ISBN:9781450393454
DOI:10.1145/3529372
  • General Chairs:
  • Akiko Aizawa,
  • Thomas Mandl,
  • Zeljko Carevic,
  • Program Chairs:
  • Annika Hinze,
  • Philipp Mayr,
  • Philipp Schaer
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

In-Cooperation

  • IEEE Technical Committee on Digital Libraries (TC DL)

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2022

Permissions

Request permissions for this article.

Check for updates

Badges

  • Best Paper

Author Tags

  1. bloom filters
  2. index sharing
  3. web archive profiling
  4. web archives

Qualifiers

  • Research-article

Conference

JCDL '22
Sponsor:

Acceptance Rates

JCDL '22 Paper Acceptance Rate 35 of 132 submissions, 27%;
Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploiting the untapped functional potential of Memento aggregators beyond aggregationInternational Journal on Digital Libraries10.1007/s00799-023-00391-025:1(93-104)Online publication date: 27-Jan-2024
  • (2023)Double locality sensitive hashing Bloom filter for high-dimensional streaming anomaly detectionInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10330660:3Online publication date: 1-May-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media