Skip to main content
Log in

A Pareto optimal Bloom filter family with hash adaptivity

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Bloom filter is a compact memory-efficient probabilistic data structure supporting membership testing, i.e., to check whether an element is in a given set. However, as Bloom filter maps each element with random hash functions, little flexibility is provided even if the information of negative keys (elements are not in the set) is available, especially when the misidentification of negative keys brings different costs. The problem worsens when the hash functions are non-uniform, i.e., mapping each element into Bloom filter non-uniformly. To address the above problem, we propose a new hash adaptive Bloom filter (HABF) that supports customizing hash functions for keys. Besides, we propose a filter family, including f-HABF (fast hashing version), c-HABF (cache-friendly version), and s-HABF (stacked version). We show that HABF family is Pareto optimal among all comparison filters in terms of accuracy and query latency. We conduct extensive experiments on representative datasets, and the results show that HABF family outperforms the standard Bloom filter and its cutting-edge variants on the whole in terms of accuracy, construction/query time, and memory space consumption. All the source codes are available in our source codes (https://github.com/njulands/HashAdaptiveBF).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

Notes

  1. The cache line size is usually 512 bits in modern X\(86-64\) processor.

References

  1. Our source codes. https://github.com/njulands/HashAdaptiveBF

  2. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  3. Sears, R., Ramakrishnan, R.: BLSM: a general purpose log structured merge tree. In: Proceedings of the International Conference on Management of Data. ACM (2012)

  4. O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (lsm-tree). Acta Informatica. Springer, pp. 351–385 (1996)

  5. Leveldb. a fast and lightweight key/value database library (2011). http://code.google.com/p/leveldb/

  6. A facebook fork of leveldb which is optimized for flash and big memory machines (2013). https://rocksdb.org/

  7. Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of International Conference on Very Large Data Bases. VLDB Endowment (1986)

  8. Xiao, B., Chen, W., He, Y.: A novel approach to detecting ddos attacks at an early stage. J. Supercomput. 1, 235–248 (2006)

    Article  Google Scholar 

  9. Bruck, J., Gao, J., Jiang, A.: Weighted Bloom filter. In: Proceedings of International Symposium on Information Theory. IEEE (2006)

  10. Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: Proceedings of the International Conference on Management of Data. ACM (2018)

  11. Dai, Z., Shrivastava, A.: Adaptive learned Bloom filter (Ada-BF). Efficient utilization of the classifier. arXiv preprint (2019)

  12. Mitzenmacher, M.: A model for learned Bloom filters and optimizing by sandwiching. In: Advances in Neural Information Processing Systems. Curran. (2018)

  13. Deeds, K., Hentschel, B., Idreos, S.: Stacked filters: learning to filter by structure In: Proceedings of International Conference on Very Large Data Bases. VLDB Endowment (2021)

  14. Dai, H.P., Zhong, Y.K., Liu, A.X., Wang, W., Li, M.: Noisy Bloom filters for multi-set membership testing. In: Proceedings of the International Conference on Measurement and Modeling of Computer Science. ACM (2016)

  15. Graf, T.M., Lemire, D.: Xor filters: Faster and smaller than bloom and cuckoo filters. J. Experim. Algorithm. 1, 1–16 (2020)

    MathSciNet  Google Scholar 

  16. Cohen, S., Matias, Y.: Spectral Bloom filters. In: Proceedings of the International Conference on Management of Data. ACM (2003)

  17. Guo, D., Wu, J., Chen, H., Yuan, Y., Luo, X.: The dynamic Bloom filters. Trans. Knowl. Data Eng. 22, 120–133 (2009)

    Article  Google Scholar 

  18. Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: building a better bloom filter. In: Proceedings of European Symposium on Algorithms. Springer (2006)

  19. Hao, F., Kodialam, M., Lakshman, T.: Building high accuracy Bloom filters using partitioned hashing. In: International Conference on Measurement and Modeling of Computer Systems. ACM (2007)

  20. Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable Bloom filters. In: Proceedings of the international conference on Management of data. ACM (2006)

  21. Mitzenmacher, M.: Compressed bloom filters. Trans. Netw. 10, 604–612 (2002)

    Article  MATH  Google Scholar 

  22. Henke, C., Schmoll, C., Zseby, T.: Empirical evaluation of hash functions for multipoint measurements. ACM SIGCOMM Comput. Commun. Rev. 38(3), 39–50 (2008)

    Article  Google Scholar 

  23. Estébanez, C., Saez, Y., Recio, G., Isasi, P.: Performance of the most common non-cryptographic hash functions. Softw.: Pract. Experience 44(6), 681–698 (2014)

  24. Lovett, K.: Miscellaneous hash functions. http://www.call-with-current-continuation.org/eggs/hashes.html

  25. Rae, J.W., Bartunov, S., Lillicrap, T.P.: Meta-Learning Neural Bloom Filters. In: Proceedings of International Conference on Machine Learning. ACM (2019)

  26. Bhattacharya, A., Bedathur, S., Bagchi, A.: Adaptive learned bloom filters under incremental workloads. In: Proceedings of India Joint International Conference on Data Science and Management of Data. ACM (2020)

  27. Realtime URI Blacklist. http://uribl.com/

  28. Babcock, B., Olston, C.: Distributed top-k monitoring. In: Proceedings of the International Conference on Management of Data. ACM (2003)

  29. Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: tracking most frequent items dynamically. Transactions on Database Systems. ACM, pp. 249–278 (2005)

  30. Wu, F., Yang, M.H., Zhang, B., Du, D.H.: Ac-key: Adaptive caching for lsm-based key-value stores. In: Proceedings of Annual Technical Conference. USENIX (2020)

  31. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: Evidence and implications. In: Proceedings of International Conference on Computer Communications. IEEE (1999)

  32. Li, Y., Tian, C., Guo, F., Li, C., Xu, Y.: Elasticbf: elastic bloom filter with hotness awareness for boosting read performance in large key-value stores. In: Proceedings of Annual Technical Conference. USENIX (2019)

  33. Xie, R.B., Li, M., Miao, Z.Y., Gu, R., Huang, H. Dai, H.P., Chen, G.H.: Hash Adaptive Bloom Filter. In: Proceedings of International Conference on Data Engineering. IEEE (2021)

  34. Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8(3), 281–293 (2000)

    Article  Google Scholar 

  35. Gosselin-Lavigne, M.A., Gonzalez, H., Stakhanova, N., Ghorbani, A.A.: A performance evaluation of hash functions for ip reputation lookup using Bloom filters. In: Proceedings of International Conference on Availability, Reliability and Security. IEEE (2015)

  36. Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet mathematics ,pp. 485–509 (2004)

  37. Peter C. Dillinger and Lorenz Hübschle-Schneider and Peter Sanders and Stefan Walzer: Fast Succinct Retrieval and Approximate Membership using Ribbon. arXiv preprint arXiv:abs/2109.01892 (2021)

  38. Zhong, M., Lu, P., Shen, K., Seiferas, J.: Optimizing data popularity conscious Bloom filters. In: Proceedings of symposium on Principles of distributed computing. ACM (2008)

  39. Dayan, N., Athanassoulis, M., Idreos, S.: Monkey: Optimal navigable key-value store. In: International Conference on Management of Data. ACM, pp. 79–94 (2017)

  40. Byun, H., Lim, H.: Learned FBF: learning-based functional bloom filter for key-value storage. Trans. Comput 1, 1 (2021)

  41. Dillinger, P.C.: Adaptive approximate state storage. Ph.D. thesis, Northeastern University (2010)

  42. Hardy, G.H., Littlewood, J.E., Pólya, G., Littlewood, D.: Inequalities. Cambridge University Press (1952)

  43. Appendix https://njulimn.github.io/assets/pdf/VLDBJ_Appendix.pdf

  44. Putze, F., Sanders, P., Singler, J.: Cache-, hash- and space-efficient bloom filters. In: Proceedings of International conference on Experimental algorithms. Springer (2007)

  45. Lang, H., Neumann, T., Kemper, A., Boncz, P.: Performance-optimal filtering: Bloom overtakes cuckoo at high throughput (2019)

  46. Fastfilter. https://github.com/FastFilter/fastfilter_cpp

  47. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  48. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. In: Proceedings of the national academy of sciences. National Acad Sciences (1982)

  49. Keras. https://keras.io/

  50. xxhash. https://github.com/Cyan4973/xxHash

  51. Cityhash. https://github.com/google/cityhash

  52. Murmurhash. https://sites.google.com/site/murmurhash/

  53. Smhasher. https://github.com/rurban/smhasher

  54. R. jenkins. http://www.burtleburtle.net/bob/hash/doobs.html

  55. Shalla’s blacklists. http://www.shallalist.de/index.html

  56. Singhal, K., Weiss, P.: DeepBloom. https://github.com/karan1149/DeepBloom

  57. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of Symposium on Cloud Computing. ACM (2010)

  58. Powers, D.M.: Applications and explanations of Zipf’s law. In: Proceedings of Association for Computational Linguistics. ACL (1998)

Download references

Acknowledgements

We thank all the reviewers for their constructive comments and the help of Zheyu Miao along the way. This work was supported in part by the National Natural Science Foundation of China under Grant 61872178, in part by the National Natural Science Foundation of China (Nos. 61832005, 62072230, U1811461), in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University, in part by the Jiangsu High-level Innovation and Entrepreneurship (Shuangchuang) Program, and in part by Alibaba Innovative Research Project.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Haipeng Dai or Guihai Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Xie, R., Chen, D. et al. A Pareto optimal Bloom filter family with hash adaptivity. The VLDB Journal 32, 525–548 (2023). https://doi.org/10.1007/s00778-022-00755-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00755-z

Keywords

Navigation