Skip to main content

Advertisement

Log in

A survey of estimating number of distinct values

  • Review Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Estimating the Number of Distinct Values (NDVs) is a critical task in the fields of databases and data streams. Over time, various algorithms for estimating NDVs have been developed, each tailored to different requirements for time, I/O, and accuracy. These algorithms can be broadly categorized into two main types: sampling-based and sketch-based. Sampling-based NDV algorithms improve efficiency by sampling rather than accessing all items, often at the cost of reduced accuracy. In contrast, sketch-based NDV algorithms maintain a compact sketch using hashing to scan the entire dataset, typically offering higher accuracy but at the expense of increased I/O costs. When dealing with large-scale data, scanning the entire table may become infeasible. Thus, the challenge of efficiently and accurately estimating NDVs has persisted for decades. This paper provides a comprehensive review of the fundamental concepts, key techniques, and a comparative analysis of various NDV estimation algorithms. We first briefly examine traditional estimators in chronological order, followed by an in-depth discussion of the newer estimators developed over the past decade, highlighting the specific scenarios in which they are applicable. Furthermore, we illustrate how NDV estimation algorithms have been adapted to address the complexities of modern real-world data environments effectively. Despite significant progress in NDV estimation research, challenges remain in terms of theoretical scalability and practical application. This paper also explores potential future directions, including block sampling NDV estimation, learning-based NDV estimation, and their implications for database applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Fisher R A, Corbet A S, Williams C B. The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 1943, 12(1): 42–58

    Article  MATH  Google Scholar 

  2. Efron B, Thisted R. Estimating the number of unseen species: how many words did Shakespeare know? Biometrika, 1976, 63(3): 435–447

    MATH  Google Scholar 

  3. Valiant G, Valiant P. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. 2011, 685–694

    MATH  Google Scholar 

  4. Hou W C, Ozsoyoglu G, Taneja B K. Processing aggregate relational queries with hard time constraints. In: Proceedings of 1989 ACM SIGMOD International Conference on Management of Data. 1989, 68–77

    Chapter  MATH  Google Scholar 

  5. Ozsoyoglu G, Du K, Tjahjana A, Hou W C, Rowland D Y. On estimating count, sum, and average relational algebra queries. In: Proceedings of the International Conference on Database and Expert Systems Applications. 1991, 406–412

    Chapter  Google Scholar 

  6. Haas P J, Naughton J F, Seshadri S, Stokes L. Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995, 311–322

    MATH  Google Scholar 

  7. Lemire D, Kaser O. Reordering columns for smaller indexes. Information Sciences, 2011, 181(12): 2550–2570

    Article  MathSciNet  MATH  Google Scholar 

  8. Li P, Wei W, Zhu R, Ding B, Zhou J, Lu H. ALECE: an attention-based learned cardinality estimator for SPJ queries on dynamic workloads. Proceedings of the VLDB Endowment, 2023, 17(2): 197–210

    Article  Google Scholar 

  9. Chabchoub Y, Chiky R, Dogan B. How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic? EURASIP Journal on Information Security, 2014, 2014(1): 5

    Article  Google Scholar 

  10. Cohen R, Nezri Y. Cardinality estimation in a virtualized network device using online machine learning. IEEE/ACM Transactions on Networking, 2019, 27(5): 2098–2110

    Article  MATH  Google Scholar 

  11. Clemens V, Schulz L C, Gartner M, Hausheer D. DDoS detection in P4 using HYPERLOGLOG and COUNTMIN sketches. In: Proceedings of NOMS 2023-2023 IEEE/IFIP Network Operations and Management Symposium. 2023, 1–6

    Google Scholar 

  12. Kalai A T, Vempala S S. Calibrated language models must hallucinate. In: Proceedings of the 56th Annual ACM Symposium on Theory of Computing. 2024, 160–171

    Chapter  MATH  Google Scholar 

  13. Bunge J, Fitzpatrick M. Estimating the number of species: a review. Journal of the American Statistical Association, 1993, 88(421): 364–373

    Article  MATH  Google Scholar 

  14. Harmouch H, Naumann F. Cardinality estimation: an experimental survey. Proceedings of the VLDB Endowment, 2017, 11(4): 499–512

    Article  MATH  Google Scholar 

  15. Batu T, Fortnow L, Rubinfeld R, Smith W D, White P. Testing that distributions are close. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 2000, 259–269

    Chapter  MATH  Google Scholar 

  16. Paninski L. Estimation of entropy and mutual information. Neural Computation, 2003, 15(6): 1191–1253

    Article  MATH  Google Scholar 

  17. Orlitsky A, Santhanam N P, Viswanathan K, Zhang J. On modeling profiles instead of values. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. 2004, 426–435

    MATH  Google Scholar 

  18. Brutlag J D, Richardson T S. A block sampling approach to distinct value estimation. Journal of Computational and Graphical Statistics, 2002, 11(2): 389–404

    Article  MathSciNet  MATH  Google Scholar 

  19. Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In: Proceedings of 2004 ACM SIGMOD International Conference on Management of Data. 2004, 287–298

    Chapter  MATH  Google Scholar 

  20. Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 2005, 55(1): 58–75

    Article  MathSciNet  MATH  Google Scholar 

  21. Bloom B H. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 1970, 13(7): 422–426

    Article  MATH  Google Scholar 

  22. Li B, Lu Y, Wang C, Kandula S. Q-error bounds of random uniform sampling for cardinality estimation. 2021, arXiv preprint arXiv: 2108.02715

  23. Charikar M, Chaudhuri S, Motwani R, Narasayya V. Towards estimation error guarantees for distinct values. In: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2000, 268–279

    MATH  Google Scholar 

  24. Goodman L A. On the estimation of the number of classes in a population. The Annals of Mathematical Statistics, 1949, 20(4): 572–579

    Article  MathSciNet  MATH  Google Scholar 

  25. Good I J, Toulmin G H. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 1956, 43(1–2): 45–63

    Article  MathSciNet  MATH  Google Scholar 

  26. Bromwich T J I A. An Introduction to the Theory of Infinite Series. Providence: American Mathematical Society, 2005

    MATH  Google Scholar 

  27. Shlosser A. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics, 1981, 19(1): 97–102

    MathSciNet  MATH  Google Scholar 

  28. Haas P J, Stokes L. Estimating the number of classes in a finite population. Journal of the American Statistical Association, 1998, 93(444): 1475–1487

    Article  MathSciNet  MATH  Google Scholar 

  29. Quenouille M H. Problems in plane sampling. The Annals of Mathematical Statistics, 1949, 20(3): 355–375

    Article  MathSciNet  MATH  Google Scholar 

  30. Burnham K P, Overton W S. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika, 1978, 65(3): 625–633

    Article  MATH  Google Scholar 

  31. Burnham K P, Overton W S. Robust estimation of population size when capture probabilities vary among animals. Ecology, 1979, 60(5): 927–936

    Article  MATH  Google Scholar 

  32. Heltshe J F, Forrester N E. Estimating species richness using the jackknife procedure. Biometrics, 1983, 39(1): 1–11

    Article  MATH  Google Scholar 

  33. Smith E P, van Belle G. Nonparametric estimation of species richness. Biometrics, 1984, 40(1): 119–129

    Article  MATH  Google Scholar 

  34. Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson N L, eds. Breakthroughs in Statistics: Methodology and Distribution. New York: Springer, 1992, 569–593

    Chapter  MATH  Google Scholar 

  35. Horvitz D G, Thompson D J. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 1952, 47(260): 663–685

    Article  MathSciNet  MATH  Google Scholar 

  36. Särndal C E, Swensson B, Wretman J. Model Assisted Survey Sampling. New York: Springer, 2003

    MATH  Google Scholar 

  37. Chao A. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 1984, 11(4): 265–270

    MathSciNet  MATH  Google Scholar 

  38. Chao A, Shen T. User’s guide for program spade (species prediction and diversity estimation). National Tsing Hua University, Dissertation, 2010

    MATH  Google Scholar 

  39. Chao A, Lee S M. Estimating the number of classes via sample coverage. Journal of the American statistical Association, 1992, 87(417): 210–217

    Article  MathSciNet  MATH  Google Scholar 

  40. Good I J. The population frequencies of species and the estimation of population parameters. Biometrika, 1953, 40(3–4): 237–264

    Article  MathSciNet  MATH  Google Scholar 

  41. Deolalikar V, Laffitte H. Extensive large-scale study of error in samping-based distinct value estimators for databases. 2016, arXiv preprint arXiv: 1612.00476

  42. Motwani R, Vassilvitskii S. Distinct values estimators for power law distributions. In: Proceedings of the 3rd Workshop on Analytic Algorithmics and Combinatorics. 2006, 230–237

    MATH  Google Scholar 

  43. Korwar R M. On the observed number of classes from multivariate power series and hypergeometric distributions. Sankhyā: The Indian Journal of Statistics, Series B, 1988, 50(1): 39–59

    MathSciNet  MATH  Google Scholar 

  44. Valiant P, Valiant G. Estimating the unseen: improved estimators for entropy and other properties. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2157–2165

    MATH  Google Scholar 

  45. Li J, Lei R, Wang S, Wei Z, Ding B. Learning-based property estimation with polynomials. Proceedings of the ACM on Management of Data, 2024, 2(3): 1–27

    MATH  Google Scholar 

  46. Valiant G, Valiant P. Instance optimal learning of discrete distributions. In: Proceedings of the 48th Annual ACM Symposium on Theory of Computing. 2016, 142–155

    MATH  Google Scholar 

  47. Raghunathan A, Valiant G, Zou J. Estimating the unseen from multiple populations. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2855–2863

    MATH  Google Scholar 

  48. Valiant G, Valiant P. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 2017, 64(6): 37

    Article  MathSciNet  MATH  Google Scholar 

  49. Valiant, G., & Valiant, P. (2010). Estimating the unseen: A sublinear-sample canonical estimator of distributions. Electronic Colloquium on Computational Complexity (ECCC), TR10-180.

    MATH  Google Scholar 

  50. Acharya J, Das H, Orlitsky A, Suresh A T. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 11–21

    MATH  Google Scholar 

  51. Pavlichin D S, Jiao J, Weissman T. Approximate profile maximum likelihood. Journal of Machine Learning Research, 2019, 20(122): 1–55

    MathSciNet  MATH  Google Scholar 

  52. Acharya J, Das H, Mohimani H, Orlitsky A, Pan S. Exact calculation of pattern probabilities. In: Proceedings of 2010 IEEE International Symposium on Information Theory. 2010, 1498–1502

    MATH  Google Scholar 

  53. Orlitsky A, Sajama S, Santhanam N P, Viswanathan K, Zhang J. Algorithms for modeling distributions over large alphabets. In: Proceedings of International Symposium onInformation Theory. 2004, 304

    MATH  Google Scholar 

  54. Orlitsky A, Santhanam N, Viswanathan K, Zhang J. Theoretical and experimental results on modeling low probabilities. In: Proceedings of 2006 IEEE Information Theory Workshop – ITW’ 06 Punta del Este. 2006, 242–246

    Chapter  MATH  Google Scholar 

  55. Vontobel P O. The bethe approximation of the pattern maximum likelihood distribution. In: Proceedings of 2012 IEEE International Symposium on Information Theory Proceedings. 2012, 2012–2016

    MATH  Google Scholar 

  56. Vontobel P O. The bethe and sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the valiant-valiant estimate. In: Proceedings of 2014 Information Theory and Applications Workshop (ITA). 2014, 1–10

    MATH  Google Scholar 

  57. Wu Y, Yang P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 2016, 62(6): 3702–3720

    Article  MathSciNet  MATH  Google Scholar 

  58. Wu Y, Yang P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 2019, 47(2): 857–883

    Article  MathSciNet  MATH  Google Scholar 

  59. Chien I. Regularized weighted chebyshev approximations for support estimation. University of Illinois at Urbana-Champaign, Dissertation, 2019

    MATH  Google Scholar 

  60. Eden T, Indyk P, Narayanan S, Rubinfeld R, Silwal S, Wagner T. Learning-based support estimation in sublinear time. In: Proceedings of the 9th International Conference on Learning Representations. 2021

    MATH  Google Scholar 

  61. Wu R, Ding B, Chu X, Wei Z, Dai X, Guan T, Zhou J. Learning to be a statistician: learned estimator for number of distinct values. Proceedings of the VLDB Endowment, 2021, 15(2): 272–284

    Article  Google Scholar 

  62. Whang K Y, Vander-Zanden B T, Taylor H M. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS), 1990, 15(2): 208–229

    Article  MATH  Google Scholar 

  63. Swamidass S J, Baldi P. Mathematical correction for fingerprint similarity measures to improve chemical retrieval. Journal of Chemical Information and Modeling, 2007, 47(3): 952–964

    Article  MATH  Google Scholar 

  64. Papapetrou O, Siberski W, Nejdl W. Cardinality estimation and dynamic length adaptation for bloom filters. Distributed and Parallel Databases, 2010, 28(2): 119–156

    Article  MATH  Google Scholar 

  65. Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In: Proceedings of the 28th Annual ACM Symposium on Theory of Computing. 1996, 20–29

    MATH  Google Scholar 

  66. Cormode G. Count-min sketch. In: Liu L, Özsu M T, eds. Encyclopedia of Database Systems. New York: Springer, 2009, 511–516

    Chapter  MATH  Google Scholar 

  67. Gibbons P B, Tirthapura S. Estimating simple functions on the union of data streams. In: Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures. 2001, 281–291

    MATH  Google Scholar 

  68. Bar-Yossef Z, Jayram T S, Kumar R, Sivakumar D, Trevisan L. Counting distinct elements in a data stream. In: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science. 2002, 1–10

    MATH  Google Scholar 

  69. Beyer K, Gemulla R, Haas P J, Reinwald B, Sismanis Y. Distinct-value synopses for multiset operations. Communications of the ACM, 2009, 52(10): 87–95

    Article  Google Scholar 

  70. Cohen E. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 1997, 55(3): 441–453

    Article  MathSciNet  MATH  Google Scholar 

  71. Dasgupta A, Lang K J, Rhodes L, Thaler J. A framework for estimating stream expression cardinalities. In: Proceedings of the 19th International Conference on Database Theory. 2016

    MATH  Google Scholar 

  72. Flajolet P, Martin G N. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 1985, 31(2): 182–209

    Article  MathSciNet  MATH  Google Scholar 

  73. Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 1999, 58(1): 137–147

    Article  MathSciNet  MATH  Google Scholar 

  74. Durand M, Flajolet P. Loglog counting of large cardinalities. In: Proceedings of the 11th Annual European Symposium on Algorithms. 2003, 605–617

    MATH  Google Scholar 

  75. Flajolet P, Fusy É, Gandouet O, Meunier F. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of 2007 Conference on Analysis of Algorithms. 2007, 127–146

    MATH  Google Scholar 

  76. Hall A, Bachmann O, Büssow R, Gänceanu S, Nunkesser M. Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment, 2012, 5(11): 1436–1446

    Article  Google Scholar 

  77. Heule S, Nunkesser M, Hall A. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology. 2013, 683–692

    Chapter  MATH  Google Scholar 

  78. Kane D M, Nelson J, Woodruff D P. An optimal algorithm for the distinct elements problem. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2010, 41–52

    MATH  Google Scholar 

  79. Ting D. Approximate distinct counts for billions of datasets. In: Proceedings of the 2019 International Conference on Management of Data. 2019, 69–86

    Chapter  MATH  Google Scholar 

  80. Chiosa M, Preußer T B, Alonso G. SKT: a one-pass multi-sketch data analytics accelerator. Proceedings of the VLDB Endowment, 2021, 14(11): 2369–2382

    Article  Google Scholar 

  81. Ertl O. SetSketch: filling the gap between MinHash and HyperLogLog. Proceedings of the VLDB Endowment, 2021, 14(11): 2244–2257

    Article  MATH  Google Scholar 

  82. Cormode G, Yi K. Small Summaries for Big Data. Cambridge: Cambridge University Press, 2020

    Book  MATH  Google Scholar 

  83. Li J, Wei Z, Ding B, Dai X, Lu L, Zhou J. Sampling-based estimation of the number of distinct values in distributed environment. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 893–903

    Chapter  MATH  Google Scholar 

  84. Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on International Colloquium on Automata, Languages, and Programming. 2002, 693–703

    MATH  Google Scholar 

  85. Wang P, Xie D, Zhao J, Li J, Li Z, Li R, Ren Y, Di J. Half-Xor: a fully-dynamic sketch for estimating the number of distinct values in big tables. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(7): 3111–3125

    Article  MATH  Google Scholar 

  86. Wang F, Chen Q, Li Y, Yang T, Tu Y, Yu L, Cui B. JoinSketch: a sketch algorithm for accurate and unbiased inner-product estimation. Proceedings of the ACM on Management of Data, 2023, 1(1): 81

    Article  MATH  Google Scholar 

  87. Shekelyan M, Cormode G. Sequential random sampling revisited: hidden shuffle method. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. 2021, 3628–3636

    MATH  Google Scholar 

  88. Wei C, Salloum S, Emara T Z, Zhang X, Huang J Z, He Y. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: Proceedings of the 11th International Conference on Cloud Computing. 2018, 347–364

    MATH  Google Scholar 

  89. Salloum S, Huang J Z, He Y. Random sample partition: a distributed data model for big data analysis. IEEE Transactions on Industrial Informatics, 2019, 15(11): 5846–5854

    Article  MATH  Google Scholar 

  90. Debnath S K, Dutta R. Secure and efficient private set intersection cardinality using bloom filter. In: Proceedings of the 18th International Conference on Information Security. 2015, 209–226

    Chapter  MATH  Google Scholar 

  91. Guo D, Wu J, Chen H, Yuan Y, Luo X. The dynamic bloom filters. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(1): 120–133

    Article  MATH  Google Scholar 

  92. Ertl O. New cardinality estimation algorithms for HyperLogLog sketches. 2017, arXiv preprint arXiv: 1702.01284

  93. Ertl O. UltraLogLog: a practical and more space-efficient alternative to HyperLogLog for approximate distinct counting. Proceedings of the VLDB Endowment, 2024, 17(7): 1655–1668

    Article  MATH  Google Scholar 

  94. Tsan B, Datta A, Izenov Y, Rusu F. Approximate sketches. Proceedings of the ACM on Management of Data, 2024, 2(1): 66

    Article  Google Scholar 

  95. Wang D, Pettie S. Better cardinality estimators for HyperLogLog, PCSA, and beyond. In: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 2023, 317–327

    Chapter  MATH  Google Scholar 

  96. Pettie S, Wang D, Yin L. Non-mergeable sketching for cardinality estimation. In: Proceedings of the 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). 2021

    MATH  Google Scholar 

  97. Qiu Y, Wang Y, Yi K, Li F, Wu B, Zhan C. Weighted distinct sampling: cardinality estimation for SPJ queries. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1465–1477

    Chapter  MATH  Google Scholar 

  98. Dai B, Hu X, Yi K. Reservoir sampling over joins. Proceedings of the ACM on Management of Data, 2024, 2(3): 1–26

    MATH  Google Scholar 

  99. Kim K, Jung J, Seo I, Han W S, Choi K, Chong J. Learned cardinality estimation: an in-depth study. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1214–1227

    Chapter  MATH  Google Scholar 

  100. Kipf A, Kipf T, Radke B, Leis V, Boncz P A, Kemper A. Learned cardinalities: estimating correlated joins with deep learning. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019

    Google Scholar 

  101. Sun J, Li G. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 2019, 13(3): 307–319

    Article  MathSciNet  MATH  Google Scholar 

  102. Negi P, Wu Z, Kipf A, Tatbul N, Marcus R, Madden S, Kraska T, Alizadeh M. Robust query driven cardinality estimation under changing workloads. Proceedings of the VLDB Endowment, 2023, 16(6): 1520–1533

    Article  Google Scholar 

  103. Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment, 2020, 13(7): 992–1005

    Article  Google Scholar 

  104. Yang Z, Kamsetty A, Luan S, Liang E, Duan Y, Chen X, Stoica I. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, 2020, 14(1): 61–73

    Article  Google Scholar 

  105. Chien E, Milenkovic O, Nedich A. Support estimation with sampling artifacts and errors. In: Proceedings of 2021 IEEE International Symposium on Information Theory (ISIT). 2021, 244–249

    MATH  Google Scholar 

  106. Hao Y, Orlitsky A. Unified sample-optimal property estimation in near-linear time. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 996

    MATH  Google Scholar 

  107. Shan J, Fu Y, Ni G, Luo J, Wu Z. Fast counting the cardinality of flows for big traffic over sliding windows. Frontiers of Computer Science, 2017, 11(1): 119–129

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This research was supported in part by the National Science and Technology Major Project (2022ZD0114802), the National Natural Science Foundation of China (Grant Nos. U2241212, 61932001), the Beijing Natural Science Foundation (No. 4222028), by the Beijing Outstanding Young Scientist Program (No.BJJWZYJH012019100020098), the Huawei-Renmin University joint program on Information Retrieval. We also wish to acknowledge the support provided by the fund for building world-class universities (disciplines) of Renmin University of China, by Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education, by Intelligent Social Governance Interdisciplinary Platform, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Public Policy and Decision-making Research Lab, and Public Computing Cloud, Renmin University of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhewei Wei.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Jiajun LI is a PhD candidate at Gaolin School of Artificial Intelligence, Renmin University of China, China advised by Professor Zhewei Wei. He received his BE degree at School of Statistics, Renmin University of China, China in 2019. His research focuses on the approximate computing algorithm, AI for databases (AI4DB), and the design of data stream and sketch algorithms.

Runlin LEI is a PhD candidate at Gaolin School of Artificial Intelligence, Renmin University of China, China advised by Professor Zhewei Wei. He received his BE degree at School of Information and Technology, Shanghai University of Finance and Economics, China in 2022. His research focuses on graph neural networks and AI for databases.

Zhewei WEI is currently a professor at Gaoling School of Artificial Intelligence, Renmin University of China, China. He obtained his PhD degree at Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), China in 2012. He received the BSc degree in the School of Mathematical Sciences at Peking University, China in 2008. His research interests include graph algorithms, massive data algorithms, and streaming algorithms. He was the Proceeding Chair of SIGMOD/PODS2020 and ICDT2021, the Area Chair of ICML 2022/2023, NeurIPS 2022/2023, ICLR 2023, WWW 2023. He is also the PC member of various top conferences, such as VLDB, KDD, ICDE, ICML, and NeurIPS.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Lei, R. & Wei, Z. A survey of estimating number of distinct values. Front. Comput. Sci. 19, 199611 (2025). https://doi.org/10.1007/s11704-024-40952-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-024-40952-3

Keywords