Abstract
Estimating the Number of Distinct Values (NDVs) is a critical task in the fields of databases and data streams. Over time, various algorithms for estimating NDVs have been developed, each tailored to different requirements for time, I/O, and accuracy. These algorithms can be broadly categorized into two main types: sampling-based and sketch-based. Sampling-based NDV algorithms improve efficiency by sampling rather than accessing all items, often at the cost of reduced accuracy. In contrast, sketch-based NDV algorithms maintain a compact sketch using hashing to scan the entire dataset, typically offering higher accuracy but at the expense of increased I/O costs. When dealing with large-scale data, scanning the entire table may become infeasible. Thus, the challenge of efficiently and accurately estimating NDVs has persisted for decades. This paper provides a comprehensive review of the fundamental concepts, key techniques, and a comparative analysis of various NDV estimation algorithms. We first briefly examine traditional estimators in chronological order, followed by an in-depth discussion of the newer estimators developed over the past decade, highlighting the specific scenarios in which they are applicable. Furthermore, we illustrate how NDV estimation algorithms have been adapted to address the complexities of modern real-world data environments effectively. Despite significant progress in NDV estimation research, challenges remain in terms of theoretical scalability and practical application. This paper also explores potential future directions, including block sampling NDV estimation, learning-based NDV estimation, and their implications for database applications.
Similar content being viewed by others
References
Fisher R A, Corbet A S, Williams C B. The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 1943, 12(1): 42–58
Efron B, Thisted R. Estimating the number of unseen species: how many words did Shakespeare know? Biometrika, 1976, 63(3): 435–447
Valiant G, Valiant P. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. 2011, 685–694
Hou W C, Ozsoyoglu G, Taneja B K. Processing aggregate relational queries with hard time constraints. In: Proceedings of 1989 ACM SIGMOD International Conference on Management of Data. 1989, 68–77
Ozsoyoglu G, Du K, Tjahjana A, Hou W C, Rowland D Y. On estimating count, sum, and average relational algebra queries. In: Proceedings of the International Conference on Database and Expert Systems Applications. 1991, 406–412
Haas P J, Naughton J F, Seshadri S, Stokes L. Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995, 311–322
Lemire D, Kaser O. Reordering columns for smaller indexes. Information Sciences, 2011, 181(12): 2550–2570
Li P, Wei W, Zhu R, Ding B, Zhou J, Lu H. ALECE: an attention-based learned cardinality estimator for SPJ queries on dynamic workloads. Proceedings of the VLDB Endowment, 2023, 17(2): 197–210
Chabchoub Y, Chiky R, Dogan B. How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic? EURASIP Journal on Information Security, 2014, 2014(1): 5
Cohen R, Nezri Y. Cardinality estimation in a virtualized network device using online machine learning. IEEE/ACM Transactions on Networking, 2019, 27(5): 2098–2110
Clemens V, Schulz L C, Gartner M, Hausheer D. DDoS detection in P4 using HYPERLOGLOG and COUNTMIN sketches. In: Proceedings of NOMS 2023-2023 IEEE/IFIP Network Operations and Management Symposium. 2023, 1–6
Kalai A T, Vempala S S. Calibrated language models must hallucinate. In: Proceedings of the 56th Annual ACM Symposium on Theory of Computing. 2024, 160–171
Bunge J, Fitzpatrick M. Estimating the number of species: a review. Journal of the American Statistical Association, 1993, 88(421): 364–373
Harmouch H, Naumann F. Cardinality estimation: an experimental survey. Proceedings of the VLDB Endowment, 2017, 11(4): 499–512
Batu T, Fortnow L, Rubinfeld R, Smith W D, White P. Testing that distributions are close. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 2000, 259–269
Paninski L. Estimation of entropy and mutual information. Neural Computation, 2003, 15(6): 1191–1253
Orlitsky A, Santhanam N P, Viswanathan K, Zhang J. On modeling profiles instead of values. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. 2004, 426–435
Brutlag J D, Richardson T S. A block sampling approach to distinct value estimation. Journal of Computational and Graphical Statistics, 2002, 11(2): 389–404
Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In: Proceedings of 2004 ACM SIGMOD International Conference on Management of Data. 2004, 287–298
Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 2005, 55(1): 58–75
Bloom B H. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 1970, 13(7): 422–426
Li B, Lu Y, Wang C, Kandula S. Q-error bounds of random uniform sampling for cardinality estimation. 2021, arXiv preprint arXiv: 2108.02715
Charikar M, Chaudhuri S, Motwani R, Narasayya V. Towards estimation error guarantees for distinct values. In: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2000, 268–279
Goodman L A. On the estimation of the number of classes in a population. The Annals of Mathematical Statistics, 1949, 20(4): 572–579
Good I J, Toulmin G H. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 1956, 43(1–2): 45–63
Bromwich T J I A. An Introduction to the Theory of Infinite Series. Providence: American Mathematical Society, 2005
Shlosser A. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics, 1981, 19(1): 97–102
Haas P J, Stokes L. Estimating the number of classes in a finite population. Journal of the American Statistical Association, 1998, 93(444): 1475–1487
Quenouille M H. Problems in plane sampling. The Annals of Mathematical Statistics, 1949, 20(3): 355–375
Burnham K P, Overton W S. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika, 1978, 65(3): 625–633
Burnham K P, Overton W S. Robust estimation of population size when capture probabilities vary among animals. Ecology, 1979, 60(5): 927–936
Heltshe J F, Forrester N E. Estimating species richness using the jackknife procedure. Biometrics, 1983, 39(1): 1–11
Smith E P, van Belle G. Nonparametric estimation of species richness. Biometrics, 1984, 40(1): 119–129
Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson N L, eds. Breakthroughs in Statistics: Methodology and Distribution. New York: Springer, 1992, 569–593
Horvitz D G, Thompson D J. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 1952, 47(260): 663–685
Särndal C E, Swensson B, Wretman J. Model Assisted Survey Sampling. New York: Springer, 2003
Chao A. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 1984, 11(4): 265–270
Chao A, Shen T. User’s guide for program spade (species prediction and diversity estimation). National Tsing Hua University, Dissertation, 2010
Chao A, Lee S M. Estimating the number of classes via sample coverage. Journal of the American statistical Association, 1992, 87(417): 210–217
Good I J. The population frequencies of species and the estimation of population parameters. Biometrika, 1953, 40(3–4): 237–264
Deolalikar V, Laffitte H. Extensive large-scale study of error in samping-based distinct value estimators for databases. 2016, arXiv preprint arXiv: 1612.00476
Motwani R, Vassilvitskii S. Distinct values estimators for power law distributions. In: Proceedings of the 3rd Workshop on Analytic Algorithmics and Combinatorics. 2006, 230–237
Korwar R M. On the observed number of classes from multivariate power series and hypergeometric distributions. Sankhyā: The Indian Journal of Statistics, Series B, 1988, 50(1): 39–59
Valiant P, Valiant G. Estimating the unseen: improved estimators for entropy and other properties. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2157–2165
Li J, Lei R, Wang S, Wei Z, Ding B. Learning-based property estimation with polynomials. Proceedings of the ACM on Management of Data, 2024, 2(3): 1–27
Valiant G, Valiant P. Instance optimal learning of discrete distributions. In: Proceedings of the 48th Annual ACM Symposium on Theory of Computing. 2016, 142–155
Raghunathan A, Valiant G, Zou J. Estimating the unseen from multiple populations. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2855–2863
Valiant G, Valiant P. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 2017, 64(6): 37
Valiant, G., & Valiant, P. (2010). Estimating the unseen: A sublinear-sample canonical estimator of distributions. Electronic Colloquium on Computational Complexity (ECCC), TR10-180.
Acharya J, Das H, Orlitsky A, Suresh A T. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 11–21
Pavlichin D S, Jiao J, Weissman T. Approximate profile maximum likelihood. Journal of Machine Learning Research, 2019, 20(122): 1–55
Acharya J, Das H, Mohimani H, Orlitsky A, Pan S. Exact calculation of pattern probabilities. In: Proceedings of 2010 IEEE International Symposium on Information Theory. 2010, 1498–1502
Orlitsky A, Sajama S, Santhanam N P, Viswanathan K, Zhang J. Algorithms for modeling distributions over large alphabets. In: Proceedings of International Symposium onInformation Theory. 2004, 304
Orlitsky A, Santhanam N, Viswanathan K, Zhang J. Theoretical and experimental results on modeling low probabilities. In: Proceedings of 2006 IEEE Information Theory Workshop – ITW’ 06 Punta del Este. 2006, 242–246
Vontobel P O. The bethe approximation of the pattern maximum likelihood distribution. In: Proceedings of 2012 IEEE International Symposium on Information Theory Proceedings. 2012, 2012–2016
Vontobel P O. The bethe and sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the valiant-valiant estimate. In: Proceedings of 2014 Information Theory and Applications Workshop (ITA). 2014, 1–10
Wu Y, Yang P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 2016, 62(6): 3702–3720
Wu Y, Yang P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 2019, 47(2): 857–883
Chien I. Regularized weighted chebyshev approximations for support estimation. University of Illinois at Urbana-Champaign, Dissertation, 2019
Eden T, Indyk P, Narayanan S, Rubinfeld R, Silwal S, Wagner T. Learning-based support estimation in sublinear time. In: Proceedings of the 9th International Conference on Learning Representations. 2021
Wu R, Ding B, Chu X, Wei Z, Dai X, Guan T, Zhou J. Learning to be a statistician: learned estimator for number of distinct values. Proceedings of the VLDB Endowment, 2021, 15(2): 272–284
Whang K Y, Vander-Zanden B T, Taylor H M. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS), 1990, 15(2): 208–229
Swamidass S J, Baldi P. Mathematical correction for fingerprint similarity measures to improve chemical retrieval. Journal of Chemical Information and Modeling, 2007, 47(3): 952–964
Papapetrou O, Siberski W, Nejdl W. Cardinality estimation and dynamic length adaptation for bloom filters. Distributed and Parallel Databases, 2010, 28(2): 119–156
Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In: Proceedings of the 28th Annual ACM Symposium on Theory of Computing. 1996, 20–29
Cormode G. Count-min sketch. In: Liu L, Özsu M T, eds. Encyclopedia of Database Systems. New York: Springer, 2009, 511–516
Gibbons P B, Tirthapura S. Estimating simple functions on the union of data streams. In: Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures. 2001, 281–291
Bar-Yossef Z, Jayram T S, Kumar R, Sivakumar D, Trevisan L. Counting distinct elements in a data stream. In: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science. 2002, 1–10
Beyer K, Gemulla R, Haas P J, Reinwald B, Sismanis Y. Distinct-value synopses for multiset operations. Communications of the ACM, 2009, 52(10): 87–95
Cohen E. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 1997, 55(3): 441–453
Dasgupta A, Lang K J, Rhodes L, Thaler J. A framework for estimating stream expression cardinalities. In: Proceedings of the 19th International Conference on Database Theory. 2016
Flajolet P, Martin G N. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 1985, 31(2): 182–209
Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 1999, 58(1): 137–147
Durand M, Flajolet P. Loglog counting of large cardinalities. In: Proceedings of the 11th Annual European Symposium on Algorithms. 2003, 605–617
Flajolet P, Fusy É, Gandouet O, Meunier F. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of 2007 Conference on Analysis of Algorithms. 2007, 127–146
Hall A, Bachmann O, Büssow R, Gänceanu S, Nunkesser M. Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment, 2012, 5(11): 1436–1446
Heule S, Nunkesser M, Hall A. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology. 2013, 683–692
Kane D M, Nelson J, Woodruff D P. An optimal algorithm for the distinct elements problem. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2010, 41–52
Ting D. Approximate distinct counts for billions of datasets. In: Proceedings of the 2019 International Conference on Management of Data. 2019, 69–86
Chiosa M, Preußer T B, Alonso G. SKT: a one-pass multi-sketch data analytics accelerator. Proceedings of the VLDB Endowment, 2021, 14(11): 2369–2382
Ertl O. SetSketch: filling the gap between MinHash and HyperLogLog. Proceedings of the VLDB Endowment, 2021, 14(11): 2244–2257
Cormode G, Yi K. Small Summaries for Big Data. Cambridge: Cambridge University Press, 2020
Li J, Wei Z, Ding B, Dai X, Lu L, Zhou J. Sampling-based estimation of the number of distinct values in distributed environment. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 893–903
Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on International Colloquium on Automata, Languages, and Programming. 2002, 693–703
Wang P, Xie D, Zhao J, Li J, Li Z, Li R, Ren Y, Di J. Half-Xor: a fully-dynamic sketch for estimating the number of distinct values in big tables. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(7): 3111–3125
Wang F, Chen Q, Li Y, Yang T, Tu Y, Yu L, Cui B. JoinSketch: a sketch algorithm for accurate and unbiased inner-product estimation. Proceedings of the ACM on Management of Data, 2023, 1(1): 81
Shekelyan M, Cormode G. Sequential random sampling revisited: hidden shuffle method. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. 2021, 3628–3636
Wei C, Salloum S, Emara T Z, Zhang X, Huang J Z, He Y. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: Proceedings of the 11th International Conference on Cloud Computing. 2018, 347–364
Salloum S, Huang J Z, He Y. Random sample partition: a distributed data model for big data analysis. IEEE Transactions on Industrial Informatics, 2019, 15(11): 5846–5854
Debnath S K, Dutta R. Secure and efficient private set intersection cardinality using bloom filter. In: Proceedings of the 18th International Conference on Information Security. 2015, 209–226
Guo D, Wu J, Chen H, Yuan Y, Luo X. The dynamic bloom filters. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(1): 120–133
Ertl O. New cardinality estimation algorithms for HyperLogLog sketches. 2017, arXiv preprint arXiv: 1702.01284
Ertl O. UltraLogLog: a practical and more space-efficient alternative to HyperLogLog for approximate distinct counting. Proceedings of the VLDB Endowment, 2024, 17(7): 1655–1668
Tsan B, Datta A, Izenov Y, Rusu F. Approximate sketches. Proceedings of the ACM on Management of Data, 2024, 2(1): 66
Wang D, Pettie S. Better cardinality estimators for HyperLogLog, PCSA, and beyond. In: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 2023, 317–327
Pettie S, Wang D, Yin L. Non-mergeable sketching for cardinality estimation. In: Proceedings of the 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). 2021
Qiu Y, Wang Y, Yi K, Li F, Wu B, Zhan C. Weighted distinct sampling: cardinality estimation for SPJ queries. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1465–1477
Dai B, Hu X, Yi K. Reservoir sampling over joins. Proceedings of the ACM on Management of Data, 2024, 2(3): 1–26
Kim K, Jung J, Seo I, Han W S, Choi K, Chong J. Learned cardinality estimation: an in-depth study. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1214–1227
Kipf A, Kipf T, Radke B, Leis V, Boncz P A, Kemper A. Learned cardinalities: estimating correlated joins with deep learning. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019
Sun J, Li G. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 2019, 13(3): 307–319
Negi P, Wu Z, Kipf A, Tatbul N, Marcus R, Madden S, Kraska T, Alizadeh M. Robust query driven cardinality estimation under changing workloads. Proceedings of the VLDB Endowment, 2023, 16(6): 1520–1533
Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment, 2020, 13(7): 992–1005
Yang Z, Kamsetty A, Luan S, Liang E, Duan Y, Chen X, Stoica I. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, 2020, 14(1): 61–73
Chien E, Milenkovic O, Nedich A. Support estimation with sampling artifacts and errors. In: Proceedings of 2021 IEEE International Symposium on Information Theory (ISIT). 2021, 244–249
Hao Y, Orlitsky A. Unified sample-optimal property estimation in near-linear time. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 996
Shan J, Fu Y, Ni G, Luo J, Wu Z. Fast counting the cardinality of flows for big traffic over sliding windows. Frontiers of Computer Science, 2017, 11(1): 119–129
Acknowledgements
This research was supported in part by the National Science and Technology Major Project (2022ZD0114802), the National Natural Science Foundation of China (Grant Nos. U2241212, 61932001), the Beijing Natural Science Foundation (No. 4222028), by the Beijing Outstanding Young Scientist Program (No.BJJWZYJH012019100020098), the Huawei-Renmin University joint program on Information Retrieval. We also wish to acknowledge the support provided by the fund for building world-class universities (disciplines) of Renmin University of China, by Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education, by Intelligent Social Governance Interdisciplinary Platform, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Public Policy and Decision-making Research Lab, and Public Computing Cloud, Renmin University of China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.
Additional information
Jiajun LI is a PhD candidate at Gaolin School of Artificial Intelligence, Renmin University of China, China advised by Professor Zhewei Wei. He received his BE degree at School of Statistics, Renmin University of China, China in 2019. His research focuses on the approximate computing algorithm, AI for databases (AI4DB), and the design of data stream and sketch algorithms.
Runlin LEI is a PhD candidate at Gaolin School of Artificial Intelligence, Renmin University of China, China advised by Professor Zhewei Wei. He received his BE degree at School of Information and Technology, Shanghai University of Finance and Economics, China in 2022. His research focuses on graph neural networks and AI for databases.
Zhewei WEI is currently a professor at Gaoling School of Artificial Intelligence, Renmin University of China, China. He obtained his PhD degree at Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), China in 2012. He received the BSc degree in the School of Mathematical Sciences at Peking University, China in 2008. His research interests include graph algorithms, massive data algorithms, and streaming algorithms. He was the Proceeding Chair of SIGMOD/PODS2020 and ICDT2021, the Area Chair of ICML 2022/2023, NeurIPS 2022/2023, ICLR 2023, WWW 2023. He is also the PC member of various top conferences, such as VLDB, KDD, ICDE, ICML, and NeurIPS.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Li, J., Lei, R. & Wei, Z. A survey of estimating number of distinct values. Front. Comput. Sci. 19, 199611 (2025). https://doi.org/10.1007/s11704-024-40952-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-024-40952-3