ABSTRACT
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. Sys. Sci., 58:137--147, 1999. Google ScholarDigital Library
- M. Astrahan, M. Schkolnick, and K. Whang. Approximating the number of unique values of an attribute without sorting. Inf. Sys., 12:11--15, 1987. Google ScholarDigital Library
- Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002. Google ScholarDigital Library
- P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180. Springer, 2005. Google ScholarDigital Library
- P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006. Google ScholarDigital Library
- M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In Proc. ACM PODS, pages 268--279, 2000. Google ScholarDigital Library
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Sys. Sci., 55:441--453, 1997. Google ScholarDigital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD, pages 240--251, 2002. Google ScholarDigital Library
- H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003.Google Scholar
- A. R. Didonato and A. H. Morris, Jr. Algorithm 708; significant digit computation of the incomplete beta function ratios. ACM Trans. Math. Software, 18(3):360--373, 1992. Google ScholarDigital Library
- M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Proc. 11th Eur. Symp. Algorithms (ESA 2003), volume 2832 of Lecture Notes in Computer Science. Springer, 2003.Google ScholarCross Ref
- C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Proc. SIGCOMM '02, pages 323--336, 2002. Google ScholarDigital Library
- P. Flajolet. Adaptive sampling In M. Hazewinkel, editor, Encyclopaedia of Mathematics, Supplement I. Kluwer, 1997.Google Scholar
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer Sys. Sci., 31:182--209, 1985. Google ScholarDigital Library
- S. Ganguly, M. Garofalakis, and R. Rastogi. Tracking set-expression cardinalities over continuous update streams. VLDB J., 13:354--369, 2004. Google ScholarDigital Library
- P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB, pages 541--550, 2001. Google ScholarDigital Library
- P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams In Proc. ACM Symp. Parallel Algorithms and Architecture, pages 281--291, 2001. Google ScholarDigital Library
- F. Giroire.Order statistics and estimating cardinalities of massive data sets. In Proc. Intl. Conf. Analysis Algorithms, pages 157--166, 2005.Google Scholar
- P. J. Haas, Y. Liu, and L. Stokes. An estimator of the number of species from quadrat sampling. Biometrics, 62:135--141, 2006.Google ScholarCross Ref
- P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93:1475--1487, 1998.Google ScholarCross Ref
- P. Hellekalek and S. Wegenkittl. Empirical evidence concerning AES. ACM Trans. Modelling Comput. Simulation, 13:322--333, 2003. Google ScholarDigital Library
- Y. E. Ioannidis. The history of histograms (abridged). In Proc. VLDB, pages 19--30, 2003. Google ScholarDigital Library
- N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univeriate Distributions-2. Wiley, 2nd edition, 1995.Google Scholar
- S. Karlin and H. M. Taylor. A Second Course in Stochastic Processes. Academic Press, 1981.Google Scholar
- D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, 1973.Google Scholar
- M. Matsumoto and T. Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Modeling Computer Simulation, 8(1):3--30, 1998. Google ScholarDigital Library
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
- S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: a new data layout scheme in DB2. In Proc. ACM SIGMOD, pages 637--641, 2003. Google ScholarDigital Library
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD, pages 23--34, 1979. Google ScholarDigital Library
- R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.Google ScholarCross Ref
- A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996. Google ScholarDigital Library
- J. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Software, 11(1):37--57, 1985. Google ScholarDigital Library
- K. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys., 15:208--229, 1990. Google ScholarDigital Library
Index Terms
- On synopses for distinct-value estimation under multiset operations
Recommendations
Distinct-value synopses for multiset operations
A View of Parallel ComputingThe task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into ...
Graph-based synopses for relational selectivity estimation
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataThis paper introduces the Tuple Graph (TUG) synopses, a new class of data summaries that enable accurate selectivity estimates for complex relational queries. The proposed summarization framework adopts a "semi-structured" view of the relational ...
Synopses for query optimization: A space-complexity perspective
Special Issue: SIGMOD/PODS 2004Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms ...
Comments