Article

On synopses for distinct-value estimation under multiset operations

Authors:
Kevin Beyer

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Peter J. Haas

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Berthold Reinwald

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Yannis Sismanis

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Rainer Gemulla

Technische Universität Dresden, Dresden, Germany

Technische Universität Dresden, Dresden, Germany
View Profile

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataJune 2007Pages 199–210https://doi.org/10.1145/1247480.1247504

Published:11 June 2007Publication History

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Pages 199–210

ABSTRACT

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.

References

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. Sys. Sci., 58:137--147, 1999. Google ScholarDigital Library
M. Astrahan, M. Schkolnick, and K. Whang. Approximating the number of unique values of an attribute without sorting. Inf. Sys., 12:11--15, 1987. Google ScholarDigital Library
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002. Google ScholarDigital Library
P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180. Springer, 2005. Google ScholarDigital Library
P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006. Google ScholarDigital Library
M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In Proc. ACM PODS, pages 268--279, 2000. Google ScholarDigital Library
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Sys. Sci., 55:441--453, 1997. Google ScholarDigital Library
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD, pages 240--251, 2002. Google ScholarDigital Library
H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003.Google Scholar
A. R. Didonato and A. H. Morris, Jr. Algorithm 708; significant digit computation of the incomplete beta function ratios. ACM Trans. Math. Software, 18(3):360--373, 1992. Google ScholarDigital Library
M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Proc. 11th Eur. Symp. Algorithms (ESA 2003), volume 2832 of Lecture Notes in Computer Science. Springer, 2003.Google ScholarCross Ref
C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Proc. SIGCOMM '02, pages 323--336, 2002. Google ScholarDigital Library
P. Flajolet. Adaptive sampling In M. Hazewinkel, editor, Encyclopaedia of Mathematics, Supplement I. Kluwer, 1997.Google Scholar
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer Sys. Sci., 31:182--209, 1985. Google ScholarDigital Library
S. Ganguly, M. Garofalakis, and R. Rastogi. Tracking set-expression cardinalities over continuous update streams. VLDB J., 13:354--369, 2004. Google ScholarDigital Library
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB, pages 541--550, 2001. Google ScholarDigital Library
P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams In Proc. ACM Symp. Parallel Algorithms and Architecture, pages 281--291, 2001. Google ScholarDigital Library
F. Giroire.Order statistics and estimating cardinalities of massive data sets. In Proc. Intl. Conf. Analysis Algorithms, pages 157--166, 2005.Google Scholar
P. J. Haas, Y. Liu, and L. Stokes. An estimator of the number of species from quadrat sampling. Biometrics, 62:135--141, 2006.Google ScholarCross Ref
P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93:1475--1487, 1998.Google ScholarCross Ref
P. Hellekalek and S. Wegenkittl. Empirical evidence concerning AES. ACM Trans. Modelling Comput. Simulation, 13:322--333, 2003. Google ScholarDigital Library
Y. E. Ioannidis. The history of histograms (abridged). In Proc. VLDB, pages 19--30, 2003. Google ScholarDigital Library
N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univeriate Distributions-2. Wiley, 2nd edition, 1995.Google Scholar
S. Karlin and H. M. Taylor. A Second Course in Stochastic Processes. Academic Press, 1981.Google Scholar
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, 1973.Google Scholar
M. Matsumoto and T. Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Modeling Computer Simulation, 8(1):3--30, 1998. Google ScholarDigital Library
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: a new data layout scheme in DB2. In Proc. ACM SIGMOD, pages 637--641, 2003. Google ScholarDigital Library
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD, pages 23--34, 1979. Google ScholarDigital Library
R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.Google ScholarCross Ref
A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996. Google ScholarDigital Library
J. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Software, 11(1):37--57, 1985. Google ScholarDigital Library
K. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys., 15:208--229, 1990. Google ScholarDigital Library

Index Terms

On synopses for distinct-value estimation under multiset operations
1. Information systems
  1. Data management systems

Recommendations

Distinct-value synopses for multiset operations
A View of Parallel Computing

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into ...
Read More
Graph-based synopses for relational selectivity estimation
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

This paper introduces the Tuple Graph (TUG) synopses, a new class of data summaries that enable accurate selectivity estimates for complex relational queries. The proposed summarization framework adopts a "semi-structured" view of the relational ...
Read More
Synopses for query optimization: A space-complexity perspective
Special Issue: SIGMOD/PODS 2004

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
June 2007
1210 pages
ISBN:9781595936868
DOI:10.1145/1247480
General Chairs:
Lizhu Zhou
Tsinghua University, China
,
Tok Wang Ling
National University of Singapore, Singapore
,
Program Chair:
Beng Chin Ooi
National University of Singapore, Singapore
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distinct-value estimation
synopsis warehouse
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 153
  Total Citations
  View Citations
- 288
  Total Downloads
- Downloads (Last 12 months)78
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On synopses for distinct-value estimation under multiset operations

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Distinct-value synopses for multiset operations

Graph-based synopses for relational selectivity estimation

Synopses for query optimization: A space-complexity perspective

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On synopses for distinct-value estimation under multiset operations

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Distinct-value synopses for multiset operations

Graph-based synopses for relational selectivity estimation

Synopses for query optimization: A space-complexity perspective

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media