skip to main content
10.1145/1247480.1247504acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

On synopses for distinct-value estimation under multiset operations

Published:11 June 2007Publication History

ABSTRACT

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.

References

  1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. Sys. Sci., 58:137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Astrahan, M. Schkolnick, and K. Whang. Approximating the number of unique values of an attribute without sorting. Inf. Sys., 12:11--15, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In Proc. ACM PODS, pages 268--279, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Sys. Sci., 55:441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD, pages 240--251, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003.Google ScholarGoogle Scholar
  10. A. R. Didonato and A. H. Morris, Jr. Algorithm 708; significant digit computation of the incomplete beta function ratios. ACM Trans. Math. Software, 18(3):360--373, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Proc. 11th Eur. Symp. Algorithms (ESA 2003), volume 2832 of Lecture Notes in Computer Science. Springer, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  12. C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Proc. SIGCOMM '02, pages 323--336, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Flajolet. Adaptive sampling In M. Hazewinkel, editor, Encyclopaedia of Mathematics, Supplement I. Kluwer, 1997.Google ScholarGoogle Scholar
  14. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer Sys. Sci., 31:182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Ganguly, M. Garofalakis, and R. Rastogi. Tracking set-expression cardinalities over continuous update streams. VLDB J., 13:354--369, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB, pages 541--550, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams In Proc. ACM Symp. Parallel Algorithms and Architecture, pages 281--291, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Giroire.Order statistics and estimating cardinalities of massive data sets. In Proc. Intl. Conf. Analysis Algorithms, pages 157--166, 2005.Google ScholarGoogle Scholar
  19. P. J. Haas, Y. Liu, and L. Stokes. An estimator of the number of species from quadrat sampling. Biometrics, 62:135--141, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  20. P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93:1475--1487, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. Hellekalek and S. Wegenkittl. Empirical evidence concerning AES. ACM Trans. Modelling Comput. Simulation, 13:322--333, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. E. Ioannidis. The history of histograms (abridged). In Proc. VLDB, pages 19--30, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univeriate Distributions-2. Wiley, 2nd edition, 1995.Google ScholarGoogle Scholar
  24. S. Karlin and H. M. Taylor. A Second Course in Stochastic Processes. Academic Press, 1981.Google ScholarGoogle Scholar
  25. D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, 1973.Google ScholarGoogle Scholar
  26. M. Matsumoto and T. Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Modeling Computer Simulation, 8(1):3--30, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: a new data layout scheme in DB2. In Proc. ACM SIGMOD, pages 637--641, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD, pages 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  31. A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Software, 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys., 15:208--229, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On synopses for distinct-value estimation under multiset operations

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
      June 2007
      1210 pages
      ISBN:9781595936868
      DOI:10.1145/1247480
      • General Chairs:
      • Lizhu Zhou,
      • Tok Wang Ling,
      • Program Chair:
      • Beng Chin Ooi

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader