Skip to main content

Sampling issues in parallel database systems

  • Conference paper
  • First Online:
Advances in Database Technology — EDBT '92 (EDBT 1992)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 580))

Included in the following conference series:

Abstract

Sampling has proven useful in database systems in applications including query size estimation, and most recently, probabilistic parallel query evaluation algorithms. In order to apply the full power of modern multiprocessor database systems, sampling techniques must (1) distribute the sampling workload evenly among the processors in the system, and (2) make use of all the data on the pages brought into main memory during the course of the sampling. In this paper we show how to achieve these two goals by proving that for query size estimation, (1) stratified random sampling guarantees perfect load balancing without reducing the accuracy of the estimate, and that (2) for a given number of I/O operations, page level sampling always produces a more accurate estimate than tuple level sampling. For probabilistic parallel query evaluation algorithms, high performance requires tight boundsxon the expected skew in the allocation of work to processors as a function of the number of samples. Toward this end we prove a new bound on this skew, and show that our new bound is better than previously known bounds.

Work supported by NSF grant IRI-8909795 and by a gift of the IBM Corporation.

Work supported by NSF Presidential Young Investigator Award and by a gift of the IBM Corporation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [BLM+91] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S.J. Smith, and M. Zagha. A comparison of sorting algorithms for the connection machine CM-2. In 3rd Annual ACM Symposium on Parallel Algorithms and Architectures, Hilton Head, South Carolina, July 1991.

    Google Scholar 

  2. William G. Cochran. Sampling Techniques. John Wiley and Sons, Inc., New York, New York, 3 edition, 1977.

    Google Scholar 

  3. David J. Dewitt and Jim Gray. Parallel database systems: The future of database processing or a passing fad? SIGMOD Record, 19(4):104–112, December 1990.

    Google Scholar 

  4. David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. A comparison of non-equijoin algorithms. In Proceedings of the Eighteenth International Conference on Very Large Databases, Barcelona, Spain, August 1991.

    Google Scholar 

  5. David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. Parallel external sorting using probabilistic splitting. In Proceedings of the Parallel and Distributed Information Symposium, Miami Beach, Florida, Dec 1991. To appear.

    Google Scholar 

  6. Gaston H. Gonnet. Expected length of the longest probe sequence in hash code searching. JACM, 28(2):289–304, April 1981.

    Google Scholar 

  7. W. C. Hou, G. Ozsoyoglu, and E. Dogdu. Error constrained count query evaluation in relational databases. In Proceedings of the SIGMOD International Conference on Management of Data, pages 278–287, Denver, Colorado, May 1991.

    Google Scholar 

  8. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. American Statistical Association Journal, pages 13–30, March 1963.

    Google Scholar 

  9. Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K. Taneja. Statistical estimators for relational algebra expressions. In Proceedings of the Seventh ACM Symposium on Principles of Database Systems, pages 276–287, Austin, Texas, March 1988.

    Google Scholar 

  10. Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeao K. Taneja. Processing aggregate relational queries with hard time constraints. In Proceedings of the ACM-SIGMOD Conference on the Management of Data, pages 68–77, Portland, Oregon, June 1989.

    Google Scholar 

  11. V. F. Kolchin, B. A. Sevastyanov, and V. P. Chistyakov. Random Allocations. V. H. Winston and Sons, Washington, D.C., 1978.

    Google Scholar 

  12. Richard J. Lipton, Jeffrey F. Naughton, and Donovan A. Schneider. Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, New Jersey, May 1990.

    Google Scholar 

  13. Frank Olken and Doron Rotem. Simple random sampling for relational databases. In Proceedings of the Twelfth International Conference on Very Large Databases, pages 160–169, Kyoto, Japan, August 1986.

    Google Scholar 

  14. M. Stonebraker. The case for shared nothing. Database Engineering, 9(1), 1986.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alain Pirotte Claude Delobel Goerg Gottlob

Rights and permissions

Reprints and permissions

Copyright information

© 1992 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Seshadri, S., Naughton, J.F. (1992). Sampling issues in parallel database systems. In: Pirotte, A., Delobel, C., Gottlob, G. (eds) Advances in Database Technology — EDBT '92. EDBT 1992. Lecture Notes in Computer Science, vol 580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0032440

Download citation

  • DOI: https://doi.org/10.1007/BFb0032440

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-55270-3

  • Online ISBN: 978-3-540-47003-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics