Sampling issues in parallel database systems

Seshadri, S.; Naughton, Jeffrey F.

doi:10.1007/BFb0032440

S. Seshadri¹ &
Jeffrey F. Naughton¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 580))

Included in the following conference series:

International Conference on Extending Database Technology

208 Accesses
8 Citations

Abstract

Sampling has proven useful in database systems in applications including query size estimation, and most recently, probabilistic parallel query evaluation algorithms. In order to apply the full power of modern multiprocessor database systems, sampling techniques must (1) distribute the sampling workload evenly among the processors in the system, and (2) make use of all the data on the pages brought into main memory during the course of the sampling. In this paper we show how to achieve these two goals by proving that for query size estimation, (1) stratified random sampling guarantees perfect load balancing without reducing the accuracy of the estimate, and that (2) for a given number of I/O operations, page level sampling always produces a more accurate estimate than tuple level sampling. For probabilistic parallel query evaluation algorithms, high performance requires tight boundsxon the expected skew in the allocation of work to processors as a function of the number of samples. Toward this end we prove a new bound on this skew, and show that our new bound is better than previously known bounds.

Work supported by NSF grant IRI-8909795 and by a gift of the IBM Corporation.

Work supported by NSF Presidential Young Investigator Award and by a gift of the IBM Corporation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

[BLM⁺91] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S.J. Smith, and M. Zagha. A comparison of sorting algorithms for the connection machine CM-2. In 3rd Annual ACM Symposium on Parallel Algorithms and Architectures, Hilton Head, South Carolina, July 1991.
Google Scholar
William G. Cochran. Sampling Techniques. John Wiley and Sons, Inc., New York, New York, 3 edition, 1977.
Google Scholar
David J. Dewitt and Jim Gray. Parallel database systems: The future of database processing or a passing fad? SIGMOD Record, 19(4):104–112, December 1990.
Google Scholar
David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. A comparison of non-equijoin algorithms. In Proceedings of the Eighteenth International Conference on Very Large Databases, Barcelona, Spain, August 1991.
Google Scholar
David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. Parallel external sorting using probabilistic splitting. In Proceedings of the Parallel and Distributed Information Symposium, Miami Beach, Florida, Dec 1991. To appear.
Google Scholar
Gaston H. Gonnet. Expected length of the longest probe sequence in hash code searching. JACM, 28(2):289–304, April 1981.
Google Scholar
W. C. Hou, G. Ozsoyoglu, and E. Dogdu. Error constrained count query evaluation in relational databases. In Proceedings of the SIGMOD International Conference on Management of Data, pages 278–287, Denver, Colorado, May 1991.
Google Scholar
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. American Statistical Association Journal, pages 13–30, March 1963.
Google Scholar
Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K. Taneja. Statistical estimators for relational algebra expressions. In Proceedings of the Seventh ACM Symposium on Principles of Database Systems, pages 276–287, Austin, Texas, March 1988.
Google Scholar
Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeao K. Taneja. Processing aggregate relational queries with hard time constraints. In Proceedings of the ACM-SIGMOD Conference on the Management of Data, pages 68–77, Portland, Oregon, June 1989.
Google Scholar
V. F. Kolchin, B. A. Sevastyanov, and V. P. Chistyakov. Random Allocations. V. H. Winston and Sons, Washington, D.C., 1978.
Google Scholar
Richard J. Lipton, Jeffrey F. Naughton, and Donovan A. Schneider. Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, New Jersey, May 1990.
Google Scholar
Frank Olken and Doron Rotem. Simple random sampling for relational databases. In Proceedings of the Twelfth International Conference on Very Large Databases, pages 160–169, Kyoto, Japan, August 1986.
Google Scholar
M. Stonebraker. The case for shared nothing. Database Engineering, 9(1), 1986.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Sciences, University of Wisconsin, 53706, Madison, WI, USA
S. Seshadri & Jeffrey F. Naughton

Authors

S. Seshadri
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey F. Naughton
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alain Pirotte Claude Delobel Goerg Gottlob

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seshadri, S., Naughton, J.F. (1992). Sampling issues in parallel database systems. In: Pirotte, A., Delobel, C., Gottlob, G. (eds) Advances in Database Technology — EDBT '92. EDBT 1992. Lecture Notes in Computer Science, vol 580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0032440

Download citation

DOI: https://doi.org/10.1007/BFb0032440
Published: 26 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-55270-3
Online ISBN: 978-3-540-47003-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics