Abstract
Sampling has proven useful in database systems in applications including query size estimation, and most recently, probabilistic parallel query evaluation algorithms. In order to apply the full power of modern multiprocessor database systems, sampling techniques must (1) distribute the sampling workload evenly among the processors in the system, and (2) make use of all the data on the pages brought into main memory during the course of the sampling. In this paper we show how to achieve these two goals by proving that for query size estimation, (1) stratified random sampling guarantees perfect load balancing without reducing the accuracy of the estimate, and that (2) for a given number of I/O operations, page level sampling always produces a more accurate estimate than tuple level sampling. For probabilistic parallel query evaluation algorithms, high performance requires tight boundsxon the expected skew in the allocation of work to processors as a function of the number of samples. Toward this end we prove a new bound on this skew, and show that our new bound is better than previously known bounds.
Work supported by NSF grant IRI-8909795 and by a gift of the IBM Corporation.
Work supported by NSF Presidential Young Investigator Award and by a gift of the IBM Corporation.
Preview
Unable to display preview. Download preview PDF.
References
[BLM+91] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S.J. Smith, and M. Zagha. A comparison of sorting algorithms for the connection machine CM-2. In 3rd Annual ACM Symposium on Parallel Algorithms and Architectures, Hilton Head, South Carolina, July 1991.
William G. Cochran. Sampling Techniques. John Wiley and Sons, Inc., New York, New York, 3 edition, 1977.
David J. Dewitt and Jim Gray. Parallel database systems: The future of database processing or a passing fad? SIGMOD Record, 19(4):104–112, December 1990.
David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. A comparison of non-equijoin algorithms. In Proceedings of the Eighteenth International Conference on Very Large Databases, Barcelona, Spain, August 1991.
David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. Parallel external sorting using probabilistic splitting. In Proceedings of the Parallel and Distributed Information Symposium, Miami Beach, Florida, Dec 1991. To appear.
Gaston H. Gonnet. Expected length of the longest probe sequence in hash code searching. JACM, 28(2):289–304, April 1981.
W. C. Hou, G. Ozsoyoglu, and E. Dogdu. Error constrained count query evaluation in relational databases. In Proceedings of the SIGMOD International Conference on Management of Data, pages 278–287, Denver, Colorado, May 1991.
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. American Statistical Association Journal, pages 13–30, March 1963.
Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K. Taneja. Statistical estimators for relational algebra expressions. In Proceedings of the Seventh ACM Symposium on Principles of Database Systems, pages 276–287, Austin, Texas, March 1988.
Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeao K. Taneja. Processing aggregate relational queries with hard time constraints. In Proceedings of the ACM-SIGMOD Conference on the Management of Data, pages 68–77, Portland, Oregon, June 1989.
V. F. Kolchin, B. A. Sevastyanov, and V. P. Chistyakov. Random Allocations. V. H. Winston and Sons, Washington, D.C., 1978.
Richard J. Lipton, Jeffrey F. Naughton, and Donovan A. Schneider. Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, New Jersey, May 1990.
Frank Olken and Doron Rotem. Simple random sampling for relational databases. In Proceedings of the Twelfth International Conference on Very Large Databases, pages 160–169, Kyoto, Japan, August 1986.
M. Stonebraker. The case for shared nothing. Database Engineering, 9(1), 1986.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1992 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Seshadri, S., Naughton, J.F. (1992). Sampling issues in parallel database systems. In: Pirotte, A., Delobel, C., Gottlob, G. (eds) Advances in Database Technology — EDBT '92. EDBT 1992. Lecture Notes in Computer Science, vol 580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0032440
Download citation
DOI: https://doi.org/10.1007/BFb0032440
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-55270-3
Online ISBN: 978-3-540-47003-8
eBook Packages: Springer Book Archive