Efficient sampling strategies for relational database operations

https://doi.org/10.1016/0304-3975(93)90224-HGet rights and content
Under an Elsevier user license
open archive

Abstract

Recently, we have proposed an adaptive, random-sampling algorithm for general query size estimation in databases. In an earlier work we analyzed the asymptotic efficiency and accuracy of the algorithm; in this paper we investigate its practicality as applied to the relational database operations select, project, and join. We extend our previous analysis to provide significantly improved bounds on the amount of sampling necessary for a given level of accuracy. Also, we provide “sanity bounds” to deal with queries for which the underlying data are extremely skewed or the query result is very small. We investigate how the existence of indices can be used to generate more efficient sampling algorithms for the operations of project and join. Finally, we report on the performance of the estimation algorithm, both as implemented in “stand alone” C programs and as implemented in a host language on a commericial relational system.

Cited by (0)

Supported by DARPA and ONR contracts N00014-85-C-0456 and N00014-85-K-0465, and by NSF Cooperative Agreement DCR-8420948.

∗∗

Supported by NSF grant IRI-8909795.

∗∗∗

Supported by a DARPA/NASA Graduate Research Assistantship. Current address: HP Labs, Palo Alto, CA.

Supported by NSF grant IRI-8909795 and a grant of the Wisconsin Alumni Research Foundation.