ABSTRACT
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. It is not even known whether it is possible to generate a sample of a join tree without first evaluating the join tree completely. We undertake a detailed study of this problem and attempt to analyze it in a variety of settings. We present theoretical results explaining the difficulty of this problem and setting limits on the efficiency that can be achieved. Based on new insights into the interaction between join and sampling, we develop join sampling techniques for the settings where our negative results do not apply. Our new sampling algorithms are significantly more efficient than those known earlier. We present experimental evaluation of our techniques on Microsoft's SQL Server 7.0.
- 1.S. Chaudhuri, R. Motwani, and V. Narasayya. Using Random Sampling for Histogram Construction. In Proc. A CM SIGMOD Conference, pages 436-447, 1998. Google ScholarDigital Library
- 2.S. Ganguly, P.B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal Sampling for Skew-Resistant Join Size Estimation. In Proc. A CM SIGMOD Conference, pages 271-281, 1996. Google ScholarDigital Library
- 3.P.J. Haas, J.F. Naughton, and A.N. Swami. On the Relative Cost of Sampling for Join Selectivity Estimation. In Proc. 13th ACM PODS, pages 14-24, 1994. Google ScholarDigital Library
- 4.J.M. Hellerstein, P.J. Haas, and H.J. Wang. Online Aggregation. In Proc. A CM SIGMOD Conference, pages 171-182, 1997. Google ScholarDigital Library
- 5.W. Hou, G. Ozsoyoglu, and E. Dogdu. Error- Constrained COUNT Query Evaluation in Relational Databases. In Proc. A CM SIGMOD Conference, pages 278-287, 1991. Google ScholarDigital Library
- 6.R.J. Lipton, J.F. Naughton, D.A. Schneider, and S. Seshadri. Efficient Sampling Strategies for Relational Database Operations. Theoretical Computer Science 116(1993): 195-226. Google ScholarDigital Library
- 7.R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
- 8.J.F. Naughton and S. Seshadri. On Estimating the Size of Projections. In Proc. Third International Conference on Database Theory, pages 499-513, 1990. Google ScholarDigital Library
- 9.F. Olken and D. Rotem. Simple random sampling from relational databases. In Proc. 12th VLDB, pages 160- 169, 1986. Google ScholarDigital Library
- 10.F. Olken. Random Sampling from Databases. PhD Dissertation, Computer Science, University of California at Berkeley, 1993. Google ScholarDigital Library
- 11.G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. A CM SIGMOD Conference, pages 256-276, 1984. Google ScholarDigital Library
- 12.J.S. Vitter. Random sampling with a reservoir. A CM Trans. Mathematical Software, 11 (1985): 37-57. Google ScholarDigital Library
- 13.G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.Google Scholar
Index Terms
- On random sampling over joins
Recommendations
Random Sampling over Joins Revisited
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataJoins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as ...
On random sampling over joins
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. It is not even known whether it is possible to generate a sample of a join tree without first evaluating the join tree ...
Sampling over Union of Joins
SIGMOD '23: Companion of the 2023 International Conference on Management of DataData scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join ...
Comments