Abstract
One of the major research questions in large databases is how to efficiently sample a random subset of records. This sample can then be used to estimate query results and optimize query execution plans and other tasks. In order to have quick access to the data, the common practice is to create an index, which is often implemented by using B+Trees. Existing state-of-the-art algorithms for random sampling over B+Trees result in a significant performance overhead. This paper proposes novel approaches for efficient random sampling over B+Trees in very large databases. We analyze the algorithms’ correctness and use extensive simulation study, which showcases their superior performance compared to previous works while not affecting the quality of the random sample.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Path’s length from the root to the leaves, with the root node being considered as height one.
- 2.
In addition to possible paths that will be visited more than once, which is negligible in large B+Trees.
- 3.
- 4.
References
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Agrawal, R., Kadadi, A., Dai, X., Andres, F.: Challenges and opportunities with big data visualization. In: Proceedings of the 7th International Conference on Management of Computational and Collective intElligence in Digital EcoSystems, pp. 169–173 (2015)
Antoshenkov, G.: Random sampling from pseudo-ranked B+ trees. In: VLDB, pp. 375–382 (1992)
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: no silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519 (2017)
Chaudhuri, S., Motwani, R., Narasayya, V.: Using random sampling for histogram construction. In: Proceedings of the ACM SIGMOD Conference, pp. 436–447 (1998)
Comer, D.: Ubiquitous B-tree. ACM Comput. Surv. (CSUR) 11(2), 121–137 (1979)
Graefe, G., Kuno, H.: Modern B-tree techniques. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 1370–1373. IEEE (2011)
Haas, P.J.: Speeding up DB2 UDB using sampling. IDUG Solut. J. 10(2), 6 (2003)
Haas, P.J., Naughton, J.F., Swami, A.N.: On the relative cost of sampling for join selectivity estimation. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 14–24 (1994)
Hou, W.-C., Ozsoyoglu, G., Dogdu, E.: Error-constrained COUNT query evaluation in relational databases. ACM SIGMOD Rec. 20(2), 278–287 (1991)
Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 299–310 (2004)
Kluckhohn, C.: Human behavior and the principle of least effort (1950)
Kudale, A.: B+ tree Preference over B Tree. Chicago, USA (n. d.). http://www.academia.edu/11575258/B_tree_preference_over_B_trees
Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, pp. 615–629 (2016)
Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)
Lipton, R.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Efficient sampling strategies for relational database operations. Theor. Comput. Sci. 116(1), 195–226 (1993)
Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. IEEE Access 8(2020), 72713–72726 (2020)
Makawita, D., Tan, K.-L., Liu, H.: Sampling from databases using B+-trees. Intell. Data Anal. 6(4), 359–377 (2002)
Minkkinen, P.: Practical applications of sampling theory. Chemometr. Intell. Lab. Syst. 74(1), 85–94 (2004)
Naughton, J.F., Seshadri, S.: On estimating the size of projections. In: Abiteboul, S., Kanellakis, P.C. (eds.) ICDT 1990. LNCS, vol. 470, pp. 499–513. Springer, Heidelberg (1990). https://doi.org/10.1007/3-540-53507-1_98
Olken, F.: Random sampling from databases. Ph.D. Dissertation. University of California, Berkeley (1993)
Olken, F., Rotem, D.: Random sampling from B+ trees. In: Proceedings of the 15th VLDB Conference, Amsterdam, The Netherlands (1989)
Olken, F., Rotem, D.: Random sampling from databases: a survey. Stat. Comput. 5(1), 25–42 (1995)
Papaemmanouil, O., Diao, Y., Dimitriadou, K., Peng, L.: Interactive data exploration via machine learning models. IEEE Data Eng. Bull. 39(4), 38–49 (2016)
Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. ACM SIGMOD Rec. 14(2), 256–276 (1984)
Poosala, V.: Zipf’s law (1995). citeseer.ist.psu.edu/116813.html
Shekelyan, M., Cormode, G., Triantafillou, P., Shanghooshabad, A., Ma, Q.: Weighted random sampling over joins. arXiv preprint arXiv:2201.02670 (2022)
Slavakis, K., Giannakis, G.B., Mateos, G.: Modeling and optimization for big data analytics: (statistical) learning tools for our era of data deluge. IEEE Signal Process. Mag. 31(5), 18–31 (2014)
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Wong, C.-K., Easton, M.C.: An efficient method for weighted sampling without replacement. SIAM J. Comput. 9(1), 111–113 (1980)
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)
Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1525–1539 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cohen, I., Yehezkel, A., Yakhini, Z. (2024). Efficient Random Sampling from Very Large Databases. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14910. Springer, Cham. https://doi.org/10.1007/978-3-031-68309-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-68309-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-68308-4
Online ISBN: 978-3-031-68309-1
eBook Packages: Computer ScienceComputer Science (R0)