Efficient Random Sampling from Very Large Databases

Cohen, Idan; Yehezkel, Aviv; Yakhini, Zohar

doi:10.1007/978-3-031-68309-1_10

Idan Cohen¹³,
Aviv Yehezkel¹³ &
Zohar Yakhini¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14910))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

362 Accesses

Abstract

One of the major research questions in large databases is how to efficiently sample a random subset of records. This sample can then be used to estimate query results and optimize query execution plans and other tasks. In order to have quick access to the data, the common practice is to create an index, which is often implemented by using B+Trees. Existing state-of-the-art algorithms for random sampling over B+Trees result in a significant performance overhead. This paper proposes novel approaches for efficient random sampling over B+Trees in very large databases. We analyze the algorithms’ correctness and use extensive simulation study, which showcases their superior performance compared to previous works while not affecting the quality of the random sample.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 159.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A New Random Sampling Method and Its Application in Improving Progressive BKZ Algorithm

Article 25 October 2023

An Effective RSP Data Sampling Algorithm

On Sampling Representatives of Relational Schemas with a Functional Dependency

Notes

1.
Path’s length from the root to the leaves, with the root node being considered as height one.
2.
In addition to possible paths that will be visited more than once, which is negligible in large B+Trees.
3.
https://pypi.org/project/BTrees/.
4.
https://github.com/idancohen88/efficient_sampling.

References

Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Article Google Scholar
Agrawal, R., Kadadi, A., Dai, X., Andres, F.: Challenges and opportunities with big data visualization. In: Proceedings of the 7th International Conference on Management of Computational and Collective intElligence in Digital EcoSystems, pp. 169–173 (2015)
Google Scholar
Antoshenkov, G.: Random sampling from pseudo-ranked B+ trees. In: VLDB, pp. 375–382 (1992)
Google Scholar
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: no silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519 (2017)
Google Scholar
Chaudhuri, S., Motwani, R., Narasayya, V.: Using random sampling for histogram construction. In: Proceedings of the ACM SIGMOD Conference, pp. 436–447 (1998)
Google Scholar
Comer, D.: Ubiquitous B-tree. ACM Comput. Surv. (CSUR) 11(2), 121–137 (1979)
Article MathSciNet Google Scholar
Graefe, G., Kuno, H.: Modern B-tree techniques. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 1370–1373. IEEE (2011)
Google Scholar
Haas, P.J.: Speeding up DB2 UDB using sampling. IDUG Solut. J. 10(2), 6 (2003)
Google Scholar
Haas, P.J., Naughton, J.F., Swami, A.N.: On the relative cost of sampling for join selectivity estimation. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 14–24 (1994)
Google Scholar
Hou, W.-C., Ozsoyoglu, G., Dogdu, E.: Error-constrained COUNT query evaluation in relational databases. ACM SIGMOD Rec. 20(2), 278–287 (1991)
Article Google Scholar
Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 299–310 (2004)
Google Scholar
Kluckhohn, C.: Human behavior and the principle of least effort (1950)
Google Scholar
Kudale, A.: B+ tree Preference over B Tree. Chicago, USA (n. d.). http://www.academia.edu/11575258/B_tree_preference_over_B_trees
Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, pp. 615–629 (2016)
Google Scholar
Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)
Article Google Scholar
Lipton, R.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Efficient sampling strategies for relational database operations. Theor. Comput. Sci. 116(1), 195–226 (1993)
Article MathSciNet Google Scholar
Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. IEEE Access 8(2020), 72713–72726 (2020)
Article Google Scholar
Makawita, D., Tan, K.-L., Liu, H.: Sampling from databases using B+-trees. Intell. Data Anal. 6(4), 359–377 (2002)
Article Google Scholar
Minkkinen, P.: Practical applications of sampling theory. Chemometr. Intell. Lab. Syst. 74(1), 85–94 (2004)
Article Google Scholar
Naughton, J.F., Seshadri, S.: On estimating the size of projections. In: Abiteboul, S., Kanellakis, P.C. (eds.) ICDT 1990. LNCS, vol. 470, pp. 499–513. Springer, Heidelberg (1990). https://doi.org/10.1007/3-540-53507-1_98
Chapter Google Scholar
Olken, F.: Random sampling from databases. Ph.D. Dissertation. University of California, Berkeley (1993)
Google Scholar
Olken, F., Rotem, D.: Random sampling from B+ trees. In: Proceedings of the 15th VLDB Conference, Amsterdam, The Netherlands (1989)
Google Scholar
Olken, F., Rotem, D.: Random sampling from databases: a survey. Stat. Comput. 5(1), 25–42 (1995)
Article Google Scholar
Papaemmanouil, O., Diao, Y., Dimitriadou, K., Peng, L.: Interactive data exploration via machine learning models. IEEE Data Eng. Bull. 39(4), 38–49 (2016)
Google Scholar
Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. ACM SIGMOD Rec. 14(2), 256–276 (1984)
Article Google Scholar
Poosala, V.: Zipf’s law (1995). citeseer.ist.psu.edu/116813.html
Shekelyan, M., Cormode, G., Triantafillou, P., Shanghooshabad, A., Ma, Q.: Weighted random sampling over joins. arXiv preprint arXiv:2201.02670 (2022)
Slavakis, K., Giannakis, G.B., Mateos, G.: Modeling and optimization for big data analytics: (statistical) learning tools for our era of data deluge. IEEE Signal Process. Mag. 31(5), 18–31 (2014)
Article Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)
Article MathSciNet Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Chapter Google Scholar
Wong, C.-K., Easton, M.C.: An efficient method for weighted sampling without replacement. SIAM J. Comput. 9(1), 111–113 (1980)
Article MathSciNet Google Scholar
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)
Google Scholar
Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1525–1539 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Reichman University, Herzliya, Israel
Idan Cohen, Aviv Yehezkel & Zohar Yakhini

Authors

Idan Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Aviv Yehezkel
View author publications
You can also search for this author in PubMed Google Scholar
Zohar Yakhini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Idan Cohen .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa
National Research Council (CNR), Rende, Italy
Giuseppe Manco
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cohen, I., Yehezkel, A., Yakhini, Z. (2024). Efficient Random Sampling from Very Large Databases. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14910. Springer, Cham. https://doi.org/10.1007/978-3-031-68309-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-68309-1_10
Published: 18 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-68308-4
Online ISBN: 978-3-031-68309-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Random Sampling from Very Large Databases