Skip to main content

Efficient Random Sampling from Very Large Databases

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14910))

Included in the following conference series:

  • 362 Accesses

Abstract

One of the major research questions in large databases is how to efficiently sample a random subset of records. This sample can then be used to estimate query results and optimize query execution plans and other tasks. In order to have quick access to the data, the common practice is to create an index, which is often implemented by using B+Trees. Existing state-of-the-art algorithms for random sampling over B+Trees result in a significant performance overhead. This paper proposes novel approaches for efficient random sampling over B+Trees in very large databases. We analyze the algorithms’ correctness and use extensive simulation study, which showcases their superior performance compared to previous works while not affecting the quality of the random sample.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Path’s length from the root to the leaves, with the root node being considered as height one.

  2. 2.

    In addition to possible paths that will be visited more than once, which is negligible in large B+Trees.

  3. 3.

    https://pypi.org/project/BTrees/.

  4. 4.

    https://github.com/idancohen88/efficient_sampling.

References

  • Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)

    Article  Google Scholar 

  • Agrawal, R., Kadadi, A., Dai, X., Andres, F.: Challenges and opportunities with big data visualization. In: Proceedings of the 7th International Conference on Management of Computational and Collective intElligence in Digital EcoSystems, pp. 169–173 (2015)

    Google Scholar 

  • Antoshenkov, G.: Random sampling from pseudo-ranked B+ trees. In: VLDB, pp. 375–382 (1992)

    Google Scholar 

  • Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: no silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519 (2017)

    Google Scholar 

  • Chaudhuri, S., Motwani, R., Narasayya, V.: Using random sampling for histogram construction. In: Proceedings of the ACM SIGMOD Conference, pp. 436–447 (1998)

    Google Scholar 

  • Comer, D.: Ubiquitous B-tree. ACM Comput. Surv. (CSUR) 11(2), 121–137 (1979)

    Article  MathSciNet  Google Scholar 

  • Graefe, G., Kuno, H.: Modern B-tree techniques. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 1370–1373. IEEE (2011)

    Google Scholar 

  • Haas, P.J.: Speeding up DB2 UDB using sampling. IDUG Solut. J. 10(2), 6 (2003)

    Google Scholar 

  • Haas, P.J., Naughton, J.F., Swami, A.N.: On the relative cost of sampling for join selectivity estimation. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 14–24 (1994)

    Google Scholar 

  • Hou, W.-C., Ozsoyoglu, G., Dogdu, E.: Error-constrained COUNT query evaluation in relational databases. ACM SIGMOD Rec. 20(2), 278–287 (1991)

    Article  Google Scholar 

  • Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 299–310 (2004)

    Google Scholar 

  • Kluckhohn, C.: Human behavior and the principle of least effort (1950)

    Google Scholar 

  • Kudale, A.: B+ tree Preference over B Tree. Chicago, USA (n. d.). http://www.academia.edu/11575258/B_tree_preference_over_B_trees

  • Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, pp. 615–629 (2016)

    Google Scholar 

  • Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)

    Article  Google Scholar 

  • Lipton, R.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Efficient sampling strategies for relational database operations. Theor. Comput. Sci. 116(1), 195–226 (1993)

    Article  MathSciNet  Google Scholar 

  • Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. IEEE Access 8(2020), 72713–72726 (2020)

    Article  Google Scholar 

  • Makawita, D., Tan, K.-L., Liu, H.: Sampling from databases using B+-trees. Intell. Data Anal. 6(4), 359–377 (2002)

    Article  Google Scholar 

  • Minkkinen, P.: Practical applications of sampling theory. Chemometr. Intell. Lab. Syst. 74(1), 85–94 (2004)

    Article  Google Scholar 

  • Naughton, J.F., Seshadri, S.: On estimating the size of projections. In: Abiteboul, S., Kanellakis, P.C. (eds.) ICDT 1990. LNCS, vol. 470, pp. 499–513. Springer, Heidelberg (1990). https://doi.org/10.1007/3-540-53507-1_98

    Chapter  Google Scholar 

  • Olken, F.: Random sampling from databases. Ph.D. Dissertation. University of California, Berkeley (1993)

    Google Scholar 

  • Olken, F., Rotem, D.: Random sampling from B+ trees. In: Proceedings of the 15th VLDB Conference, Amsterdam, The Netherlands (1989)

    Google Scholar 

  • Olken, F., Rotem, D.: Random sampling from databases: a survey. Stat. Comput. 5(1), 25–42 (1995)

    Article  Google Scholar 

  • Papaemmanouil, O., Diao, Y., Dimitriadou, K., Peng, L.: Interactive data exploration via machine learning models. IEEE Data Eng. Bull. 39(4), 38–49 (2016)

    Google Scholar 

  • Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. ACM SIGMOD Rec. 14(2), 256–276 (1984)

    Article  Google Scholar 

  • Poosala, V.: Zipf’s law (1995). citeseer.ist.psu.edu/116813.html

  • Shekelyan, M., Cormode, G., Triantafillou, P., Shanghooshabad, A., Ma, Q.: Weighted random sampling over joins. arXiv preprint arXiv:2201.02670 (2022)

  • Slavakis, K., Giannakis, G.B., Mateos, G.: Modeling and optimization for big data analytics: (statistical) learning tools for our era of data deluge. IEEE Signal Process. Mag. 31(5), 18–31 (2014)

    Article  Google Scholar 

  • Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

  • Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16

    Chapter  Google Scholar 

  • Wong, C.-K., Easton, M.C.: An efficient method for weighted sampling without replacement. SIAM J. Comput. 9(1), 111–113 (1980)

    Article  MathSciNet  Google Scholar 

  • Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)

    Google Scholar 

  • Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1525–1539 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Idan Cohen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cohen, I., Yehezkel, A., Yakhini, Z. (2024). Efficient Random Sampling from Very Large Databases. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14910. Springer, Cham. https://doi.org/10.1007/978-3-031-68309-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-68309-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-68308-4

  • Online ISBN: 978-3-031-68309-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics