skip to main content
10.1145/3577193.3593734acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

DyVer: Dynamic Version Handling for Array Databases

Published: 21 June 2023 Publication History

Abstract

Array databases are important data management systems for scientific applications. In array databases, version handling is an important problem due to the no-overwrite feature of scientific data. Existing studies for optimizing data versioning in array databases are relatively simple, which either focus on minimizing storage sizes or improving simple version chains. In this paper, we focus on two challenges: (1) how to balance the tradeoff between storage size and query time for numerous version data, which may have derivative relationships with each other; (2) how to dynamically maintain this balance with continuously added new versions. To address the above challenges, this paper presents DyVer, a versioning framework for SciDB which is one of the most well-known array databases. DyVer includes two techniques, including an efficient storage layout optimizer to quickly reduce data query time under storage capacity constraint and a version segment technique to cope with dynamic version additions. We evaluate DyVer using real-world scientific datasets. Results show that DyVer can achieve up to 95% improvement on the average query time compared to state-of-the-art data versioning techniques under the same storage capacity constraint.

References

[1]
David A Bader and Kamesh Madduri. 2006. Gtgraph: A synthetic graph generator suite. Atlanta, GA, February 38 (2006).
[2]
Peter Baumann, Andreas Dehmel, Paula Furtado, Roland Ritsch, and Norbert Widmann. 1998. The multidimensional database system RasDaMan. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 575--577.
[3]
Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya Parameswaran. 2015. Principles of dataset versioning: Exploring the recreation/storage tradeoff. In Proceedings of the VLDB Endowment. NIH Public Access, 1346.
[4]
M. J. Brodzik and J. S. Stewart. 2016. Near-Real-Time SSM/I-SSMIS EASE-Grid Daily Global Ice Concentration and Snow Extent, Version 5. (2016).
[5]
Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 963--968.
[6]
S. Chan, R. Bindlish, P. E. O'Neill, E. G. Njoku, T. Jackson, A. Colliander, F. Chen, M. Burgin, S. Dunbar, J. R. Piepmeier, S. Yueh, D. Entekhabi, M. Cosh, T. Caldwell, J. Walker, A. Berg, T. Rowlandson, A. Pacheco, H. McNairn, M. Thibeault, J. Martinez-Fernandez, A. González-Zamora, D. Bosch, P. Starks, D. Goodrich, J. Prueger, M. Palecki, E. E. Small, M. Zreda, J. Calvet, W. T. Crow, and Y. Kerr. 2016. Assessment of the SMAP passive soil moisture product. In IEEE Transactions on Geoscience and Remote Sensing. 6046--6048.
[7]
P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. 2009. A Demonstration of SciDB: A Science-Oriented DBMS. Proc. VLDB Endow. 2, 2 (aug 2009), 1534--1537.
[8]
Harold N Gabow, Zvi Galil, Thomas Spencer, and Robert E Tarjan. 1986. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Combinatorica 6, 2 (1986), 109--122.
[9]
Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. 2020. Orpheus db: bolt-on versioning for relational databases. The VLDB Journal 29, 1 (2020), 509--538.
[10]
Chathura Kankanamge, Siddhartha Sahu, Amine Mhedbhi, Jeremy Chen, and Semih Salihoglu. 2017. Graphflow: An active graph database. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD). 1695--1698.
[11]
Udayan Khurana and Amol Deshpande. 2013. Efficient snapshot retrieval over historical graph data. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 997--1008.
[12]
Lawrence Livermore National Laboratory. 2022. Program for Climate Model Diagnosis and Intercomparison. https://pcmdi.llnl.gov/. (2022).
[13]
Joshua MacDonald. 2017. Xdelta3. https://github.com/jmacd/xdelta. (2017).
[14]
Michael Maddox, David Goehring, Aaron J Elmore, Samuel Madden, Aditya Parameswaran, and Amol Deshpande. 2016. Decibel: The relational dataset branching system. In Proceedings of the VLDB Endowment. NIH Public Access, 624.
[15]
NASA. 2022. National Snow and Ice Data Center. https://nsidc.org/. (2022).
[16]
Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. 2016. The tiledb array data storage manager. Proceedings of the VLDB Endowment 10, 4 (2016), 349--360.
[17]
World Climate Research Programme. 2023. WCRP Coupled Model Intercomparison Project (CMIP). https://www.wcrp-climate.org/wgcm-cmip. (2023).
[18]
Adam Seering, Philippe Cudre-Mauroux, Samuel Madden, and Michael Stonebraker. 2012. Efficient versioning for scientific array databases. In 2012 IEEE 28th International Conference on Data Engineering (ICDE). IEEE, 1013--1024.
[19]
Emad Soroush and Magdalena Balazinska. 2013. Time travel in a scientific array database. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 98--109.
[20]
M. Studinger. 2020. IceBridge ATM L2 Icessn Elevation, Slope, and Roughness, Version 2. NASA National Snow and Ice Data Center Distributed Active Archive Center.
[21]
TileDB Team. 2018. TileDB Introduction. https://tiledb-inc-tiledb.readthedocs-hosted.com/en/1.6.3/introduction.html. (2018).
[22]
Haoyuan Xing, Sofoklis Floratos, Spyros Blanas, Suren Byna, M Prabhat, Kesheng Wu, and Paul Brown. 2018. ArrayBridge: Interweaving declarative array processing in SciDB with imperative HDF5-based programs. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 977--988.
[23]
Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. 2010. A cost-effective strategy for intermediate data storage in scientific cloud workflow systems. In 2010 IEEE international symposium on parallel & distributed processing (IPDPS). 1--12.
[24]
Ramon Antonio Rodriges Zalipynis. 2018. Chronosdb: distributed, file based, geospatial array dbms. Proceedings of the VLDB Endowment 11, 10 (2018), 1247--1261.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing
June 2023
505 pages
ISBN:9798400700569
DOI:10.1145/3577193
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. scientific data management
  2. array database
  3. versioning

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 136
    Total Downloads
  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)13
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media