skip to main content
10.1145/2588555.2612185acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Parallel data analysis directly on scientific file formats

Published: 18 June 2014 Publication History

Abstract

Scientific experiments and large-scale simulations produce massive amounts of data. Many of these scientific datasets are arrays, and are stored in file formats such as HDF5 and NetCDF. Although scientific data management systems, such as SciDB, are designed to manipulate arrays, there are challenges in integrating these systems into existing analysis workflows. Major barriers include the expensive task of preparing and loading data before querying, and converting the final results to a format that is understood by the existing post-processing and visualization tools. As a consequence, integrating a data management system into an existing scientific data analysis workflow is time-consuming and requires extensive user involvement. In this paper, we present the design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format. This design choice eliminates the tedious and error-prone data loading process, and makes the query results readily available to the next processing steps of the analysis workflow. Our design leverages the increasing main memory capacities found in supercomputers through bitmap indexing and in-memory query execution. In addition, query processing over the HDF5 data format can be effortlessly parallelized to utilize the ample concurrency available in large-scale supercomputers and modern parallel file systems. We evaluate the performance of our system on a large supercomputing system and experiment with both a synthetic dataset and a real cosmology observation dataset. Our system frequently outperforms the relational database system that the cosmology team currently uses, and is more than 10X faster than Hive when processing data in parallel. Overall, by eliminating the data loading step, our query processing system is more effective in supporting in situ scientific analysis workflows.

References

[1]
NetCDF. http://www.unidata.ucar.edu/software/netcdf.
[2]
The HDF5 Format. http://www.hdfgroup.org/HDF5/.
[3]
I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient query execution on raw data files. In SIGMOD, pages 241--252, 2012.
[4]
C. Balkesen, G. Alonso, J. Teubner, and M. T. Ozsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013.
[5]
P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidimensional database system RasDaMan. In ACM SIGMOD, 1998.
[6]
B. Behzad, H. V. T. Luu, J. Huchette, et al. Taming parallel I/O complexity with auto-tuning. In SC, 2013.
[7]
S. Blanas and J. M. Patel. Memory footprint matters: efficient equi-join algorithms for main memory data processing. In SoCC, 2013.
[8]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In ACM SIGMOD, 2010.
[9]
J. S. Bloom, J. W. Richards, P. E. Nugent, et al. Automating discovery and classification of transients and variable stars in the synoptic survey era. arXiv preprint arXiv:1106.5491, 2011.
[10]
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, 2005.
[11]
P. G. Brown. Overview of SciDB: Large scale array storage, processing and analysis. In ACM SIGMOD, 2010.
[12]
J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. A. Brandt. SciHadoop: array-based query processing in Hadoop. In SC, 2011.
[13]
G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009.
[14]
C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and evaluation. In SIGMOD, 1998.
[15]
C. Y. Chan and Y. E. Ioannidis. An efficient bitmap encoding scheme for selection queries. In SIGMOD, 1999.
[16]
Y. Cheng, C. Qin, and F. Rusu. GLADE: Big data analytics made easy. In SIGMOD, pages 697--700, 2012.
[17]
J. Chou, K. Wu, and Prabhat. FastQuery: A general indexing and querying system for scientific data. In SSDBM, pages 573--574, 2011.
[18]
B. Dong, S. Byna, and K. Wu. SDS: A framework for scientific data services. In Proceedings of the 8th Parallel Data Storage Workshop, PDSW '13, pages 27--32, 2013.
[19]
G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD, pages 102--111, 1990.
[20]
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput., 22(6):789--828, Sept. 1996.
[21]
S. Idreos, F. Groffen, N. Nes, et al. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull., 35(1):40--45, 2012.
[22]
IPCC 2013. Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, in press.
[23]
S. Lakshminarasimhan, D. A. Boyuka, et al. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1--12, 2013.
[24]
L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional arrays: Design, implementation, and optimization techniques. In SIGMOD, 1996.
[25]
A. P. Marathe and K. Salem. Query processing techniques for arrays. The VLDB Journal, 11(1):68--91, Aug. 2002.
[26]
E. Ogasawara, D. Jonas, et al. Chiron: A parallel engine for algebraic scientific workflows. Journal of Concurrency and Computation: Practice and Experience, 25(16), 2013.
[27]
P. E. O'Neil. Model 204 architecture and performance. In HPTS, pages 40--59, 1989.
[28]
S. Perlmutter. Nobel Lecture: Measuring the acceleration of the cosmic expansion using supernovae. Reviews of Modern Physics, 84:1127--1149, July 2012.
[29]
A. Shoshani and D. Rotem, editors. Scientific Data Management: Challenges, Technology, and Deployment. Chapman & Hall/CRC Press, 2009.
[30]
E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A storage manager for complex parallel array processing. In ACM SIGMOD, pages 253--264, 2011.
[31]
M. Stonebraker, D. J. Abadi, A. Batkin, et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.
[32]
Y. Su and G. Agrawal. Supporting user-defined subsetting and aggregation over parallel NetCDF datasets. In IEEE/ACM CCGRID, pages 212--219, 2012.
[33]
R. Thakur, W. Gropp, and E. Lusk. On implementing MPI-IO portably and with high performance. In IOPADS, pages 23--32, 1999.
[34]
A. R. van Ballegooij. RAM: A multidimensional array DBMS. In EDBT, pages 154--165, 2004.
[35]
Y. Wang, W. Jiang, and G. Agrawal. SciMATE: A novel MapReduce-like framework for multiple scientific data formats. In CCGRID, pages 443--450, 2012.
[36]
Y. Wang, Y. Su, and G. Agrawal. Supporting a light-weight data management layer over HDF5. In CCGRID, 2013.
[37]
K. Wu. FastBit: An efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series, 16:556--560, 2005.
[38]
K. Wu, E. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31:1--38, 2006.
[39]
Y. Zhang, M. Kersten, and S. Manegold. SciQL: Array data processing inside an RDBMS. In ACM SIGMOD, 2013.
[40]
H. Zou, M. Slawinska, K. Schwan, et al. FlexQuery: An online in-situ query system for interactive remote visual data exploration at large scale. In IEEE Cluster, 2013.

Cited By

View all
  • (2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
  • (2024)A Data Optimizer for Region-Aware Self-describing Files in Scientific ComputingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698526(883-897)Online publication date: 20-Nov-2024
  • (2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HDF5
  2. in situ
  3. parallel processing

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'14
Sponsor:

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
  • (2024)A Data Optimizer for Region-Aware Self-describing Files in Scientific ComputingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698526(883-897)Online publication date: 20-Nov-2024
  • (2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
  • (2023)Parallelizing file-type conversion for financial analysis2023 31st Telecommunications Forum (TELFOR)10.1109/TELFOR59449.2023.10372813(1-4)Online publication date: 21-Nov-2023
  • (2023)GPU Database for Large Geospatial Datasets2023 IEEE 17th International Symposium on Applied Computational Intelligence and Informatics (SACI)10.1109/SACI58269.2023.10158535(000399-000404)Online publication date: 23-May-2023
  • (2023)Fuzzy-based Approach for Road Accident Risk Estimation on GPU2023 IEEE 23rd International Symposium on Computational Intelligence and Informatics (CINTI)10.1109/CINTI59972.2023.10382074(000415-000420)Online publication date: 20-Nov-2023
  • (2023)A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00017(71-81)Online publication date: May-2023
  • (2022)JSONSki: streaming semi-structured data with bit-parallel fast-forwardingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507719(200-211)Online publication date: 28-Feb-2022
  • (2022)A case study on parallel HDF5 dataset concatenation for high energy physics data analysisParallel Computing10.1016/j.parco.2021.102877110:COnline publication date: 1-May-2022
  • (2022)Resource-aware adaptive indexing for in situ visual exploration and analyticsThe VLDB Journal10.1007/s00778-022-00739-z32:1(199-227)Online publication date: 16-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media