research-article

Parallel data analysis directly on scientific file formats

Authors:

Arie ShoshaniAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 385 - 396

https://doi.org/10.1145/2588555.2612185

Published: 18 June 2014 Publication History

Abstract

Scientific experiments and large-scale simulations produce massive amounts of data. Many of these scientific datasets are arrays, and are stored in file formats such as HDF5 and NetCDF. Although scientific data management systems, such as SciDB, are designed to manipulate arrays, there are challenges in integrating these systems into existing analysis workflows. Major barriers include the expensive task of preparing and loading data before querying, and converting the final results to a format that is understood by the existing post-processing and visualization tools. As a consequence, integrating a data management system into an existing scientific data analysis workflow is time-consuming and requires extensive user involvement. In this paper, we present the design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format. This design choice eliminates the tedious and error-prone data loading process, and makes the query results readily available to the next processing steps of the analysis workflow. Our design leverages the increasing main memory capacities found in supercomputers through bitmap indexing and in-memory query execution. In addition, query processing over the HDF5 data format can be effortlessly parallelized to utilize the ample concurrency available in large-scale supercomputers and modern parallel file systems. We evaluate the performance of our system on a large supercomputing system and experiment with both a synthetic dataset and a real cosmology observation dataset. Our system frequently outperforms the relational database system that the cosmology team currently uses, and is more than 10X faster than Hive when processing data in parallel. Overall, by eliminating the data loading step, our query processing system is more effective in supporting in situ scientific analysis workflows.

References

[1]

NetCDF. http://www.unidata.ucar.edu/software/netcdf.

[2]

The HDF5 Format. http://www.hdfgroup.org/HDF5/.

[3]

I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient query execution on raw data files. In SIGMOD, pages 241--252, 2012.

Digital Library

[4]

C. Balkesen, G. Alonso, J. Teubner, and M. T. Ozsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013.

Digital Library

[5]

P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidimensional database system RasDaMan. In ACM SIGMOD, 1998.

Digital Library

[6]

B. Behzad, H. V. T. Luu, J. Huchette, et al. Taming parallel I/O complexity with auto-tuning. In SC, 2013.

Digital Library

[7]

S. Blanas and J. M. Patel. Memory footprint matters: efficient equi-join algorithms for main memory data processing. In SoCC, 2013.

Digital Library

[8]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In ACM SIGMOD, 2010.

Digital Library

[9]

J. S. Bloom, J. W. Richards, P. E. Nugent, et al. Automating discovery and classification of transients and variable stars in the synoptic survey era. arXiv preprint arXiv:1106.5491, 2011.

[10]

P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, 2005.

[11]

P. G. Brown. Overview of SciDB: Large scale array storage, processing and analysis. In ACM SIGMOD, 2010.

Digital Library

[12]

J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. A. Brandt. SciHadoop: array-based query processing in Hadoop. In SC, 2011.

Digital Library

[13]

G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009.

Digital Library

[14]

C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and evaluation. In SIGMOD, 1998.

Digital Library

[15]

C. Y. Chan and Y. E. Ioannidis. An efficient bitmap encoding scheme for selection queries. In SIGMOD, 1999.

Digital Library

[16]

Y. Cheng, C. Qin, and F. Rusu. GLADE: Big data analytics made easy. In SIGMOD, pages 697--700, 2012.

Digital Library

[17]

J. Chou, K. Wu, and Prabhat. FastQuery: A general indexing and querying system for scientific data. In SSDBM, pages 573--574, 2011.

Digital Library

[18]

B. Dong, S. Byna, and K. Wu. SDS: A framework for scientific data services. In Proceedings of the 8th Parallel Data Storage Workshop, PDSW '13, pages 27--32, 2013.

Digital Library

[19]

G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD, pages 102--111, 1990.

Digital Library

[20]

W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput., 22(6):789--828, Sept. 1996.

Digital Library

[21]

S. Idreos, F. Groffen, N. Nes, et al. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull., 35(1):40--45, 2012.

[22]

IPCC 2013. Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, in press.

[23]

S. Lakshminarasimhan, D. A. Boyuka, et al. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1--12, 2013.

Digital Library

[24]

L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional arrays: Design, implementation, and optimization techniques. In SIGMOD, 1996.

Digital Library

[25]

A. P. Marathe and K. Salem. Query processing techniques for arrays. The VLDB Journal, 11(1):68--91, Aug. 2002.

Digital Library

[26]

E. Ogasawara, D. Jonas, et al. Chiron: A parallel engine for algebraic scientific workflows. Journal of Concurrency and Computation: Practice and Experience, 25(16), 2013.

[27]

P. E. O'Neil. Model 204 architecture and performance. In HPTS, pages 40--59, 1989.

Digital Library

[28]

S. Perlmutter. Nobel Lecture: Measuring the acceleration of the cosmic expansion using supernovae. Reviews of Modern Physics, 84:1127--1149, July 2012.

[29]

A. Shoshani and D. Rotem, editors. Scientific Data Management: Challenges, Technology, and Deployment. Chapman & Hall/CRC Press, 2009.

Digital Library

[30]

E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A storage manager for complex parallel array processing. In ACM SIGMOD, pages 253--264, 2011.

Digital Library

[31]

M. Stonebraker, D. J. Abadi, A. Batkin, et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.

Digital Library

[32]

Y. Su and G. Agrawal. Supporting user-defined subsetting and aggregation over parallel NetCDF datasets. In IEEE/ACM CCGRID, pages 212--219, 2012.

Digital Library

[33]

R. Thakur, W. Gropp, and E. Lusk. On implementing MPI-IO portably and with high performance. In IOPADS, pages 23--32, 1999.

Digital Library

[34]

A. R. van Ballegooij. RAM: A multidimensional array DBMS. In EDBT, pages 154--165, 2004.

Digital Library

[35]

Y. Wang, W. Jiang, and G. Agrawal. SciMATE: A novel MapReduce-like framework for multiple scientific data formats. In CCGRID, pages 443--450, 2012.

Digital Library

[36]

Y. Wang, Y. Su, and G. Agrawal. Supporting a light-weight data management layer over HDF5. In CCGRID, 2013.

Digital Library

[37]

K. Wu. FastBit: An efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series, 16:556--560, 2005.

[38]

K. Wu, E. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31:1--38, 2006.

Digital Library

[39]

Y. Zhang, M. Kersten, and S. Manegold. SciQL: Array data processing inside an RDBMS. In ACM SIGMOD, 2013.

Digital Library

[40]

H. Zou, M. Slawinska, K. Schwan, et al. FlexQuery: An online in-situ query system for interactive remote visual data exploration at large scale. In IEEE Cluster, 2013.

Cited By

Rodriges Zalipynis R(2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
https://doi.org/10.3390/earth5030027
Song YWu TLi YLi GLiu YYin SXue WWang J(2024)A Data Optimizer for Region-Aware Self-describing Files in Scientific ComputingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698526(883-897)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698526
Fathollahzadeh SBoehm M(2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589265
Show More Cited By

Index Terms

Parallel data analysis directly on scientific file formats
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Scientific visualization
2. Information systems
  1. Information systems applications

Recommendations

A case study on parallel HDF5 dataset concatenation for high energy physics data analysis
Abstract
In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, ...
Highlights
- A case study on parallel HDF5 dataset concatenation for High-Energy Physics data analysis.
Six degrees of scientific data: reading patterns for extreme scale science IO
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing

Petascale science simulations generate 10s of TBs of application data per day, much of it devoted to their checkpoint/restart fault tolerance mechanisms. Previous work demonstrated the importance of carefully managing such output to prevent application ...
Ada 95 bindings for the NCSA hierarchical data format

This paper describes Ada95 bindings for HDF4 and HDF5, the current versions of the NCSA Hierarchical Data Format (HDF). These self-describing file formats are intended for storage of large, diverse collections of scientific data and for retrieving ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

71
Total Citations
View Citations
804
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)4

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rodriges Zalipynis R(2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
https://doi.org/10.3390/earth5030027
Song YWu TLi YLi GLiu YYin SXue WWang J(2024)A Data Optimizer for Region-Aware Self-describing Files in Scientific ComputingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698526(883-897)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698526
Fathollahzadeh SBoehm M(2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589265
Jarmov AZdraveski VKostoska M(2023)Parallelizing file-type conversion for financial analysis2023 31st Telecommunications Forum (TELFOR)10.1109/TELFOR59449.2023.10372813(1-4)Online publication date: 21-Nov-2023
https://doi.org/10.1109/TELFOR59449.2023.10372813
Mogyorosi PSzénási S(2023)GPU Database for Large Geospatial Datasets2023 IEEE 17th International Symposium on Applied Computational Intelligence and Informatics (SACI)10.1109/SACI58269.2023.10158535(000399-000404)Online publication date: 23-May-2023
https://doi.org/10.1109/SACI58269.2023.10158535
Mogyorosi PSzénási S(2023)Fuzzy-based Approach for Road Accident Risk Estimation on GPU2023 IEEE 23rd International Symposium on Computational Intelligence and Informatics (CINTI)10.1109/CINTI59972.2023.10382074(000415-000420)Online publication date: 20-Nov-2023
https://doi.org/10.1109/CINTI59972.2023.10382074
Lee CHewes VCerati GKowalkowski JAurisano AAgrawal AChoudhary ALiao W(2023)A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00017(71-81)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00017
Jiang LZhao ZFalsafi BFerdman MLu SWenisch T(2022)JSONSki: streaming semi-structured data with bit-parallel fast-forwardingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507719(200-211)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507719
Lee SHou KWang KSehrish SPaterno MKowalkowski JKoziol QRoss RAgrawal AChoudhary ALiao W(2022)A case study on parallel HDF5 dataset concatenation for high energy physics data analysisParallel Computing10.1016/j.parco.2021.102877110:COnline publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.parco.2021.102877
Maroulis SBikakis NPapastefanatos GVassiliadis PVassiliou Y(2022)Resource-aware adaptive indexing for in situ visual exploration and analyticsThe VLDB Journal10.1007/s00778-022-00739-z32:1(199-227)Online publication date: 16-Apr-2022
https://doi.org/10.1007/s00778-022-00739-z
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten