research-article

ArrayStore: a storage manager for complex parallel array processing

Authors:

Magdalena Balazinska,

Daniel WangAuthors Info & Claims

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Pages 253 - 264

https://doi.org/10.1145/1989323.1989351

Published: 12 June 2011 Publication History

Abstract

We present the design, implementation, and evaluation of ArrayStore, a new storage manager for complex, parallel array processing. ArrayStore builds on prior work in the area of multidimensional data storage, but considers the new problem of supporting a parallel and more varied workload comprising not only range-queries, but also binary operations such as joins and complex user-defined functions.

This paper makes two key contributions. First, it examines several existing single-site storage management strategies and array partitioning strategies to identify which combination is best suited for the array-processing workload above. Second, it develops a new and efficient storage-management mechanism that enables parallel processing of operations that must access data from adjacent partitions.

We evaluate ArrayStore on over 80GB of real data from two scientific domains and real operators used in these domains. We show that ArrayStore outperforms previously proposed storage management strategies in the context of its diverse target workload.

References

[1]

http://mahout.apache.org/.

[2]

Arge et. al. The priority r-tree: a practically efficient and worst-case optimal r-tree. In Proc. of the SIGMOD Conf., pages 347--358, 2004.

Digital Library

[3]

SeaFlow cytometer. http://armbrustlab.ocean.washington.edu/resources/sea_flow.

[4]

Ballegooij et. al. Distribution rules for array database queries. In 16th. DEXA Conf., pages 55--64, 2005.

Digital Library

[5]

Beckmann et. al. The r*-tree: an efficient and robust access method for points and rectangles. SIGMOD Record, 19(2):322--331, 1990.

Digital Library

[6]

J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975.

Digital Library

[7]

Berchtold et. al. The pyramid-technique: towards breaking the curse of dimensionality. In Proc. of the SIGMOD Conf., pages 142--153, 1998.

Digital Library

[8]

Chang et. al. Titan: A high-performance remote sensing database. In Proc. of the 13th ICDE Conf., pages 375--384, 1997.

Digital Library

[9]

Chang et. al. T2: a customizable parallel database for multi-dimensional data. SIGMOD Record, 27(1):58--66, 1998.

Digital Library

[10]

Cognos PowerPlay. http://www-01.ibm.com/software/data/cognos/products/series7/powerplay/.

[11]

Cohen et. al. MAD skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.

Digital Library

[12]

DeWitt et. al. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85--98, 1992.

Digital Library

[13]

DeWitt et. al. Client-server paradise. In Proc. of the 20th Int. Conf. on Very Large DataBases (VLDB), pages 558--569, 1994.

Digital Library

[14]

Baumann et. al. The multidimensional database system RasDaMan. In Proc. of the SIGMOD Conf., pages 575--577, 1998.

Digital Library

[15]

Marathe et. al. Query processing techniques for arrays. The VLDB Journal, 11(1):68--91, 2002.

Digital Library

[16]

Furtado et. al. Storage of multidimensional arrays based on arbitrary tiling. In Proc. of the 15th ICDE Conf., page 480, 1999.

Digital Library

[17]

A. Guttman. R-trees: a dynamic index structure for spatial searching. In Proc. of the SIGMOD Conf., pages 47--57, 1984.

Digital Library

[18]

Hey et. al., editor. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.

[19]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. of SOCC Symp., June 2010.

Digital Library

[20]

Y. Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010.

Digital Library

[21]

Lacey et. al. Merger rates in hierarchical models of galaxy formation - part two - comparison with n-body simulations. Monthly Notices of the Royal Astronomical Society (mnras), 271:676--

[22]

, December 1994.

[23]

Loebman et. al. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS), 2009.

[24]

Large Synoptic Survey Telescope. http://www.lsst.org/.

[25]

Lukaszuk et. al. Efficient high-dimensional indexing by superimposing space-partitioning schemes. In Proc. of the 8th IDEAS Symp., pages 257--264, 2004.

Digital Library

[26]

Moon et. al. Scalability analysis of declustering methods for multidimensional range queries. IEEE TKDE, 10(2):310--327, 1998.

Digital Library

[27]

Nieto-santisteban et. al. Cross-matching very large datasets. In National Science and Technology Council(NSTC) NASA Conference, 2006.

[28]

Oracle OLAP. http://www.oracle.com/technetwork/database/options/olap/index.html.

[29]

Orlandic et. al. The design of a retrieval technique for high-dimensional data on tertiary storage. SIGMOD Record, 31(2):15--21, 2002.

Digital Library

[30]

Otoo et. al. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. of the 10th DOLAP Conf., pages 25--32, 2007.

Digital Library

[31]

Palo. http://www.palo.net/.

[32]

Pedersen et. al. Multidimensional database technology. IEEE Computer, 34(12):40--46, 2001.

Digital Library

[33]

Ratko et. al. A class of region-preserving space transformations for indexing high-dimensional data. Journal of Computer Science, 1:89--97, 2005.

[34]

Reiner et. al. Hierarchical storage support and management for large-scale multidimensional array database management systems. In 13th. DEXA Conf., pages 689--700, 2002.

Digital Library

[35]

John T. Robinson. The K-D-B-tree: a search structure for large multidimensional dynamic indexes. In Proc. of the SIGMOD Conf., pages 10--18, 1981.

Digital Library

[36]

Rogers et. al. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. of the SIGMOD Conf., 2010.

Digital Library

[37]

Sarawagi et. al. Efficient organization of large multidimensional arrays. In Proc. of the 10th ICDE Conf., pages 328--336, 1994.

Digital Library

[38]

Seamons et. al. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc of 7th SSDBM, pages 218--227, 1994.

Digital Library

[39]

Shimada et. al. A storage scheme for multidimensional data alleviating dimension dependency. In Proc. of the 3rd ICDIM Conf., pages 662--668, 2008.

[40]

E. Soroush and M. Balazinska. Hybrid merge/overlap execution technique for parallel array processing. In 1st Workshop on Array Databases (AD2011), 2011.

Digital Library

[41]

Stonebraker et. al. Requirements for science data bases and SciDB. In Fourth CIDR Conf. (perspectives), 2009.

[42]

Tsuji et. al. An extendible multidimensional array system for MOLAP. In Proc. of the 21st SAC Symp., pages 503--510, 2006.

Digital Library

[43]

Zhang et. al. RIOT: I/O-efficient numerical computing without SQL. In Proc. of the Fourth CIDR Conf., 2009.

Cited By

Rodriges Zalipynis R(2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
https://doi.org/10.3390/earth5030027
Han SLiu XLi J(2024)MulRF: A Multi-Dimensional Range Filter for Sublinear Time Range Query ProcessingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339731336:11(6600-6613)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3397313
Rimi RHasan KTsuji T(2023)Multidimensional query processing algorithm by dimension transformationScientific Reports10.1038/s41598-023-31758-713:1Online publication date: 11-Apr-2023
https://doi.org/10.1038/s41598-023-31758-7
Show More Cited By

Index Terms

ArrayStore: a storage manager for complex parallel array processing
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications

Recommendations

Hybrid merge/overlap execution technique for parallel array processing
AD '11: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases

Whether in business or science, multi-dimensional arrays are a common abstraction in data analytics and many systems exist for efficiently processing arrays. As dataset grow in size, it is becoming increasingly important to process these arrays in ...
Workfile Disk Management for Concurrent Mergesorts in a Multiprocessor Database System

This paper studies workfile disk management for concurrent mergesorts ina multiprocessor database system. Specifically, we examine the impacts of workfile disk allocation and data striping on the average mergesort response time. Concurrent mergesorts in ...
Performance analysis of "Groupby-After-Join" query processing in parallel database systems

Queries containing aggregate functions often combine multiple tables through join operations. This query is subsequently called "Groupby-Join". There is a special category of this query whereby the group-by operation can only be performed after the join ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

June 2011

1364 pages

ISBN:9781450306614

DOI:10.1145/1989323

General Chair:
Timos Sellis
IMIS/RC Athena
,
Program Chair:
Renée J. Miller
University of Toronto
,
Publications Chairs:
Anastasios Kementsietsidis
IBM T.J. Watson Research Center
,
Yannis Velegrakis
University of Trento

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '11

Sponsor:

SIGMOD

SIGMOD/PODS '11: International Conference on Management of Data

June 12 - 16, 2011

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
807
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)5

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rodriges Zalipynis R(2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
https://doi.org/10.3390/earth5030027
Han SLiu XLi J(2024)MulRF: A Multi-Dimensional Range Filter for Sublinear Time Range Query ProcessingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339731336:11(6600-6613)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3397313
Rimi RHasan KTsuji T(2023)Multidimensional query processing algorithm by dimension transformationScientific Reports10.1038/s41598-023-31758-713:1Online publication date: 11-Apr-2023
https://doi.org/10.1038/s41598-023-31758-7
Dominico SAlves Mde Almeida E(2023)NoGar: A Non-cooperative Game for Thread Pinning in Array DatabasesDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_15(213-227)Online publication date: 18-Aug-2023
https://doi.org/10.1007/978-3-031-39847-6_15
Sudhir SCafarella MMadden S(2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.14778/3503585.3503606
Fegaras LKhan THasanuzzaman Noor MSultana T(2022)Scalable Tensors for Big Data Analytics2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020383(107-114)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020383
Han SLiu XLi J(2022)Efficient Partitioning Method for Optimizing the Compression on Array DataJournal of Computer Science and Technology10.1007/s11390-022-2371-737:5(1049-1067)Online publication date: 30-Sep-2022
https://doi.org/10.1007/s11390-022-2371-7
Han SLiu XLi J(2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
https://doi.org/10.1007/s11280-022-01098-z
Choi DYoon HChung Y(2022)ReSKY: Efficient Subarray Skyline Computation in Array DatabasesDistributed and Parallel Databases10.1007/s10619-022-07419-540:2-3(261-298)Online publication date: 17-Jul-2022
https://doi.org/10.1007/s10619-022-07419-5
Dominico Sde Almeida EAlves M(2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
https://doi.org/10.1007/s00607-021-01043-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents