skip to main content
10.1145/1989323.1989351acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ArrayStore: a storage manager for complex parallel array processing

Published: 12 June 2011 Publication History

Abstract

We present the design, implementation, and evaluation of ArrayStore, a new storage manager for complex, parallel array processing. ArrayStore builds on prior work in the area of multidimensional data storage, but considers the new problem of supporting a parallel and more varied workload comprising not only range-queries, but also binary operations such as joins and complex user-defined functions.
This paper makes two key contributions. First, it examines several existing single-site storage management strategies and array partitioning strategies to identify which combination is best suited for the array-processing workload above. Second, it develops a new and efficient storage-management mechanism that enables parallel processing of operations that must access data from adjacent partitions.
We evaluate ArrayStore on over 80GB of real data from two scientific domains and real operators used in these domains. We show that ArrayStore outperforms previously proposed storage management strategies in the context of its diverse target workload.

References

[1]
http://mahout.apache.org/.
[2]
Arge et. al. The priority r-tree: a practically efficient and worst-case optimal r-tree. In Proc. of the SIGMOD Conf., pages 347--358, 2004.
[3]
SeaFlow cytometer. http://armbrustlab.ocean.washington.edu/resources/sea_flow.
[4]
Ballegooij et. al. Distribution rules for array database queries. In 16th. DEXA Conf., pages 55--64, 2005.
[5]
Beckmann et. al. The r*-tree: an efficient and robust access method for points and rectangles. SIGMOD Record, 19(2):322--331, 1990.
[6]
J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975.
[7]
Berchtold et. al. The pyramid-technique: towards breaking the curse of dimensionality. In Proc. of the SIGMOD Conf., pages 142--153, 1998.
[8]
Chang et. al. Titan: A high-performance remote sensing database. In Proc. of the 13th ICDE Conf., pages 375--384, 1997.
[9]
Chang et. al. T2: a customizable parallel database for multi-dimensional data. SIGMOD Record, 27(1):58--66, 1998.
[10]
Cognos PowerPlay. http://www-01.ibm.com/software/data/cognos/products/series7/powerplay/.
[11]
Cohen et. al. MAD skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.
[12]
DeWitt et. al. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85--98, 1992.
[13]
DeWitt et. al. Client-server paradise. In Proc. of the 20th Int. Conf. on Very Large DataBases (VLDB), pages 558--569, 1994.
[14]
Baumann et. al. The multidimensional database system RasDaMan. In Proc. of the SIGMOD Conf., pages 575--577, 1998.
[15]
Marathe et. al. Query processing techniques for arrays. The VLDB Journal, 11(1):68--91, 2002.
[16]
Furtado et. al. Storage of multidimensional arrays based on arbitrary tiling. In Proc. of the 15th ICDE Conf., page 480, 1999.
[17]
A. Guttman. R-trees: a dynamic index structure for spatial searching. In Proc. of the SIGMOD Conf., pages 47--57, 1984.
[18]
Hey et. al., editor. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[19]
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. of SOCC Symp., June 2010.
[20]
Y. Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010.
[21]
Lacey et. al. Merger rates in hierarchical models of galaxy formation - part two - comparison with n-body simulations. Monthly Notices of the Royal Astronomical Society (mnras), 271:676--
[22]
, December 1994.
[23]
Loebman et. al. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS), 2009.
[24]
Large Synoptic Survey Telescope. http://www.lsst.org/.
[25]
Lukaszuk et. al. Efficient high-dimensional indexing by superimposing space-partitioning schemes. In Proc. of the 8th IDEAS Symp., pages 257--264, 2004.
[26]
Moon et. al. Scalability analysis of declustering methods for multidimensional range queries. IEEE TKDE, 10(2):310--327, 1998.
[27]
Nieto-santisteban et. al. Cross-matching very large datasets. In National Science and Technology Council(NSTC) NASA Conference, 2006.
[28]
Oracle OLAP. http://www.oracle.com/technetwork/database/options/olap/index.html.
[29]
Orlandic et. al. The design of a retrieval technique for high-dimensional data on tertiary storage. SIGMOD Record, 31(2):15--21, 2002.
[30]
Otoo et. al. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. of the 10th DOLAP Conf., pages 25--32, 2007.
[31]
Palo. http://www.palo.net/.
[32]
Pedersen et. al. Multidimensional database technology. IEEE Computer, 34(12):40--46, 2001.
[33]
Ratko et. al. A class of region-preserving space transformations for indexing high-dimensional data. Journal of Computer Science, 1:89--97, 2005.
[34]
Reiner et. al. Hierarchical storage support and management for large-scale multidimensional array database management systems. In 13th. DEXA Conf., pages 689--700, 2002.
[35]
John T. Robinson. The K-D-B-tree: a search structure for large multidimensional dynamic indexes. In Proc. of the SIGMOD Conf., pages 10--18, 1981.
[36]
Rogers et. al. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. of the SIGMOD Conf., 2010.
[37]
Sarawagi et. al. Efficient organization of large multidimensional arrays. In Proc. of the 10th ICDE Conf., pages 328--336, 1994.
[38]
Seamons et. al. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc of 7th SSDBM, pages 218--227, 1994.
[39]
Shimada et. al. A storage scheme for multidimensional data alleviating dimension dependency. In Proc. of the 3rd ICDIM Conf., pages 662--668, 2008.
[40]
E. Soroush and M. Balazinska. Hybrid merge/overlap execution technique for parallel array processing. In 1st Workshop on Array Databases (AD2011), 2011.
[41]
Stonebraker et. al. Requirements for science data bases and SciDB. In Fourth CIDR Conf. (perspectives), 2009.
[42]
Tsuji et. al. An extendible multidimensional array system for MOLAP. In Proc. of the 21st SAC Symp., pages 503--510, 2006.
[43]
Zhang et. al. RIOT: I/O-efficient numerical computing without SQL. In Proc. of the Fourth CIDR Conf., 2009.

Cited By

View all
  • (2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
  • (2024)MulRF: A Multi-Dimensional Range Filter for Sublinear Time Range Query ProcessingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339731336:11(6600-6613)Online publication date: Nov-2024
  • (2023)Multidimensional query processing algorithm by dimension transformationScientific Reports10.1038/s41598-023-31758-713:1Online publication date: 11-Apr-2023
  • Show More Cited By

Index Terms

  1. ArrayStore: a storage manager for complex parallel array processing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
      June 2011
      1364 pages
      ISBN:9781450306614
      DOI:10.1145/1989323
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 June 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. overlap execution strategy
      2. parallel databases
      3. query processing
      4. scientific databases

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)18
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
      • (2024)MulRF: A Multi-Dimensional Range Filter for Sublinear Time Range Query ProcessingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339731336:11(6600-6613)Online publication date: Nov-2024
      • (2023)Multidimensional query processing algorithm by dimension transformationScientific Reports10.1038/s41598-023-31758-713:1Online publication date: 11-Apr-2023
      • (2023)NoGar: A Non-cooperative Game for Thread Pinning in Array DatabasesDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_15(213-227)Online publication date: 18-Aug-2023
      • (2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
      • (2022)Scalable Tensors for Big Data Analytics2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020383(107-114)Online publication date: 17-Dec-2022
      • (2022)Efficient Partitioning Method for Optimizing the Compression on Array DataJournal of Computer Science and Technology10.1007/s11390-022-2371-737:5(1049-1067)Online publication date: 30-Sep-2022
      • (2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
      • (2022)ReSKY: Efficient Subarray Skyline Computation in Array DatabasesDistributed and Parallel Databases10.1007/s10619-022-07419-540:2-3(261-298)Online publication date: 17-Jul-2022
      • (2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media