ABSTRACT
We present the design, implementation, and evaluation of ArrayStore, a new storage manager for complex, parallel array processing. ArrayStore builds on prior work in the area of multidimensional data storage, but considers the new problem of supporting a parallel and more varied workload comprising not only range-queries, but also binary operations such as joins and complex user-defined functions.
This paper makes two key contributions. First, it examines several existing single-site storage management strategies and array partitioning strategies to identify which combination is best suited for the array-processing workload above. Second, it develops a new and efficient storage-management mechanism that enables parallel processing of operations that must access data from adjacent partitions.
We evaluate ArrayStore on over 80GB of real data from two scientific domains and real operators used in these domains. We show that ArrayStore outperforms previously proposed storage management strategies in the context of its diverse target workload.
- http://mahout.apache.org/.Google Scholar
- Arge et. al. The priority r-tree: a practically efficient and worst-case optimal r-tree. In Proc. of the SIGMOD Conf., pages 347--358, 2004. Google ScholarDigital Library
- SeaFlow cytometer. http://armbrustlab.ocean.washington.edu/resources/sea_flow.Google Scholar
- Ballegooij et. al. Distribution rules for array database queries. In 16th. DEXA Conf., pages 55--64, 2005. Google ScholarDigital Library
- Beckmann et. al. The r*-tree: an efficient and robust access method for points and rectangles. SIGMOD Record, 19(2):322--331, 1990. Google ScholarDigital Library
- J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975. Google ScholarDigital Library
- Berchtold et. al. The pyramid-technique: towards breaking the curse of dimensionality. In Proc. of the SIGMOD Conf., pages 142--153, 1998. Google ScholarDigital Library
- Chang et. al. Titan: A high-performance remote sensing database. In Proc. of the 13th ICDE Conf., pages 375--384, 1997. Google ScholarDigital Library
- Chang et. al. T2: a customizable parallel database for multi-dimensional data. SIGMOD Record, 27(1):58--66, 1998. Google ScholarDigital Library
- Cognos PowerPlay. http://www-01.ibm.com/software/data/cognos/products/series7/powerplay/.Google Scholar
- Cohen et. al. MAD skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google ScholarDigital Library
- DeWitt et. al. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85--98, 1992. Google ScholarDigital Library
- DeWitt et. al. Client-server paradise. In Proc. of the 20th Int. Conf. on Very Large DataBases (VLDB), pages 558--569, 1994. Google ScholarDigital Library
- Baumann et. al. The multidimensional database system RasDaMan. In Proc. of the SIGMOD Conf., pages 575--577, 1998. Google ScholarDigital Library
- Marathe et. al. Query processing techniques for arrays. The VLDB Journal, 11(1):68--91, 2002. Google ScholarDigital Library
- Furtado et. al. Storage of multidimensional arrays based on arbitrary tiling. In Proc. of the 15th ICDE Conf., page 480, 1999. Google ScholarDigital Library
- A. Guttman. R-trees: a dynamic index structure for spatial searching. In Proc. of the SIGMOD Conf., pages 47--57, 1984. Google ScholarDigital Library
- Hey et. al., editor. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.Google Scholar
- Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. of SOCC Symp., June 2010. Google ScholarDigital Library
- Y. Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010. Google ScholarDigital Library
- Lacey et. al. Merger rates in hierarchical models of galaxy formation - part two - comparison with n-body simulations. Monthly Notices of the Royal Astronomical Society (mnras), 271:676--Google Scholar
- , December 1994.Google Scholar
- Loebman et. al. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS), 2009.Google ScholarCross Ref
- Large Synoptic Survey Telescope. http://www.lsst.org/.Google Scholar
- Lukaszuk et. al. Efficient high-dimensional indexing by superimposing space-partitioning schemes. In Proc. of the 8th IDEAS Symp., pages 257--264, 2004. Google ScholarDigital Library
- Moon et. al. Scalability analysis of declustering methods for multidimensional range queries. IEEE TKDE, 10(2):310--327, 1998. Google ScholarDigital Library
- Nieto-santisteban et. al. Cross-matching very large datasets. In National Science and Technology Council(NSTC) NASA Conference, 2006.Google Scholar
- Oracle OLAP. http://www.oracle.com/technetwork/database/options/olap/index.html.Google Scholar
- Orlandic et. al. The design of a retrieval technique for high-dimensional data on tertiary storage. SIGMOD Record, 31(2):15--21, 2002. Google ScholarDigital Library
- Otoo et. al. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. of the 10th DOLAP Conf., pages 25--32, 2007. Google ScholarDigital Library
- Palo. http://www.palo.net/.Google Scholar
- Pedersen et. al. Multidimensional database technology. IEEE Computer, 34(12):40--46, 2001. Google ScholarDigital Library
- Ratko et. al. A class of region-preserving space transformations for indexing high-dimensional data. Journal of Computer Science, 1:89--97, 2005.Google ScholarCross Ref
- Reiner et. al. Hierarchical storage support and management for large-scale multidimensional array database management systems. In 13th. DEXA Conf., pages 689--700, 2002. Google ScholarDigital Library
- John T. Robinson. The K-D-B-tree: a search structure for large multidimensional dynamic indexes. In Proc. of the SIGMOD Conf., pages 10--18, 1981. Google ScholarDigital Library
- Rogers et. al. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. of the SIGMOD Conf., 2010. Google ScholarDigital Library
- Sarawagi et. al. Efficient organization of large multidimensional arrays. In Proc. of the 10th ICDE Conf., pages 328--336, 1994. Google ScholarDigital Library
- Seamons et. al. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc of 7th SSDBM, pages 218--227, 1994. Google ScholarDigital Library
- Shimada et. al. A storage scheme for multidimensional data alleviating dimension dependency. In Proc. of the 3rd ICDIM Conf., pages 662--668, 2008.Google Scholar
- E. Soroush and M. Balazinska. Hybrid merge/overlap execution technique for parallel array processing. In 1st Workshop on Array Databases (AD2011), 2011. Google ScholarDigital Library
- Stonebraker et. al. Requirements for science data bases and SciDB. In Fourth CIDR Conf. (perspectives), 2009.Google Scholar
- Tsuji et. al. An extendible multidimensional array system for MOLAP. In Proc. of the 21st SAC Symp., pages 503--510, 2006. Google ScholarDigital Library
- Zhang et. al. RIOT: I/O-efficient numerical computing without SQL. In Proc. of the Fourth CIDR Conf., 2009.Google Scholar
Index Terms
- ArrayStore: a storage manager for complex parallel array processing
Recommendations
Hybrid merge/overlap execution technique for parallel array processing
AD '11: Proceedings of the EDBT/ICDT 2011 Workshop on Array DatabasesWhether in business or science, multi-dimensional arrays are a common abstraction in data analytics and many systems exist for efficiently processing arrays. As dataset grow in size, it is becoming increasingly important to process these arrays in ...
Workfile Disk Management for Concurrent Mergesorts in a Multiprocessor Database System
This paper studies workfile disk management for concurrent mergesorts ina multiprocessor database system. Specifically, we examine the impacts of workfile disk allocation and data striping on the average mergesort response time. Concurrent mergesorts in ...
Performance analysis of "Groupby-After-Join" query processing in parallel database systems
Queries containing aggregate functions often combine multiple tables through join operations. This query is subsequently called "Groupby-Join". There is a special category of this query whereby the group-by operation can only be performed after the join ...
Comments