skip to main content
10.1145/1989323.1989351acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ArrayStore: a storage manager for complex parallel array processing

Published:12 June 2011Publication History

ABSTRACT

We present the design, implementation, and evaluation of ArrayStore, a new storage manager for complex, parallel array processing. ArrayStore builds on prior work in the area of multidimensional data storage, but considers the new problem of supporting a parallel and more varied workload comprising not only range-queries, but also binary operations such as joins and complex user-defined functions.

This paper makes two key contributions. First, it examines several existing single-site storage management strategies and array partitioning strategies to identify which combination is best suited for the array-processing workload above. Second, it develops a new and efficient storage-management mechanism that enables parallel processing of operations that must access data from adjacent partitions.

We evaluate ArrayStore on over 80GB of real data from two scientific domains and real operators used in these domains. We show that ArrayStore outperforms previously proposed storage management strategies in the context of its diverse target workload.

References

  1. http://mahout.apache.org/.Google ScholarGoogle Scholar
  2. Arge et. al. The priority r-tree: a practically efficient and worst-case optimal r-tree. In Proc. of the SIGMOD Conf., pages 347--358, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. SeaFlow cytometer. http://armbrustlab.ocean.washington.edu/resources/sea_flow.Google ScholarGoogle Scholar
  4. Ballegooij et. al. Distribution rules for array database queries. In 16th. DEXA Conf., pages 55--64, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Beckmann et. al. The r*-tree: an efficient and robust access method for points and rectangles. SIGMOD Record, 19(2):322--331, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Berchtold et. al. The pyramid-technique: towards breaking the curse of dimensionality. In Proc. of the SIGMOD Conf., pages 142--153, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chang et. al. Titan: A high-performance remote sensing database. In Proc. of the 13th ICDE Conf., pages 375--384, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chang et. al. T2: a customizable parallel database for multi-dimensional data. SIGMOD Record, 27(1):58--66, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cognos PowerPlay. http://www-01.ibm.com/software/data/cognos/products/series7/powerplay/.Google ScholarGoogle Scholar
  11. Cohen et. al. MAD skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. DeWitt et. al. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. DeWitt et. al. Client-server paradise. In Proc. of the 20th Int. Conf. on Very Large DataBases (VLDB), pages 558--569, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Baumann et. al. The multidimensional database system RasDaMan. In Proc. of the SIGMOD Conf., pages 575--577, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Marathe et. al. Query processing techniques for arrays. The VLDB Journal, 11(1):68--91, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Furtado et. al. Storage of multidimensional arrays based on arbitrary tiling. In Proc. of the 15th ICDE Conf., page 480, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Guttman. R-trees: a dynamic index structure for spatial searching. In Proc. of the SIGMOD Conf., pages 47--57, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hey et. al., editor. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.Google ScholarGoogle Scholar
  19. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. of SOCC Symp., June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lacey et. al. Merger rates in hierarchical models of galaxy formation - part two - comparison with n-body simulations. Monthly Notices of the Royal Astronomical Society (mnras), 271:676--Google ScholarGoogle Scholar
  22. , December 1994.Google ScholarGoogle Scholar
  23. Loebman et. al. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS), 2009.Google ScholarGoogle ScholarCross RefCross Ref
  24. Large Synoptic Survey Telescope. http://www.lsst.org/.Google ScholarGoogle Scholar
  25. Lukaszuk et. al. Efficient high-dimensional indexing by superimposing space-partitioning schemes. In Proc. of the 8th IDEAS Symp., pages 257--264, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Moon et. al. Scalability analysis of declustering methods for multidimensional range queries. IEEE TKDE, 10(2):310--327, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Nieto-santisteban et. al. Cross-matching very large datasets. In National Science and Technology Council(NSTC) NASA Conference, 2006.Google ScholarGoogle Scholar
  28. Oracle OLAP. http://www.oracle.com/technetwork/database/options/olap/index.html.Google ScholarGoogle Scholar
  29. Orlandic et. al. The design of a retrieval technique for high-dimensional data on tertiary storage. SIGMOD Record, 31(2):15--21, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Otoo et. al. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. of the 10th DOLAP Conf., pages 25--32, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Palo. http://www.palo.net/.Google ScholarGoogle Scholar
  32. Pedersen et. al. Multidimensional database technology. IEEE Computer, 34(12):40--46, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ratko et. al. A class of region-preserving space transformations for indexing high-dimensional data. Journal of Computer Science, 1:89--97, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  34. Reiner et. al. Hierarchical storage support and management for large-scale multidimensional array database management systems. In 13th. DEXA Conf., pages 689--700, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. John T. Robinson. The K-D-B-tree: a search structure for large multidimensional dynamic indexes. In Proc. of the SIGMOD Conf., pages 10--18, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rogers et. al. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. of the SIGMOD Conf., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sarawagi et. al. Efficient organization of large multidimensional arrays. In Proc. of the 10th ICDE Conf., pages 328--336, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Seamons et. al. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc of 7th SSDBM, pages 218--227, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shimada et. al. A storage scheme for multidimensional data alleviating dimension dependency. In Proc. of the 3rd ICDIM Conf., pages 662--668, 2008.Google ScholarGoogle Scholar
  40. E. Soroush and M. Balazinska. Hybrid merge/overlap execution technique for parallel array processing. In 1st Workshop on Array Databases (AD2011), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Stonebraker et. al. Requirements for science data bases and SciDB. In Fourth CIDR Conf. (perspectives), 2009.Google ScholarGoogle Scholar
  42. Tsuji et. al. An extendible multidimensional array system for MOLAP. In Proc. of the 21st SAC Symp., pages 503--510, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhang et. al. RIOT: I/O-efficient numerical computing without SQL. In Proc. of the Fourth CIDR Conf., 2009.Google ScholarGoogle Scholar

Index Terms

  1. ArrayStore: a storage manager for complex parallel array processing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
        June 2011
        1364 pages
        ISBN:9781450306614
        DOI:10.1145/1989323

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 June 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate741of3,710submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader