skip to main content
10.1145/2513190.2517827acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
invited-talk

Revisiting aggregation techniques for big data

Published:28 October 2013Publication History

ABSTRACT

In this talk we first present an introduction to AsterixDB [1], a parallel, semistructured platform to ingest, store, index, query, analyze, and publish "big data" (http://asterixdb.ics.uci.edu) and the various challenges we addressed while building it. AsterixDB combines ideas from semistructured data management, parallel database systems, and first-generation data-intensive computing platforms (MapReduce and Hadoop). The full AsterixDB software stack provides support for big data applications from the storage and processing engine (Hyracks [2] available at: http://hyracks.googlecode.com), to the exible query optimization layer (Algebricks), to the interfaces for user-level interaction (AQL, HiveQL, Pregelix, etc.) Hyracks is a partitioned-parallel engine for data intensive computing jobs in the form of DAGs. Algebricks is a model-agnostic, algebraic layer for compiling and optimizing parallel queries to be processed by Hyracks. Queries for AsterixDB can be expressed by either popular higher-level data analysis languages like Pig, Hive or Jaql, or by its native query language (AQL) and data model (ADM) with support for semi-structured information and fuzzy data.

Fundamental data processing operations, like joins and aggregations, are natively supported in AsterixDB. The second part of the talk focuses on our experiences while designing efficient local (per node) aggregation algorithms for AsterixDB. In particular, there are two challenges for local aggregations in a big data system: first, if the aggregation is group-based (like the "group-by" in SQL), the aggregation result may not fit in main memory; second, in order to allow multiple operations being processed simultaneously, an aggregation operation should work within a strict memory budget provided by the platform. Despite its importance and challenges, the design and evaluation of local aggregation algorithms has not received the same level of attention that other basic operators, such as joins, have received in the literature. Facing a lack of "off the shelf" local aggregation algorithms for big data, we present low-level implementation details for engineering the aggregation operator, utilizing (i) sort-based, (ii) hash-based, and (iii) sort-hash-hybrid approaches. We present six algorithms all of which work within a strictly bounded memory budget, and can easily adapt between in-memory and external processing. Among them, two are novel and four are based on extending existing join algorithms.

We deployed all algorithms as operators in the Hyracks platform and evaluated their performance through extensive experimentation. Our experiments cover many different performance factors, including input cardinality, memory, data distribution, and hash table structure. Our study guided our selection of the local aggregation algorithms supported in the recent release of AsterixDB, namely: the hybrid-hash. Pre-Partitioning algorithm for its tolerance on the estimation of the input grouping key cardinality, the Hash-Sort algorithm for its good performance when aggregating skewed data, and the Sort-Based algorithm when the input data is already sorted. This local aggregation work is the first part of a two-part big data aggregation study, as it addresses the "map" phase. Our findings provide the foundation for the global aggregation strategy we are currently investigating for the "reduce" phase. We hope our experience can help developers of other Big Data platforms to build a solid local aggregation operator.

References

  1. A. Behm, V. R. Borkar, M. J. Carey, R. Grover, N. Onose, C. Li, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Revisiting aggregation techniques for big data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP
        October 2013
        110 pages
        ISBN:9781450324120
        DOI:10.1145/2513190

        Copyright © 2013 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 October 2013

        Check for updates

        Qualifiers

        • invited-talk

        Acceptance Rates

        DOLAP '13 Paper Acceptance Rate13of26submissions,50%Overall Acceptance Rate29of79submissions,37%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader