invited-talk

Revisiting aggregation techniques for big data

Author:
Vassilis J. Tsotras

University of California, Riverside, Riverside, CA, USA

University of California, Riverside, Riverside, CA, USA
View Profile

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAPOctober 2013Pages 1–2https://doi.org/10.1145/2513190.2517827

Published:28 October 2013Publication History

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Pages 1–2

ABSTRACT

In this talk we first present an introduction to AsterixDB [1], a parallel, semistructured platform to ingest, store, index, query, analyze, and publish "big data" (http://asterixdb.ics.uci.edu) and the various challenges we addressed while building it. AsterixDB combines ideas from semistructured data management, parallel database systems, and first-generation data-intensive computing platforms (MapReduce and Hadoop). The full AsterixDB software stack provides support for big data applications from the storage and processing engine (Hyracks [2] available at: http://hyracks.googlecode.com), to the exible query optimization layer (Algebricks), to the interfaces for user-level interaction (AQL, HiveQL, Pregelix, etc.) Hyracks is a partitioned-parallel engine for data intensive computing jobs in the form of DAGs. Algebricks is a model-agnostic, algebraic layer for compiling and optimizing parallel queries to be processed by Hyracks. Queries for AsterixDB can be expressed by either popular higher-level data analysis languages like Pig, Hive or Jaql, or by its native query language (AQL) and data model (ADM) with support for semi-structured information and fuzzy data.

Fundamental data processing operations, like joins and aggregations, are natively supported in AsterixDB. The second part of the talk focuses on our experiences while designing efficient local (per node) aggregation algorithms for AsterixDB. In particular, there are two challenges for local aggregations in a big data system: first, if the aggregation is group-based (like the "group-by" in SQL), the aggregation result may not fit in main memory; second, in order to allow multiple operations being processed simultaneously, an aggregation operation should work within a strict memory budget provided by the platform. Despite its importance and challenges, the design and evaluation of local aggregation algorithms has not received the same level of attention that other basic operators, such as joins, have received in the literature. Facing a lack of "off the shelf" local aggregation algorithms for big data, we present low-level implementation details for engineering the aggregation operator, utilizing (i) sort-based, (ii) hash-based, and (iii) sort-hash-hybrid approaches. We present six algorithms all of which work within a strictly bounded memory budget, and can easily adapt between in-memory and external processing. Among them, two are novel and four are based on extending existing join algorithms.

We deployed all algorithms as operators in the Hyracks platform and evaluated their performance through extensive experimentation. Our experiments cover many different performance factors, including input cardinality, memory, data distribution, and hash table structure. Our study guided our selection of the local aggregation algorithms supported in the recent release of AsterixDB, namely: the hybrid-hash. Pre-Partitioning algorithm for its tolerance on the estimation of the input grouping key cardinality, the Hash-Sort algorithm for its good performance when aggregating skewed data, and the Sort-Based algorithm when the input data is already sorted. This local aggregation work is the first part of a two-part big data aggregation study, as it addresses the "map" phase. Our findings provide the foundation for the global aggregation strategy we are currently investigating for the "reduce" phase. We hope our experience can help developers of other Big Data platforms to build a solid local aggregation operator.

References

A. Behm, V. R. Borkar, M. J. Carey, R. Grover, N. Onose, C. Li, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarDigital Library
V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google ScholarDigital Library

Index Terms

Revisiting aggregation techniques for big data
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
2. Software and its engineering

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP
October 2013
110 pages
ISBN:9781450324120
DOI:10.1145/2513190
General Chair:
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Ladjel Bellatreche
ISAE-ENSMA, France
,
Alfredo Cuzzocrea
ICAR-CNR and University of Calabria, Italy
Copyright © 2013 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 October 2013
Check for updates
Author Tags
aggregation
big data management system
Qualifiers
- invited-talk
Conference

Acceptance Rates
DOLAP '13 Paper Acceptance Rate13of26submissions,50%Overall Acceptance Rate29of79submissions,37%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 478
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Revisiting aggregation techniques for big data

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

ABSTRACT

References

Cited By

Index Terms

Recommendations

Big Data Analytics

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Big Data Analytics with R and Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Revisiting aggregation techniques for big data

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

ABSTRACT

References

Cited By

Index Terms

Recommendations

Big Data Analytics

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Big Data Analytics with R and Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media