research-article

Orca: a modular query optimizer architecture for big data

Authors:

Mohamed A. Soliman,

Lyublena Antova,

Venkatesh Raghavan,

George C. Caragea,

Carlos Garcia-Alvarado,

Michalis Petropoulos,

Sivaramakrishnan Narayanan,

Konstantinos Krikellas,

Rhonda BaldwinAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 337 - 348

https://doi.org/10.1145/2588555.2595637

Published: 18 June 2014 Publication History

Abstract

The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.

In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture.

In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.

References

[1]

TPC-DS. http://www.tpc.org/tpcds, 2005.

[2]

L. Antova, A. ElHelw, M. Soliman, Z. Gu, M. Petropoulos, and F. Waas. Optimizing Queries over Partitioned Tables in MPP Systems. In SIGMOD, 2014.

Digital Library

[3]

L. Antova, K. Krikellas, and F. M. Waas. Automatic Capture of Minimal, Portable, and Executable Bug Repros using AMPERe. In DBTest, 2012.

Digital Library

[4]

K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and E. Paulson. Efficient Processing of Data Warehousing Queries in a Split Execution Environment. In SIGMOD, 2011.

Digital Library

[5]

A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-world Models. Dist. Parallel Databases, 29(3), 2011.

Digital Library

[6]

R. Chaiken, B. Jenkins, P.- A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB, 1(2), 2008.

Digital Library

[7]

L. Chan. Presto: Interacting with petabytes of data at Facebook. http://prestodb.io, 2013.

[8]

Y. Chen, R. L. Cole, W. J. McKenna, S. Perlfiov, A. Sinha, and E. Szedenits, Jr. Partial Join Order Optimization in the Paraccel Analytic Database. In SIGMOD, 2009.

Digital Library

[9]

J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's Globally-distributed Database. In OSDI, 2012.

Digital Library

[10]

D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in Polybase. In SIGMOD, 2013.

Digital Library

[11]

F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner. SAP HANA Database: Data Management for Modern Business Applications. SIGMOD Rec., 40(4), 2012.

Digital Library

[12]

G. Graefe. Encapsulation of Parallelism in the Volcano Query Processing System. In SIGMOD, 1990.

Digital Library

[13]

G. Graefe. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull., 18(3), 1995.

[14]

G. Graefe and W. J. McKenna. The Volcano Optimizer Generator: Extensibility and Efficient Search. In ICDE, 1993.

Digital Library

[15]

Z. Gu, M. A. Soliman, and F. M. Waas. Testing the Accuracy of Query Optimizers. In DBTest, 2012.

Digital Library

[16]

Hortonworks. Stinger, Interactive query for Apache Hive. http://hortonworks.com/labs/stinger/, 2013.

[17]

M. Kornacker and J. Erickson. Cloudera Impala: Real-Time Queries in Apache Hadoop, for Real. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html, 2012.

[18]

A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The Vertica Analytic Database: C-store 7 Years Later. VLDB Endow., 5(12), 2012.

Digital Library

[19]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010.

Digital Library

[20]

Pivotal. Greenplum Database. http://www.gopivotal.com/products/pivotal-greenplum- database, 2013.

[21]

Pivotal. HAWQ. http://www.gopivotal.com/sites/ default/files/Hawq_WP_042313_FINAL.pdf, 2013.

[22]

P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In SIGMOD, 1979.

Digital Library

[23]

S. Shankar, R. Nehme, J. Aguilar-Saborit, A. Chung, M. Elhemali, A. Halverson, E. Robinson, M. S. Subramanian, D. DeWitt, and C. Galindo-Legaria. Query Optimization in Microsoft SQL Server PDW. In SIGMOD, 2012.

Digital Library

[24]

E. Shen and L. Antova. Reversing Statistics for Scalable Test Databases Generation. In Proceedings of the Sixth International Workshop on Testing Database Systems, pages 7:1--7:6, 2013.

Digital Library

[25]

M. Singh and B. Leonhardi. Introduction to the IBM Netezza Warehouse Appliance. In CASCON, 2011.

Digital Library

[26]

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005.

Digital Library

[27]

Teradata. http://www.teradata.com/, 2013.

[28]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse using Hadoop. In ICDE, 2010.

[29]

F. Waas and C. Galindo-Legaria. Counting, Enumerating, and Sampling of Execution Plans in a Cost-based Query Optimizer. In SIGMOD, 2000.

Digital Library

[30]

F. M. Waas and J. M. Hellerstein. Parallelizing Extensible Query Optimizers. In SIGMOD Conference, pages 871--878, 2009.

Digital Library

[31]

R. Weiss. A Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server, 2012.

Cited By

Chen CMa WGao CZhang WZeng KYe TChen YDu X(2025)GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
https://dl.acm.org/doi/10.1007/s11704-024-40624-2
Shankhdhar PLiu FNarale JSun JSchlussel RAntova L(2024)Presto's History-Based Query OptimizerProceedings of the VLDB Endowment10.14778/3685800.368582817:12(4077-4089)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685828
Kim KLee SKim IHan W(2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639300
Show More Cited By

Index Terms

Orca: a modular query optimizer architecture for big data
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Practical planning and execution of groupjoin and nested aggregates
Abstract
Groupjoins combine execution of a join and a subsequent group-by. They are common in analytical queries and occur in about [inline-graphic not available: see fulltext] of the queries in TPC-H and TPC-DS. While they were originally invented to ...
Efficient Top-k Query Answering through its Top-N Rewritings Using Views
PIKM '15: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management

Recently, various algorithms were proposed to speed up top-k query answering by using multiple materialized query results. Nevertheless, for most of the proposed algorithms, a potentially costly view selection operation is required. In fact, the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
1,510
Total Downloads

Downloads (Last 12 months)123
Downloads (Last 6 weeks)19

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen CMa WGao CZhang WZeng KYe TChen YDu X(2025)GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
https://dl.acm.org/doi/10.1007/s11704-024-40624-2
Shankhdhar PLiu FNarale JSun JSchlussel RAntova L(2024)Presto's History-Based Query OptimizerProceedings of the VLDB Endowment10.14778/3685800.368582817:12(4077-4089)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685828
Kim KLee SKim IHan W(2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639300
Bruno NGalindo-Legaria CJoshi MCalvo Vargas EMahapatra KRavindran SChen GCervantes Juárez ESezgin BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Unified Query Optimization in the Fabric Data WarehouseCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653369(18-30)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653369
Tao DLiu ERandeni Kadupitige SCahill MFekete ARöhm U(2024)First Past the Post: Evaluating Query Optimization in MongoDBDatabases Theory and Applications10.1007/978-981-96-1242-0_8(99-113)Online publication date: 13-Dec-2024
https://doi.org/10.1007/978-981-96-1242-0_8
Anneser CTatbul NCohen DXu ZPandian PLaptev NMarcus R(2023)AutoSteer: Learned Query Optimization for Any SQL DatabaseProceedings of the VLDB Endowment10.14778/3611540.361154416:12(3515-3527)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611544
Pedreira PErling OKaranasos KSchneider SMcKinney WValluri SZait MNadeau J(2023)The Composable Data Management System ManifestoProceedings of the VLDB Endowment10.14778/3603581.360360416:10(2679-2685)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.14778/3603581.3603604
Xu JLu HBao Z(2023)A Query Optimizer for Range Queries over Multi-Attribute TrajectoriesACM Transactions on Intelligent Systems and Technology10.1145/355581114:1(1-28)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3555811
Wang ZZeng KHuang BChen WCui XWang BLiu JFan LQu DHou ZGuan TLi CZhou J(2023)Tempura: a general cost-based optimizer framework for incremental data processing (Journal Version)The VLDB Journal10.1007/s00778-023-00785-132:6(1315-1342)Online publication date: 20-Mar-2023
https://doi.org/10.1007/s00778-023-00785-1
Rong YLi HZhao KGao XCui JAl Hasan MXiong L(2022)DBinsight: A Tool for Interactively Understanding the Query Processing Pipeline in RDBMSsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557211(4960-4964)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557211
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten