research-article

QMapper for Smart Grid: Migrating SQL-based Application to Hive

Authors:

Songlin HuAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 647 - 658

https://doi.org/10.1145/2723372.2742792

Published: 27 May 2015 Publication History

Abstract

Apache Hive has been widely used by Internet companies for big data analytics applications. It can provide the capability of compiling high-level languages into efficient MapReduce workflows, which frees users from complicated and time consuming programming. The popularity of Hive and its HiveQL-compatible systems like Impala and Shark attracts attentions from traditional enterprises as well. However, enterprise big data processing systems such as Smart Grid applications often have to migrate their RDBMS-based legacy applications to Hive rather than directly writing new logic in HiveQL. Considering their differences in syntax and cost model, manual translation from SQL in RDBMS to HiveQL is very difficult, error-prone, and often leads to poor performance.

In this paper, we propose QMapper, a tool for automatically translating SQL into proper HiveQL. QMapper consists of a rule-based rewriter and a cost-based optimizer. The experiments based on the TPC-H benchmark demonstrate that, compared to manually rewritten Hive queries provided by Hive contributors, QMapper dramatically reduces the query latency on average. Our real world Smart Grid application also shows its efficiency.

References

[1]

A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB, 2(1):922--933, 2009.

Digital Library

[2]

S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, pages 137--142, 2010.

Digital Library

[3]

K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. In VLDB, 2011.

Digital Library

[4]

Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing big data systems: A cross-industry study of mapreduce workloads. VLDB, 5(12):1802--1813, 2012.

Digital Library

[5]

A. Floratou, U. F. Minhas, and U. F. Minhas. Sql-on-hadoop: Full circle back to shared-nothing database architectures. Proceedings of the VLDB Endowment, 12(7):1295--1306, 2014.

Digital Library

[6]

M. J. Franklin, B. T. Jónsson, and D. Kossmann. Performance tradeoffs for client-server query processing. ACM SIGMOD Record, 25(2):149--160, 1996.

Digital Library

[7]

L. M. Haas, W. Chang, G. M. Lohman, J. McPherson, P. F. Wilms, G. Lapis, B. Lindsay, H. Pirahesh, M. J. Carey, and E. Shekita. Starburst mid-flight: as the dust clears. TKDE, 2(1):143--160, 1990.

Digital Library

[8]

H. Herodotou. Hadoop performance models. arXiv preprint arXiv:1106.0940, 2011.

[9]

H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB, 4(11):1111--1122, 2011.

Digital Library

[10]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In CIDR, volume 11, pages 261--272, 2011.

[11]

S. Hu, W. Liu, T. Rabl, S. Huang, Y. Liang, Z. Xiao, H.-A. Jacobson, X. Pei, and J. Wang. Dualtable: A hybrid storage model for update optimization in hive. In ICDE, 2015. to appear.

[12]

R. Lee, T. Luo, F. Huai, Yand Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, pages 25--36, 2011.

Digital Library

[13]

H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. VLDB, 5(11):1196--1207, 2012.

Digital Library

[14]

L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, and M. Wong. Tenzing a sql implementation on the mapreduce framework. 2011.

[15]

Y. Liu, S. Hu, T. Rabl, W. Liu, H.-A. Jacobsen, K. Wu, J. Chen, and J. Li. DGFIndex for Smart Grid: Enhancing Hive with a Cost-Effective Multidimensional Range Index. Proceedings of the VLDB Endowment, 13(7):1496--1507, 2014.

Digital Library

[16]

C. Olston, U. Reed, Benjamand Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008.

Digital Library

[17]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2(2):1626--1629, 2009.

Digital Library

[18]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In ICDE, pages 996--1005, 2010.

[19]

S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SoCC, page 12, 2011.

Digital Library

[20]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, pages 13--24, 2013.

Digital Library

[21]

Y. Xu and S. Hu. Qmapper: a tool for sql optimization on hive using query rewriting. In WWW, pages 211--212, 2013.

Digital Library

Cited By

Yang CChen TKristiani EWu S(2021)The implementation of data storage and analytics platform for big data lake of electricity usage with sparkThe Journal of Supercomputing10.1007/s11227-020-03505-677:6(5934-5959)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s11227-020-03505-6
Odarchenko RHassan ZZaman A(2019)Use of Big Data in AviationAutomated Systems in the Aviation and Aerospace Industries10.4018/978-1-5225-7709-6.ch017(436-452)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7709-6.ch017
Ahmad M(2016)Leveraging Data Analytics by Transforming Relational Database Schema in to Big DataTrends in Computer Science and Information Technology10.17352/tcsit.0000021:1(012-017)Online publication date: 30-Dec-2016
https://doi.org/10.17352/tcsit.000002
Show More Cited By

Index Terms

QMapper for Smart Grid: Migrating SQL-based Application to Hive
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

QMapper: a tool for SQL optimization on hive using query rewriting
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Although HiveQL offers similar features with SQL, it is still difficult to map complex SQL queries into HiveQL and manual translation often leads to poor performance. A tool named QMapper is developed to address this problem by utilizing query rewriting ...
Query optimization using column statistics in hive
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the ...
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
502
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)4

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang CChen TKristiani EWu S(2021)The implementation of data storage and analytics platform for big data lake of electricity usage with sparkThe Journal of Supercomputing10.1007/s11227-020-03505-677:6(5934-5959)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s11227-020-03505-6
Odarchenko RHassan ZZaman A(2019)Use of Big Data in AviationAutomated Systems in the Aviation and Aerospace Industries10.4018/978-1-5225-7709-6.ch017(436-452)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7709-6.ch017
Ahmad M(2016)Leveraging Data Analytics by Transforming Relational Database Schema in to Big DataTrends in Computer Science and Information Technology10.17352/tcsit.0000021:1(012-017)Online publication date: 30-Dec-2016
https://doi.org/10.17352/tcsit.000002
Rajan KKakadia DCurino CKrishnan S(2016)PerfOratorProceedings of the Seventh ACM Symposium on Cloud Computing10.1145/2987550.2987566(415-427)Online publication date: 5-Oct-2016
https://dl.acm.org/doi/10.1145/2987550.2987566
Zhu FLiu JWang SXu JXu LRen JYe DWei JHuang T(2016)Hug the Elephant: Migrating a Legacy Data Analytics Application to Hadoop Ecosystem2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2016.14(177-187)Online publication date: Oct-2016
https://doi.org/10.1109/ICSME.2016.14
Liu RKuo CYang CChen SLiu J(2016)On Construction of an Energy Monitoring Service Using Big Data Technology for Smart Campus2016 7th International Conference on Cloud Computing and Big Data (CCBD)10.1109/CCBD.2016.026(81-86)Online publication date: Nov-2016
https://doi.org/10.1109/CCBD.2016.026

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten