Cost-Based Optimization of Logical Partitions for a Query Workload in a Hadoop Data Warehouse

Peng, Shu; Gu, Jun; Wang, X. Sean; Rao, Weixiong; Yang, Min; Cao, Yu

doi:10.1007/978-3-319-11116-2_52

Shu Peng¹⁹,
Jun Gu¹⁹,
X. Sean Wang¹⁹,
Weixiong Rao²⁰,
Min Yang¹⁹ &
…
Yu Cao²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8709))

Included in the following conference series:

Asia-Pacific Web Conference

3364 Accesses

Abstract

Recently, Hadoop has become a common programming framework for big data analysis on a cluster of commodity machines. To optimize queries on a large amount of data managed by the Hadoop Distributed File System (HDFS), it is particularly important to optimize the reading of the data. Previous works either designed file formats to cluster data belonging to the same column, or proposed to place correlated data onto the same physical nodes. In query-workload aware situation, a possible optimization strategy is to place data that may not be used by the same query into different logical partitions so that not every partition is needed for a query, while physically distribute the data in each partition evenly across the compute nodes. This paper proposes a condition-based partitioning scheme to implement this optimization strategy. Experiments show that the proposed scheme not only reduces the I/O cost, but also maintains the workload of the compute nodes balanced across the cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses

Adaptive partitioning and indexing for in situ query processing

Article 15 November 2019

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

References

Hadoop, http://hadoop.apache.org/
Hive, http://hive.apache.org/
TPC-H, Benchmark Specification, http://www.tpc.org/tpch/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Eltabakh, M.Y., Tian, Y., Zcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011)
Article Google Scholar
Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: SIGMOD Conference, pp. 961–972. ACM, New York (2011)
Google Scholar
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: ICDE Conference, pp. 1199–1208 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Key Laboratory of Intelligent Information Processing , School of Computer Science, Fudan University, Shanghai, China
Shu Peng, Jun Gu, X. Sean Wang & Min Yang
School of Software Engineering, Tongji University, Shanghai, China
Weixiong Rao
EMC Labs, Tsinghua Science Park, Beijing, China
Yu Cao

Authors

Shu Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jun Gu
View author publications
You can also search for this author in PubMed Google Scholar
X. Sean Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weixiong Rao
View author publications
You can also search for this author in PubMed Google Scholar
Min Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Cao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Beijing Institute of Spacecraft System Engineering, Beijing, China
Lei Chen
School of Computer Science, National University of Defense Technology, 410073, Changsha, Hunan, China
Yan Jia
RMIT University, Melbourne, Australia
Timos Sellis
School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
Guanfeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, S., Gu, J., Wang, X.S., Rao, W., Yang, M., Cao, Y. (2014). Cost-Based Optimization of Logical Partitions for a Query Workload in a Hadoop Data Warehouse. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-11116-2_52
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11115-5
Online ISBN: 978-3-319-11116-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics