Efficient query processing framework for big data warehouse: an almost join-free approach

Wang, Huiju; Qin, Xiongpai; Zhou, Xuan; Li, Furong; Qin, Zuoyan; Zhu, Qing; Wang, Shan

doi:10.1007/s11704-014-4025-6

Efficient query processing framework for big data warehouse: an almost join-free approach

Research Article
Published: 26 January 2015

Volume 9, pages 224–236, (2015)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Huiju Wang^1,2,3,
Xiongpai Qin^1,2,
Xuan Zhou¹,
Furong Li^1,2,
Zuoyan Qin¹,
Qing Zhu^1,2 &
…
Shan Wang^1,2

506 Accesses
15 Citations
Explore all metrics

Abstract

The rapidly increasing scale of data warehouses is challenging today’s data analytical technologies. A conventional data analytical platform processes data warehouse queries using a star schema — it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users’ demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries.

In this paper, we propose a new query processing framework for data warehouses. It pushes the join operations partially to the pre-processing phase and partially to the post-processing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation operations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and stable despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data warehouse. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization Factor Analysis of Large-Scale Join Queries on Different Platforms

A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses

Chabok: a Map-Reduce based method to solve data warehouse problems

Article Open access 26 October 2018

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Chaudhuri S, Dayal U. An overview of data warehousing and olap technology. SIGMOD Record, 1997, 26(1): 65–74
Article Google Scholar
Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation. 2004, 137–150
Google Scholar
Apache hadoop. http://hadoop.apache.org
Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data. 2009, 165–178
Chapter Google Scholar
Afrati F N, Ullman J D. Optimizing joins in a map-reduce environment. In: Proceedings of the 2010 International Conference on Extending Databas Technology. 2010, 99–110
Google Scholar
Dawei Jiang G CA. K. H. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1299–1311
Article Google Scholar
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a notsoforeign language for data processing. In: Proceedings of the 2008 SIGMOD International Conference on Management of Data. 2008, 1099–1110
Chapter Google Scholar
Dittrich J, Quiané-Ruiz J A, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 2010, 3(1): 518–529
Google Scholar
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techniques for mapreduce. Proceedings of the VLDB Endowent, 2011, 4(7): 419–429
Article Google Scholar
Lin Y, Agrawal D, Chen C, Ooi B C, Wu S. LLAMA: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 SIGMOD International Conference on Management of Data. 2011, 961–972
Chapter Google Scholar
Xu Y, Kostamaa P, Gao L. Integrating hadoop and parallel DBMS. In: Proceedings of the 2010 SIGMOD Conference on Management of Data. 2010, 969–974
Chapter Google Scholar
Abouzeid A, Bajda-Pawlikowski K, Abadi D J, Rasin A, Silberschatz A. Hadoopdb: An architectural hybrid of mapreduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2009, 2(1): 922–933
Article Google Scholar
Swami A, Gupta A. Optimization of large join queries. SIGMOD Record, 1988, 17(3): 8–17
Article Google Scholar
Raman V, Swart G, Qiao L, Reiss F, Dialani V, Kossmann D, Narang I, Sidle R. Constant-time query processing. In: Proceedings of the 2008 International Conference of Data Engineering. 2008, 60–69
Chapter Google Scholar
Valduriez P. Join indices. ACM Transactions on Database Systems, 1987, 12: 218–246
Article Google Scholar
Markl V, Ramsak F, Bayer R. Improving OLAP performance by multidimensional hierarchical clustering. In: Proceedings of the 1999 International Symposium on Database Engineering and Applications. 1999, 165–177
Google Scholar
Karayannidis N, Tsois A, Sellis T K, Pieringer R, Markl V, Ramsak F, Fenk R, Elhardt K, Bayer R. Processing star queries on hierarchicallyclustered fact tables. In: Proceedings of the 28th VLDB Conference. 2002, 730–741
Google Scholar
Bayer R. The universal b-tree for multidimensional indexing: general concepts. In: Proceedings of the 1997 International Conference on Worldwide Computing and Its Applications. 1997, 198–209
Google Scholar
Theodoratos D, Tsois A. Heuristic optimization of olap queries in multidimensionally hierarchically clustered databases. In: Proceedings of ACM 4th International Workshop on Data Warehousing and OLAP. 2001, 48–55
Google Scholar
Korth H F, Kuper G M, Feigenbaum J, Gelder A V, Ullman J D. System/u: A database system based on the universal relation assumption. ACM Transactions on Database Systems, 1984, 9(3): 331–347
Article Google Scholar
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techniques for mapreduce. Proceedings of the VLDB Endowent, 2011, 4(7): 419–429
Article Google Scholar

Download references

Author information

Authors and Affiliations

DEKE Lab (Renmin University of China), Beijing, 100872, China
Huiju Wang, Xiongpai Qin, Xuan Zhou, Furong Li, Zuoyan Qin, Qing Zhu & Shan Wang
School of Information, Renmin University of China, Beijing, 100872, China
Huiju Wang, Xiongpai Qin, Furong Li, Qing Zhu & Shan Wang
School of Computing, National University of Singapore, Singapore, 117417, Singapore
Huiju Wang

Authors

Huiju Wang
View author publications
Search author on:PubMed Google Scholar
Xiongpai Qin
View author publications
Search author on:PubMed Google Scholar
Xuan Zhou
View author publications
Search author on:PubMed Google Scholar
Furong Li
View author publications
Search author on:PubMed Google Scholar
Zuoyan Qin
View author publications
Search author on:PubMed Google Scholar
Qing Zhu
View author publications
Search author on:PubMed Google Scholar
Shan Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Huiju Wang.

Additional information

Huiju Wang graduated from Renmin University of China in 2012 and works as postdoctoral research fellow at the school of computing of National University of Singapore. His research spans the areas of big data, clouding computing, databases and data management, with emphasis on graph database, graph index, graph data exploration.

Xiongpai Qin received his MS and PhD degree in computer science from Renmin University of China in 1998 and 2009 respectively, and works as a lecturer at Information School of Renmin University of China. His research interests include semantic based information retrieval, high performance database and big data.

Xuan Zhou is an associate professor at the Renmin University of China. He obtained his BS in computer science from Fudan University, China in 2001, and his PhD from the National University of Singapore in 2005. His research interests include database and information management. He has published his work in the top conferences and journals on data management.

Furong Li is a PhD candidate at National University of Singapore. She obtained her BS from Renmin University of China in 2012. Her research interests include data integration, social networks and big data management.

Zuoyan Qin received his BS and MS from Renmin University of China in 2008 and 2011 respectively. He is one senior engineer in Baidu company one. Before joining Baidu, he worked in Tencent. His main focus is big data processing and cloud computing.

Qing Zhu is an associate professor of School of Information, Renmin University of China. She completed her Phd in 2005 in Renmin University and MS in 1991 in Beijing University of Technology, China. Her research interests include Grid computing, distributed algorithms, Semantic Web service and high performance Database.

Professor Shan Wang finished her undergraduate studies at the Peking University, China in 1968, and completed her Master study at Renmin University of China in 1981. Her research interests include high performance database, data warehouse and knowledge engineering, information retrieval, etc.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Qin, X., Zhou, X. et al. Efficient query processing framework for big data warehouse: an almost join-free approach. Front. Comput. Sci. 9, 224–236 (2015). https://doi.org/10.1007/s11704-014-4025-6

Download citation

Received: 20 January 2014
Accepted: 20 August 2014
Published: 26 January 2015
Issue Date: April 2015
DOI: https://doi.org/10.1007/s11704-014-4025-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient query processing framework for big data warehouse: an almost join-free approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimization Factor Analysis of Large-Scale Join Queries on Different Platforms

A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses

Chabok: a Map-Reduce based method to solve data warehouse problems

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now