Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Han, Hyuck; Jung, Hyungsoo; Eom, Hyeonsang; Yeom, Heon Y.

doi:10.1007/s10586-010-0144-5

Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Published: 06 November 2010

Volume 14, pages 183–197, (2011)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Hyuck Han¹,
Hyungsoo Jung²,
Hyeonsang Eom¹ &
…
Heon Y. Yeom¹

294 Accesses
15 Citations
3 Altmetric
Explore all metrics

Abstract

A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient query processing framework for big data warehouse: an almost join-free approach

Article 26 January 2015

Efficient Large Outer Joins over MapReduce

Similarity Grouping in Big Data Systems

References

Abouzeid, A., BajdaPawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB ’07: Proceedings of the 35th International Conference on Very Large Data Bases (2007)
Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT ’10: Proceedings of the 13th International Conference on Extending Database Technology (2010)
Google Scholar
Aguilar-Saborit, J., Muntés-Mulero, V., Zuzarte, C., Larriba-Pey, J.L.: Star join revisited: Performance internals for cluster architectures. Data Knowl. Eng. 63(3), 997–1015 (2007)
Article Google Scholar
Apache: Hadoop. http://hadoop.apache.org/ (2007)
Apache: Pig. http://hadoop.apache.org/pig (2007)
Apache: Hive. http://hadoop.apache.org/hive (2008)
Aster Data: Aster Data nCluster. http://www.asterdata.com/product/index.php (2010)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM (1970)
Businesscom: CloudBase. http://cloudbase.sourceforge.net (2008)
Chaudhuri, S., Dayal, U.: An overview of data warehousing and olap technology. SIGMOD Rec. 26(1), 65–74 (1997)
Article Google Scholar
Chen, M.S., Lo, M., Yu, P.S., Young, H.C.: Applying segmented right-deep trees to pipelining multiple hash joins. IEEE Trans. Knowl. Data Eng. 7(4), 656–668 (1995)
Article Google Scholar
Datta, A., VanderMeer, D., Ramamritham, K.: Parallel star join + dataindexes: Efficient query processing in data warehouses and olap. IEEE Trans. Knowl. Data Eng. 14(6), 1299–1316 (2002)
Article Google Scholar
Deshpande, P.M., Ramasamy, K., Shukla, A., Naughton, J.F.: Caching multidimensional queries using chunks. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (1998)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (2003)
Google Scholar
Grossman, R., Gu, Y.: Data mining using high performance data clouds: Experimental studies using sector and sphere. In: SIGKDD 2008 (2008)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI’04: The 6th Symposium on Operating System Design and Implementation (2004)
Google Scholar
Karayannidis, N., Tsois, A., Sellis, T., Pieringer, R., Markl, V., Ramsak, F., Fenk, R., Elhardt, K., Bayer, R.: Processing star queries on hierarchically-clustered fact tables. In: VLDB ’02: Proceedings of the 28th International Conference on Very Large Data Bases (2002)
Google Scholar
Lo, M.L., Chen, M.S.S., Ravishankar, C.V., Yu, P.S.: On optimal processor allocation to support pipelined hash joins. In: SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993)
Google Scholar
Inmon, W.H.: Building the Data Warehouse. Wiley, New York (1996)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: European Conference on Computer Systems (EuroSys) (2007)
Google Scholar
Monash, C.: Cloudera presents the MapReduce bull case. http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/ (2009)
NexR: Icube cloud testbed. http://www.icubecloud.com (2009)
O’Neil, P., Graefe, G.: Multi-table joins through bitmapped join indices. SIGMOD Rec. 24(3), 8–11 (1995)
Article Google Scholar
O’Neil, P., Quass, D.: Improved query performance with variant indexes. SIGMOD Rec. 26(2), 38–49 (1997)
Article Google Scholar
O’Neil, P., O’Neil, E., Chen, X.: The star schema benchmark (2007)
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Sci. Program. J. (2005)
Roussopoulos, N.: Materialized views and data warehouses. SIGMOD Rec. 27, 21–26 (1997)
Article Google Scholar
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar
Yang, Hc., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea
Hyuck Han, Hyeonsang Eom & Heon Y. Yeom
School of Information Technologies, University of Sydney, Sydney, NSW, 2006, Australia
Hyungsoo Jung

Authors

Hyuck Han
View author publications
You can also search for this author in PubMed Google Scholar
Hyungsoo Jung
View author publications
You can also search for this author in PubMed Google Scholar
Hyeonsang Eom
View author publications
You can also search for this author in PubMed Google Scholar
Heon Y. Yeom
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyungsoo Jung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, H., Jung, H., Eom, H. et al. Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks. Cluster Comput 14, 183–197 (2011). https://doi.org/10.1007/s10586-010-0144-5

Download citation

Received: 02 February 2010
Accepted: 11 October 2010
Published: 06 November 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10586-010-0144-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Abstract

Access this article

Similar content being viewed by others

Efficient query processing framework for big data warehouse: an almost join-free approach

Efficient Large Outer Joins over MapReduce

Similarity Grouping in Big Data Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Abstract

Access this article

Similar content being viewed by others

Efficient query processing framework for big data warehouse: an almost join-free approach

Efficient Large Outer Joins over MapReduce

Similarity Grouping in Big Data Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation