Abstract
A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.
Similar content being viewed by others
References
Abouzeid, A., BajdaPawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB ’07: Proceedings of the 35th International Conference on Very Large Data Bases (2007)
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT ’10: Proceedings of the 13th International Conference on Extending Database Technology (2010)
Aguilar-Saborit, J., Muntés-Mulero, V., Zuzarte, C., Larriba-Pey, J.L.: Star join revisited: Performance internals for cluster architectures. Data Knowl. Eng. 63(3), 997–1015 (2007)
Apache: Hadoop. http://hadoop.apache.org/ (2007)
Apache: Pig. http://hadoop.apache.org/pig (2007)
Apache: Hive. http://hadoop.apache.org/hive (2008)
Aster Data: Aster Data nCluster. http://www.asterdata.com/product/index.php (2010)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM (1970)
Businesscom: CloudBase. http://cloudbase.sourceforge.net (2008)
Chaudhuri, S., Dayal, U.: An overview of data warehousing and olap technology. SIGMOD Rec. 26(1), 65–74 (1997)
Chen, M.S., Lo, M., Yu, P.S., Young, H.C.: Applying segmented right-deep trees to pipelining multiple hash joins. IEEE Trans. Knowl. Data Eng. 7(4), 656–668 (1995)
Datta, A., VanderMeer, D., Ramamritham, K.: Parallel star join + dataindexes: Efficient query processing in data warehouses and olap. IEEE Trans. Knowl. Data Eng. 14(6), 1299–1316 (2002)
Deshpande, P.M., Ramasamy, K., Shukla, A., Naughton, J.F.: Caching multidimensional queries using chunks. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (1998)
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (2003)
Grossman, R., Gu, Y.: Data mining using high performance data clouds: Experimental studies using sector and sphere. In: SIGKDD 2008 (2008)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI’04: The 6th Symposium on Operating System Design and Implementation (2004)
Karayannidis, N., Tsois, A., Sellis, T., Pieringer, R., Markl, V., Ramsak, F., Fenk, R., Elhardt, K., Bayer, R.: Processing star queries on hierarchically-clustered fact tables. In: VLDB ’02: Proceedings of the 28th International Conference on Very Large Data Bases (2002)
Lo, M.L., Chen, M.S.S., Ravishankar, C.V., Yu, P.S.: On optimal processor allocation to support pipelined hash joins. In: SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993)
Inmon, W.H.: Building the Data Warehouse. Wiley, New York (1996)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: European Conference on Computer Systems (EuroSys) (2007)
Monash, C.: Cloudera presents the MapReduce bull case. http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/ (2009)
NexR: Icube cloud testbed. http://www.icubecloud.com (2009)
O’Neil, P., Graefe, G.: Multi-table joins through bitmapped join indices. SIGMOD Rec. 24(3), 8–11 (1995)
O’Neil, P., Quass, D.: Improved query performance with variant indexes. SIGMOD Rec. 26(2), 38–49 (1997)
O’Neil, P., O’Neil, E., Chen, X.: The star schema benchmark (2007)
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Sci. Program. J. (2005)
Roussopoulos, N.: Materialized views and data warehouses. SIGMOD Rec. 27, 21–26 (1997)
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Yang, Hc., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Han, H., Jung, H., Eom, H. et al. Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks. Cluster Comput 14, 183–197 (2011). https://doi.org/10.1007/s10586-010-0144-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-010-0144-5