Skip to main content
Log in

Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abouzeid, A., BajdaPawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB ’07: Proceedings of the 35th International Conference on Very Large Data Bases (2007)

    Google Scholar 

  2. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT ’10: Proceedings of the 13th International Conference on Extending Database Technology (2010)

    Google Scholar 

  3. Aguilar-Saborit, J., Muntés-Mulero, V., Zuzarte, C., Larriba-Pey, J.L.: Star join revisited: Performance internals for cluster architectures. Data Knowl. Eng. 63(3), 997–1015 (2007)

    Article  Google Scholar 

  4. Apache: Hadoop. http://hadoop.apache.org/ (2007)

  5. Apache: Pig. http://hadoop.apache.org/pig (2007)

  6. Apache: Hive. http://hadoop.apache.org/hive (2008)

  7. Aster Data: Aster Data nCluster. http://www.asterdata.com/product/index.php (2010)

  8. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM (1970)

  9. Businesscom: CloudBase. http://cloudbase.sourceforge.net (2008)

  10. Chaudhuri, S., Dayal, U.: An overview of data warehousing and olap technology. SIGMOD Rec. 26(1), 65–74 (1997)

    Article  Google Scholar 

  11. Chen, M.S., Lo, M., Yu, P.S., Young, H.C.: Applying segmented right-deep trees to pipelining multiple hash joins. IEEE Trans. Knowl. Data Eng. 7(4), 656–668 (1995)

    Article  Google Scholar 

  12. Datta, A., VanderMeer, D., Ramamritham, K.: Parallel star join + dataindexes: Efficient query processing in data warehouses and olap. IEEE Trans. Knowl. Data Eng. 14(6), 1299–1316 (2002)

    Article  Google Scholar 

  13. Deshpande, P.M., Ramasamy, K., Shukla, A., Naughton, J.F.: Caching multidimensional queries using chunks. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (1998)

    Google Scholar 

  14. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (2003)

    Google Scholar 

  15. Grossman, R., Gu, Y.: Data mining using high performance data clouds: Experimental studies using sector and sphere. In: SIGKDD 2008 (2008)

    Google Scholar 

  16. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI’04: The 6th Symposium on Operating System Design and Implementation (2004)

    Google Scholar 

  17. Karayannidis, N., Tsois, A., Sellis, T., Pieringer, R., Markl, V., Ramsak, F., Fenk, R., Elhardt, K., Bayer, R.: Processing star queries on hierarchically-clustered fact tables. In: VLDB ’02: Proceedings of the 28th International Conference on Very Large Data Bases (2002)

    Google Scholar 

  18. Lo, M.L., Chen, M.S.S., Ravishankar, C.V., Yu, P.S.: On optimal processor allocation to support pipelined hash joins. In: SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993)

    Google Scholar 

  19. Inmon, W.H.: Building the Data Warehouse. Wiley, New York (1996)

    Google Scholar 

  20. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: European Conference on Computer Systems (EuroSys) (2007)

    Google Scholar 

  21. Monash, C.: Cloudera presents the MapReduce bull case. http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/ (2009)

  22. NexR: Icube cloud testbed. http://www.icubecloud.com (2009)

  23. O’Neil, P., Graefe, G.: Multi-table joins through bitmapped join indices. SIGMOD Rec. 24(3), 8–11 (1995)

    Article  Google Scholar 

  24. O’Neil, P., Quass, D.: Improved query performance with variant indexes. SIGMOD Rec. 26(2), 38–49 (1997)

    Article  Google Scholar 

  25. O’Neil, P., O’Neil, E., Chen, X.: The star schema benchmark (2007)

  26. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Sci. Program. J. (2005)

  27. Roussopoulos, N.: Materialized views and data warehouses. SIGMOD Rec. 27, 21–26 (1997)

    Article  Google Scholar 

  28. Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  29. Yang, Hc., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyungsoo Jung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, H., Jung, H., Eom, H. et al. Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks. Cluster Comput 14, 183–197 (2011). https://doi.org/10.1007/s10586-010-0144-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-010-0144-5

Keywords

Navigation