Executing Multiple Group by Query Using MapReduce Approach: Implementation and Optimization

Pan, Jie; Magoulès, Frédéric; Le Biannic, Yann

doi:10.1007/978-3-642-13067-0_67

Jie Pan²¹,
Frédéric Magoulès²¹ &
Yann Le Biannic²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6104))

Included in the following conference series:

International Conference on Grid and Pervasive Computing

1739 Accesses
2 Citations

Abstract

MapReduce model is a new parallel programming model initially developed for large-scale web content processing. Data analysis meets the issue of how to do calculation over extremely large dataset. The arrival of MapReduce provides a chance to utilize commodity hardware for massively parallel data analysis applications. The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field. In this paper, we focus on a special type of data analysis query, namely, multiple group by query. We first study the communication cost of MapReduce model, then we give an initial implementation of multiple group by query. We then propose an optimized version which addresses and improves the communication cost issues. Our optimized version shows a better accelerating ability and a better scalability than the other version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jeffrey, D., Sanjay, G.: MapReduce: Simplified data processing on large clusters. Communications of the ACM, 107–113 (2008)
Google Scholar
Hung-chih, Y., Ali, D., et al.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD 2007, pp. 1029–1040 (2007)
Google Scholar
Lämmel, R.: Google’s MapReduce programming model. Sci. Comput. Program, 208–237 (2007) (revisited)
Google Scholar
GridGain, http://www.gridgain.com/
Hadoop, http://hadoop.apache.org/ (accessed, April 2009)
Zhimin, C., Vivek, N.: Efficient computation of multiple group by queries. In: SIGMOD 2005, pp. 263–274 (2005)
Google Scholar
Grid’5000, https://www.grid5000.fr/
Dewitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Communications of the ACM, 85–98 (1992)
Google Scholar
Hellerstein, J.: Parallel programming in the age of big data (2008)
Google Scholar
Stephano, C.A., Mauro, N., et al.: Horizontal data partitioning in database design. In: SIGMOD 1982, pp. 128–136. ACM, New York (1982)
Google Scholar
Cascading, http://www.cascading.org/
Chao, J., Christian, V., et al.: MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms. In: ESCIENCE 2008, pp. 214–221 (2008)
Google Scholar
Dionysios, L., Kenneth, Y., et al.: Ad-hoc data processing in the cloud. In: Proc. VLDB Endow., pp. 1472–1475 (2008)
Google Scholar
Azza, A., Bajda-Pawlikowski, et al.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)
Google Scholar
Thusoo, A., Sarma, J.S., et al.: Hive - A Warehousing Solution Over a Map-Reduce Framework. In: VLDB (2009)
Google Scholar
Christopher, O., Benjamin, R., et al.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Ecole Centrale Paris, Grande Voie des Vignes, 92295, Châtenay-Malabry Cedex, France
Jie Pan & Frédéric Magoulès
SAP BusinessObjects, 157-159, rue Anatole France, 92309, Levallois-Perret Cedex, France
Yann Le Biannic

Authors

Jie Pan
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Magoulès
View author publications
You can also search for this author in PubMed Google Scholar
Yann Le Biannic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Informatica, Elettronica e Sistemistica (DEIS), Università di Bologna, Viale Risorgimento 2, 40136, Bologna, Italy
Paolo Bellavista
Department of Computer Science and Information Engineering, National Dong Hwa University, 974, Hualien, Taiwan, Republic of China
Ruay-Shiung Chang
Department of Electronic Engineering, National Ilan University, 260, Ilan, Taiwan , Republic of China
Han-Chieh Chao
National Dong Hwa University, 974, Hualien, Taiwan, Republic of China
Shin-Feng Lin
Faculty of Science Informatics Institute, University of Amsterdam, 1098, Amsterdam, XG, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pan, J., Magoulès, F., Le Biannic, Y. (2010). Executing Multiple Group by Query Using MapReduce Approach: Implementation and Optimization. In: Bellavista, P., Chang, RS., Chao, HC., Lin, SF., Sloot, P.M.A. (eds) Advances in Grid and Pervasive Computing. GPC 2010. Lecture Notes in Computer Science, vol 6104. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13067-0_67

Download citation

DOI: https://doi.org/10.1007/978-3-642-13067-0_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13066-3
Online ISBN: 978-3-642-13067-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics