An improved partitioning mechanism for optimizing massive data analysis using MapReduce

Slagter, Kenn; Hsu, Ching-Hsien; Chung, Yeh-Ching; Zhang, Daqiang

doi:10.1007/s11227-013-0924-9

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

Published: 11 April 2013

Volume 66, pages 539–555, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Kenn Slagter¹,
Ching-Hsien Hsu²,
Yeh-Ching Chung¹ &
…
Daqiang Zhang³

772 Accesses
26 Citations
Explore all metrics

Abstract

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-Aware Partitioning Schema in MapReduce

Reducing partition skew on MapReduce: an incremental allocation approach

Article 17 June 2019

Zhuo Wang, Qun Chen, … Zhanhuai Li

Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

Article 30 April 2018

Elaheh Gavagsaz, Ali Rezaee & Hamid Haj Seyyed Javadi

References

Candan KS, Kim JW, Nagarkar P, Nagendra M, Yu R (2010) RanKloud: scalable multimedia data processing in server clusters. IEEE MultiMed 18(1):64–77
Article Google Scholar
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrws M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: 7th UENIX symposium on operating systems design and implementation, pp 205–218
Google Scholar
Dean J, GhemawatDean S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113
Article Google Scholar
Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system. In: 19th ACM symposium on operating systems principles (SOSP)
Google Scholar
Jiang W, Agrawal G (2011) Ex-MATE data intensive computing with large reduction objects and its application to graph mining. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 475–484
Google Scholar
Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: IEEE fourth international conference on escience, pp 214–220
Chapter Google Scholar
Kavulya S, Tany J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–95
Google Scholar
Krishnan A (2005) GridBLAST: a globus-based high-throughput implementation of BLAST in a grid computing framework. Concurr Comput 17(13):1607–1623
Article Google Scholar
Liu H, Orban D (2011) Cloud MapReduce: a MapReduce implementation on top of a cloud operating system. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 464–474
Google Scholar
Hsu C-H, Chen S-C (2012) Efficient selection strategies towards processor reordering techniques for improving data locality in heterogeneous clusters. J Supercomput 60(3):284–300
Article MathSciNet Google Scholar
Matsunaga A, Tsugawa M, Fortes J (2008) Programming abstractions for data intensive computing on clouds and grids. In: IEEE fourth international conference on escience, pp 489–493
Google Scholar
Miceli C, Miceli M, Jha S, Kaiser H, Merzky A (2009) Programming abstractions for data intensive computing on clouds and grids. In: IEEE/ACM international symposium on cluster computing and the grid, pp 480–483
Google Scholar
Panda B, Riedewald M, Fink D (2010) The model-summary problem and a solution for trees. In: International conference on data engineering, pp 452–455
Google Scholar
Papadimitriou S, Sun J (2008) Distributed co-clustering with map-reduce. In: IEEE international conference on data mining, p 519
Google Scholar
Hsu C-H, Chen SC (2010) A two-level scheduling strategy for optimizing communications of data parallel programs in clusters. Int J Ad Hoc Ubiq Comput 6(4):263–269
Article MathSciNet Google Scholar
Shafer J, Rixner S, Cox AL (2010) The hadoop distributed filesystem: balancing portability and performance. In: IEEE international symposium on performance analysis of system and software (ISPASS), p 123
Google Scholar
Stockinger H, Pagni M, Cerutti L, Falquet L (2006) Grid approach to embarrassingly parallel CPU-intensive bioinformatics problems. In: IEEE international conference on e-science and grid computing
Google Scholar
Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P (2009) Mochi: visual log-analysis based tools for debugging hadoop. In: USENIX workshop on hot topics in cloud computing (HotCloud)
Google Scholar
Hsu C-H, Tsai B-R (2009) Scheduling for atomic broadcast operation in heterogeneous networks with one port model. J Supercomput 50(3):269–288
Article Google Scholar
Vashishtha H, Smit M, Stroulia E (2010) Moving text analysis tools to the cloud. In: IEEE world congress on services, pp 110–112
Google Scholar
Verma A, Llor’a X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using MapReduce. In: International conference on intelligent systems design and applications
Google Scholar
Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles (SOSP)
Google Scholar
Fadika Z, Govindaraju M (2011) DELMA: dynamic elastic MapReduce framework for CPU-intensive applications. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 454–463
Google Scholar
O’Malley O (2008) TeraByte sort on Apache hadoop
Apache software foundation (2007) Hadoop. http://hadoop.apache.org/core
Hsu C-H, Chen T-L, Park J-H (2008) On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems. J Supercomput 45(1):129–150
Article Google Scholar
HBase. http://hadoop.apache.org/hbase/
Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI
Google Scholar
Lynden S, Tanimura Y, Kojima I, Matono A (2011) Dynamic data redistribution for MapReduce joins. In: IEEE international conference on cloud computing technology and science, pp 713–717
Google Scholar
Groot S, Kitsuregawa M (2010) Jumbo: beyond MapReduce for workload balancing. In: VLDB, PhD workshop
Google Scholar
Heinz S, Zobel J, Williams H (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(12):192–223
Article Google Scholar
Hsu C-H, Chen S-C, Lan C-Y (2007) Scheduling contention-free irregular redistribution in parallelizing compilers. J Supercomput 40(3):229–247
Article Google Scholar
Shannon CE (1951) Prediction and entropy of printed English. Bell Syst Tech J 30:50–64
Article MATH Google Scholar

Download references

Acknowledgements

We would like to thank the various colleagues in the System Software Laboratory at National Tsing Hua University as well as my colleagues at the Department of Computer Science and Information Engineering in Chung Hua University for their support and for their help on earlier drafts of this paper.

Author information

Authors and Affiliations

Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Kenn Slagter & Yeh-Ching Chung
Department of Computer Science, Chung Hua University, Hsinchu, Taiwan
Ching-Hsien Hsu
School of Software Engineering, Tongji University, Shanghai, China
Daqiang Zhang

Authors

Kenn Slagter
View author publications
You can also search for this author in PubMed Google Scholar
Ching-Hsien Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Yeh-Ching Chung
View author publications
You can also search for this author in PubMed Google Scholar
Daqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ching-Hsien Hsu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Slagter, K., Hsu, CH., Chung, YC. et al. An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66, 539–555 (2013). https://doi.org/10.1007/s11227-013-0924-9

Download citation

Published: 11 April 2013
Issue Date: October 2013
DOI: https://doi.org/10.1007/s11227-013-0924-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

Abstract

Access this article

Similar content being viewed by others

Data-Aware Partitioning Schema in MapReduce

Reducing partition skew on MapReduce: an incremental allocation approach

Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

Abstract

Access this article

Similar content being viewed by others

Data-Aware Partitioning Schema in MapReduce

Reducing partition skew on MapReduce: an incremental allocation approach

Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation