Abstract
In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.
Similar content being viewed by others
References
Candan KS, Kim JW, Nagarkar P, Nagendra M, Yu R (2010) RanKloud: scalable multimedia data processing in server clusters. IEEE MultiMed 18(1):64–77
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrws M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: 7th UENIX symposium on operating systems design and implementation, pp 205–218
Dean J, GhemawatDean S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113
Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system. In: 19th ACM symposium on operating systems principles (SOSP)
Jiang W, Agrawal G (2011) Ex-MATE data intensive computing with large reduction objects and its application to graph mining. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 475–484
Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: IEEE fourth international conference on escience, pp 214–220
Kavulya S, Tany J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–95
Krishnan A (2005) GridBLAST: a globus-based high-throughput implementation of BLAST in a grid computing framework. Concurr Comput 17(13):1607–1623
Liu H, Orban D (2011) Cloud MapReduce: a MapReduce implementation on top of a cloud operating system. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 464–474
Hsu C-H, Chen S-C (2012) Efficient selection strategies towards processor reordering techniques for improving data locality in heterogeneous clusters. J Supercomput 60(3):284–300
Matsunaga A, Tsugawa M, Fortes J (2008) Programming abstractions for data intensive computing on clouds and grids. In: IEEE fourth international conference on escience, pp 489–493
Miceli C, Miceli M, Jha S, Kaiser H, Merzky A (2009) Programming abstractions for data intensive computing on clouds and grids. In: IEEE/ACM international symposium on cluster computing and the grid, pp 480–483
Panda B, Riedewald M, Fink D (2010) The model-summary problem and a solution for trees. In: International conference on data engineering, pp 452–455
Papadimitriou S, Sun J (2008) Distributed co-clustering with map-reduce. In: IEEE international conference on data mining, p 519
Hsu C-H, Chen SC (2010) A two-level scheduling strategy for optimizing communications of data parallel programs in clusters. Int J Ad Hoc Ubiq Comput 6(4):263–269
Shafer J, Rixner S, Cox AL (2010) The hadoop distributed filesystem: balancing portability and performance. In: IEEE international symposium on performance analysis of system and software (ISPASS), p 123
Stockinger H, Pagni M, Cerutti L, Falquet L (2006) Grid approach to embarrassingly parallel CPU-intensive bioinformatics problems. In: IEEE international conference on e-science and grid computing
Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P (2009) Mochi: visual log-analysis based tools for debugging hadoop. In: USENIX workshop on hot topics in cloud computing (HotCloud)
Hsu C-H, Tsai B-R (2009) Scheduling for atomic broadcast operation in heterogeneous networks with one port model. J Supercomput 50(3):269–288
Vashishtha H, Smit M, Stroulia E (2010) Moving text analysis tools to the cloud. In: IEEE world congress on services, pp 110–112
Verma A, Llor’a X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using MapReduce. In: International conference on intelligent systems design and applications
Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles (SOSP)
Fadika Z, Govindaraju M (2011) DELMA: dynamic elastic MapReduce framework for CPU-intensive applications. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 454–463
O’Malley O (2008) TeraByte sort on Apache hadoop
Apache software foundation (2007) Hadoop. http://hadoop.apache.org/core
Hsu C-H, Chen T-L, Park J-H (2008) On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems. J Supercomput 45(1):129–150
Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI
Lynden S, Tanimura Y, Kojima I, Matono A (2011) Dynamic data redistribution for MapReduce joins. In: IEEE international conference on cloud computing technology and science, pp 713–717
Groot S, Kitsuregawa M (2010) Jumbo: beyond MapReduce for workload balancing. In: VLDB, PhD workshop
Heinz S, Zobel J, Williams H (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(12):192–223
Hsu C-H, Chen S-C, Lan C-Y (2007) Scheduling contention-free irregular redistribution in parallelizing compilers. J Supercomput 40(3):229–247
Shannon CE (1951) Prediction and entropy of printed English. Bell Syst Tech J 30:50–64
Acknowledgements
We would like to thank the various colleagues in the System Software Laboratory at National Tsing Hua University as well as my colleagues at the Department of Computer Science and Information Engineering in Chung Hua University for their support and for their help on earlier drafts of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Slagter, K., Hsu, CH., Chung, YC. et al. An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66, 539–555 (2013). https://doi.org/10.1007/s11227-013-0924-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-0924-9