Skip to main content
Log in

An Adaptive and Memory Efficient Sampling Mechanism for Partitioning in MapReduce

International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Big Data refers to the massive amounts of structured and unstructured data being produced every day from a wide range of sources. Big Data is difficult to work with and needs a large number of machines to process it, as well as software capable of running in a distributed environment. MapReduce is a recent programming model that simplifies writing distributed programs on distributed systems. For MapReduce to work it needs to divide work amongst computers in a network. Consequently, the performance of MapReduce is dependent on how evenly it distributes the workload. This paper proposes an adaptive sampling mechanism for total order partitioning that can reduce memory consumption whilst partitioning with a trie-based sampling mechanism (ATrie). The performance of the proposed algorithm is compared to a state of the art trie-based partitioning system (ETrie). Experiments show the proposed mechanism is more adaptive and more memory efficient than previous implementations. Since ATrie is adaptive, its performance depended on the type of data used. A performance evaluation of a 2-level ATrie shows it uses 2.43 times less memory for case insensitive email addresses, and uses 1,024 times less memory for birthdates compared to that of a 2-level ETrie. These results show the potential of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. Apache Software Foundation, “Hadoop”, http://hadoop.apache.org/core

  2. Candan, K., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: scalable multimedia data processing in server clusters. IEEE MultiMed. 18, 64–77 (2011)

    Article  Google Scholar 

  3. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: 7th USENIX Symposium on Operating Systems Design and Implementation, pp. 205–218 (2006)

  4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  5. Fadika, Z., Govindaraju, M.: Delma: dynamically elastic mapreduce framework for cpu-intensive applications. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid, Computing, pp.454–463 (2011)

  6. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: ACM SIGOPS Operating Systems Review, ACM, pp. 29–43 (2003)

  7. Groot, S., Kitsuregawa, M.: Jumbo: Beyond MapReduce for Workload Balancing. VLDB, Phd Workshop (2010)

  8. HBase, http://hadoop.apache.org/hbase/

  9. Heinz, S., Zobel, J., Williams, H.: Burst tries: a fast, efficient data structure for string keys. Trans. Inf. Syst. (TOIS) 20(12), 192–223 (2002)

    Article  Google Scholar 

  10. Jiang, W., Agrawal, G.: Ex-mate: data intensive computing with large reduction objects and its application to graph mining. In: Cluster, Cloud and Grid Computing (CCGrid): 11th IEEE/ACM International Symposium on, IEEE 2011, pp. 475–484 (2011)

  11. Jin, C., Vecchiola, C., Buyya, R.: MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: IEEE Fourth International Conference on eScience pp. 214–221 (2008)

  12. Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production mapreduce cluster. In: Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference, pp. 94–103 (2010)

  13. Krishnan, A.: GridBLAST: a globus-based high-throughput implementation of BLAST in a Grid computing framework. Concurr. Comput. Pract. Exp. 17(13), 1607–1623 (2005)

    Article  Google Scholar 

  14. Liu, H., Orban, D.: Cloud mapreduce: a mapreduce implementation on top of a cloud operating system. In: 11th IEEE/ACM International Symposium, pp. 464–474 (2011)

  15. Lynden, S., Tanimura, Y., Kojima, I., Matono, A.: Dynamic data redistribution for MapReduce joins. In: IEEE International Conference on Coud Computing Technology and Science, pp. 713–717 (2011)

  16. Matsunaga, A., Tsugawa, M., Fortes, J.: “Programming abstractions for data intensive computing on clouds and grids. In: IEEE Fourth International Conference on eScience, pp. 489–493 (2008)

  17. Miceli, C., Miceli, M., Jha, S., Kaiser, H., Merzky, A.: Programming abstractions for data intensive computing on clouds and grids. In: IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 480–483 (2009)

  18. O’Malley, O.: TeraByte Sort on Apache Hadoop (2008)

  19. Panda, B., Riedewald, M., Fink, D.: The model-summary problem and a solution for trees. In: Data Engineering, International Conference on Data, Engineering, pp. 452–455 (2010)

  20. Papadimitriou, S., Sun, J.: Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: IEEE International Conference on Data Mining, p. 519 (2008)

  21. Shafer, J., Rixner, S., Cox, A.L.: The Hadoop distributed filesystem: balancing portability and performance. In: IEEE International Symposium on Performance Analysis of System and Software(ISPASS), p. 123 (2010)

  22. Slagter, K., Hsu, C.-H., Chung, Y.-C., Zhang, D.: An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J. Supercomput. 66(1), 539–555 (2013)

    Google Scholar 

  23. Stockinger, H., Pagni, M., Cerutti, L., Falquet, L.: Grid approach to embarrassingly parallel CPU-intensive bioinformatics problems. In: IEEE International Conference on e-Science and Grid Computing (2006)

  24. Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: Mochi: visual log-analysis based tools for debugging Hadoop. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) (2009)

  25. Vashishtha, H., Smit, M., Stroulia, E.: Moving text analysis tools to the cloud. In: IEEE World Congress on Services, pp. 110–112 (2010)

  26. Verma, A., Llora, X., Goldberg, D.E., Campbell, R.H.: Scaling genetic algorithms using mapreduce. In: Intelligent Systems Design and Applications (2009)

  27. White, T.: “Hadoop the definitive guide 2nd edition”, Published Oreilly (2010)

  28. Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (2009)

Download references

Acknowledgments

We would like to thank the various colleagues in the System Software Laboratory at National Tsing Hua University as well as my colleagues at the Department of Computer Science and Information Engineering in Chung Hua University for their support and for their help on earlier drafts of this paper.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kenn Slagter or Ching-Hsien Hsu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Slagter, K., Hsu, CH. & Chung, YC. An Adaptive and Memory Efficient Sampling Mechanism for Partitioning in MapReduce. Int J Parallel Prog 43, 489–507 (2015). https://doi.org/10.1007/s10766-013-0288-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0288-z

Keywords

Navigation