Skip to main content
Log in

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

MapReduce is the most popular framework for distributed processing. Recently, the scalability of data mining and machine learning algorithms has significantly improved with help from MapReduce. However, MapReduce does not handle iterative algorithms very efficiently. The problem is that many data mining and machine learning algorithms are iterative by nature. In order to overcome the limitations of MapReduce, many advanced distributed systems have been developed, including HaLoop, iMapReduce, Twister, and Spark. In this paper, we identify and categorize the limitations of MapReduce in handling iterative algorithms, and then, experimentally investigate the consequences of these limitations by using the most flexible and stable distributed system, Spark. According to our experiment results, the network I/O overhead was the primary factor that affected system performance the most. The disk I/O overhead also affected system performance, but it was less significant than the network I/O overhead. For the synchronization overhead, it affected system performance only when the static data was not cached.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)

    Article  Google Scholar 

  3. Lee, S., Kim, J., Moon, Y.S., Lee, W.: Efficient level-based top-down data cube computation using MapReduce. Trans. Large-Scale Data-Knowl.-Cent. Syst. XXI, pp. 1–9 (2015)

  4. Shim, K.: MapReduce algorithms for big data analysis. Proc. VLDB Endow. 5(12), 2016–2017 (2012)

    Article  Google Scholar 

  5. Apache. Apache Hadoop. https://hadoop.apache.org/

  6. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  7. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. Int. J. Very Large Data Bases 21(2), 169–190 (2012)

    Article  Google Scholar 

  8. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: (2010, April) MapReduce Online. NSDI 10(4), 20 (2010)

  9. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010, June)

  10. Lee, H., Kang, M., Youn, S.B., Lee, J. G., Kwon, Y.: An experimental comparison of iterative MapReduce frameworks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2089–2094 (2016, October)

  11. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10 (2010, June)

  12. Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapreduce: a distributed computing framework for iterative computation. J. Grid Comput.10(1), 47–68 (2012)

  13. Jiang, X., Li, C., Sun, J.: A modified K-means clustering for mining of multimedia databases based on dimensionality reduction and similarity measures. Clust. Comput. 1–8 (2017)

  14. Miner, D., Shook, A.: MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc. (2012)

  15. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Clust. Comput. 1–15 (2017)

  16. Kang, M., Lee, J.: A comparative analysis of iterative MapReduce systems. In: Proceedings of the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB), pp. 61–64 (2016)

  17. Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3(1), 1–177 (2010)

    Article  Google Scholar 

  18. Apache. Apache Spark. https://spark.apache.org/

  19. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010, June)

  20. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, June (2014)

  21. The Lemur Project. The ClueWeb09 Collection. http://lemurproject.org/clueweb09, May (2011)

  22. Kwon, Y., Nunley, D., Gardner, J.P., Balazinska, M., Howe, B., Loebman, S.: Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In: Scientific and Statistical Database Management, pp. 132–150. Springer, Berlin, Heidelberg (2010, January)

  23. Kim, J., Lee, W., Song, J.J., Lee, S.B.: Optimized combinatorial clustering for stochastic processes. Clust. Comput. 20(2), 1135–1148 (2017)

    Article  Google Scholar 

  24. Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Process. Syst. 6, 281–288 (2007)

    Google Scholar 

  25. Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly Media, Inc. (2016)

  26. Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G., ICSI, V.: Making sense of performance in data analytics frameworks. NSDI 15, 293–307 (2015, May)

  27. Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This research, “Geospatial Big Data Management, Analysis and Service Platform Technology Development”, was supported by the MOLIT(The Ministry of Land, Infrastructure and Transport), Korea, under the national spatial information research program supervised by the KAIA(Korea Agency for Infrastructure Technology Advancement)”(17NSIP-B081011-04). In addition, this project was supported by Microsoft Research through “Azure for Research” global RFP program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jae-Gil Lee.

Additional information

This paper is a revised and expanded version of a paper entitled ‘A Comparative Analysis of Iterative MapReduce Systems’ The paper proudly received the Runner-Up Paper Award. Presented at the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB 2016), 17–19 October 2016, Jeju Island, Korea.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, M., Lee, JG. An experimental analysis of limitations of MapReduce for iterative algorithms on Spark. Cluster Comput 20, 3593–3604 (2017). https://doi.org/10.1007/s10586-017-1167-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1167-y

Keywords

Navigation