An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Kang, Minseo; Lee, Jae-Gil

doi:10.1007/s10586-017-1167-y

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Published: 19 September 2017

Volume 20, pages 3593–3604, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Minseo Kang¹ &
Jae-Gil Lee¹

598 Accesses
6 Citations
Explore all metrics

Abstract

MapReduce is the most popular framework for distributed processing. Recently, the scalability of data mining and machine learning algorithms has significantly improved with help from MapReduce. However, MapReduce does not handle iterative algorithms very efficiently. The problem is that many data mining and machine learning algorithms are iterative by nature. In order to overcome the limitations of MapReduce, many advanced distributed systems have been developed, including HaLoop, iMapReduce, Twister, and Spark. In this paper, we identify and categorize the limitations of MapReduce in handling iterative algorithms, and then, experimentally investigate the consequences of these limitations by using the most flexible and stable distributed system, Spark. According to our experiment results, the network I/O overhead was the primary factor that affected system performance the most. The disk I/O overhead also affected system performance, but it was less significant than the network I/O overhead. For the synchronization overhead, it affected system performance only when the static data was not cached.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Article 13 May 2020

Leveraging resource management for efficient performance of Apache Spark

Article Open access 23 August 2019

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)
Article Google Scholar
Lee, S., Kim, J., Moon, Y.S., Lee, W.: Efficient level-based top-down data cube computation using MapReduce. Trans. Large-Scale Data-Knowl.-Cent. Syst. XXI, pp. 1–9 (2015)
Shim, K.: MapReduce algorithms for big data analysis. Proc. VLDB Endow. 5(12), 2016–2017 (2012)
Article Google Scholar
Apache. Apache Hadoop. https://hadoop.apache.org/
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Article Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. Int. J. Very Large Data Bases 21(2), 169–190 (2012)
Article Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: (2010, April) MapReduce Online. NSDI 10(4), 20 (2010)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010, June)
Lee, H., Kang, M., Youn, S.B., Lee, J. G., Kwon, Y.: An experimental comparison of iterative MapReduce frameworks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2089–2094 (2016, October)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10 (2010, June)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapreduce: a distributed computing framework for iterative computation. J. Grid Comput.10(1), 47–68 (2012)
Jiang, X., Li, C., Sun, J.: A modified K-means clustering for mining of multimedia databases based on dimensionality reduction and similarity measures. Clust. Comput. 1–8 (2017)
Miner, D., Shook, A.: MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc. (2012)
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Clust. Comput. 1–15 (2017)
Kang, M., Lee, J.: A comparative analysis of iterative MapReduce systems. In: Proceedings of the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB), pp. 61–64 (2016)
Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3(1), 1–177 (2010)
Article Google Scholar
Apache. Apache Spark. https://spark.apache.org/
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010, June)
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, June (2014)
The Lemur Project. The ClueWeb09 Collection. http://lemurproject.org/clueweb09, May (2011)
Kwon, Y., Nunley, D., Gardner, J.P., Balazinska, M., Howe, B., Loebman, S.: Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In: Scientific and Statistical Database Management, pp. 132–150. Springer, Berlin, Heidelberg (2010, January)
Kim, J., Lee, W., Song, J.J., Lee, S.B.: Optimized combinatorial clustering for stochastic processes. Clust. Comput. 20(2), 1135–1148 (2017)
Article Google Scholar
Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Process. Syst. 6, 281–288 (2007)
Google Scholar
Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly Media, Inc. (2016)
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G., ICSI, V.: Making sense of performance in data analytics frameworks. NSDI 15, 293–307 (2015, May)
Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)
Article Google Scholar

Download references

Acknowledgements

This research, “Geospatial Big Data Management, Analysis and Service Platform Technology Development”, was supported by the MOLIT(The Ministry of Land, Infrastructure and Transport), Korea, under the national spatial information research program supervised by the KAIA(Korea Agency for Infrastructure Technology Advancement)”(17NSIP-B081011-04). In addition, this project was supported by Microsoft Research through “Azure for Research” global RFP program.

Author information

Authors and Affiliations

Graduate School of Knowledge Service Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
Minseo Kang & Jae-Gil Lee

Authors

Minseo Kang
View author publications
You can also search for this author in PubMed Google Scholar
Jae-Gil Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jae-Gil Lee.

Additional information

This paper is a revised and expanded version of a paper entitled ‘A Comparative Analysis of Iterative MapReduce Systems’ The paper proudly received the Runner-Up Paper Award. Presented at the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB 2016), 17–19 October 2016, Jeju Island, Korea.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, M., Lee, JG. An experimental analysis of limitations of MapReduce for iterative algorithms on Spark. Cluster Comput 20, 3593–3604 (2017). https://doi.org/10.1007/s10586-017-1167-y

Download citation

Received: 09 June 2017
Revised: 20 July 2017
Accepted: 01 September 2017
Published: 19 September 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10586-017-1167-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Abstract

Access this article

Similar content being viewed by others

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Leveraging resource management for efficient performance of Apache Spark

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

Abstract

Access this article

Similar content being viewed by others

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Leveraging resource management for efficient performance of Apache Spark

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation