Migration-Aware Genetic Optimization for MapReduce Scheduling and Replica Placement in Hadoop

Guerrero, Carlos; Lera, Isaac; Juiz, Carlos

doi:10.1007/s10723-018-9432-8

Migration-Aware Genetic Optimization for MapReduce Scheduling and Replica Placement in Hadoop

Published: 14 February 2018

Volume 16, pages 265–284, (2018)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Carlos Guerrero¹,
Isaac Lera¹ &
Carlos Juiz¹

318 Accesses
27 Citations
2 Altmetric
Explore all metrics

Abstract

This work addresses the optimization of file locality, file availability, and replica migration cost in a Hadoop architecture. Our optimization algorithm is based on the Non-dominated Sorting Genetic Algorithm-II and it simultaneously determines file block placement, with a variable replication factor, and MapReduce job scheduling. Our proposal has been tested with experiments that considered three data center sizes (8, 16 and 32 nodes) with the same workload and number of files (150 files and 3519 file blocks). In general terms, the use of a placement policy with a variable replica factor obtains higher improvements for our three optimization objectives. On the contrary, the use of a job scheduling policy only improves these objectives when it is used along a variable replication factor. The results have also shown that the migration cost is a suitable optimization objective as significant improvements up to 34% have been observed between the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce scheduling algorithms in Hadoop: a systematic study

Article Open access 10 October 2023

Improvement of Makespan and TCTime in Dynamic Job Ordering and Slot Utilization for MapReduce Workloads

Reducing partition skew on MapReduce: an incremental allocation approach

Article 17 June 2019

References

Beloglazov, A., Buyya, R.: Energy efficient allocation of virtual machines in cloud data centers. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp 577–578 (2010), https://doi.org/10.1109/CCGRID.2010.45
Borthakur, D., et al.: Hdfs architecture guide. Hadoop Apache Project 53 (2008)
Bose, S.K., Brock, S., Skeoch, R., Rao, S.: Cloudspider: combining replication with scheduling for optimizing live migration of virtual machines across wide area networks. In: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID ’11, pp 13–22. IEEE Computer Society, Washington, DC (2011), https://doi.org/10.1109/CCGrid.2011.16
Bryk, P., Malawski, M., Juve, G., Deelman, E.: Storage-aware algorithms for scheduling of workflow ensembles in clouds. J. Grid Comput. 14(2), 359–378 (2016). https://doi.org/10.1007/s10723-015-9355-6
Article Google Scholar
Chen, Y., Ganapathi, A., Griffith, R., Katz, R.: The case for evaluating mapreduce performance using workload suites. In: 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, pp 390–399 (2011), https://doi.org/10.1109/MASCOTS.2011.12
Cheng, Z., Luan, Z., Meng, Y., Xu, Y., Qian, D., Roy, A., Zhang, N., Guan, G.: Erms: an elastic replication management system for hdfs. In: 2012 IEEE International Conference on Cluster Computing Workshops, pp 32–40 (2012), https://doi.org/10.1109/ClusterW.2012.25
Dai, W., Ibrahim, I., Bassiouni, M.: A new replica placement policy for hadoop distributed file system. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp 262–267 (2016), https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2016.30
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI’04, pp 10–10. USENIX Association, Berkeley (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. Trans. Evol. Comput. 6(2), 182–197 (2002). https://doi.org/10.1109/4235.996017
Article Google Scholar
Durillo, J.J., Prodan, R.: Multi-objective workflow scheduling in amazon ec2. Cluster Comput. 17(2), 169–189 (2014). https://doi.org/10.1007/s10586-013-0325-0
Article Google Scholar
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011). https://doi.org/10.14778/2002938.2002943
Article Google Scholar
Ghomi, E.J., Rahmani, A.M., Qader, N.N.: Load-balancing algorithms in cloud computing: a survey. J. Netw. Comput. Appl. 88, 50–71 (2017). https://doi.org/10.1016/j.jnca.2017.04.007 [http://www.sciencedirect.com/science/article/pii/S1084804517301480]
Article Google Scholar
Grace, R.K., Manimegalai, R.: Dynamic replica placement and selection strategies in data grids—a comprehensive survey. J. Parallel Distrib. Comput. 74 (2), 2099–2108 (2014). https://doi.org/10.1016/j.jpdc.2013.10.009 [http://www.sciencedirect.com/science/article/pii/S0743731513002207]
Article Google Scholar
Guerrero, C., Lera, I., Juiz, C.: Genetic algorithm for multi-objective optimization of container allocation in cloud architecture. J. Grid Comput. https://doi.org/10.1007/s10723-017-9419-x (2017)
Guzek, M., Bouvry, P., Talbi, E.G.: A survey of evolutionary computation for resource management of processing in cloud computing [review article]. IEEE Comput. Intell. Mag. 10(2), 53–67 (2015). https://doi.org/10.1109/MCI.2015.2405351
Article Google Scholar
Hamrouni, T., Slimani, S., Charrada, F.B.: A survey of dynamic replication and replica selection strategies based on data mining techniques in data grids. Eng. Appl. Artif. Intell. 48, 140–158 (2016). https://doi.org/10.1016/j.engappai.2015.11.002 [http://www.sciencedirect.com/science/article/pii/S0952197615002493]
Article Google Scholar
Hashem, I.A.T., Anuar, N.B., Marjani, M., Gani, A., Sangaiah, A.K., Sakariyah, A.K.: Multi-objective scheduling of mapreduce jobs in big data processing. Multimed. Tools Appl. 1–16. https://doi.org/10.1007/s11042-017-4685-y (2017)
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of “big data” on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015). https://doi.org/10.1016/j.is.2014.07.006 [http://www.sciencedirect.com/science/article/pii/S0306437914001288]
Article Google Scholar
Ibn-Khedher, H., Hadji, M., Abd-Elrahman, E., Afifi, H., Kamal, A.E.: Scalable and cost efficient algorithms for virtual cdn migration. In: 2016 IEEE 41st Conference on Local Computer Networks (LCN), pp 112–120 (2016), https://doi.org/10.1109/LCN.2016.23
Khezr, S.N., Navimipour, N.J.: Mapreduce and its applications, challenges, and architecture: a comprehensive review and directions for future research. J. Grid Comput. 15(3), 295–321 (2017). https://doi.org/10.1007/s10723-017-9408-0
Article Google Scholar
Kimovski, D., Saurabh, N., Stankovski, V., Prodan, R.: Multi-objective middleware for distributed VMI repositories in federated cloud environment. Scalable Comput.: Pract. Exp. 17(4), 299–312 (2016) [http://www.scpe.org/index.php/scpe/article/view/1202]
Google Scholar
Lammel, R.: Google’s mapreduce programming model. revisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Article MathSciNet MATH Google Scholar
Long, S.Q., Zhao, Y.L., Chen, W.: Morm: a multi-objective optimized replication management strategy for cloud storage cluster. J. Syst. Archit. 60(2), 234–244 (2014). https://doi.org/10.1016/j.sysarc.2013.11.012 [http://www.sciencedirect.com/science/artice/pii/S1383762113002671]
Article Google Scholar
López-Pires, F., Barán, B.: Many-objective virtual machine placement. J. Grid Comput. 15 (2), 161–176 (2017). https://doi.org/10.1007/s10723-017-9399-x
Article Google Scholar
Lu, L., Shi, X., Jin, H., Wang, Q., Yuan, D., Wu, S.: Morpho: a decoupled mapreduce framework for elastic cloud computing. Futur. Gener. Comput. Syst. 36 (Supplement C), 80–90 (2014). https://doi.org/10.1016/j.future.2013.12.026. http://www.sciencedirect.com/science/article/pii/S0167739X13002902. Special Section: Intelligent Big Data Processing Special Section: Behavior Data Security Issues in Network Information Propagation Special Section: Energy-efficiency in Large Distributed Computing Architectures Special Section: eScience Infrastructure and Applications
Article Google Scholar
Maheshwari, N., Nanduri, R., Varma, V.: Dynamic energy efficient data placement and cluster reconfiguration algorithm for mapreduce framework. Futur. Gener. Comput. Syst. 28(1), 119–127 (2012). https://doi.org/10.1016/j.future.2011.07.001 [http://www.sciencedirect.com/science/article/pii/S0167739X1100135X]
Article Google Scholar
Maio, V.D., Prodan, R., Benedict, S., Kecskemeti, G.: Modelling energy consumption of network transfers and virtual machine migration. Futur. Gener. Comput. Syst. 56, 388–406 (2016). https://doi.org/10.1016/j.future.2015.07.007 [http://www.sciencedirect.com/science/article/pii/S0167739X15002307]
Article Google Scholar
Malik, S.U.R., Khan, S.U., Ewen, S.J., Tziritas, N., Kolodziej, J., Zomaya, A.Y., Madani, S.A., Min-Allah, N., Wang, L., Xu, C.Z., Malluhi, Q.M., Pecero, J.E., Balaji, P., Vishnu, A., Ranjan, R., Zeadally, S., Li, H.: Performance analysis of data intensive cloud systems based on data management and replication: a survey. Distrib. Parallel Databases 34(2), 179–215 (2016). https://doi.org/10.1007/s10619-015-7173-2
Article Google Scholar
Mansouri, Y., Toosi, A.N., Buyya, R.: Cost optimization for dynamic replication and migration of data in cloud data centers. IEEE Trans. Cloud Comput. PP(99), 1–1 (2017). https://doi.org/10.1109/TCC.2017.2659728
Article Google Scholar
Marler, R.T., Arora, J.S.: The weighted sum method for multi-objective optimization: new insights. Struct. Multidiscip. Optim. 41(6), 853–862 (2010). https://doi.org/10.1007/s00158-009-0460-7
Article MathSciNet MATH Google Scholar
Marozzo, F., Talia, D., Trunfio, P.: P2p-mapreduce: parallel data processing in dynamic cloud environments. J. Comput. Syst. Sci. 78(5), 1382–1402 (2012). https://doi.org/10.1016/j.jcss.2011.12.021. http://www.sciencedirect.com/science/article/pii/S0022000011001668. JCSS Special Issue: Cloud Computing 2011
Article Google Scholar
Milani, B.A., Navimipour, N.J.: A comprehensive review of the data replication techniques in the cloud environments: major trends and future directions. J. Netw. Comput. Appl. 64, 229–238 (2016). https://doi.org/10.1016/j.jnca.2016.02.005 [http://www.sciencedirect.com/science/article/pii/S1084804516000795]
Article Google Scholar
Pawlikowski, K.: Steady-state simulation of queueing processes: Survey of problems and solutions. ACM Comput. Surv. 22 (2), 123–170 (1990). https://doi.org/10.1145/78919.78921 [http://doi.acm.org/10.1145/78919.78921]
Article Google Scholar
Semenkin, E., Semenkina, M.: Self-configuring Genetic Algorithm with Modified Uniform Crossover Operator, pp 414–421. Berlin, Heidelberg (2012)
Google Scholar
Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for mapreduce scheduling. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp 241–250 (2016), https://doi.org/10.1109/CLUSTER.2016.48
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp 1–10 (2010), https://doi.org/10.1109/MSST.2010.5496972
Song, J., He, H., Wang, Z., Yu, G., Pierson, J.M.: Modulo based data placement algorithm for energy consumption optimization of mapreduce system. J. Grid Comput. https://doi.org/10.1007/s10723-016-9370-2 (2016)
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC ’13, pp 5:1–5:16. ACM, New York (2013), https://doi.org/10.1145/2523616.2523633. http://doi.acm.org/10.1145/2523616.2523633
Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: Proceedings of the First International Workshop on Cloud Data Management, CloudDB ’09, pp 37–44. ACM, New York (2009), https://doi.org/10.1145/1651263.1651271. http://doi.acm.org/10.1145/1651263.1651271
Wang, W., Zhu, K., Ying, L., Tan, J., Zhang, L.: Maptask scheduling in mapreduce with data locality: throughput and heavy-traffic optimality. IEEE/ACM Trans. Netw. 24 (1), 190–203 (2016). https://doi.org/10.1109/TNET.2014.2362745
Article Google Scholar
Wang, X., Wang, Y., Cui, Y.: A new multi-objective bi-level programming model for energy and locality aware multi-job scheduling in cloud computing. Futur. Gener. Comput. Syst. 36, 91–101 (2014). https://doi.org/10.1016/j.future.2013.12.004. http://www.sciencedirect.com/science/article/pii/S0167739X13002689. Special Section: Intelligent Big Data ProcessingSpecial Section: Behavior Data Security Issues in Network Information PropagationSpecial Section: Energy-efficiency in Large Distributed Computing Architectures Special Section: eScience Infrastructure and Applications
Article Google Scholar
Wei, G., Vasilakos, A.V., Zheng, Y., Xiong, N.: A game-theoretic method of fair resource allocation for cloud computing services. J. Supercomput. 54(2), 252–269 (2010). https://doi.org/10.1007/s11227-009-0318-1
Article Google Scholar
Wei, Q., Veeravalli, B., Gong, B., Zeng, L., Feng, D.: Cdrm: a cost-effective dynamic replication management scheme for cloud storage cluster. In: 2010 IEEE International Conference on Cluster Computing, pp. 188–196 (2010), https://doi.org/10.1109/CLUSTER.2010.24
Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. Trans. Evol. Comput. 1(1), 67–82 (1997). https://doi.org/10.1109/4235.585893
Article Google Scholar
Wu, J., Yuan, H., He, Y., Zou, Z.: Chordmr: a p2p-based job management scheme in cloud. J. Netw. 9, 541–548 (2014)
Google Scholar
Xie, T., Sun, Y.: A file assignment strategy independent of workload characteristic assumptions. Trans. Storage 5 (3), 10:1–10:24 (2009). https://doi.org/10.1145/1629075.1629079 [http://doi.acm.org/10.1145/1629075.1629079]
Article MathSciNet Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pp 10–10. USENIX Association, Berkeley (2010). http://dl.acm.org/citation.cfm?id=1863103.1863113
Zhan, Z.H., Liu, X.F., Gong, Y.J., Zhang, J., Chung, H.S.H., Li, Y.: Cloud computing resource scheduling and a survey of its evolutionary approaches. ACM Comput. Surv. 47(4), 63:1–63:33 (2015). https://doi.org/10.1145/2788397 [http://doi.acm.org/10.1145/2788397]
Article Google Scholar
Zhang, Q., Pan, X., Shen, Y., Li, W.: A novel scalable architecture of cloud storage system for small files based on p2p. In: 2012 IEEE International Conference on Cluster Computing Workshops, pp 41–47 (2012), https://doi.org/10.1109/ClusterW.2012.27

Download references

Acknowledgements

This research was supported by Ministerio de Economía, Industria y Competitividad (MINECO) of Spain and the European Commission (FEDER funds) throught the grant number TIN2017-88547-P.

Author information

Authors and Affiliations

Computer Science Department, University of Balearic Islands, Crta. Valldemossa km 7.5, E07122, Palma, Spain
Carlos Guerrero, Isaac Lera & Carlos Juiz

Authors

Carlos Guerrero
View author publications
You can also search for this author in PubMed Google Scholar
Isaac Lera
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Juiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Guerrero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guerrero, C., Lera, I. & Juiz, C. Migration-Aware Genetic Optimization for MapReduce Scheduling and Replica Placement in Hadoop. J Grid Computing 16, 265–284 (2018). https://doi.org/10.1007/s10723-018-9432-8

Download citation

Received: 23 June 2017
Accepted: 04 February 2018
Published: 14 February 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10723-018-9432-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Migration-Aware Genetic Optimization for MapReduce Scheduling and Replica Placement in Hadoop

Abstract

Access this article

Similar content being viewed by others

MapReduce scheduling algorithms in Hadoop: a systematic study

Improvement of Makespan and TCTime in Dynamic Job Ordering and Slot Utilization for MapReduce Workloads

Reducing partition skew on MapReduce: an incremental allocation approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Migration-Aware Genetic Optimization for MapReduce Scheduling and Replica Placement in Hadoop

Abstract

Access this article

Similar content being viewed by others

MapReduce scheduling algorithms in Hadoop: a systematic study

Improvement of Makespan and TCTime in Dynamic Job Ordering and Slot Utilization for MapReduce Workloads

Reducing partition skew on MapReduce: an incremental allocation approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation