Abstract
Today, data-intensive applications rely on geographically distributed systems to leverage data collection, storing and processing. Data locality has been seen as a prominent technique to improve application performance and reduce the impact of network latency by scheduling jobs directly in the nodes hosting the data to be processed. MapReduce and Dryad are examples of frameworks which exploit locality by splitting jobs into multiple tasks that are dispatched to process portions of data locally. However, as the ecosystem of big data analysis has shifted from single clusters to span geo-distributed data centers, it is unavoidable that data may still be transferred through the network in order reduce the schedule length. Nevertheless, there is a lack of mechanism to efficiently blend data locality and inter-data center data transfer requirement in the existing scheduling techniques to address data-intensive processing across dispersed data centers. Therefore, the objective of this work is to propose and solve the makespan optimization problem for data-intensive job scheduling on geo-distributed data centers. To this end, we first formulate the task placement and the data access as a linear programming and use the GLPK solver to solve it. We then present a low complexity heuristic scheduling algorithm called GeoDis which allows data locality to cope with the data transfer requirement to achieve a greater performance on the makespan. The experiments with various realistic traces and synthetic generated workload show that GeoDis can reduce makespan of processing jobs by 44% as compared to the state-of-the-art algorithms and remain within \(91\%\) closer to the optimal solution by the LP solver.
Similar content being viewed by others
References
Abad CL, Lu Y, Campbell RH (2011) Dare: adaptive data replication for efficient cluster scheduling. In: 2011 IEEE international conference on cluster computing, pp 159–168. doi:10.1109/CLUSTER.2011.26
Abawajy JH, Deris MM (2014) Data replication approach with consistency guarantee for data grid. IEEE Trans Comput 63(12):2975–2987. doi:10.1109/TC.2013.183
AWS: Amazon Web Service (2006). http://aws.amazon.com
Ananthanarayanan G, Ghodsi A, Shenker S, Stoica I (2013) Effective straggler mitigation: attack of the clones. In: Presented as part of the 10th USENIX symposium on networked systems design and implementation (NSDI 13). USENIX, Lombard, IL, pp 185–198. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/ananthanarayanan
Ananthanarayanan G, Kandula S, Greenberg A, Stoica I, Lu Y, Saha B, Harris E (2010) Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley, CA, USA, pp 265–278. http://dl.acm.org/citation.cfm?id=1924943.1924962
Anikode LR, Tang B (2011) Integrating scheduling and replication in data grids with performance guarantee. In: Global telecommunications conference (GLOBECOM 2011), 2011 IEEE, pp 1–6. doi:10.1109/GLOCOM.2011.6134492
Breslau L, Cao P, Fan L, Phillips G, Shenker S (1999) Web caching and Zipf-like distributions: evidence and implications. In: INFOCOM ’99. Eighteenth annual joint conference of the IEEE computer and communications societies. Proceedings. IEEE, vol 1, pp 126–134. doi:10.1109/INFCOM.1999.749260
Cameron DG, Carvajal-Schiaffino R, Millar AP, Nicholson C, Stockinger K, Zini F (2003) Evaluating scheduling and replica optimisation strategies in optorsim. In: Proceedings. First Latin American Web Congress, pp 52–59 (2003). doi:10.1109/GRID.2003.1261698
Cardosa M, Wang C, Nangia A, Chandra A, Weissman J (2011) Exploring MapReduce efficiency with highly-distributed data. In: Proceedings of the second international workshop on MapReduce and its applications, MapReduce ’11, ACM, New York, NY, USA, pp 27–34. doi:10.1145/1996092.1996100
Cavallo M, Modica GD, Polito C, Tomarchio O (2016) Application profiling in hierarchical Hadoop for geo-distributed computing environments. In: 2016 IEEE symposium on computers and communication (ISCC), pp 555–560. doi:10.1109/ISCC.2016.7543796
Chen W, Paik I, Li Z (2016) Cost-aware streaming workflow allocation on geo-distributed data centers. IEEE Trans Comput 66(2):256–271. doi:10.1109/TC.2016.2595579
Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating mapreduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390–399. doi:10.1109/MASCOTS.2011.12
Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2017) Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786. doi:10.1109/TPDS.2016.2594765
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. doi:10.1145/1327452.1327492
Elghirani A, Subrata R, Zomaya AY (2007) Intelligent scheduling and replication in datagrids: a synergistic approach. In: Seventh IEEE international symposium on cluster computing and the grid (CCGrid ’07), pp 179–182. doi:10.1109/CCGRID.2007.65
Garg N, Kumar A, Pandit V (2007) Order scheduling models: hardness and algorithms. In: Proceedings of the 27th international conference on foundations of software technology and theoretical computer science, FSTTCS’07, Springer, Berlin, pp 96–107
Google Compute Engine (2011). https://cloud.google.com/compute/
Greenberg A, Hamilton J, Maltz DA, Patel P (2008) The cost of a cloud: research problems in data center networks. SIGCOMM Comput Commun Rev 39(1):68–73. doi:10.1145/1496091.1496103
Apache Hadoop Project (2013). http://hadoop.apache.org
Heintz B, Chandra A, Sitaraman RK, Weissman J (2016) End-to-end optimization for geo-distributed mapreduce. IEEE Trans Cloud Comput 4(3):293–306. doi:10.1109/TCC.2014.2355225
Herodotou H, Dong F, Babu S (2011) No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proceedings of the 2nd ACM symposium on cloud computing, SOCC ’11, ACM, New York, NY, USA, pp 18:1–18:14. doi:10.1145/2038916.2038934
Hu Z, Li B, Luo J (2016) Flutter: scheduling tasks closer to data across geo-distributed datacenters. In: IEEE INFOCOM 2016 - the 35th annual IEEE international conference on computer communications, pp 1–9. doi:10.1109/INFOCOM.2016.7524469
Hung CC, Golubchik L, Yu M (2015) Scheduling jobs across geo-distributed datacenters. In: Proceedings of the sixth acm symposium on cloud computing, SoCC ’15, ACM, New York, NY, USA , pp 111–124. doi:10.1145/2806777.2806780
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2007 Eurosys conference. Association for Computing Machinery, Inc., Lisbon, Portugal. http://research.microsoft.com/apps/pubs/default.aspx?id=63785
Jalaparti V, Ballani H, Costa P, Karagiannis T, Rowstron A (2012) Bridging the tenant-provider gap in cloud services. In: Proceedings of the third ACM symposium on cloud computing, SoCC ’12, ACM, New York, NY, USA, pp 10:1–10:14. doi:10.1145/2391229.2391239
Jalaparti V, Bodik P, Menache I, Rao S, Makarychev K, Caesar M (2015) Network-aware scheduling for data-parallel jobs: plan when you can. SIGCOMM Comput Commun Rev 45(4):407–420. doi:10.1145/2829988.2787488
Jin Y, Gao Y, Qian Z, Zhai M, Peng H, Lu S (2016) Workload-aware scheduling across geo-distributed data centers. In: 2016 IEEE Trustcom/BigDataSE/ISPA, pp 1455–1462. doi:10.1109/TrustCom.2016.0228
Jolfaei F, Haghighat AT (2012) The impact of bandwidth and storage space on job scheduling and data replication strategies in data grids. In: Computing technology and information management (ICCM), 2012 8th international conference on, vol 1, pp 283–288
Kloudas K, Mamede M, Preguiça N, Rodrigues R (2015) Pixida: optimizing data parallel jobs in wide-area data analytics. Proc VLDB Endow 9(2):72–83. doi:10.14778/2850578.2850582
Koshiba Y, Chen W, Yamada Y, Tanaka T, Paik I (2015) Investigation of network traffic in geo-distributed data centers. In: 2015 IEEE 7th international conference on awareness science and technology (iCAST), pp 174–179 (2015). doi:10.1109/ICAwST.2015.7314042
Kwok YK, Ahmad I (1999) Fastest: a practical low-complexity algorithm for compile-time assignment of parallel programs to multiprocessors. IEEE Trans Parallel Distrib Syst 10(2):147–159. doi:10.1109/71.752781
Lee YC, Zomaya AY (2007) Practical scheduling of bag-of-tasks applications on grids with dynamic resilience. IEEE Trans Comput 56(6):815–825. doi:10.1109/TC.2007.1042
Li P, Guo S, Miyazaki T, Liao X, Jin H, Zomaya A, Wang K (2016) Traffic-aware geo-distributed big data analytics with predictable job completion time. IEEE Trans Parallel Distrib Syst 28(6):1785–1796. doi:10.1109/TPDS.2016.2626285
Li P, Guo S, Yu S, Zhuang W (2015) Cross-cloud mapreduce for big data. IEEE Trans Cloud Comput 26(3):1–14. doi:10.1109/TCC.2015.2474385
Li S, Lu Q, Zhang W, Zhu L (2015) A mapreduce cluster deployment optimization framework with geo-distributed data. In: 2015 IEEE 12th Intl Conf on ubiquitous intelligence and computing and 2015 IEEE 12th intl conf on autonomic and trusted computing and 2015 IEEE 15th intl conf on scalable computing and communications and its associated workshops (UIC-ATC-ScalCom), pp 943–949. doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.179
Li, W., Yang, Y., Yuan, D.: A novel cost-effective dynamic data replication strategy for reliability in cloud data centres. In: Dependable, autonomic and secure computing (DASC), 2011 IEEE ninth international conference on, pp 496–502. doi:10.1109/DASC.2011.95
Liao X, Gao Z, Ji W, Wang Y (2015) An enforcement of real time scheduling in spark streaming. In: Green computing conference and sustainable computing conference (IGSC), 2015 sixth international, pp 1–6. doi:10.1109/IGCC.2015.7393730
Lin W, Qian Z, Xu J, Yang S, Zhou J, Zhou L (2016) Streamscope: continuous reliable distributed processing of big data streams. In: 13th USENIX symposium on networked systems design and implementation (NSDI 16), USENIX Association, Santa Clara, CA, pp 439–453. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/lin
Makhorin A (2012) Gnu linear programming kit, version 4.52. http://www.gnu.org/software/glpk/glpk.html
Mandal A, Xin Y, Baldine I, Ruth P, Heerman C, Chase J, Orlikowski V, Yumerefendi A (2011) Provisioning and evaluating multi-domain networked clouds for Hadoop-based applications. In: 2011 IEEE third international conference on cloud computing technology and science, pp 690–697. doi:10.1109/CloudCom.2011.107
Microsoft Azure (2010). https://azure.microsoft.com/
Nguyen VH, Tuong NH, Tran VH, Thoai N (2013) An MILP-based makespan minimization model for single-machine scheduling problem with splitable jobs and availability constraints. In: Computing, management and telecommunications (ComManTel), 2013 international conference on, pp 397–400. doi:10.1109/ComManTel.2013.6482427
Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. SIGCOMM Comput Commun Rev 45(4):421–434. doi:10.1145/2829988.2787505
Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of the 2015 ACM conference on special interest group on data communication, SIGCOMM ’15, ACM, New York, NY, USA, pp 421–434. doi:10.1145/2785956.2787505
Rackspace (1998). https://www.rackspace.com/
Schrage L (1968) A proof of the optimality of the shortest remaining processing time discipline. Oper Res 16(3):687–690. https://www.rackspace.com/
Sih GC, Lee EA (1993) A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans Parallel Distrib Syst 4(2):175–187. doi:10.1109/71.207593
Sooezi N, Abrishami S, Lotfian M (2015) Scheduling data-driven workflows in multi-cloud environment. In: 2015 IEEE 7th international conference on cloud computing technology and science (CloudCom), pp 163–167. doi:10.1109/CloudCom.2015.95
Apache Spark? (2013). http://spark.apache.org/
Toosi AN, Buyya R (2015) A fuzzy logic-based controller for cost and energy efficient load balancing in geo-distributed data centers. In: 2015 IEEE/ACM 8th international conference on utility and cloud computing (UCC), pp 186–194. doi:10.1109/UCC.2015.35
Tripathi R, Vignesh S, Tamarapalli V, Medhi D (2017) Cost efficient design of fault tolerant geo-distributed data centers. IEEE Trans Network Service Manag 14(2):289–301. doi:10.1109/TNSM.2017.2691007
Tudoran R, Costan A, Antoniu G (2016) Overflow: multi-site aware big data management for scientific workflows on clouds. IEEE Trans Cloud Comput 4(1):76–89. doi:10.1109/TCC.2015.2440254
Venugopal S, Buyya R (2008) An SCP-based heuristic approach for scheduling distributed data-intensive applications on global grids. J Parallel Distrib Comput 68(4):471–487. doi:10.1016/j.jpdc.2007.07.004
Vulimiri A, Curino C, Godfrey PB, Jungblut T, Karanasos K, Padhye J, Varghese G (2015) Wanalytics: geo-distributed analytics for a data intensive world. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15, ACM, New York, NY, USA, pp 1087–1092. doi:10.1145/2723372.2735365
Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015) Global analytics in the face of bandwidth and regulatory constraints. In: 12th usenix symposium on networked systems design and implementation (NSDI 15), USENIX Association, Oakland, CA, pp 323–336. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/vulimiri
Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-Hadoop: mapreduce across distributed data centers for data-intensive computing. Future Gener Comput Syst 29(3):739–750. doi:10.1016/j.future.2012.09.001. Special section: recent developments in high performance computing and security
Zarina M, Ahmad F, bin Mohd Rose AN, Nordin M, Deris MM (2013) Job scheduling for dynamic data replication strategy in heterogeneous federation data grid systems. In: Informatics and applications (ICIA), 2013 second international conference on, pp 203–206. doi:10.1109/ICoIA.2013.6650256
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Convolbo, M.W., Chou, J., Hsu, CH. et al. GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers. Computing 100, 21–46 (2018). https://doi.org/10.1007/s00607-017-0564-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-017-0564-7