Abstract
To provide timely results for big data analytics, it is crucial to satisfy deadline requirements for MapReduce jobs in today’s production environments. Much effort has been devoted to the problem of meeting deadlines, and typically there exist two kinds of solutions. The first is to allocate appropriate resources to complete the entire job before the specified time limit, where missed deadlines result because of tight deadline constraints or lack of resources; the second is to run a pre-constructed sample based on deadline constraints, which can satisfy the time requirement but fail to maximize the volumes of processed data. In this paper, we propose a deadline-oriented task scheduling approach, named ‘Dart’, to address the above problem. Given a specified deadline and restricted resources, Dart uses an iterative estimation method, which is based on both historical data and job running status to precisely estimate the real-time job completion time. Based on the estimated time, Dart uses an approach–revise algorithm to make dynamic scheduling decisions for meeting deadlines while maximizing the amount of processed data and mitigating stragglers. Dart also efficiently handles task failures and data skew, protecting its performance from being harmed. We have validated our approach using workloads from OpenCloud and Facebook on a cluster of 64 virtual machines. The results show that Dart can not only effectively meet the deadline but also process near-maximum volumes of data even with tight deadlines and limited resources.
Similar content being viewed by others
References
Acharya, S., Gibbons, P., Poosala, V., 1999. Aqua: a fast decision support system using approximate query answers. Proc. 25th Int. Conf. on Very Large Data Bases, p.754–757.
Agarwal, S., Mozafari, B., Panda, A., et al., 2013. Blinkdb: queries with bounded errors and bounded response times on very large data. Proc. 8th ACM European Conf. on Computer Systems, p.29–42. https://doi.org/10.1145/2465351.2465355
Ananthanarayanan, G., Kandula, S., Greenberg, A.G., et al., 2010. Reining in the outliers in Map-Reduce clusters using Mantri. Proc. 10th USENIX Symp. on Operating Systems Design and Implementation, p.24–38.
Ananthanarayanan, G., Ghodsi, A., Shenker, S., et al., 2013. Effective straggler mitigation: attack of the clones. Proc. 10th USENIX Symp. on Networked Systems Design and Implementation, p.185–198.
Ananthanarayanan, G., Hung, M.C.C., Ren, X., et al., 2014. Grass: trimming stragglers in approximation analytics. Proc. 11th USENIX Symp. on Networked Systems Design and Implementation, p.289–302.
Apache, 2016. The Apache Hadoop Project. http://hadoop.apache.org/
Bates, D.M., Watts, D.G., 1988. Nonlinear regression inference using the linear approximation. In: Jantsch, E., Waddington, C. (Eds.), Nonlinear Regression: Iterative Estimation and Linear Approximations. Wiley Online Library, p.142–167. https://doi.org/10.1002/9780470316757.ch2
Bell Laboratories, 2001. Approximate Query Processing: Taming the Terabytes. http://www.vldb.org/conf/2001/tut4.pdf
Chen, Y., Ganapathi, A., Griggith, R., et al., 2011. The case for evaluating MapReduce performance using workload suites. Proc. IEEE 19th Int. Symp. on Modeling, Analysis & Simulation of Computer and Telecommunication Systems. https://doi.org/10.1109/MASCOTS.2011.12
Chen, Y., Alspaugh, S., Katz, R., 2012. Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc. VLDB Endow., 5(12): 1802–1813. https://doi.org/10.14778/2367502.2367519
Chowdhury, M., Zaharia, M., Ma, J., et al., 2011. Managing data transfers in computer clusters with orchestra. SIGCOMM Comput. Commun. Rev., 41(4): 98–109. https://doi.org/10.1145/2043164.2018448
Chowdhury, M., Zhong, Y., Stoica, I., 2014. Efficient coflow scheduling with varys. SIGCOMM Comput. Commun. Rev., 44(4): 443–454. https://doi.org/10.1145/2740070.2626315
Cloudera, 2013. Statistical Workload Injector for MapReduce. https://github.com/SWIMProjectUCB/SWIM
Dean, J., Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1): 107–113. https://doi.org/10.1145/1327452.1327492
Ferguson, A.D., Bodik, P., Kandula, S., 2012. Jockey: guaranteed job latency in data parallel clusters. Proc. 7th ACM European Conf. on Computer Systems, p.99–112. https://doi.org/10.1145/2168836.2168847
Herodotou, H., Lim, H., Luo, G., 2011. Starfish: a self-tuning system for big data analytics. Proc. 7th Biennial Conf. on Innovative Data Systems Research, p.261–272.
Hu, M., Wang, C., You, P., et al., 2015. Deadline-oriented task scheduling for mapreduce environments. LNCS, 9529: 359–372. https://doi.org/10.1007/978-3-319-27122-4_25
Kc, K., Anyanwu, K., 2010. Scheduling Hadoop jobs to meet deadlines. IEEE 2nd Int. Conf. on Cloud Computing Technology and Science, p.388–392. https://doi.org/10.1109/CloudCom.2010.97
Li, S., Hu, S., Wang, S., et al., 2014. Woha: deadlineaware Map-Reduce workflow scheduling framework over Hadoop clusters. IEEE 34th Int. Conf. on Distributed Computing Systems, p.93–103. https://doi.org/10.1109/ICDCS.2014.18
Liu, J., Shih, K., Lin, W., et al., 1994. Imprecise computations. Proc. IEEE, 82: 83–94. https://doi.org/10.1109/5.259428
Lohr, S., 2009. Simple probability samples. In: Sampling: Design and Analysis. Addison-Wesley, London, p.35–67.
Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math., 11(2): 431–441.
Morton, K., Balazinska, M., Grossman, D., 2010a. Para- Timer: a progress indicator for MapReduce dags. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.507–518. https://doi.org/10.1145/1807167.1807223
Morton, K., Friesen, A., Balazinska, M., et al., 2010b. Estimating the progress of MapReduce pipelines. Proc. IEEE 26th Int. Conf. on Data Engineering, p.681–684. https://doi.org/10.1109/ICDE.2010.5447919
Motulsky, H.J., Ransnas, L.A., 1987. Fitting curves to data using nonlinear regression: a practical and nonmathematical review. FASEB J., 1(5): 365–374.
OREILLY, 2013. Interactive Big Data Analysis Using Approximate Answers. https://tinyurl.com/k5favda/
Polo, J., Carrera, D., Becerra, Y., et al., 2010. Performancedriven task co-scheduling for MapReduce environments. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.373–380. https://doi.org/10.1109/NOMS.2010.5488494
Ren, K., Kwon, Y., Balazinska, M., et al., 2013. Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc. VLDB Endow., 6(10): 853–864. https://doi.org/10.14778/2536206.2536213
Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al., 2013. Apache Hadoop Yarn: yet another resource negotiator. Proc. 4th Annual Symp. on Cloud Computing, p.5:1-5:16. https://doi.org/10.1145/2523616.2523633
Venkataraman, S., Panda, A., Ananthanarayanan, G., et al., 2007. The power of choice in data-aware cluster scheduling. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.301–316.
Verma, A., Cherkasova, L., Campbell, R.H., 2011. Aria: automatic resource inference and allocation for MapReduce environments. Proc. 8th ACM Int. Conf. on Autonomic Computing, p.235–244. https://doi.org/10.1145/1998582.1998637
Verma, A., Cherkasova, L., Kumar, V.S., et al., 2012. Deadline-based workload management for MapReduce environments: pieces of the performance puzzle. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.900–905. https://doi.org/10.1109/NOMS.2012.6212006
Wang, C., Peng, Y., Tang, M., et al., 2014. MapCheckReduce: an improved MapReduce computing model for imprecise applications. Proc. IEEE Int. Congress on Big Data, p.366–373. https://doi.org/10.1109/BigData.Congress.2014.61
Wang, X., Shen, D., Bai, M., et al., 2015. SAMES: deadlineconstraint scheduling in MapReduce. Front. Comput. Sci., 9(1): 128–141. https://doi.org/10.1007/s11704-014-4138-y
Zacheilas, N., Kalogeraki, V., 2014. Real-time scheduling of skewed MapReduce jobs in heterogeneous environments. Proc. 11th Int. Conf. on Autonomic Computing, p.189–200.
Zaharia, M., Konwinski, A., Joseph, A.D., et al., 2008. Improving MapReduce performance in heterogeneous environments. Proc. 8th USENIX Symp. on Operating Systems Design and Implementation, p.7–21.
Zaharia, M., Borthakur, D., Sen, S., et al., 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Proc. 5th European Conf. on Computer Systems, p.265–278. https://doi.org/10.1145/1755913.1755940
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Key Research and Development Program of China (No. 2016YFB1000101)
Rights and permissions
About this article
Cite this article
Hu, Mh., Wang, Cj. & Peng, Yx. Meeting deadlines for approximation processing in MapReduce environments. Frontiers Inf Technol Electronic Eng 18, 1754–1772 (2017). https://doi.org/10.1631/FITEE.1601056
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1601056