Skip to main content
Log in

Meeting deadlines for approximation processing in MapReduce environments

  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

To provide timely results for big data analytics, it is crucial to satisfy deadline requirements for MapReduce jobs in today’s production environments. Much effort has been devoted to the problem of meeting deadlines, and typically there exist two kinds of solutions. The first is to allocate appropriate resources to complete the entire job before the specified time limit, where missed deadlines result because of tight deadline constraints or lack of resources; the second is to run a pre-constructed sample based on deadline constraints, which can satisfy the time requirement but fail to maximize the volumes of processed data. In this paper, we propose a deadline-oriented task scheduling approach, named ‘Dart’, to address the above problem. Given a specified deadline and restricted resources, Dart uses an iterative estimation method, which is based on both historical data and job running status to precisely estimate the real-time job completion time. Based on the estimated time, Dart uses an approach–revise algorithm to make dynamic scheduling decisions for meeting deadlines while maximizing the amount of processed data and mitigating stragglers. Dart also efficiently handles task failures and data skew, protecting its performance from being harmed. We have validated our approach using workloads from OpenCloud and Facebook on a cluster of 64 virtual machines. The results show that Dart can not only effectively meet the deadline but also process near-maximum volumes of data even with tight deadlines and limited resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Acharya, S., Gibbons, P., Poosala, V., 1999. Aqua: a fast decision support system using approximate query answers. Proc. 25th Int. Conf. on Very Large Data Bases, p.754–757.

    Google Scholar 

  • Agarwal, S., Mozafari, B., Panda, A., et al., 2013. Blinkdb: queries with bounded errors and bounded response times on very large data. Proc. 8th ACM European Conf. on Computer Systems, p.29–42. https://doi.org/10.1145/2465351.2465355

    Google Scholar 

  • Ananthanarayanan, G., Kandula, S., Greenberg, A.G., et al., 2010. Reining in the outliers in Map-Reduce clusters using Mantri. Proc. 10th USENIX Symp. on Operating Systems Design and Implementation, p.24–38.

    Google Scholar 

  • Ananthanarayanan, G., Ghodsi, A., Shenker, S., et al., 2013. Effective straggler mitigation: attack of the clones. Proc. 10th USENIX Symp. on Networked Systems Design and Implementation, p.185–198.

    Google Scholar 

  • Ananthanarayanan, G., Hung, M.C.C., Ren, X., et al., 2014. Grass: trimming stragglers in approximation analytics. Proc. 11th USENIX Symp. on Networked Systems Design and Implementation, p.289–302.

    Google Scholar 

  • Apache, 2016. The Apache Hadoop Project. http://hadoop.apache.org/

    Google Scholar 

  • Bates, D.M., Watts, D.G., 1988. Nonlinear regression inference using the linear approximation. In: Jantsch, E., Waddington, C. (Eds.), Nonlinear Regression: Iterative Estimation and Linear Approximations. Wiley Online Library, p.142–167. https://doi.org/10.1002/9780470316757.ch2

    Google Scholar 

  • Bell Laboratories, 2001. Approximate Query Processing: Taming the Terabytes. http://www.vldb.org/conf/2001/tut4.pdf

    Google Scholar 

  • Chen, Y., Ganapathi, A., Griggith, R., et al., 2011. The case for evaluating MapReduce performance using workload suites. Proc. IEEE 19th Int. Symp. on Modeling, Analysis & Simulation of Computer and Telecommunication Systems. https://doi.org/10.1109/MASCOTS.2011.12

    Google Scholar 

  • Chen, Y., Alspaugh, S., Katz, R., 2012. Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc. VLDB Endow., 5(12): 1802–1813. https://doi.org/10.14778/2367502.2367519

    Article  Google Scholar 

  • Chowdhury, M., Zaharia, M., Ma, J., et al., 2011. Managing data transfers in computer clusters with orchestra. SIGCOMM Comput. Commun. Rev., 41(4): 98–109. https://doi.org/10.1145/2043164.2018448

    Article  Google Scholar 

  • Chowdhury, M., Zhong, Y., Stoica, I., 2014. Efficient coflow scheduling with varys. SIGCOMM Comput. Commun. Rev., 44(4): 443–454. https://doi.org/10.1145/2740070.2626315

    Article  Google Scholar 

  • Cloudera, 2013. Statistical Workload Injector for MapReduce. https://github.com/SWIMProjectUCB/SWIM

    Google Scholar 

  • Dean, J., Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1): 107–113. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  • Ferguson, A.D., Bodik, P., Kandula, S., 2012. Jockey: guaranteed job latency in data parallel clusters. Proc. 7th ACM European Conf. on Computer Systems, p.99–112. https://doi.org/10.1145/2168836.2168847

    Google Scholar 

  • Herodotou, H., Lim, H., Luo, G., 2011. Starfish: a self-tuning system for big data analytics. Proc. 7th Biennial Conf. on Innovative Data Systems Research, p.261–272.

    Google Scholar 

  • Hu, M., Wang, C., You, P., et al., 2015. Deadline-oriented task scheduling for mapreduce environments. LNCS, 9529: 359–372. https://doi.org/10.1007/978-3-319-27122-4_25

    Google Scholar 

  • Kc, K., Anyanwu, K., 2010. Scheduling Hadoop jobs to meet deadlines. IEEE 2nd Int. Conf. on Cloud Computing Technology and Science, p.388–392. https://doi.org/10.1109/CloudCom.2010.97

    Google Scholar 

  • Li, S., Hu, S., Wang, S., et al., 2014. Woha: deadlineaware Map-Reduce workflow scheduling framework over Hadoop clusters. IEEE 34th Int. Conf. on Distributed Computing Systems, p.93–103. https://doi.org/10.1109/ICDCS.2014.18

    Google Scholar 

  • Liu, J., Shih, K., Lin, W., et al., 1994. Imprecise computations. Proc. IEEE, 82: 83–94. https://doi.org/10.1109/5.259428

    Article  Google Scholar 

  • Lohr, S., 2009. Simple probability samples. In: Sampling: Design and Analysis. Addison-Wesley, London, p.35–67.

    Google Scholar 

  • Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math., 11(2): 431–441.

    Article  MathSciNet  Google Scholar 

  • Morton, K., Balazinska, M., Grossman, D., 2010a. Para- Timer: a progress indicator for MapReduce dags. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.507–518. https://doi.org/10.1145/1807167.1807223

    Google Scholar 

  • Morton, K., Friesen, A., Balazinska, M., et al., 2010b. Estimating the progress of MapReduce pipelines. Proc. IEEE 26th Int. Conf. on Data Engineering, p.681–684. https://doi.org/10.1109/ICDE.2010.5447919

    Google Scholar 

  • Motulsky, H.J., Ransnas, L.A., 1987. Fitting curves to data using nonlinear regression: a practical and nonmathematical review. FASEB J., 1(5): 365–374.

    Article  Google Scholar 

  • OREILLY, 2013. Interactive Big Data Analysis Using Approximate Answers. https://tinyurl.com/k5favda/

    Google Scholar 

  • Polo, J., Carrera, D., Becerra, Y., et al., 2010. Performancedriven task co-scheduling for MapReduce environments. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.373–380. https://doi.org/10.1109/NOMS.2010.5488494

    Google Scholar 

  • Ren, K., Kwon, Y., Balazinska, M., et al., 2013. Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc. VLDB Endow., 6(10): 853–864. https://doi.org/10.14778/2536206.2536213

    Article  Google Scholar 

  • Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al., 2013. Apache Hadoop Yarn: yet another resource negotiator. Proc. 4th Annual Symp. on Cloud Computing, p.5:1-5:16. https://doi.org/10.1145/2523616.2523633

    Google Scholar 

  • Venkataraman, S., Panda, A., Ananthanarayanan, G., et al., 2007. The power of choice in data-aware cluster scheduling. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.301–316.

    Google Scholar 

  • Verma, A., Cherkasova, L., Campbell, R.H., 2011. Aria: automatic resource inference and allocation for MapReduce environments. Proc. 8th ACM Int. Conf. on Autonomic Computing, p.235–244. https://doi.org/10.1145/1998582.1998637

    Google Scholar 

  • Verma, A., Cherkasova, L., Kumar, V.S., et al., 2012. Deadline-based workload management for MapReduce environments: pieces of the performance puzzle. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.900–905. https://doi.org/10.1109/NOMS.2012.6212006

    Google Scholar 

  • Wang, C., Peng, Y., Tang, M., et al., 2014. MapCheckReduce: an improved MapReduce computing model for imprecise applications. Proc. IEEE Int. Congress on Big Data, p.366–373. https://doi.org/10.1109/BigData.Congress.2014.61

    Google Scholar 

  • Wang, X., Shen, D., Bai, M., et al., 2015. SAMES: deadlineconstraint scheduling in MapReduce. Front. Comput. Sci., 9(1): 128–141. https://doi.org/10.1007/s11704-014-4138-y

    Article  MathSciNet  Google Scholar 

  • Zacheilas, N., Kalogeraki, V., 2014. Real-time scheduling of skewed MapReduce jobs in heterogeneous environments. Proc. 11th Int. Conf. on Autonomic Computing, p.189–200.

    Google Scholar 

  • Zaharia, M., Konwinski, A., Joseph, A.D., et al., 2008. Improving MapReduce performance in heterogeneous environments. Proc. 8th USENIX Symp. on Operating Systems Design and Implementation, p.7–21.

    Google Scholar 

  • Zaharia, M., Borthakur, D., Sen, S., et al., 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Proc. 5th European Conf. on Computer Systems, p.265–278. https://doi.org/10.1145/1755913.1755940

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming-hao Hu.

Additional information

Project supported by the National Key Research and Development Program of China (No. 2016YFB1000101)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Mh., Wang, Cj. & Peng, Yx. Meeting deadlines for approximation processing in MapReduce environments. Frontiers Inf Technol Electronic Eng 18, 1754–1772 (2017). https://doi.org/10.1631/FITEE.1601056

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1601056

Key words

CLC number

Navigation