Accelerating DAG-Style Job Execution via Optimizing Resource Pipeline Scheduling

Duan, Yubin; Wang, Ning; Wu, Jie

doi:10.1007/s11390-021-1488-4

Accelerating DAG-Style Job Execution via Optimizing Resource Pipeline Scheduling

Regular Paper
Published: 30 July 2022

Volume 37, pages 852–868, (2022)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yubin Duan¹,
Ning Wang² &
Jie Wu¹

198 Accesses
1 Citation
Explore all metrics

Abstract

The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a time-efficient manner. However, simply adding more computation resources may not speed up the data analysis significantly. The data analysis jobs usually consist of multiple stages which are organized as a directed acyclic graph (DAG). The precedence relationships between stages cause scheduling challenges. General DAG scheduling is a well-known NP-hard problem. Moreover, we observe that in some parallel computing frameworks such as Spark, the execution of a stage in DAG contains multiple phases that use different resources. We notice that carefully arranging the execution of those resources in pipeline can reduce their idle time and improve the average resource utilization. Therefore, we propose a resource pipeline scheme with the objective of minimizing the job makespan. For perfectly parallel stages, we propose a contention-free scheduler with detailed theoretical analysis. Moreover, we extend the contention-free scheduler for three-phase stages, considering the computation phase of some stages can be partitioned. Additionally, we are aware that job stages in real-world applications are usually not perfectly parallel. We need to frequently adjust the parallelism levels during the DAG execution. Considering reinforcement learning (RL) techniques can adjust the scheduling policy on the fly, we investigate a scheduler based on RL for online arrival jobs. The RL-based scheduler can adjust the resource contention adaptively. We evaluate both contention-free and RL-based schedulers on a Spark cluster. In the evaluation, a real-world cluster trace dataset is used to simulate different DAG styles. Evaluation results show that our pipelined scheme can significantly improve CPU and network utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DRLBTSA: Deep reinforcement learning based task-scheduling algorithm in cloud computing

Article 17 June 2023

Sudheer Mangalampalli, Ganesh Reddy Karri, … GhaidaMuttashar Abdul Sahib

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

Dynamic resource allocation in cloud computing: analysis and taxonomies

Article 28 January 2022

Ali Belgacem

References

Duan Y, Wang N, Wu J. Reducing makespans of DAG scheduling through interleaving overlapping resource utilization. In Proc. the 17th IEEE International Conference on Mobile Ad Hoc and Sensor Systems, December 2020, pp.392-400. https://doi.org/10.1109/MASS50613.2020.00055.
Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A. Quincy: Fair scheduling for distributed computing clusters. In Proc. the 22nd ACM SIGOPS Symposium on Operating Systems Principles, October 2009, pp.261-276. https://doi.org/10.1145/1629575.1629601.
Grandl R, Ananthanarayanan G, Kandula S, Rao S, Akella A. Multi-resource packing for cluster schedulers. ACM SIG-COMM Computer Communication Review, 2014, 44(4): 455-466. https://doi.org/10.1145/2740070.2626334.
Zhang Z, Li C, Tao Y, Yang R, Tang H, Xu J. Fuxi: A faulttolerant resource management and job scheduling system at Internet scale. Proc. the VLDB Endowment, 2014, 7(13): 1393-1404. https://doi.org/10.14778/2733004.2733012.
Vulimiri A, Curino C, Godfrey P B, Jungblut T, Padhye J, Varghese G. Global analytics in the face of bandwidth and regulatory constraints. In Proc. the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, pp.323-336.
Grandl R, Kandula S, Rao S, Akella A, Kulkarni J. GRAPHENE: Packing and dependency-aware scheduling for data-parallel clusters. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.81-97.
Hu Z, Li B, Chen C, Ke X. FlowTime: Dynamic scheduling of deadline-aware workflows ad-hoc jobs. In Proc. the 38th IEEE International Conference on Distributed Computing Systems, July 2018, pp.929-938. https://doi.org/10.1109/ICDCS.2018.00094.
Brucker P. Scheduling Algorithms (5th edition). Springer, 2007.
Wang H, Sinnen O. List-scheduling versus cluster-scheduling. IEEE Transactions on Parallel, Distributed Systems, 2018, 29(8): 1736-1749. https://doi.org/10.1109/TPDS.2018.2808959.
Article Google Scholar
Johnson S M. Optimal two-and three-stage production schedules with setup times included. Naval Research Logistics Quarterly, 1954, 1(1): 61-68. https://doi.org/10.1002/nav.3800010110.
Article MATH Google Scholar
Amdahl G M. Validity of the single processor approach to achieving large scale computing capabilities. In Proc. the AFIPS '67 Spring Joint Computer Conference, April 1967, pp.483-485. https://doi.org/10.1145/1465482.1465560.
Mao H, Schwarzkopf M, Venkatakrishnan S B, Meng Z, Alizadeh M. Learning scheduling algorithms for data processing clusters. In Proc. the ACM Special Interest Group on Data Communication, August 2019, pp.270-288. https://doi.org/10.1145/3341302.3342080.
Zaharia M, Borthakur D, Sen S J, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: A simple technique for achieving locality, fairness in cluster scheduling. In Proc. the 5th European Conference on Computer Systems, April 2010, pp.265-278. https://doi.org/10.1145/1755913.1755940.
Khalil E, Dai H, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. In Proc. the Annual Conference on Neural Information Processing Systems, December 2017, pp.6348-6358.
Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3/4): 229-256. https://doi.org/10.1007/BF00992696.
Article MATH Google Scholar
Weaver L, Tao N. The optimal reward baseline for gradient-based reinforcement learning. arXiv:1301.2315, 2013. https://arxiv.org/abs/1301.2315, Jan. 2022.
Shao W, Xu F, Chen L, Zheng H, Liu F. Stage delay scheduling: Speeding up DAG-style data analytics jobs with resource interleaving. In Proc. the 48th International Conference on Parallel Processing, August 2019, Article No. 8. https://doi.org/10.1145/3337821.3337872.
Hu Z, Li B, Qin Z, Goh R S M. Job scheduling without prior information in big data processing systems. In Proc. the 37th IEEE International Conference on Distributed Computing Systems, June 2017, pp.572-582. https://doi.org/10.1109/ICDCS.2017.105.
Liu S, Wang H, Li B. Optimizing shuffle in wide-area data analytics. In Proc. the 37th IEEE International Conference on Distributed Computing Systems, June 2017, pp.560-571. https://doi.org/10.1109/ICDCS.2017.131.
Delimitrou C, Kozyrakis C. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 2013, 48(4): 77-88. https://doi.org/10.1145/2499368.2451125.
Article Google Scholar
Vavilapalli V K, Murthy A C, Douglas C et al. Apache Hadoop YARN: Yet another resource negotiator. In Proc. the 4th Annual Symposium on Cloud Computing, October 2013, Article No. 5. https://doi.org/10.1145/2523616.2523633
Delimitrou C, Kozyrakis C. Quasar: Resource-efficient, QoS-aware cluster management. ACM SIGARCH Computer Architecture News, 2014, 42(1): 127-144. https://doi.org/10.1145/2654822.2541941.
Zhang W, Zheng N, Chen Q, Yang Y, Song Z, Ma T, Leng J, Guo M. URSA: Precise capacity planning, fair scheduling based on low-level statistics for public clouds. In Proc. the 49th International Conference on Parallel Processing, August 2020, Article No. 73. https://doi.org/10.1145/3404397.3404451.
Ousterhout K, Canel C, Ratnasamy S, Shenker S. Monotasks: Architecting for performance clarity in data analytics frameworks. In Proc. the 26th Symposium on Operating Systems Principles, October 2017, pp.184-200. https://doi.org/10.1145/3132747.3132766.
Agrawal K, Li J, Lu K, Moseley B. Scheduling parallel DAG jobs online to minimize average ow time. In Proc. the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, January 2016, pp.176-189. https://doi.org/10.1137/1.9781611974331.ch14.
Chekuri C, Goel A, Khanna S, Kumar A. Multi-processor scheduling to minimize ow time with " resource augmentation. In Proc. the 36th Annual ACM Symposium on Theory of Computing, June 2004, pp.363-372. https://doi.org/10.1145/1007352.1007411.
Mastrolilli M, Svensson O. (Acyclic) job shops are hard to approximate. In Proc. the 49th Annual IEEE Symposium on Foundations of Computer Science, October 2008, pp.583-592. https://doi.org/10.1109/FOCS.2008.36.
Shmoys D B, Stein C, Wein J. Improved approximation algorithms for shop scheduling problems. SIAM Journal on Computing, 1994, 23(3): 617-632. https://doi.org/10.1137/S009753979222676X.
Article MathSciNet MATH Google Scholar
Zheng H, Wu J. Joint scheduling of overlapping MapReduce phases: Pair jobs for optimization. IEEE Transactions on Services Computing, 2021, 14(5): 1453-1463. https://doi.org/10.1109/TSC.2018.2875698.
Article Google Scholar
Zheng H, Wan Z, Wu J. Optimizing MapReduce framework through joint scheduling of overlapping phases. In Proc. the 25th IEEE International Conference on Computer Communication and Networks, August 2016. https://doi.org/10.1109/ICCCN.2016.7568555.
Grandl R, Chowdhury M, Akella A, Ananthanarayanan G. Altruistic scheduling in multi-resource clusters. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.65-80.
Ferguson R D, Bodík P, Kandula S, Boutin E, Fonseca R. Jockey: Guaranteed job latency in data parallel clusters. In Proc. the 7th EuroSys Conference on Computer Systems, April 2012, pp.99-112. https://doi.org/10.1145/2168836.2168847.
Im S, Kell N, Kulkarni J, Panigrahi D. Tight bounds for online vector scheduling. In Proc. the 56th IEEE Annual Symposium on Foundations of Computer Science, October 2015, pp.525-544. https://doi.org/10.1109/FOCS.2015.39.
Tan H, Han Z, Li X Y, Lau F C M. Online job dispatching, scheduling in edge-clouds. In Proc. the IEEE Conference on Computer Communications, May 2017. https://doi.org/10.1109/INFOCOM.2017.8057116.
Marchetti-Spaccamela A, Megow N, Schlöter J, Skutella M, Stougie L. On the complexity of conditional DAG scheduling in multiprocessor systems. In Proc. the IEEE International Parallel and Distributed Processing Symposium, May 2020, pp.1061-1070. https://doi.org/10.1109/IPDPS47924.2020.00112.
Luo J H, Zhou Y F, Li X J, Yuan M X, Yao J G, Zeng J. Learning to optimize DAG scheduling in heterogeneous environment. arXiv:2103.06980, 2021. https://arxiv.org/abs/2103.06980, March 2022.

Download references

Author information

Authors and Affiliations

Department of Computer and Information Sciences, Temple University, Philadelphia, 19122, U.S.A.
Yubin Duan & Jie Wu
Department of Computer Science, Rowan University, Glassboro, 08028, U.S.A.
Ning Wang

Authors

Yubin Duan
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Wu.

Supplementary Information

ESM 1

(PDF 154 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duan, Y., Wang, N. & Wu, J. Accelerating DAG-Style Job Execution via Optimizing Resource Pipeline Scheduling. J. Comput. Sci. Technol. 37, 852–868 (2022). https://doi.org/10.1007/s11390-021-1488-4

Download citation

Received: 06 April 2021
Accepted: 23 November 2021
Published: 30 July 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11390-021-1488-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating DAG-Style Job Execution via Optimizing Resource Pipeline Scheduling

Abstract

Access this article

Similar content being viewed by others

DRLBTSA: Deep reinforcement learning based task-scheduling algorithm in cloud computing

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Dynamic resource allocation in cloud computing: analysis and taxonomies

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating DAG-Style Job Execution via Optimizing Resource Pipeline Scheduling

Abstract

Access this article

Similar content being viewed by others

DRLBTSA: Deep reinforcement learning based task-scheduling algorithm in cloud computing

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Dynamic resource allocation in cloud computing: analysis and taxonomies

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation