skip to main content
10.1145/2785956.2787488acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Free access

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Published: 17 August 2015 Publication History

Abstract

To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks and flows that are ready to run. However, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them. Coordinating the placement of data and tasks of these jobs allows for significantly improving their network locality and freeing up bandwidth, which can be used by other jobs running on the cluster. With this intuition, we develop Corral, a scheduling framework that uses characteristics of future workloads to determine an offline schedule which (i) jointly places data and compute to achieve better data locality, and (ii) isolates jobs both spatially (by scheduling them in different parts of the cluster) and temporally, improving their performance. We implement Corral on Apache Yarn, and evaluate it on a 210 machine cluster using production workloads. Compared to Yarn's capacity scheduler, Corral reduces the makespan of these workloads up to 33% and the median completion time up to 56%, with 20-90% reduction in data transferred across racks.

Supplementary Material

WEBM File (p407-jalaparti.webm)

References

[1]
Amazon S3. https://aws.amazon.com/s3/.
[2]
Amazon Web Services. http://aws.amazon.com/.
[3]
Apache Hadoop. http://hadoop.apache.org/.
[4]
Apache Tez. http://hortonworks.com/hadoop/tez/.
[5]
Facebook data grows by over 500 TB daily. http://tinyurl.com/96d8oqj/.
[6]
Hadoop Distributed Filesystem. http://hadoop.apache.org/hdfs.
[7]
Hadoop MapReduce Next Generation - Capacity Scheduler. http://tinyurl.com/no2evu5.
[8]
Hadoop YARN Project. http://tinyurl.com/bnadg9l.
[9]
Microsoft Azure. https://azure.microsoft.com/.
[10]
Microsoft Azure Storage. https://azure.microsoft.com/en-us/services/storage/.
[11]
ORC File Format. http://tinyurl.com/n4pxofh.
[12]
TPC Benchmark H. http://www.tpc.org/tpch/.
[13]
Windows Azure's Flat Network Storage and 2012 Scalability Targets. http://bit.ly/1A4Hbjt.
[14]
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing Data-parallel Computing. In NSDI 2012.
[15]
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Reoptimizing Data Parallel Computing. In NSDI'12, 2012.
[16]
F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters. In USENIX ATC, 2014.
[17]
G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris. Scarlett: Coping with Skewed Content Popularity in Mapreduce Clusters. In EuroSys, 2011.
[18]
G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching for Parallel Jobs. In NSDI, 2012.
[19]
K. P. Belkhale and P. Banerjee. An approximate algorithm for the partitionable independent task scheduling problem. Urbana, 51:61801, 1990.
[20]
R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008.
[21]
Y. Chen, A. Ganapathi, R. Griffith, and Y. Katz. The Case for Evaluating MapReduce Performance Using Workload Suites. In MASCOTS, 2011.
[22]
M. Chowdhury, S. Kandula, and I. Stoica. Leveraging Endpoint Flexibility in Data-Intensive Clusters. In ACM SIGCOMM, 2013.
[23]
M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing Data Transfers in Computer Clusters with Orchestra. In ACM SIGCOMM, 2011.
[24]
M. Chowdhury, Y. Zhong, and I. Stoica. Efficient Coflow Scheduling with Varys. In ACM SIGCOMM, 2014.
[25]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.
[26]
F. R. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron. Decentralized Task-aware Scheduling for Data Center Networks. In ACM SIGCOMM, August 2014.
[27]
J. Du and J. Y.-T. Leung. Complexity of Scheduling Parallel Task Systems. SIAM J. Discret. Math., 1989.
[28]
K. Elmeleegy. Piranha: Optimizing Short Jobs in Hadoop. Proc. VLDB Endow., 2013.
[29]
M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. Proc. VLDB Endow., 2011.
[30]
A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In EuroSys, 2012.
[31]
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.
[32]
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource Packing for Cluster Schedulers. In SIGCOMM, 2014.
[33]
H. Herodotou, F. Dong, and S. Babu. No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics. In SOCC, 2011.
[34]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.
[35]
C. Y. Hong, M. Caesar, and P. B. Godfrey. Finishing Flows Quickly with Preemptive Scheduling. In SIGCOMM, 2012.
[36]
M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In SOSP, 2009.
[37]
V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Bridging the Tenant-provider Gap in Cloud Services. In SOCC, 2012.
[38]
Y. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys (CSUR), 1999.
[39]
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating Skew in Mapreduce Applications. In ACM SIGMOD, 2012.
[40]
R. Lepère, D. Trystram, and G. J. Woeginger. Approximation Algorithms for Scheduling Malleable Tasks Under Precedence Constraints. International Journal of Foundations of Computer Science, 13(04):613--627, 2002.
[41]
M. Li, D. Subhraveti, A. R. Butt, A. Khasymski, and P. Sarkar. CAM: A Topology Aware Minimum Cost Flow Based Resource Manager for MapReduce Applications in the Cloud. In HPDC, 2012.
[42]
M. Ovsiannikov, S. Rus, D. Reeves, P. Sutter, S. Rao, and J. Kelly. The Quantcast File System. Proc. VLDB Endow.
[43]
B. Palanisamy, A. Singh, L. Liu, and B. Jain. Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud. In SC, 2011.
[44]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive- a warehousing solution over a map-reduce framework. In VLDB, 2009.
[45]
J. Turek, J. L. Wolf, and P. S. Yu. Approximate Algorithms Scheduling Parallelizable Tasks. In SPAA, 1992.
[46]
G. Wang, A. Butt, P. Pandey, and K. Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In MASCOTS, 2009.
[47]
C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron. Better Never than Late: Meeting Deadlines in Datacenter Networks. In ACM SIGCOMM, 2011.
[48]
M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In EuroSys, 2010.
[49]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010.
[50]
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. VLDB J., 21(5):611--636, 2012.

Cited By

View all
  • (2024)When will my ML job finish? toward providing completion time estimates through predictability-centric schedulingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691964(487-505)Online publication date: 10-Jul-2024
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • (2024)Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682877(1-10)Online publication date: 19-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
August 2015
684 pages
ISBN:9781450335423
DOI:10.1145/2785956
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cluster schedulers
  2. cross-layer optimization
  3. data-intensive applications
  4. joint data and compute placement

Qualifiers

  • Research-article

Conference

SIGCOMM '15
Sponsor:
SIGCOMM '15: ACM SIGCOMM 2015 Conference
August 17 - 21, 2015
London, United Kingdom

Acceptance Rates

SIGCOMM '15 Paper Acceptance Rate 40 of 242 submissions, 17%;
Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)159
  • Downloads (Last 6 weeks)23
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)When will my ML job finish? toward providing completion time estimates through predictability-centric schedulingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691964(487-505)Online publication date: 10-Jul-2024
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • (2024)Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682877(1-10)Online publication date: 19-Jun-2024
  • (2023)CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-EfficiencyProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36267887:3(1-28)Online publication date: 7-Dec-2023
  • (2023)Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training JobsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575705(457-472)Online publication date: 27-Jan-2023
  • (2023)Cougar: A General Framework for Jobs Optimization In Cloud2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00262(3417-3429)Online publication date: Apr-2023
  • (2023)An Efficient Approach for Resilience and Reliability Against Cascading Failure2023 15th International Conference on Developments in eSystems Engineering (DeSE)10.1109/DeSE58274.2023.10100283(71-76)Online publication date: 9-Jan-2023
  • (2023)Dynamic Resource Management for Machine Learning Pipeline WorkloadsSN Computer Science10.1007/s42979-023-02101-84:5Online publication date: 30-Aug-2023
  • (2023)Mixtran: an efficient and fair scheduler for mixed deep learning workloads in heterogeneous GPU environmentsCluster Computing10.1007/s10586-023-04104-927:3(2775-2784)Online publication date: 12-Aug-2023
  • (2022)PushBox: Making Use of Every Bit of Time to Accelerate Completion of Data-Parallel JobsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318203733:12(4256-4269)Online publication date: 1-Dec-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media