Abstract
Nowadays, offloading technologies are applied to smart devices, which add more jobs into cloud data center. In cloud data center, limited physical resources and competitions of different jobs all need to be improved the performance. Considering more jobs are in the kind of task parallelism, how to improve their performance is very important. However, due to the size of transistor is approaching physical extreme limit, the count of transistor integrated into a single CPU core is seriously restricted. On the other hand, constrained by cooling efficiency, the frequency of CPU can not be raising without restriction which could lead CPU’s energy consumption and heat production to a rapidly growth. As performance improvement of new generations of hardware has slowed, the era of serial computing is over and programmers are getting hard to acquire free application acceleration through hardware updating. The direction of computer architecture is transforming to parallel structure and in order to be highly qualified in the new world of parallel computing, squeezing last bit of performance of current state-of-the-art architectures is an urgent task for whole cloud computing community. In this paper, we present Function Flow a C++11-based generic framework for task parallelism. Our insight is that heavy use of generic parallel algorithms in task parallelism may introduce numerous unnecessary synchronous operation which can cause loss of performance of applications. To solve the problem, in Function Flow we propose a DAG-driven task scheduler for programs that can be expressed as a Direct Acyclic Graph of tasks with dependency edges. Function Flow distributes work threads to cores and schedules tasks based purely on tasks’ state in DAGs constructed by programmers. Because our implementation is based on callback mechanism, DAGs are represented compactly and the scheduler in Function Flow works in a dynamic and fully-distributed manner. To achieve high performance the only thing programmers need to do is characterizing dependencies between tasks with the help of user-friendly interfaces provided by Function Flow. We use several micro-benchmarks to demonstrate the efficiency of our approach and analyze the performance of the framework.
Similar content being viewed by others
References
Li SG et al (2015) Automatic tuning of sparse matrix-vector multiplication on multicore clusters. Science China Information Sciences 58(9):92102–092102
Blumofe RD, Leiserson CE (1994) Scheduling multithreaded computations by work stealing. J ACM (JACM) 46(5):720–748
Blumofe RD et al (1995) Cilk: an efficient multithreaded runtime system, ACM Sigplan Symposium on Principles and Practice of Parallel Programming ACM, 207–216
Robison AD (2012) Cilk plus: Language support for thread and vector parallelism, Talk at HP-CAST, 18–25
Acar UA, Blelloch GE, Blumofe RD (2002) The data locality of work stealing. Theory of Computing Systems 35(3):321–347
Reinders J (2007) Intel threading building blocks: outfitting c++ for multi-core processor parallelism, O’Reilly Media, Inc
Guo Yi et al (2010) SLAW: A scalable locality-aware adaptive work-stealing scheduler, IEEE International Symposium on Parallel & Distributed Processing IEEE, 341–342
Guo Yi et al (2009) Work-first and help-first scheduling policies for async-finish task parallelism, IEEE International Symposium on Parallel & distributed Processing IEEE Computer Society, 1–12
Chen Quan, Guo M, Huang Z (2012) CATS: Cache Aware task-stealing based on online profiling in multi-socket multi-core architectures, ACM International Conference on Supercomputing ACM, 163–172
Chen Quan, Guo M, Guan H (2014) LAWS: Locality-Aware work-stealing for multi-socket multi-core architectures, ACM International Conference on Supercomputing ACM, 3–12
Augonnet Cdric, et al (2009) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency & Computation Practice & Experience 23(2):187–198
Bosilca G et al (2011) DAGUe: a generic distributed dag engine for high performance computing, IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum IEEE, 1151–1158
Gautier T et al (2013) XKAapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures, IEEE, International Symposium on Parallel & Distributed Processing IEEE, 1299–1308
Maglalang J, Krishnamoorthy S, Agrawal K (2016) Locality-aware dynamic task graph scheduling
Li D et al (2010) Superscalar communication: a runtime optimization for distributed applications. Science China Information Sciences 53(10):1931–1946
Demmel JW, Higham NJ, Schreiber RS (2010) Stability of block l u factorization. Numerical Linear Algebra with Applications 2(2):173–190
Agullo E et al (2009) Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, 012037
Buttari A et al (2007) Parallel tiled QR factorization for multicore architectures. Concurrency & Computation Practice & Experience 20(13):1573C1590
Chen J et al (2013) Block Algorithm and Its Implementation for Cholesky Factorization, 232–236
Squillante MS, Nelson RD (1991) Analysis of task migration in shared-memory multiprocessor scheduling, ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems DBLP, 143–155
Mitzenmacher M (1998) Analyses of load stealing models based on differential equations, 10th ACM Symposium on Parallel Algorithms and Architectures ACM, 212–221
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is part of the Topical Collection: Special Issue on Software Defined Networking: Trends, Challenges and Prospective Smart Solutions
Guest Editors: Ahmed E. Kamal, Liangxiu Han, Sohail Jabbar, and Liu Lu
Rights and permissions
About this article
Cite this article
Li, C., Liao, X. & Jin, H. Enhancing application performance via DAG-driven scheduling in task parallelism for cloud center. Peer-to-Peer Netw. Appl. 12, 381–391 (2019). https://doi.org/10.1007/s12083-017-0576-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12083-017-0576-2