Abstract
In a general-purpose computing system, several parallel applications run simultaneously on the same platform. Even if each application is highly tuned for that specific platform, additional performance issues are arising in such a dynamic environment in which multiple applications compete for the resources. Different scheduling and resource management techniques have been proposed either at operating system or user level to improve the performance of concurrent workloads. In this paper, we propose a task-based strategy called “Steal Locally, Share Globally” implemented in the runtime of our parallel programming model GPRM (Glasgow Parallel Reduction Machine). We have chosen a state-of-the-art manycore parallel machine, the Intel Xeon Phi, to compare GPRM with some well-known parallel programming models, OpenMP, Intel Cilk Plus and Intel TBB, in both single-programming and multiprogramming scenarios. We show that GPRM not only performs well for single workloads, but also outperforms the other models for multiprogramming workloads. There are three considerations regarding our task-based scheme: (i) It is implemented inside the parallel framework, not as a separate layer; (ii) It improves the performance without the need to change the number of threads for each application (iii) It can be further tuned and improved, not only for the GPRM applications, but for other equivalent parallel programming models.







Similar content being viewed by others
Notes
TCB is called “subtask list” in the previous GPRM papers.
These features can be enabled via command-line switches when compiling the GPRM runtime system. It that sense they can be considered as runtime support features.
We use the noun “steal” (OED “the act of stealing”) rather than “theft”.
The “sequential tile” is responsible for the sequential tasks, but also contributes to the parallel execution, whenever required.
For all experiments, results from the benchmarks kernel are considered in the figures (a) and (b), while in the other results taken from the VTune Amplifier, all information from the start of the application, including its initial phase and the CPU time consumed by the shared libraries is taken into account.
The sequential phase of the MergeSort benchmark with the input size 80M is around 2 s, and the initial phase of the MatMul benchmark with the input size 4096\(\times \)4096 is about half a second.
References
Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. Theory Comput. Syst. 34(2), 115–144 (2001)
Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20(3), 404–418 (2009)
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of openmp task scheduling algorithms for large numa architectures. In: Euro-Par 2014 Parallel Processing, pp. 596–607. Springer, New York (2014)
Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. pp. 1–11. IEEE (2008)
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: a set of benchmarks targeting the exploitation of task parallelism in openmp. In: International Conference on Parallel Processing, 2009. ICPP’09. pp. 124–131. IEEE (2009)
Emani, M.K., Wang, Z., O’Boyle, M.F.: Smart, adaptive mapping of parallelism in the presence of external workload. In: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10. IEEE (2013)
Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)
Harris, T., Maas, M., Marathe, V.J.: Callisto: co-scheduling parallel runtime systems. In: Proceedings of the 9th European Conference on Computer Systems, p. 24. ACM (2014)
Hofmeyr, S., Iancu, C., Blagojević, F.: Load balancing on speed. In: ACM Sigplan Notices, vol. 45, pp. 147–158. ACM (2010)
Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)
Kim, W., Voss, M.: Multicore desktop programming with intel threading building blocks. IEEE Softw. 28(1), 23–31 (2011)
Lubin, M., McMillan, S., Kruse, C.G., Del Vento, D., Montuoro, R.: Efficient software development: 4 Whats new in intel parallel studio xe 2013 service pack (2013)
McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Elsevier (2012)
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc. (2007)
Sasaki, H., Tanimoto, T., Inoue, K., Nakamura, H.: Scalability-based manycore partitioning. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 107–116. ACM (2012)
Saule, E., Catalyurek, U.V.: An early evaluation of the scalability of graph algorithms on the intel mic architecture. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW) pp. 1629–1639. IEEE (2012)
Sussman, G.J., Jr., G.L.S.: Scheme: an interpreter for extended lambda calculus. In: MEMO 349, MIT AI LAB (1975)
Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for openmp tasks in nanos v4. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, pp. 256–259. IBM Corp. (2007)
Tousimojarad, A., Vanderbauwhede, W.: The Glasgow Parallel Reduction Machine: Programming Shared-Memory Many-Core Systems Using Parallel Task Composition. EPTCS 137, 79–94 (2013). doi:10.4204/EPTCS.137.7
Tousimojarad, A., Vanderbauwhede, W.: Comparison of three popular parallel programming models on the Intel Xeon Phi. In: Euro-Par 2014: Parallel Processing Workshops, pp. 314–325. Springer, New York (2014)
Tousimojarad, A., Vanderbauwhede, W.: An efficient thread mapping strategy for multiprogramming on manycore processors. In: Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing, vol. 25, pp. 63–71. IOS Press (2014). doi:10.3233/978-1-61499-381-0-63
Tousimojarad, A., Vanderbauwhede, W.: A parallel task-based approach to linear algebra. In: 2014 IEEE 13th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 59–66. IEEE (2014)
Tucker, A.: Efficient Scheduling on Multiprogrammed Shared-memory Multiprocessors. Ph.D. thesis, Stanford University (1994)
Veen, A.H.: Dataflow machine architecture. ACM Comput. Surv. (CSUR) 18(4), 365–396 (1986)
Yan, J., He, J., Han, W., Chen, W., Zheng, W.: How openmp applications get more benefit from many-core era. In: Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, pp. 83–95. Springer, New York (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tousimojarad, A., Vanderbauwhede, W. Steal Locally, Share Globally. Int J Parallel Prog 43, 894–917 (2015). https://doi.org/10.1007/s10766-015-0350-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-015-0350-0