Abstract
As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation.
In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions.
We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average.
Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.
- T. E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 1, 1, 6--16. DOI:http://dx.doi.org/10.1109/71.80120 Google ScholarDigital Library
- Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. 1992. Scheduler activations: Effective kernel support for the user-level management of parallelism. ACM Transactions on Computer Systems 10, 1, 53--79. Google ScholarDigital Library
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). ACM, New York, 119--129. DOI:http://dx.doi.org/10.1145/277651.277678 Google ScholarDigital Library
- Robert D. Blumofe and Dionisios Papadopoulos. 1998. Hood: A User-Level Threads Library for Multiprogrammed Multiprocessors. Technical Report. University of Texas, Austin.Google Scholar
- Su-Hui Chiang, Rajesh K. Mansharamani, and Mary K. Vernon. 1994. Use of application characteristics and limited preemption for run-to-completion parallel processor scheduling policies. SIGMETRICS Performance Evaluation Review 22, 1, 33--44. DOI:http://dx.doi.org/10.1145/183019.183023 Google ScholarDigital Library
- A. B. Downey. 1997. A Model for Speedup of Parallel Programs. Computer Science Division, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/1997/CSD-97-933.pdf. Google ScholarDigital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). Google ScholarDigital Library
- Mary W. Hall and Margaret Martonosi. 1998. Adaptive parallelism in compiler-parallelized code. Concurrency: Practice and Experience 10, 14, 1235--1250. DOI:http://dx.doi.org/10.1002/(SICI)1096-9128(19981210)10:14<1235::AID-CPE373>3.0.CO;2-ZGoogle ScholarCross Ref
- Jan Hungershöfer, Achim Streit, and Jens-Michael Wierum. 2001. Efficient Resource Management for Malleable Applications. Technical Report TR-003-01. Paderson Center for Parallel Computing.Google Scholar
- L. V. Kale, S. Kumar, and J. DeSouza. 2002. A malleable-job system for timeshared parallel machines. In Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE, Los Alamitos, CA, 230. DOI:http://dx.doi.org/10.1109/CCGRID.2002.1017131 Google ScholarDigital Library
- I. H. Kazi and D. J. Lilja. 2000. A comprehensive dynamic processor allocation scheme for multiprogrammed multiprocessor systems. In Proceedings of the 2000 International Conference on Parallel Processing IEEE, Los Alamitos, CA, 153--161. DOI:http://dx.doi.org/10.1109/ICPP.2000.876103 Google ScholarDigital Library
- Daniel James McFarland. 2011. Exploiting Malleable Parallelism on Multicore Systems. University Libraries, Virginia Polytechnic Institute and State University, Blacksburg, VA. http://scholar.lib.vt.edu/theses/available/etd-06292011-130247Google Scholar
- OMQ Community. 2012. ZeroMQ. Retrieved October 11, 2016, from http://zero.mq/.Google Scholar
- Heidi Pan, Benjamin Hindman, and Krste Asanović. 2010. Composing parallel software efficiently with lithe. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, 376--387. DOI:http://dx.doi.org/10.1145/1806596.1806639 Google ScholarDigital Library
- PAPI Team. 2012. PAPI: Performance Application Programming Interface Retrieved October 11, 2016, from http://icl.cs.utk.edu/papi/.Google Scholar
- Karl Pearson. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302, 157--175.Google ScholarCross Ref
- Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August. 2012. Parcae: A system for flexible parallel execution. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Google ScholarDigital Library
- Jan H. Schonherr, Jan Richling, and Hans-Ulrich Heiss. 2010. Dynamic teams in OpenMP. In Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’10). IEEE, Los Alamitos, CA, 231--237. DOI:http://dx.doi.org/10.1109/SBAC-PAD.2010.36 Google ScholarDigital Library
- Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. 2014. Adaptive, efficient, parallel execution of parallel programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). Google ScholarDigital Library
- R. Sudarsan and C. J. Ribbens. 2007. ReSHAPE: A framework for dynamic resizing and scheduling of homogeneous applications in a parallel environment. In Proceedings of the International Conference on Parallel Processing (ICPP’07). 44. DOI:http://dx.doi.org/10.1109/ICPP.2007.73 Google ScholarDigital Library
- M. Aater Suleman, Moinuddin K. Qureshi, and Yale N. Patt. 2008. Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, 277--286. DOI:http://dx.doi.org/10.1145/1346281.1346317 Google ScholarDigital Library
- A. Tucker and A. Gupta. 1989. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, 159--166. DOI:http://dx.doi.org/10.1145/74850.74866 Google ScholarDigital Library
- Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1, 6, 80--83.Google ScholarCross Ref
Index Terms
- Transparently Space Sharing a Multicore Among Multiple Processes
Recommendations
Efficient multiprogramming for multicores with SCAF
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on MicroarchitectureAs hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads ...
Composing parallel software efficiently with lithe
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and ImplementationApplications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and ...
Composing parallel software efficiently with lithe
PLDI '10Applications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and ...
Comments